33
The Use of Text Retrieval and Natural Language Processing in Software Engineering Sonia Haiduc Assistant Professor Department of Computer Science Florida State University

The Use of Text Retrieval and Natural Language Processing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

Page 2: The Use of Text Retrieval and Natural Language Processing

2

SoniaHaiduc

• Academicbackground:

B.Sc inRomania MS,Ph.D inDetroit,USA

Page 3: The Use of Text Retrieval and Natural Language Processing

Tallahassee, FL

FloridaStateUniversity

Page 4: The Use of Text Retrieval and Natural Language Processing

4

SERENELab

JavierEscobarAvila ChrisMills EstebanParraRodriguez

• SoniaHaiduc,AssistantProfessor

• Ph.D.Students:

Page 5: The Use of Text Retrieval and Natural Language Processing

Mainresearchinterests

• Softwaremaintenanceandevolution

• Programcomprehension

• Sourcecodeanalysis

• Miningsoftwarerepositories

• Developerperformanceandefficiency

Page 6: The Use of Text Retrieval and Natural Language Processing

Ourgoal

• Helpsoftwaredeveloperstobuildandmaintainsoftwarefasterandbetter

• WeoftenleveragetechniquesfromoutsideSE– InformationRetrieval– NaturalLanguageProcessing– MachineLearning

Page 7: The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

Page 8: The Use of Text Retrieval and Natural Language Processing

TextualInformationinSoftware• Capturesconceptsoftheproblemdomain,developer

intentions,developercommunication,etc.

• Foundinmanysoftwareartifacts:– Requirements– Designdocuments– Sourcecode(identifiers,comments)– Commitnotes– Documentation– Usermanuals– Q/Awebsites(StackOverflow,etc.)– Developercommunication(emails,chat,tweets,etc.)– …

Page 9: The Use of Text Retrieval and Natural Language Processing

TextRetrieval

• InformationRetrieval(IR):theprocessofactivelyseekingoutinformationrelevanttoatopicofinterest(vanRijsbergen)

• TextRetrieval(TR):abranchofIRwheretheinformationisintextformat– Searchspace:collectionofdocuments(corpus)– Document - generictermforaninformationunit

• book,chapter,article,webpage,etc.• class,method,interface,etc.• individualrequirement,bugdescription,testcase,e-mail,designdiagram,etc.

Page 10: The Use of Text Retrieval and Natural Language Processing

NaturalLanguageProcessing

• ReferstotheuseandabilityofsystemstoprocesssentencesinanaturallanguagesuchasEnglish(ratherthaninaspecialized,artificialcomputerlanguagesuchasC++)

• Combinestechniquesfromcomputerscience,artificialintelligence,computationallinguistics,probabilityandstatistics

Page 11: The Use of Text Retrieval and Natural Language Processing

TRandNLPinSoftwareEngineering

• Appliedtoover30differentSEtaskso TraceabilityLinkRecoveryo Feature/concept/concern/buglocationo Codereuseo Bugtriageo Programcomprehensiono Architecture/designrecoveryo Qualityassessmentandmeasuremento Softwareevolutionanalysiso Defectpredictionanddebuggingo Automaticdocumentationo Testing

o Requirementsanalysiso Restructuring/refactoringo Softwarecategorizationo Licensinganalysiso Impactanalysiso Clonedetectiono Effortprediction/estimationo Domainanalysiso Webservicesdiscoveryo Usecaseanalysiso Teammanagement,etc.

Page 12: The Use of Text Retrieval and Natural Language Processing

UsingTRandNLPforRetrievingSoftwareArtifacts

• FormulatetheSEtaskasaretrievalproblemandfindthesoftwareartifactsthatsatisfyaparticularinformationneed

• Someexamples:– BugLocation:retrieveallmethodsinthecoderelevantforaparticularbugreport;– BugReportDe-duplication:findallbugreportsthatalreadyexistandaresimilarto

anewbugreport,inordertopreventduplication.– BugTriage:givenanewbugreport,findthesolvedbugreportthatismostsimilar

tothenewoneandassignittothesamedeveloper.– FeatureLocation:findtheclassesinthecodethatimplementaparticularfeature

orconcept;– CodeReuse:retrievepiecesofcodeorentiresystemsonlinethatimplementa

particularfunctionality;– Cloneandplagiarismdetection:givenapieceofcode(e.g.,amethod),findsimilar

piecesofcodetoitandmarkthemaspotentialclones.– DefectPrediction:givenamethodorclass,estimatethenumberofitcontainsby

extrapolatingfromsimilarartifactsforwhichthenumberofdefectsisknown.– ImpactAnalysis:whenchangingamethod,determineothermethodsthatmaybe

impactedbythechangebyfindingthesimilarmethodstoit.

Page 13: The Use of Text Retrieval and Natural Language Processing

RetrievingRelevantSoftwareArtifacts

RelevantArtifacts

TR/NLPModel

Query

INPUT

SoftwareArtifacts

Page 14: The Use of Text Retrieval and Natural Language Processing

Steps1. CreateandpreprocesscorpususinglightNLP2. Indexcorpus– choosetheTRmodel3. Formulateaquery

• Manualorautomatic4. Computesimilarities betweenthequeryand

thedocumentsinthecorpus(i.e.,relevance)5. Rank thedocumentsbasedonthesimilarities6. ReturnthetopNdocumentsastheresultlist7. Inspecttheresults8. GOTO3.ifneededorSTOP

Page 15: The Use of Text Retrieval and Natural Language Processing

CreatingandPreprocessingaSoftwareCorpus

• Parsingsoftwareartifactsandextractingdocuments– corpus =acollectionofdocuments(e.g.,methods)

• Textnormalization(whitespaceandnon-textualtokensremoval,tokenization)

• Splitting:split_identifiers,SplitIdentifiers,etc.• Stopwordsremoval

– commonwordsinEnglish,programminglanguagekeywords,project-specificwords,etc.

• Abbreviationandacronymexpansion• Stemming

Page 16: The Use of Text Retrieval and Natural Language Processing

ExtractingDocuments• Documentscanbeofdifferenttypesandgranularities(e.g.,

methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

Page 17: The Use of Text Retrieval and Natural Language Processing

• Documentscanbeofdifferenttypesandgranularities(e.g.,methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

ExtractingDocuments

Page 18: The Use of Text Retrieval and Natural Language Processing

TransformSourceCodetoPlainText

public void run IProgressMonitor monitor throwsInvocationTargetException InterruptedException if m_iFlag

processCorpus monitor checkUpdate else if m_iFlagprocessCorpus monitor UD_UPDATECORPUS else

processQueryString monitor

Page 19: The Use of Text Retrieval and Natural Language Processing

TextNormalization

• Removewhitespaceandnon-textualcharacters

• Breakupthetextinmeaningful“tokens”andkeeponlywhatmakessense

• Payattentionto:– Numbers: “P450”,“2001”– Hyphenation: “MS-DOS”,“R2-D2”– Punctuation: “John’s”,“command.com”– Case: “us”,“US”– Phrases: “venetianblind”

Page 20: The Use of Text Retrieval and Natural Language Processing

Splitting

• Splitting:decomposingidentifiersintotheircompoundwords

• Identifiersmayusedivisionmarkers(e.g.,camelCaseand_),ormaynot

• Examples:– getName ->‘get’,‘Name’– getMAXstring ->‘get’,‘MAX’,‘string’– ASTNode ->‘AST’,‘Node’– account_number ->‘account’,‘number’– simpletypename ->‘simple’,‘type’,‘name’

Page 21: The Use of Text Retrieval and Natural Language Processing

StopWords

• Veryfrequentwords,withnopowerofdiscrimination(e.g.,programminglanguagekeywords,commonEnglishterms)

• Typicallyfunctionwords,notindicativeofcontent(e.g.,“the”,“class”)

• Thestopwordssetdependsonthedocumentcollectionandontheapplication

Page 22: The Use of Text Retrieval and Natural Language Processing

Abbreviation andAcronymExpansion

• Expandabbreviationsandacronymstothecorrespondingfullwords

• Examples:– mess ->‘message’

– src ->‘source’

– auth ->‘authenticate’OR‘author’?

Page 23: The Use of Text Retrieval and Natural Language Processing

Stemming

• Identifymorphologicalvariants,creating“classes”– system,systems– forget,forgetting,forgetful– analyse,analysis,analytical,analysing

• Replaceeachtermbytheclassrepresentative(rootormostcommonvariant)

Page 24: The Use of Text Retrieval and Natural Language Processing

TRandNLPModels

• TheTR/NLPmodelindexestheinformationinthecorpusforfastretrieval

• DifferentTR/NLPmodelsrepresentthesamecorpusdifferentlyandcanleadtodifferentsearchresults

• MostPopularTRandNLPModelsUsedinSE:– VectorSpaceModel(VSM)– LatentSemanticIndexing(LSI)– LatentDirichlet Allocation(LDA)– OkapiBM25andBM25F– LanguageModels

Page 25: The Use of Text Retrieval and Natural Language Processing

QueryFormulation

• Aqueryisformulatedthatcapturestheinformationneedofthedeveloper– canbemanuallyformulatedbythedeveloper(e.g.,“copypaste”– forfindingtheclassesthatimplementthecopy-pastefeatureinaneditor)

– canbeautomaticallyformulatedbasedonasoftwareartifactorwritteninformationneed(e.g.,extractaquerydirectlyfromabugreportwrittenbyauseroranotherdeveloper)

• Thequeryisthenpreprocessedusingthesameapproachusedonthecorpus

Page 26: The Use of Text Retrieval and Natural Language Processing

SimpleQueryImprovements

• Spellchecking->correctwords

• Comparewithsoftwarevocabulary– removewordsthatdonotappearinthesoftwaresystem

– usesoftwarethesaurustosuggestalternativewords(i.e.,synonyms)

Page 27: The Use of Text Retrieval and Natural Language Processing

QueryReformulation

• Howcanwereformulateabadquery?– Thesaurusexpansion:• Suggesttermssimilartoqueryterms

– Relevancefeedback:• Suggestterms(anddocuments)similartoretrieveddocumentsthathavebeenjudgedtoberelevant

–Moreadvanced:automaticbasedonqueryproperties,miningtermsfromsourcecode,etc.

Page 28: The Use of Text Retrieval and Natural Language Processing

Evaluation• Foragivenquery,producetherankedlistofdocuments.• Determineathresholdandcuttherankedlistsuchthatonly

theresultsuptothethresholdareconsideredasretrieved.

• Markeachdocumentinthetopresults(uptothethreshold)thatisrelevantaccordingtothegoldstandard.

• Note:differentthresholdsontherankedlistproducesdifferentsetsofretrieveddocuments.

Top

threshold

Page 29: The Use of Text Retrieval and Natural Language Processing

RankedListThresholds• Fixed

– e.g.,keepthefirst10results.

• Scorethreshold:– e.g.,keepresultswithscoreinthetop5%ofallscores.

• Gapthreshold:– traversetherankedlist(fromhighesttolowestscore)– findthewidestgapbetweenadjacentscores– thescoreimmediatelypriortothegapbecomesthethreshold tocut thelist

Page 30: The Use of Text Retrieval and Natural Language Processing

PrecisionandRecall

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision=

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

not retrieved & irrelevant

retrieved not retrieved

rele

vant

irrel

evan

t

Page 31: The Use of Text Retrieval and Natural Language Processing

Trade-offBetweenRecallandPrecision

10

1

Recall

Prec

isio

nThe ideal

Returns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includes

lots of junk

Page 32: The Use of Text Retrieval and Natural Language Processing

F-Measure

• ThetraditionalF-measureorbalancedF-score(F1score)istheharmonicmeanofprecisionandrecall

Page 33: The Use of Text Retrieval and Natural Language Processing

PrecisionandRecall:theHolyGrail

• Precisionandrecalldonottelltheentirestory

Good results

Top

Good results

Top

• Averageprecision: