Transcript
Page 1: The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

Page 2: The Use of Text Retrieval and Natural Language Processing

2

SoniaHaiduc

• Academicbackground:

B.Sc inRomania MS,Ph.D inDetroit,USA

Page 3: The Use of Text Retrieval and Natural Language Processing

Tallahassee, FL

FloridaStateUniversity

Page 4: The Use of Text Retrieval and Natural Language Processing

4

SERENELab

JavierEscobarAvila ChrisMills EstebanParraRodriguez

• SoniaHaiduc,AssistantProfessor

• Ph.D.Students:

Page 5: The Use of Text Retrieval and Natural Language Processing

Mainresearchinterests

• Softwaremaintenanceandevolution

• Programcomprehension

• Sourcecodeanalysis

• Miningsoftwarerepositories

• Developerperformanceandefficiency

Page 6: The Use of Text Retrieval and Natural Language Processing

Ourgoal

• Helpsoftwaredeveloperstobuildandmaintainsoftwarefasterandbetter

• WeoftenleveragetechniquesfromoutsideSE– InformationRetrieval– NaturalLanguageProcessing– MachineLearning

Page 7: The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

Page 8: The Use of Text Retrieval and Natural Language Processing

TextualInformationinSoftware• Capturesconceptsoftheproblemdomain,developer

intentions,developercommunication,etc.

• Foundinmanysoftwareartifacts:– Requirements– Designdocuments– Sourcecode(identifiers,comments)– Commitnotes– Documentation– Usermanuals– Q/Awebsites(StackOverflow,etc.)– Developercommunication(emails,chat,tweets,etc.)– …

Page 9: The Use of Text Retrieval and Natural Language Processing

TextRetrieval

• InformationRetrieval(IR):theprocessofactivelyseekingoutinformationrelevanttoatopicofinterest(vanRijsbergen)

• TextRetrieval(TR):abranchofIRwheretheinformationisintextformat– Searchspace:collectionofdocuments(corpus)– Document - generictermforaninformationunit

• book,chapter,article,webpage,etc.• class,method,interface,etc.• individualrequirement,bugdescription,testcase,e-mail,designdiagram,etc.

Page 10: The Use of Text Retrieval and Natural Language Processing

NaturalLanguageProcessing

• ReferstotheuseandabilityofsystemstoprocesssentencesinanaturallanguagesuchasEnglish(ratherthaninaspecialized,artificialcomputerlanguagesuchasC++)

• Combinestechniquesfromcomputerscience,artificialintelligence,computationallinguistics,probabilityandstatistics

Page 11: The Use of Text Retrieval and Natural Language Processing

TRandNLPinSoftwareEngineering

• Appliedtoover30differentSEtaskso TraceabilityLinkRecoveryo Feature/concept/concern/buglocationo Codereuseo Bugtriageo Programcomprehensiono Architecture/designrecoveryo Qualityassessmentandmeasuremento Softwareevolutionanalysiso Defectpredictionanddebuggingo Automaticdocumentationo Testing

o Requirementsanalysiso Restructuring/refactoringo Softwarecategorizationo Licensinganalysiso Impactanalysiso Clonedetectiono Effortprediction/estimationo Domainanalysiso Webservicesdiscoveryo Usecaseanalysiso Teammanagement,etc.

Page 12: The Use of Text Retrieval and Natural Language Processing

UsingTRandNLPforRetrievingSoftwareArtifacts

• FormulatetheSEtaskasaretrievalproblemandfindthesoftwareartifactsthatsatisfyaparticularinformationneed

• Someexamples:– BugLocation:retrieveallmethodsinthecoderelevantforaparticularbugreport;– BugReportDe-duplication:findallbugreportsthatalreadyexistandaresimilarto

anewbugreport,inordertopreventduplication.– BugTriage:givenanewbugreport,findthesolvedbugreportthatismostsimilar

tothenewoneandassignittothesamedeveloper.– FeatureLocation:findtheclassesinthecodethatimplementaparticularfeature

orconcept;– CodeReuse:retrievepiecesofcodeorentiresystemsonlinethatimplementa

particularfunctionality;– Cloneandplagiarismdetection:givenapieceofcode(e.g.,amethod),findsimilar

piecesofcodetoitandmarkthemaspotentialclones.– DefectPrediction:givenamethodorclass,estimatethenumberofitcontainsby

extrapolatingfromsimilarartifactsforwhichthenumberofdefectsisknown.– ImpactAnalysis:whenchangingamethod,determineothermethodsthatmaybe

impactedbythechangebyfindingthesimilarmethodstoit.

Page 13: The Use of Text Retrieval and Natural Language Processing

RetrievingRelevantSoftwareArtifacts

RelevantArtifacts

TR/NLPModel

Query

INPUT

SoftwareArtifacts

Page 14: The Use of Text Retrieval and Natural Language Processing

Steps1. CreateandpreprocesscorpususinglightNLP2. Indexcorpus– choosetheTRmodel3. Formulateaquery

• Manualorautomatic4. Computesimilarities betweenthequeryand

thedocumentsinthecorpus(i.e.,relevance)5. Rank thedocumentsbasedonthesimilarities6. ReturnthetopNdocumentsastheresultlist7. Inspecttheresults8. GOTO3.ifneededorSTOP

Page 15: The Use of Text Retrieval and Natural Language Processing

CreatingandPreprocessingaSoftwareCorpus

• Parsingsoftwareartifactsandextractingdocuments– corpus =acollectionofdocuments(e.g.,methods)

• Textnormalization(whitespaceandnon-textualtokensremoval,tokenization)

• Splitting:split_identifiers,SplitIdentifiers,etc.• Stopwordsremoval

– commonwordsinEnglish,programminglanguagekeywords,project-specificwords,etc.

• Abbreviationandacronymexpansion• Stemming

Page 16: The Use of Text Retrieval and Natural Language Processing

ExtractingDocuments• Documentscanbeofdifferenttypesandgranularities(e.g.,

methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

Page 17: The Use of Text Retrieval and Natural Language Processing

• Documentscanbeofdifferenttypesandgranularities(e.g.,methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

ExtractingDocuments

Page 18: The Use of Text Retrieval and Natural Language Processing

TransformSourceCodetoPlainText

public void run IProgressMonitor monitor throwsInvocationTargetException InterruptedException if m_iFlag

processCorpus monitor checkUpdate else if m_iFlagprocessCorpus monitor UD_UPDATECORPUS else

processQueryString monitor

Page 19: The Use of Text Retrieval and Natural Language Processing

TextNormalization

• Removewhitespaceandnon-textualcharacters

• Breakupthetextinmeaningful“tokens”andkeeponlywhatmakessense

• Payattentionto:– Numbers: “P450”,“2001”– Hyphenation: “MS-DOS”,“R2-D2”– Punctuation: “John’s”,“command.com”– Case: “us”,“US”– Phrases: “venetianblind”

Page 20: The Use of Text Retrieval and Natural Language Processing

Splitting

• Splitting:decomposingidentifiersintotheircompoundwords

• Identifiersmayusedivisionmarkers(e.g.,camelCaseand_),ormaynot

• Examples:– getName ->‘get’,‘Name’– getMAXstring ->‘get’,‘MAX’,‘string’– ASTNode ->‘AST’,‘Node’– account_number ->‘account’,‘number’– simpletypename ->‘simple’,‘type’,‘name’

Page 21: The Use of Text Retrieval and Natural Language Processing

StopWords

• Veryfrequentwords,withnopowerofdiscrimination(e.g.,programminglanguagekeywords,commonEnglishterms)

• Typicallyfunctionwords,notindicativeofcontent(e.g.,“the”,“class”)

• Thestopwordssetdependsonthedocumentcollectionandontheapplication

Page 22: The Use of Text Retrieval and Natural Language Processing

Abbreviation andAcronymExpansion

• Expandabbreviationsandacronymstothecorrespondingfullwords

• Examples:– mess ->‘message’

– src ->‘source’

– auth ->‘authenticate’OR‘author’?

Page 23: The Use of Text Retrieval and Natural Language Processing

Stemming

• Identifymorphologicalvariants,creating“classes”– system,systems– forget,forgetting,forgetful– analyse,analysis,analytical,analysing

• Replaceeachtermbytheclassrepresentative(rootormostcommonvariant)

Page 24: The Use of Text Retrieval and Natural Language Processing

TRandNLPModels

• TheTR/NLPmodelindexestheinformationinthecorpusforfastretrieval

• DifferentTR/NLPmodelsrepresentthesamecorpusdifferentlyandcanleadtodifferentsearchresults

• MostPopularTRandNLPModelsUsedinSE:– VectorSpaceModel(VSM)– LatentSemanticIndexing(LSI)– LatentDirichlet Allocation(LDA)– OkapiBM25andBM25F– LanguageModels

Page 25: The Use of Text Retrieval and Natural Language Processing

QueryFormulation

• Aqueryisformulatedthatcapturestheinformationneedofthedeveloper– canbemanuallyformulatedbythedeveloper(e.g.,“copypaste”– forfindingtheclassesthatimplementthecopy-pastefeatureinaneditor)

– canbeautomaticallyformulatedbasedonasoftwareartifactorwritteninformationneed(e.g.,extractaquerydirectlyfromabugreportwrittenbyauseroranotherdeveloper)

• Thequeryisthenpreprocessedusingthesameapproachusedonthecorpus

Page 26: The Use of Text Retrieval and Natural Language Processing

SimpleQueryImprovements

• Spellchecking->correctwords

• Comparewithsoftwarevocabulary– removewordsthatdonotappearinthesoftwaresystem

– usesoftwarethesaurustosuggestalternativewords(i.e.,synonyms)

Page 27: The Use of Text Retrieval and Natural Language Processing

QueryReformulation

• Howcanwereformulateabadquery?– Thesaurusexpansion:• Suggesttermssimilartoqueryterms

– Relevancefeedback:• Suggestterms(anddocuments)similartoretrieveddocumentsthathavebeenjudgedtoberelevant

–Moreadvanced:automaticbasedonqueryproperties,miningtermsfromsourcecode,etc.

Page 28: The Use of Text Retrieval and Natural Language Processing

Evaluation• Foragivenquery,producetherankedlistofdocuments.• Determineathresholdandcuttherankedlistsuchthatonly

theresultsuptothethresholdareconsideredasretrieved.

• Markeachdocumentinthetopresults(uptothethreshold)thatisrelevantaccordingtothegoldstandard.

• Note:differentthresholdsontherankedlistproducesdifferentsetsofretrieveddocuments.

Top

threshold

Page 29: The Use of Text Retrieval and Natural Language Processing

RankedListThresholds• Fixed

– e.g.,keepthefirst10results.

• Scorethreshold:– e.g.,keepresultswithscoreinthetop5%ofallscores.

• Gapthreshold:– traversetherankedlist(fromhighesttolowestscore)– findthewidestgapbetweenadjacentscores– thescoreimmediatelypriortothegapbecomesthethreshold tocut thelist

Page 30: The Use of Text Retrieval and Natural Language Processing

PrecisionandRecall

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision=

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

not retrieved & irrelevant

retrieved not retrieved

rele

vant

irrel

evan

t

Page 31: The Use of Text Retrieval and Natural Language Processing

Trade-offBetweenRecallandPrecision

10

1

Recall

Prec

isio

nThe ideal

Returns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includes

lots of junk

Page 32: The Use of Text Retrieval and Natural Language Processing

F-Measure

• ThetraditionalF-measureorbalancedF-score(F1score)istheharmonicmeanofprecisionandrecall

Page 33: The Use of Text Retrieval and Natural Language Processing

PrecisionandRecall:theHolyGrail

• Precisionandrecalldonottelltheentirestory

Good results

Top

Good results

Top

• Averageprecision:


Recommended