The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

2

SoniaHaiduc

• Academicbackground:

B.Sc inRomania MS,Ph.D inDetroit,USA

Tallahassee, FL

FloridaStateUniversity

4

SERENELab

JavierEscobarAvila ChrisMills EstebanParraRodriguez

• SoniaHaiduc,AssistantProfessor

• Ph.D.Students:

Mainresearchinterests

• Softwaremaintenanceandevolution

• Programcomprehension

• Sourcecodeanalysis

• Miningsoftwarerepositories

• Developerperformanceandefficiency

Ourgoal

• Helpsoftwaredeveloperstobuildandmaintainsoftwarefasterandbetter

• WeoftenleveragetechniquesfromoutsideSE– InformationRetrieval– NaturalLanguageProcessing– MachineLearning

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

TextualInformationinSoftware• Capturesconceptsoftheproblemdomain,developer

intentions,developercommunication,etc.

• Foundinmanysoftwareartifacts:– Requirements– Designdocuments– Sourcecode(identifiers,comments)– Commitnotes– Documentation– Usermanuals– Q/Awebsites(StackOverflow,etc.)– Developercommunication(emails,chat,tweets,etc.)– …

TextRetrieval

• InformationRetrieval(IR):theprocessofactivelyseekingoutinformationrelevanttoatopicofinterest(vanRijsbergen)

• TextRetrieval(TR):abranchofIRwheretheinformationisintextformat– Searchspace:collectionofdocuments(corpus)– Document - generictermforaninformationunit

• book,chapter,article,webpage,etc.• class,method,interface,etc.• individualrequirement,bugdescription,testcase,e-mail,designdiagram,etc.

NaturalLanguageProcessing

• ReferstotheuseandabilityofsystemstoprocesssentencesinanaturallanguagesuchasEnglish(ratherthaninaspecialized,artificialcomputerlanguagesuchasC++)

• Combinestechniquesfromcomputerscience,artificialintelligence,computationallinguistics,probabilityandstatistics

TRandNLPinSoftwareEngineering

• Appliedtoover30differentSEtaskso TraceabilityLinkRecoveryo Feature/concept/concern/buglocationo Codereuseo Bugtriageo Programcomprehensiono Architecture/designrecoveryo Qualityassessmentandmeasuremento Softwareevolutionanalysiso Defectpredictionanddebuggingo Automaticdocumentationo Testing

o Requirementsanalysiso Restructuring/refactoringo Softwarecategorizationo Licensinganalysiso Impactanalysiso Clonedetectiono Effortprediction/estimationo Domainanalysiso Webservicesdiscoveryo Usecaseanalysiso Teammanagement,etc.

UsingTRandNLPforRetrievingSoftwareArtifacts

• FormulatetheSEtaskasaretrievalproblemandfindthesoftwareartifactsthatsatisfyaparticularinformationneed

• Someexamples:– BugLocation:retrieveallmethodsinthecoderelevantforaparticularbugreport;– BugReportDe-duplication:findallbugreportsthatalreadyexistandaresimilarto

anewbugreport,inordertopreventduplication.– BugTriage:givenanewbugreport,findthesolvedbugreportthatismostsimilar

tothenewoneandassignittothesamedeveloper.– FeatureLocation:findtheclassesinthecodethatimplementaparticularfeature

orconcept;– CodeReuse:retrievepiecesofcodeorentiresystemsonlinethatimplementa

particularfunctionality;– Cloneandplagiarismdetection:givenapieceofcode(e.g.,amethod),findsimilar

piecesofcodetoitandmarkthemaspotentialclones.– DefectPrediction:givenamethodorclass,estimatethenumberofitcontainsby

extrapolatingfromsimilarartifactsforwhichthenumberofdefectsisknown.– ImpactAnalysis:whenchangingamethod,determineothermethodsthatmaybe

impactedbythechangebyfindingthesimilarmethodstoit.

RetrievingRelevantSoftwareArtifacts

RelevantArtifacts

TR/NLPModel

Query

INPUT

SoftwareArtifacts

Steps1. CreateandpreprocesscorpususinglightNLP2. Indexcorpus– choosetheTRmodel3. Formulateaquery

• Manualorautomatic4. Computesimilarities betweenthequeryand

thedocumentsinthecorpus(i.e.,relevance)5. Rank thedocumentsbasedonthesimilarities6. ReturnthetopNdocumentsastheresultlist7. Inspecttheresults8. GOTO3.ifneededorSTOP

CreatingandPreprocessingaSoftwareCorpus

• Parsingsoftwareartifactsandextractingdocuments– corpus =acollectionofdocuments(e.g.,methods)

• Textnormalization(whitespaceandnon-textualtokensremoval,tokenization)

• Splitting:split_identifiers,SplitIdentifiers,etc.• Stopwordsremoval

– commonwordsinEnglish,programminglanguagekeywords,project-specificwords,etc.

• Abbreviationandacronymexpansion• Stemming

ExtractingDocuments• Documentscanbeofdifferenttypesandgranularities(e.g.,

methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

• Documentscanbeofdifferenttypesandgranularities(e.g.,methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

ExtractingDocuments

TransformSourceCodetoPlainText

public void run IProgressMonitor monitor throwsInvocationTargetException InterruptedException if m_iFlag

processCorpus monitor checkUpdate else if m_iFlagprocessCorpus monitor UD_UPDATECORPUS else

processQueryString monitor

TextNormalization

• Removewhitespaceandnon-textualcharacters

• Breakupthetextinmeaningful“tokens”andkeeponlywhatmakessense

• Payattentionto:– Numbers: “P450”,“2001”– Hyphenation: “MS-DOS”,“R2-D2”– Punctuation: “John’s”,“command.com”– Case: “us”,“US”– Phrases: “venetianblind”

Splitting

• Splitting:decomposingidentifiersintotheircompoundwords

• Identifiersmayusedivisionmarkers(e.g.,camelCaseand_),ormaynot

• Examples:– getName ->‘get’,‘Name’– getMAXstring ->‘get’,‘MAX’,‘string’– ASTNode ->‘AST’,‘Node’– account_number ->‘account’,‘number’– simpletypename ->‘simple’,‘type’,‘name’

StopWords

• Veryfrequentwords,withnopowerofdiscrimination(e.g.,programminglanguagekeywords,commonEnglishterms)

• Typicallyfunctionwords,notindicativeofcontent(e.g.,“the”,“class”)

• Thestopwordssetdependsonthedocumentcollectionandontheapplication

Abbreviation andAcronymExpansion

• Expandabbreviationsandacronymstothecorrespondingfullwords

• Examples:– mess ->‘message’

– src ->‘source’

– auth ->‘authenticate’OR‘author’?

Stemming

• Identifymorphologicalvariants,creating“classes”– system,systems– forget,forgetting,forgetful– analyse,analysis,analytical,analysing

• Replaceeachtermbytheclassrepresentative(rootormostcommonvariant)

TRandNLPModels

• TheTR/NLPmodelindexestheinformationinthecorpusforfastretrieval

• DifferentTR/NLPmodelsrepresentthesamecorpusdifferentlyandcanleadtodifferentsearchresults

• MostPopularTRandNLPModelsUsedinSE:– VectorSpaceModel(VSM)– LatentSemanticIndexing(LSI)– LatentDirichlet Allocation(LDA)– OkapiBM25andBM25F– LanguageModels

QueryFormulation

• Aqueryisformulatedthatcapturestheinformationneedofthedeveloper– canbemanuallyformulatedbythedeveloper(e.g.,“copypaste”– forfindingtheclassesthatimplementthecopy-pastefeatureinaneditor)

– canbeautomaticallyformulatedbasedonasoftwareartifactorwritteninformationneed(e.g.,extractaquerydirectlyfromabugreportwrittenbyauseroranotherdeveloper)

• Thequeryisthenpreprocessedusingthesameapproachusedonthecorpus

SimpleQueryImprovements

• Spellchecking->correctwords

• Comparewithsoftwarevocabulary– removewordsthatdonotappearinthesoftwaresystem

– usesoftwarethesaurustosuggestalternativewords(i.e.,synonyms)

QueryReformulation

• Howcanwereformulateabadquery?– Thesaurusexpansion:• Suggesttermssimilartoqueryterms

– Relevancefeedback:• Suggestterms(anddocuments)similartoretrieveddocumentsthathavebeenjudgedtoberelevant

–Moreadvanced:automaticbasedonqueryproperties,miningtermsfromsourcecode,etc.

Evaluation• Foragivenquery,producetherankedlistofdocuments.• Determineathresholdandcuttherankedlistsuchthatonly

theresultsuptothethresholdareconsideredasretrieved.

• Markeachdocumentinthetopresults(uptothethreshold)thatisrelevantaccordingtothegoldstandard.

• Note:differentthresholdsontherankedlistproducesdifferentsetsofretrieveddocuments.

Top

threshold

RankedListThresholds• Fixed

– e.g.,keepthefirst10results.

• Scorethreshold:– e.g.,keepresultswithscoreinthetop5%ofallscores.

• Gapthreshold:– traversetherankedlist(fromhighesttolowestscore)– findthewidestgapbetweenadjacentscores– thescoreimmediatelypriortothegapbecomesthethreshold tocut thelist

PrecisionandRecall

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision=

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

not retrieved & irrelevant

retrieved not retrieved

rele

vant

irrel

evan

t

Trade-offBetweenRecallandPrecision

10

1

Recall

Prec

isio

nThe ideal

Returns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includes

lots of junk

F-Measure

• ThetraditionalF-measureorbalancedF-score(F1score)istheharmonicmeanofprecisionandrecall

PrecisionandRecall:theHolyGrail

• Precisionandrecalldonottelltheentirestory

Good results

Top

Good results

Top

• Averageprecision: