Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
TheUseofTextRetrievalandNaturalLanguageProcessingin
SoftwareEngineering
SoniaHaiducAssistantProfessor
DepartmentofComputerScienceFloridaStateUniversity
2
SoniaHaiduc
• Academicbackground:
B.Sc inRomania MS,Ph.D inDetroit,USA
Tallahassee, FL
FloridaStateUniversity
4
SERENELab
JavierEscobarAvila ChrisMills EstebanParraRodriguez
• SoniaHaiduc,AssistantProfessor
• Ph.D.Students:
Mainresearchinterests
• Softwaremaintenanceandevolution
• Programcomprehension
• Sourcecodeanalysis
• Miningsoftwarerepositories
• Developerperformanceandefficiency
Ourgoal
• Helpsoftwaredeveloperstobuildandmaintainsoftwarefasterandbetter
• WeoftenleveragetechniquesfromoutsideSE– InformationRetrieval– NaturalLanguageProcessing– MachineLearning
TheUseofTextRetrievalandNaturalLanguageProcessingin
SoftwareEngineering
SoniaHaiducAssistantProfessor
DepartmentofComputerScienceFloridaStateUniversity
TextualInformationinSoftware• Capturesconceptsoftheproblemdomain,developer
intentions,developercommunication,etc.
• Foundinmanysoftwareartifacts:– Requirements– Designdocuments– Sourcecode(identifiers,comments)– Commitnotes– Documentation– Usermanuals– Q/Awebsites(StackOverflow,etc.)– Developercommunication(emails,chat,tweets,etc.)– …
TextRetrieval
• InformationRetrieval(IR):theprocessofactivelyseekingoutinformationrelevanttoatopicofinterest(vanRijsbergen)
• TextRetrieval(TR):abranchofIRwheretheinformationisintextformat– Searchspace:collectionofdocuments(corpus)– Document - generictermforaninformationunit
• book,chapter,article,webpage,etc.• class,method,interface,etc.• individualrequirement,bugdescription,testcase,e-mail,designdiagram,etc.
NaturalLanguageProcessing
• ReferstotheuseandabilityofsystemstoprocesssentencesinanaturallanguagesuchasEnglish(ratherthaninaspecialized,artificialcomputerlanguagesuchasC++)
• Combinestechniquesfromcomputerscience,artificialintelligence,computationallinguistics,probabilityandstatistics
TRandNLPinSoftwareEngineering
• Appliedtoover30differentSEtaskso TraceabilityLinkRecoveryo Feature/concept/concern/buglocationo Codereuseo Bugtriageo Programcomprehensiono Architecture/designrecoveryo Qualityassessmentandmeasuremento Softwareevolutionanalysiso Defectpredictionanddebuggingo Automaticdocumentationo Testing
o Requirementsanalysiso Restructuring/refactoringo Softwarecategorizationo Licensinganalysiso Impactanalysiso Clonedetectiono Effortprediction/estimationo Domainanalysiso Webservicesdiscoveryo Usecaseanalysiso Teammanagement,etc.
UsingTRandNLPforRetrievingSoftwareArtifacts
• FormulatetheSEtaskasaretrievalproblemandfindthesoftwareartifactsthatsatisfyaparticularinformationneed
• Someexamples:– BugLocation:retrieveallmethodsinthecoderelevantforaparticularbugreport;– BugReportDe-duplication:findallbugreportsthatalreadyexistandaresimilarto
anewbugreport,inordertopreventduplication.– BugTriage:givenanewbugreport,findthesolvedbugreportthatismostsimilar
tothenewoneandassignittothesamedeveloper.– FeatureLocation:findtheclassesinthecodethatimplementaparticularfeature
orconcept;– CodeReuse:retrievepiecesofcodeorentiresystemsonlinethatimplementa
particularfunctionality;– Cloneandplagiarismdetection:givenapieceofcode(e.g.,amethod),findsimilar
piecesofcodetoitandmarkthemaspotentialclones.– DefectPrediction:givenamethodorclass,estimatethenumberofitcontainsby
extrapolatingfromsimilarartifactsforwhichthenumberofdefectsisknown.– ImpactAnalysis:whenchangingamethod,determineothermethodsthatmaybe
impactedbythechangebyfindingthesimilarmethodstoit.
RetrievingRelevantSoftwareArtifacts
RelevantArtifacts
TR/NLPModel
Query
INPUT
SoftwareArtifacts
Steps1. CreateandpreprocesscorpususinglightNLP2. Indexcorpus– choosetheTRmodel3. Formulateaquery
• Manualorautomatic4. Computesimilarities betweenthequeryand
thedocumentsinthecorpus(i.e.,relevance)5. Rank thedocumentsbasedonthesimilarities6. ReturnthetopNdocumentsastheresultlist7. Inspecttheresults8. GOTO3.ifneededorSTOP
CreatingandPreprocessingaSoftwareCorpus
• Parsingsoftwareartifactsandextractingdocuments– corpus =acollectionofdocuments(e.g.,methods)
• Textnormalization(whitespaceandnon-textualtokensremoval,tokenization)
• Splitting:split_identifiers,SplitIdentifiers,etc.• Stopwordsremoval
– commonwordsinEnglish,programminglanguagekeywords,project-specificwords,etc.
• Abbreviationandacronymexpansion• Stemming
ExtractingDocuments• Documentscanbeofdifferenttypesandgranularities(e.g.,
methods,classes,files,emails,paragraphs,bugdescriptions,etc.)
• Documentscanbeofdifferenttypesandgranularities(e.g.,methods,classes,files,emails,paragraphs,bugdescriptions,etc.)
ExtractingDocuments
TransformSourceCodetoPlainText
public void run IProgressMonitor monitor throwsInvocationTargetException InterruptedException if m_iFlag
processCorpus monitor checkUpdate else if m_iFlagprocessCorpus monitor UD_UPDATECORPUS else
processQueryString monitor
TextNormalization
• Removewhitespaceandnon-textualcharacters
• Breakupthetextinmeaningful“tokens”andkeeponlywhatmakessense
• Payattentionto:– Numbers: “P450”,“2001”– Hyphenation: “MS-DOS”,“R2-D2”– Punctuation: “John’s”,“command.com”– Case: “us”,“US”– Phrases: “venetianblind”
Splitting
• Splitting:decomposingidentifiersintotheircompoundwords
• Identifiersmayusedivisionmarkers(e.g.,camelCaseand_),ormaynot
• Examples:– getName ->‘get’,‘Name’– getMAXstring ->‘get’,‘MAX’,‘string’– ASTNode ->‘AST’,‘Node’– account_number ->‘account’,‘number’– simpletypename ->‘simple’,‘type’,‘name’
StopWords
• Veryfrequentwords,withnopowerofdiscrimination(e.g.,programminglanguagekeywords,commonEnglishterms)
• Typicallyfunctionwords,notindicativeofcontent(e.g.,“the”,“class”)
• Thestopwordssetdependsonthedocumentcollectionandontheapplication
Abbreviation andAcronymExpansion
• Expandabbreviationsandacronymstothecorrespondingfullwords
• Examples:– mess ->‘message’
– src ->‘source’
– auth ->‘authenticate’OR‘author’?
Stemming
• Identifymorphologicalvariants,creating“classes”– system,systems– forget,forgetting,forgetful– analyse,analysis,analytical,analysing
• Replaceeachtermbytheclassrepresentative(rootormostcommonvariant)
TRandNLPModels
• TheTR/NLPmodelindexestheinformationinthecorpusforfastretrieval
• DifferentTR/NLPmodelsrepresentthesamecorpusdifferentlyandcanleadtodifferentsearchresults
• MostPopularTRandNLPModelsUsedinSE:– VectorSpaceModel(VSM)– LatentSemanticIndexing(LSI)– LatentDirichlet Allocation(LDA)– OkapiBM25andBM25F– LanguageModels
QueryFormulation
• Aqueryisformulatedthatcapturestheinformationneedofthedeveloper– canbemanuallyformulatedbythedeveloper(e.g.,“copypaste”– forfindingtheclassesthatimplementthecopy-pastefeatureinaneditor)
– canbeautomaticallyformulatedbasedonasoftwareartifactorwritteninformationneed(e.g.,extractaquerydirectlyfromabugreportwrittenbyauseroranotherdeveloper)
• Thequeryisthenpreprocessedusingthesameapproachusedonthecorpus
SimpleQueryImprovements
• Spellchecking->correctwords
• Comparewithsoftwarevocabulary– removewordsthatdonotappearinthesoftwaresystem
– usesoftwarethesaurustosuggestalternativewords(i.e.,synonyms)
QueryReformulation
• Howcanwereformulateabadquery?– Thesaurusexpansion:• Suggesttermssimilartoqueryterms
– Relevancefeedback:• Suggestterms(anddocuments)similartoretrieveddocumentsthathavebeenjudgedtoberelevant
–Moreadvanced:automaticbasedonqueryproperties,miningtermsfromsourcecode,etc.
Evaluation• Foragivenquery,producetherankedlistofdocuments.• Determineathresholdandcuttherankedlistsuchthatonly
theresultsuptothethresholdareconsideredasretrieved.
• Markeachdocumentinthetopresults(uptothethreshold)thatisrelevantaccordingtothegoldstandard.
• Note:differentthresholdsontherankedlistproducesdifferentsetsofretrieveddocuments.
Top
threshold
RankedListThresholds• Fixed
– e.g.,keepthefirst10results.
• Scorethreshold:– e.g.,keepresultswithscoreinthetop5%ofallscores.
• Gapthreshold:– traversetherankedlist(fromhighesttolowestscore)– findthewidestgapbetweenadjacentscores– thescoreimmediatelypriortothegapbecomesthethreshold tocut thelist
PrecisionandRecall
documents relevant of number Totalretrieved documents relevant of Number recall =
retrieved documents of number Totalretrieved documents relevant of Number precision=
Relevant documents
Retrieved documents
Entire document collection
retrieved & relevant
not retrieved but relevant
retrieved & irrelevant
not retrieved & irrelevant
retrieved not retrieved
rele
vant
irrel
evan
t
Trade-offBetweenRecallandPrecision
10
1
Recall
Prec
isio
nThe ideal
Returns relevant documents butmisses many useful ones too
Returns most relevantdocuments but includes
lots of junk
F-Measure
• ThetraditionalF-measureorbalancedF-score(F1score)istheharmonicmeanofprecisionandrecall
PrecisionandRecall:theHolyGrail
• Precisionandrecalldonottelltheentirestory
Good results
Top
Good results
Top
• Averageprecision: