The Use of Text Retrieval and Natural Language Processing

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

SoniaHaiduc

• Academicbackground:

B.Sc inRomania MS,Ph.D inDetroit,USA

Tallahassee, FL

FloridaStateUniversity

SERENELab

JavierEscobarAvila ChrisMills EstebanParraRodriguez

• SoniaHaiduc,AssistantProfessor

• Ph.D.Students:

Mainresearchinterests

• Softwaremaintenanceandevolution

• Programcomprehension

• Sourcecodeanalysis

• Miningsoftwarerepositories

• Developerperformanceandefficiency

Ourgoal

• Helpsoftwaredeveloperstobuildandmaintainsoftwarefasterandbetter

• WeoftenleveragetechniquesfromoutsideSE– InformationRetrieval– NaturalLanguageProcessing– MachineLearning

TheUseofTextRetrievalandNaturalLanguageProcessingin

SoftwareEngineering

SoniaHaiducAssistantProfessor

DepartmentofComputerScienceFloridaStateUniversity

TextualInformationinSoftware• Capturesconceptsoftheproblemdomain,developer

intentions,developercommunication,etc.

• Foundinmanysoftwareartifacts:– Requirements– Designdocuments– Sourcecode(identifiers,comments)– Commitnotes– Documentation– Usermanuals– Q/Awebsites(StackOverflow,etc.)– Developercommunication(emails,chat,tweets,etc.)– …

TextRetrieval

• InformationRetrieval(IR):theprocessofactivelyseekingoutinformationrelevanttoatopicofinterest(vanRijsbergen)

• TextRetrieval(TR):abranchofIRwheretheinformationisintextformat– Searchspace:collectionofdocuments(corpus)– Document - generictermforaninformationunit

• book,chapter,article,webpage,etc.• class,method,interface,etc.• individualrequirement,bugdescription,testcase,e-mail,designdiagram,etc.

NaturalLanguageProcessing

• ReferstotheuseandabilityofsystemstoprocesssentencesinanaturallanguagesuchasEnglish(ratherthaninaspecialized,artificialcomputerlanguagesuchasC++)

• Combinestechniquesfromcomputerscience,artificialintelligence,computationallinguistics,probabilityandstatistics

TRandNLPinSoftwareEngineering

• Appliedtoover30differentSEtaskso TraceabilityLinkRecoveryo Feature/concept/concern/buglocationo Codereuseo Bugtriageo Programcomprehensiono Architecture/designrecoveryo Qualityassessmentandmeasuremento Softwareevolutionanalysiso Defectpredictionanddebuggingo Automaticdocumentationo Testing

o Requirementsanalysiso Restructuring/refactoringo Softwarecategorizationo Licensinganalysiso Impactanalysiso Clonedetectiono Effortprediction/estimationo Domainanalysiso Webservicesdiscoveryo Usecaseanalysiso Teammanagement,etc.

UsingTRandNLPforRetrievingSoftwareArtifacts

• FormulatetheSEtaskasaretrievalproblemandfindthesoftwareartifactsthatsatisfyaparticularinformationneed

• Someexamples:– BugLocation:retrieveallmethodsinthecoderelevantforaparticularbugreport;– BugReportDe-duplication:findallbugreportsthatalreadyexistandaresimilarto

anewbugreport,inordertopreventduplication.– BugTriage:givenanewbugreport,findthesolvedbugreportthatismostsimilar

tothenewoneandassignittothesamedeveloper.– FeatureLocation:findtheclassesinthecodethatimplementaparticularfeature

orconcept;– CodeReuse:retrievepiecesofcodeorentiresystemsonlinethatimplementa

particularfunctionality;– Cloneandplagiarismdetection:givenapieceofcode(e.g.,amethod),findsimilar

piecesofcodetoitandmarkthemaspotentialclones.– DefectPrediction:givenamethodorclass,estimatethenumberofitcontainsby

extrapolatingfromsimilarartifactsforwhichthenumberofdefectsisknown.– ImpactAnalysis:whenchangingamethod,determineothermethodsthatmaybe

impactedbythechangebyfindingthesimilarmethodstoit.

RetrievingRelevantSoftwareArtifacts

RelevantArtifacts

TR/NLPModel

SoftwareArtifacts

Steps1. CreateandpreprocesscorpususinglightNLP2. Indexcorpus– choosetheTRmodel3. Formulateaquery

• Manualorautomatic4. Computesimilarities betweenthequeryand

thedocumentsinthecorpus(i.e.,relevance)5. Rank thedocumentsbasedonthesimilarities6. ReturnthetopNdocumentsastheresultlist7. Inspecttheresults8. GOTO3.ifneededorSTOP

CreatingandPreprocessingaSoftwareCorpus

• Parsingsoftwareartifactsandextractingdocuments– corpus =acollectionofdocuments(e.g.,methods)

• Textnormalization(whitespaceandnon-textualtokensremoval,tokenization)

• Splitting:split_identifiers,SplitIdentifiers,etc.• Stopwordsremoval

– commonwordsinEnglish,programminglanguagekeywords,project-specificwords,etc.

• Abbreviationandacronymexpansion• Stemming

ExtractingDocuments• Documentscanbeofdifferenttypesandgranularities(e.g.,

methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

• Documentscanbeofdifferenttypesandgranularities(e.g.,methods,classes,files,emails,paragraphs,bugdescriptions,etc.)

ExtractingDocuments

TransformSourceCodetoPlainText

public void run IProgressMonitor monitor throwsInvocationTargetException InterruptedException if m_iFlag

processCorpus monitor checkUpdate else if m_iFlagprocessCorpus monitor UD_UPDATECORPUS else

processQueryString monitor

TextNormalization

• Removewhitespaceandnon-textualcharacters

• Breakupthetextinmeaningful“tokens”andkeeponlywhatmakessense

• Payattentionto:– Numbers: “P450”,“2001”– Hyphenation: “MS-DOS”,“R2-D2”– Punctuation: “John’s”,“command.com”– Case: “us”,“US”– Phrases: “venetianblind”

Splitting

• Splitting:decomposingidentifiersintotheircompoundwords

• Identifiersmayusedivisionmarkers(e.g.,camelCaseand_),ormaynot

• Examples:– getName ->‘get’,‘Name’– getMAXstring ->‘get’,‘MAX’,‘string’– ASTNode ->‘AST’,‘Node’– account_number ->‘account’,‘number’– simpletypename ->‘simple’,‘type’,‘name’

StopWords

• Veryfrequentwords,withnopowerofdiscrimination(e.g.,programminglanguagekeywords,commonEnglishterms)

• Typicallyfunctionwords,notindicativeofcontent(e.g.,“the”,“class”)

• Thestopwordssetdependsonthedocumentcollectionandontheapplication

Abbreviation andAcronymExpansion

• Expandabbreviationsandacronymstothecorrespondingfullwords

• Examples:– mess ->‘message’

– src ->‘source’

– auth ->‘authenticate’OR‘author’?

Stemming

• Identifymorphologicalvariants,creating“classes”– system,systems– forget,forgetting,forgetful– analyse,analysis,analytical,analysing

• Replaceeachtermbytheclassrepresentative(rootormostcommonvariant)

TRandNLPModels

• TheTR/NLPmodelindexestheinformationinthecorpusforfastretrieval

• DifferentTR/NLPmodelsrepresentthesamecorpusdifferentlyandcanleadtodifferentsearchresults

• MostPopularTRandNLPModelsUsedinSE:– VectorSpaceModel(VSM)– LatentSemanticIndexing(LSI)– LatentDirichlet Allocation(LDA)– OkapiBM25andBM25F– LanguageModels

QueryFormulation

• Aqueryisformulatedthatcapturestheinformationneedofthedeveloper– canbemanuallyformulatedbythedeveloper(e.g.,“copypaste”– forfindingtheclassesthatimplementthecopy-pastefeatureinaneditor)

– canbeautomaticallyformulatedbasedonasoftwareartifactorwritteninformationneed(e.g.,extractaquerydirectlyfromabugreportwrittenbyauseroranotherdeveloper)

• Thequeryisthenpreprocessedusingthesameapproachusedonthecorpus

SimpleQueryImprovements

• Spellchecking->correctwords

• Comparewithsoftwarevocabulary– removewordsthatdonotappearinthesoftwaresystem

– usesoftwarethesaurustosuggestalternativewords(i.e.,synonyms)

QueryReformulation

• Howcanwereformulateabadquery?– Thesaurusexpansion:• Suggesttermssimilartoqueryterms

– Relevancefeedback:• Suggestterms(anddocuments)similartoretrieveddocumentsthathavebeenjudgedtoberelevant

–Moreadvanced:automaticbasedonqueryproperties,miningtermsfromsourcecode,etc.

Evaluation• Foragivenquery,producetherankedlistofdocuments.• Determineathresholdandcuttherankedlistsuchthatonly

theresultsuptothethresholdareconsideredasretrieved.

• Markeachdocumentinthetopresults(uptothethreshold)thatisrelevantaccordingtothegoldstandard.

• Note:differentthresholdsontherankedlistproducesdifferentsetsofretrieveddocuments.

threshold

RankedListThresholds• Fixed

– e.g.,keepthefirst10results.

• Scorethreshold:– e.g.,keepresultswithscoreinthetop5%ofallscores.

• Gapthreshold:– traversetherankedlist(fromhighesttolowestscore)– findthewidestgapbetweenadjacentscores– thescoreimmediatelypriortothegapbecomesthethreshold tocut thelist

PrecisionandRecall

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision=

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

not retrieved & irrelevant

retrieved not retrieved

Trade-offBetweenRecallandPrecision

Recall

nThe ideal

Returns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includes

lots of junk

F-Measure

• ThetraditionalF-measureorbalancedF-score(F1score)istheharmonicmeanofprecisionandrecall

PrecisionandRecall:theHolyGrail

• Precisionandrecalldonottelltheentirestory

Good results

• Averageprecision:

The Use of Text Retrieval and Natural Language Processing

Documents

Using Text Embeddings for Information Retrieval

Effective Techniques for Indonesia Text Retrieval

Text Analytics for Dummies - Alta Planaaltaplana.com/TAS08-TextAnalyticsForDummies.pdf · Text Analytics for Dummies 23 Data Mining Text Mining Data Retrieval Information Retrieval

Information Retrieval and Web Searchweb.eecs.umich.edu/~mihalcea/498IR/Lectures/Text... · 2015-01-21 · Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea

Probabilistic Retrieval of OCR Degraded Text Using …ccc.inaoep.mx/~villasen/bib/Probabilistic Retrieval of...Probabilistic Retrieval of OCR Degraded Text Using N-Grams S.M. Harding

Introduction to Information Retrieval …Take-away today Text classiﬁcation: deﬁnition & relevance to information retrieval and natural language processing Naive Bayes: simple

Modern Information Retrieval Chapter 7: Text Processing

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this

Text Indexing and Retrieval

Inverted Indexing for Text Retrieval

Seminário IN1099 Information Retrieval & Text Mining

Content-Based Image Retrieval Rong Jin. Content-based Image Retrieval Retrieval by text Label database images by text tags Image retrieval as text retrieval

Tuning Oracle Text for Rapid Document Retrieval - A … Oracle Text for ... What is Oracle Text? Earlier versions –SQL*TextRetrieval ... Tuning Oracle Text for Rapid Document Retrieval

WMES3103 : INFORMATION RETRIEVAL TEXT OPERATIONS

1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, 1989. (Chapter 9)

Text Retrieval Based on Medical Subwords

Information Retrieval & Text Mining - Intranet DEIBhome.deib.polimi.it/.../DMTM/DMTM1112_TextMining.pdf · 2012-06-13 · Information Retrieval & Text Mining Data Mining and Text

Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

Text mining - mycourses.aalto.fi · •Text mining combines aspects from many traditions (data mining, machine learning, natural language processing, information retrieval, corpus

The Text REtrieval Conferences (TRECs)