26
3/17/09 1 Text Processing CISC489/689‐010, Lecture #3 Monday, Feb. 16 Ben CartereFe Indexing An index is a list of things (keys) with pointers to other things (items). Keywords catalog numbers ( shelves). Concepts page numbers. Terms documents. Need for indexes: Ease of use. Speed. Scalability.

Text Processing - ir.cis.udel.edu

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Processing - ir.cis.udel.edu

3/17/09

1

TextProcessing

CISC489/689‐010,Lecture#3Monday,Feb.16

BenCartereFe

Indexing

•  Anindexisalistofthings(keys)withpointerstootherthings(items).– Keywordscatalognumbers(shelves).– Conceptspagenumbers.– Termsdocuments.

•  Needforindexes:– Easeofuse.– Speed.– Scalability.

Page 2: Text Processing - ir.cis.udel.edu

3/17/09

2

Manualvs.AutomaVcIndexing

•  Manual:– An“expert”assignskeystoeachitem.

– Example:cardcatalog.

•  AutomaVc:– KeysautomaVcallyidenVfiedandassigned.– Example:Google.

•  AutomaVcasgoodasmanualformostpurposes.

TextProcessing

•  FirststepinautomaVcindexing.•  ConverVngdocumentsintoindex terms. 

•  Termsarenotjustwords.– Notallwordsareofequalvalueinasearch.– SomeVmesnotclearwherewordsbeginandend.

•  Especiallywhennotspace‐separated,e.g.Chinese,Korean.

– Matchingtheexactwordstypedbytheuserdoesn’tworkverywellintermsofeffecVveness.

Page 3: Text Processing - ir.cis.udel.edu

3/17/09

3

TextProcessingSteps

•  Foreachdocument:– Parseittolocatethepartsthatareimportant.

– Segmentandtokenizethetextintheimportantpartstogetwords.

– Removestop words.– Stemwordstocommonroots.

•  Advancedprocessingmayincludedphrases,enVtytagging,link‐graphfeatures,andmore.

Parsing

•  Somepartsofadocumentaremoreimportantthanothers.

•  Documentparserrecognizesstructureusingmarkup suchasHTMLtags.– Headers,anchortext,boldedtextarelikelytobeimportant.

–  JavaScript,styleinformaVon,navigaVonlinkslesslikelytobeimportant.

– Metadatacanalsobeimportant.

Page 4: Text Processing - ir.cis.udel.edu

3/17/09

4

ExampleWikipediaPage

WikipediaMarkup<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|

topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.

Page 5: Text Processing - ir.cis.udel.edu

3/17/09

5

WikipediaHTML

DocumentParsing

•  HTMLpagesorganizeintotrees.

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Nodes contain blocks of text.

Page 6: Text Processing - ir.cis.udel.edu

3/17/09

6

EndResultofParsing

•  Blocksoftextfromimportantpartsofpage.– Tropicalfishincludefishfoundintropicalenvironmentsaroundtheworld,includingbothfreshwaterandsaltwaterspecies.Fishkeepersoienusetheterm“tropicalfish”toreferonlythoserequiringfreshwater,withsaltwatertropicalfishreferredtoas“marinefish”.

•  Nextstep:segmenVngandtokenizing.

Tokenizing

•  Formingwordsfromsequenceofcharactersinblocksoftext.

•  SurprisinglycomplexinEnglish,canbeharderinotherlanguages.

•  EarlyIRsystems:– Anysequenceofalphanumericcharactersoflength3ormore.

– Terminatedbyaspaceorotherspecialcharacter.

– Upper‐casechangedtolower‐case.

Page 7: Text Processing - ir.cis.udel.edu

3/17/09

7

Tokenizing

•  Example:– “Bigcorp's2007bi‐annualreportshowedprofitsrose10%.”becomes

– “bigcorp2007annualreportshowedprofitsrose”•  ToosimpleforsearchapplicaVonsorevenlarge‐scaleexperiments

•  Why?ToomuchinformaVonlost– SmalldecisionsintokenizingcanhavemajorimpactoneffecVvenessofsomequeries

TokenizingProblems•  Smallwordscanbeimportantinsomequeries,usuallyincombinaVons

•  xp,ma,pm,beneking,elpaso,masterp,gm,jlo,worldwarII

•  Bothhyphenatedandnon‐hyphenatedformsofmanywordsarecommon–  SomeVmeshyphenisnotneeded

•  e‐bay,wal‐mart,acVve‐x,cd‐rom,t‐shirts

– AtotherVmes,hyphensshouldbeconsideredeitheraspartofthewordorawordseparator

•  winston‐salem,mazdarx‐7,e‐cards,pre‐diabetes,t‐mobile,spanish‐speaking

Page 8: Text Processing - ir.cis.udel.edu

3/17/09

8

TokenizingProblems

•  Specialcharactersareanimportantpartoftags,URLs,codeindocuments

•  Capitalizedwordscanhavedifferentmeaningfromlowercasewords–  Bush,Apple

•  Apostrophescanbeapartofaword,apartofapossessive,orjustamistake–  rosieo'donnell,can't,don't,80's,1890's,men'sstrawhats,master'sdegree,england'stenlargestciVes,shriner's

TokenizingProblems

•  Numberscanbeimportant,includingdecimals– nokia3250,top10courses,united93,quickVme6.5pro,92.3thebeat,288358

•  Periodscanoccurinnumbers,abbreviaVons,URLs,endsofsentences,andothersituaVons–  I.B.M.,Ph.D.,cis.udel.edu

•  Note:tokenizingstepsforqueriesmustbeidenVcaltostepsfordocuments

Page 9: Text Processing - ir.cis.udel.edu

3/17/09

9

TokenizingProcess

•  Assumewehaveusedtheparsertofindblocksofimportanttext.

•  Awordmaybeanysequenceofalphanumericcharactersterminatedbyaspaceorspecialcharacter.–  everythingconvertedtolowercase.–  everythingindexed.

•  Defercomplexdecisionstoothercomponents–  example:92.3→923butsearchfindsdocumentswith92and3adjacent

–  incorporatesomerulestoreducedependenceonquerytransformaVoncomponents

EndResultofTokenizaVon

•  Listofwordsinblocksoftext.–  tropicalfishincludefishfoundintropicalenvironmentsaroundtheworldincludingbothfreshwaterandsaltwaterspeciesfishkeepersoienusethetermtropicalfishtoreferonlythoserequiringfreshwaterwithsaltwatertropicalfishreferredtoasmarinefish

•  Nextstep:stopping.•  Butfirst:textstaVsVcs.

Page 10: Text Processing - ir.cis.udel.edu

3/17/09

10

TextStaVsVcs

•  Hugevarietyofwordsusedintextbut•  ManystaVsVcalcharacterisVcsofwordoccurrencesarepredictable– e.g.,distribuVonofwordcounts

•  RetrievalmodelsandrankingalgorithmsdependheavilyonstaVsVcalproperVesofwords– e.g.,importantwordsoccuroienindocumentsbutarenothighfrequencyincollecVon

Zipf’sLaw•  DistribuVonofwordfrequenciesisveryskewed 

–  afewwordsoccurveryoien,manywordshardlyeveroccur

–  e.g.,twomostcommonwords(“the”,“of”)makeupabout10%ofallwordoccurrencesintextdocuments

•  Zipf’s“law”:–  observaVonthatrank(r)ofawordVmesitsfrequency(f)isapproximatelyaconstant(k) 

•  assumingwordsarerankedinorderofdecreasingfrequency

–  i.e.,r.f ≈korr.Pr≈c,wherePrisprobabilityofwordoccurrenceandc≈ 0.1forEnglish 

Page 11: Text Processing - ir.cis.udel.edu

3/17/09

11

Zipf’sLaw

WikipediaStaVsVcs(wiki000subset)

Totaldocuments 5,001

Totalwordoccurrences 22,545,922

Vocabularysize 348,436

Wordsoccurring>1000Vmes 2,751

Wordsoccurringonce 163,404

Word Freq r Pr(%) r.Pr

poliVcian 5096 510 0.023 0.116

contractor 100 14,852 4.4∙10‐4 0.066

kickboxer 10 56,125 4.4∙10‐5 0.025

comdedian 1 185,035 4.4∙10‐6 0.008

Page 12: Text Processing - ir.cis.udel.edu

3/17/09

12

Top50Wordsfromwiki000Subset

Zipf’sLawforwiki000Subset

Rank

Pro

babi

lity

Page 13: Text Processing - ir.cis.udel.edu

3/17/09

13

Zipf’sLaw

•  WhatistheproporVonofwordswithagivenfrequency?– Wordthatoccursn Vmeshasrankrn = k/n – Numberofwordswithfrequencyn is

•  rn − rn+1  =  k/n − k/(n + 1)=  k/n(n + 1)– ProporVonfoundbydividingbytotalnumberofwords=highestrank=k 

– So,proporVonwithfrequencynis1/n(n+1)

Zipf’sLaw

•  Exampleword

frequencyranking

•  Tocomputenumberofwordswithfrequency493–  rankof“png”minustherankof“defend”

– 5005−5001=4

Rank Word Freq

4999 objecVve 494

5000 albany 494

5001 defend 494

5002 appeals 493

5003 125 493

5004 lasVng 493

5005 png 493

Page 14: Text Processing - ir.cis.udel.edu

3/17/09

14

Example

•  ProporVonsofwordsoccurringnVmesin5,001Wikipediadocuments

•  Vocabularysizeis348,436.

Num.occurrences(n)

Predictedpropor:on(1/n(n+1))

Actualpropor:on

Actualnumberofwords

1 .500 .469 163,404

2 .167 .151 52,672

3 .083 .070 24,272

4 .050 .045 15,685

5 .033 .030 10,437

6 .024 .022 7,832

7 .018 .017 5,962

8 .014 .014 4,890

9 .011 .011 3,886

10 .009 .009 3,291

VocabularyGrowth

•  Ascorpusgrows,sodoesvocabularysize– Fewernewwordswhencorpusisalreadylarge

•  ObservedrelaVonship(Heaps’ Law):

v=k.nβ

wherevisvocabularysize(numberofuniquewords),nisthenumberofwordsincorpus, k,β areparametersthatvaryfor

eachcorpus (typicalvaluesgivenare10≤ k ≤ 100 andβ ≈ 0.5)

Page 15: Text Processing - ir.cis.udel.edu

3/17/09

15

wiki000SubsetExample

Words in collection

Voca

bula

ry s

ize

v ≈ 18.61·n0.5819

Heaps’LawPredicVons

•  PredicVonsforTRECcollecVonsareaccurateforlargenumbersofwords– e.g.,first22,545,922wordsofwiki000scanned– predicVonis353,587uniquewords– actualnumberis348,436

•  PredicVonsforsmallnumbersofwords(i.e.<1000)aremuchworse

Page 16: Text Processing - ir.cis.udel.edu

3/17/09

16

Heaps’LawPredicVons

•  Heaps’Lawworkswithverylargecorpora– newwordsoccurringevenaierseeing30million!

•  Newwordscomefromavarietyofsources•  spellingerrors,inventedwords(e.g.product,companynames),code,otherlanguages,emailaddresses,etc.

•  Searchenginesmustdealwiththeselargeandgrowingvocabularies

Stopping

•  FuncVonwords(determiners,preposiVons)haveliFlemeaningontheirown

•  Highoccurrencefrequencies– Top6words:the, of, and, in, to, a

•  Treatedasstopwords (i.e.removed)–  reduceindexspace,improveresponseVme,improveeffecVveness

•  CanbeimportantincombinaVons– e.g.,“tobeornottobe”

Page 17: Text Processing - ir.cis.udel.edu

3/17/09

17

Stopping

•  Keeptrackofallverycommonwordsinastopwords list.

•  Duringtextprocessing,ignoreanywordonthelist.

•  Stopwordlistcanbecreatedfromhigh‐frequencywordsorbasedonastandardlist

•  ListsarecustomizedforapplicaVons,domains,andevenpartsofdocuments– e.g.,“click”isagoodstopwordforanchortext

Stopping

•  Whenstoragespaceisnotaconcern,itcanbebeFertonotstop.– Queriesarelessrestricted.– RemovestopwordsatqueryVmeunlessusersaystoincludethem.

•  Googledoesnotstop.– “tobeornottobe” returnsresults.– +thereturnsresults(over14billion).

Page 18: Text Processing - ir.cis.udel.edu

3/17/09

18

EndResultofStopping

•  Listofwordsminusthoseonthestoplist.–  tropicalfishincludefishfoundtropicalenvironmentsaroundworldincludingbothfreshwatersaltwaterspeciesfishkeepersoienusetermtropicalfishreferonlythoserequiringfreshwatersaltwatertropicalfishreferredmarinefish

•  Nextstep:stemming.

Stemming•  ManymorphologicalvariaVonsofwords

–  inflecFonal(plurals,tenses)– derivaFonal(makingverbsnounsetc.)

•  Inmostcases,thesehavethesameorverysimilarmeanings

•  StemmersaFempttoreducemorphologicalvariaVonsofwordstoacommonstem– usuallyinvolvesremovingsuffixes

•  CanbedoneatindexingVmeoraspartofqueryprocessing(likestopwords)

Page 19: Text Processing - ir.cis.udel.edu

3/17/09

19

Stemming

•  GenerallyasmallbutsignificanteffecVvenessimprovement– canbecrucialforsomelanguages– e.g.,5‐10%improvementforEnglish,upto50%inArabic

Words with the Arabic root ktb

Stemming

•  Twobasictypes– DicVonary‐based:useslistsofrelatedwords– Algorithmic:usesprogramtodeterminerelatedwords

•  Algorithmicstemmers– suffix‐s: remove‘s’endingsassumingplural

•  e.g.,cats→cat,lakes→lake

•  Manyfalse negaFves:supplies→supplie•  Somefalse posiFves:ups→up

Page 20: Text Processing - ir.cis.udel.edu

3/17/09

20

PorterStemmer

•  AlgorithmicstemmerusedinIRexperimentssincethe70s

•  Consistsofaseriesofrulesdesignedtothelongestpossiblesuffixateachstep

•  ProvablyeffecVve•  Producesstemsnotwords

•  Makesanumberoferrorsanddifficulttomodify

PorterStemmer

•  Examplestep(1of5)

Page 21: Text Processing - ir.cis.udel.edu

3/17/09

21

PorterStemmer

•  Porter2stemmeraddressessomeoftheseissues

•  Approachhasbeenusedwithotherlanguages

KrovetzStemmer

•  Hybridalgorithmic‐dicVonary– WordcheckedindicVonary

•  Ifpresent,eitherleialoneorreplacedwith“excepVon”•  Ifnotpresent,wordischeckedforsuffixesthatcouldberemoved

•  Aierremoval,dicVonaryischeckedagain

•  Produceswordsnotstems•  ComparableeffecVveness•  LowerfalseposiVverate,somewhathigherfalsenegaVve

Page 22: Text Processing - ir.cis.udel.edu

3/17/09

22

StemmerComparison

EndResultofStemming

•  Listofstemmedterms:–  tropicfishincludefishfoundtropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish

–  (fromPorter2stemmer)

•  Nextstep:advancedprocessing,orindexing.

Page 23: Text Processing - ir.cis.udel.edu

3/17/09

23

Martin Hall, 49, head of public policy and external affairs at the London Stock Exchange, is to leave at the end of June.

… The departure of Hall, who had

been in the running to be head of corporate affairs at the BBC, appears to have been prompted by the decision of the new chief executive, Michael Lawrence, to split Hall’s job in two and take the public policy element under his own wing.

<person id=pe1>Martin Hall</person>, 49, <sense num=2>head</sense> of <ow1>public policy</ow1> and external affairs at the <corp id=co1>London Stock Exchange</corp>, is to <syn grp=1>leave</syn> at the end of June.

… The <syn grp=1>departure</syn> of

<person id=pe1>Hall</person>, <ref to=pe1>who</ref> had been in the running to be head of corporate affairs at the <corp id=co2>BBC</corp>, appears to have been prompted by the decision of the new chief executive, <person id=pe2>Michael Lawrence</person>, to split <person id=pe1>Hall’s</person> job in two and take the public policy element under <ref to=pe1>his</ref> own wing.

AdvancedTextProcessing

•  Part‐of‐speechtagging.•  SensedisambiguaVon.•  SynonymclassificaVon.•  NamedenVtytagging.•  PhraseidenVficaVon.•  ReferentresoluVon.•  SentencesegmentaVon.•  TranslaVon.•  SpeechrecogniVon.

TextProcessingErrors

•  Alltextprocessingiserrorful.– DesigndecisionsproducesegmentaVonerrors,stoppingerrors,stemmingerrors.

–  FalseposiVvesandfalsenegaVves.– Moreadvancedmethodsmoredifficultprocessingmoreerrors.

•  Doesthebenefitoutweighthecost?–  SegmentaVon&stemming:definitely.–  POStagging,NEtagging:dependsondomain.–  Synonymclasses:maybenot.

Page 24: Text Processing - ir.cis.udel.edu

3/17/09

24

EndResultofTextProcessing<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]]

environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.

•  Metadata:–  Title:Tropicalfish

•  Importantfields:–  Links:fishtropicfreshwatsalt

waterfishkeepmarinfish

•  Body:–  tropicfishincludefishfound

tropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish

CourseProject

•  PhaseI,worksheet1.– Writeatextprocessingmodule.

– ParseWikipediapages,tokenize,stop,andstem.– AnswerquesVonsaboutWikipediadata:howbigisvocabulary,howmanywordoccurrencesarethere,etc.

•  DuenextWednesday.– PleasestartASAP!

Page 25: Text Processing - ir.cis.udel.edu

3/17/09

25

ExpectaVons

•  ReadWikipediapagesoffdisk.•  IdenVfypartsofthemthatdonotneedtobeindexed.

•  Converttherestintoalistofwords.•  Dropstopwords,stemremainingwordstoterms.

•  KeeptrackofthenumberofVmeseachtermappears,howmanydocumentsitappearsin.

PseudoJavaimport java.io.*; import java.util.*;

… HashMap<String, int> termCounts = new HashMap();

File doc = new File(filename); Scanner docScanner = new Scanner(doc); while (docScanner.hasNextLine()) {

List<String> terms = processLine(docScanner.nextLine()) for (int i=0; i < terms.size(); i++) { String currentTerm = terms.get(i); int termCount = termCounts.get(currentTerm);

termCounts.set(currentTerm, termCount+1); }

}

docScanner.close()

Page 26: Text Processing - ir.cis.udel.edu

3/17/09

26

public List processLine(String line) { List<String> terms = new List();

int i = 0;

Scanner lineScanner = new Scanner(line);

lineScanner.useDelimiter(“\\s*”); while (lineScanner.hasNext()) { String word = lineScanner.next();

/* check if word is appropriate for indexing or if it marks the start of a block to ignore */ if (word.indexOf(“{{“) >= 0)

/* ignore words until closing the block with a }}

… /* other conditions */

/* strip non-alphanumeric characters and lower-case */

word = word.replaceAll("[^a-zA-Z0-9]", ""); word = word.toLowerCase();

/* check if word is in the stop list */

if (!isStopWord(word)) { word = stemmer.stem(word); /* stem word */ terms.set(i, word);

i++; } } return(terms);

}