Lucene - jlu.myweb.cs.uwindsor.cajlu.myweb.cs.uwindsor.ca/538/538Lucene2018.pdf5 Indexing • Just the same as the index at the end of a book • Without an index, you have to search

1

Lucene

JianguoLu

SchoolofComputerScience

UniversityofWindsor

AComparisonofOpenSourceSearchEnginesfor1.69MPages

2

lucene• DevelopedbyDougCuHnginiIally

–  Java-based.Createdin1999,DonatedtoApachein2001

• Features– Nocrawler,Nodocumentparsing,No“PageRank”

• WebsitesPoweredbyLucene–  IBMOmnifindY!EdiIon,TechnoraI– Wikipedia,InternetArchive,LinkedIn,monster.com

•  AdddocumentstoanindexviaIndexWriter– AdocumentisacollecIonoffields– Flexibletextanalysis–tokenizers,filters

• SearchfordocumentsviaIndexSearcherHits=search(Query,Filter,Sort,topN)

•  Rankingbasedonb*idfsimilaritywithnormalizaIon3

4

Whatislucene

•  Luceneis– anAPI– anInformaIonRetrievalLibrary

•  LuceneisnotanapplicaIonreadyforuse– Notanwebserver– Notasearchengine

•  Itcanbeusedto–  Indexfiles– Searchtheindex

•  Itisopensource,wrifeninJava•  Twostages:indexandsearch

Picture From Lucene in Action

5

Indexing

•  Justthesameastheindexattheendofabook

•  Withoutanindex,youhavetosearchforakeywordbyscanningallthepagesofabook

•  Processofindexing–  Acquirecontent

–  Say,semanIcwebDBPedia

–  Builddocument–  Transformtotextfilefromotherformatssuchaspdf,msword–  Lucenedoesnotsupportthiskindoffilter–  Therearetoolstodothis

–  Analyzedocument–  Tokenizethedocument–  Stemming–  Stopwords–  Luceneprovidesastringofanalyzers–  Usercanalsocustomizetheanalyzer

–  Indexdocument

•  KeyclassesinLuceneIndexing–  Document,Analyzer,IndexWriter

6

CodesnippetsDirectorydir=FSDirectory.open(newFile(indexDir));writer=newIndexWriter(dir,newStandardAnalyzer(Version.LUCENE_30),true,IndexWriter.MaxFieldLength.UNLIMITED);……Documentdoc=newDocument();doc.add(newField("contents",newFileReader(f)));doc.add(newField("filename",f.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(newField("fullpath",f.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));writer.addDocument(doc);

Createadocumentinstancefromafile

Addthedoctowriter

Whendocisadded,UseStandardAnalyzer

IndexiswrifenInthisdirectory

7

DocumentandField

doc.add(newField("fullpath",f.getCanonicalPath(),

Field.Store.YES,Field.Index.NOT_ANALYZED));

• ConstructaField:– Firsttwoparametersarefieldnameandvalue

– Thirdparameter:whethertostorethevalue–  IfNO,contentisdiscardedareritindexed.Storingthevalueisusefulifyouneedthevaluelater,likeyouwanttodisplayitinthesearchresult.

– Fourthparameter:whetherandhowthefieldshouldindexed.

doc.add(newField("contents",newFileReader(f)));

– Createatokenizedandindexedfieldthatisnotstored

8

Luceneanalyzers

• StandardAnalyzer – AsophisIcatedgeneral-purposeanalyzer.

• WhitespaceAnalyzer – Averysimpleanalyzerthatjustseparatestokensusingwhitespace.

• StopAnalyzer – RemovescommonEnglishwordsthatarenotusuallyusefulforindexing.

• SnowballAnalyzer – Astemmingthatworksonwordroots.

• AnalyzersforlanguagesotherthanEnglish

Examplesofanalyzers

%javalia.analysis.AnalyzerDemo"NoFluff,JustStuff"

Analyzing"NoFluff,JustStuff"

org.apache.lucene.analysis.WhitespaceAnalyzer:

[No][Fluff,][Just][Stuff]

org.apache.lucene.analysis.SimpleAnalyzer:

[no][fluff][just][stuff]

org.apache.lucene.analysis.StopAnalyzer:

[fluff][just][stuff]

org.apache.lucene.analysis.standard.StandardAnalyzer:

[fluff][just][stuff]

9

StandardAnalyser

• support– companyname

–  XY&ZcorporaIonà[XY&Z][corporaIon]

– Email–  [email protected]>[email protected]

– Ipaddress– Serialnumbers

• Supportchineseandjapanese

10

StopAnalyser

• Defaultstopwords:– "a","an","and","are","as","at","be","but","by”,"for","if","in","into","is","it","no","not","of","on”,"or","such","that","the","their","then","there","these”,"they","this","to","was","will","with”

– Notethatthereareotherstop-wordlists

• Youcanpassyourownsetofstopwords

11

Customizedanalyzers

• MostapplicaIonsdonotusebuilt-inanalyzers

• Customize– Stopwords– ApplicaIonspecifictokens(productpartnumber)

– Synonymexpansion

– Preservecaseforcertaintokens– Choosingstemmingalgorithm

• Solrconfigurethetokenizingusingsolrconfig.xml

12

N-gramfilter

privatestaIcclassNGramAnalyzerextendsAnalyzer{

publicTokenStreamtokenStream(StringfieldName,Readerreader){

returnnewNGramTokenFilter(newKeywordTokenizer(reader),2,4);

}

}

Lefuceà

1:[le]

2:[et]

3:[f]

4:[tu]

5:[uc]

6:[ce]

…

13

ExampleofSnowballAnalyzer•  stemming

• EnglishAnalyzeranalyzer=newSnowballAnalyzer(Version.LUCENE_30,"English");

AnalyzerUIls.assertAnalyzesTo(analyzer,"stemmingalgorithms”,newString[]{"stem","algorithm"});

• SpanishAnalyzeranalyzer=newSnowballAnalyzer(Version.LUCENE_30,"Spanish");

AnalyzerUIls.assertAnalyzesTo(analyzer,"algoritmos”,newString[]{"algoritm"});

14

Lucenetokenizersandfilters

15

Synonymfilter

16

Highlightqueryterms

17

Analyzers(solr)

18

LexCorp BFG-9000

LexCorp BFG-9000

BFG 9000 Lex Corp

LexCorp

bfg 9000 lex corp

lexcorp

WhitespaceTokenizer

WordDelimiterFilter catenateWords=1

LowercaseFilter

Lex corp bfg9000

Lex bfg9000

bfg 9000 Lex corp

bfg 9000 lex corp

WhitespaceTokenizer

LowercaseFilter

A Match!

corp

19

Search

• Threemodelsofsearch– Boolean– Vectorspace– ProbabilisIc

• LucenesupportsacombinaIonofbooleanandvectorspacemodel

• Stepstocarryoutasearch– Buildaquery– Issuethequerytotheindex– Renderthereturns

–  Rankpagesaccordingtorelevancetothequery

20

Searchcode

Directorydir=FSDirectory.open(newFile(indexDir));

IndexSearcheris=newIndexSearcher(dir);

QueryParserparser=newQueryParser(Version.LUCENE_30,"contents",

newStandardAnalyzer(Version.LUCENE_30));

Queryquery=parser.parse(q);

TopDocshits=is.search(query,10);

for(ScoreDocscoreDoc:hits.scoreDocs){

Documentdoc=is.doc(scoreDoc.doc);

System.out.println(doc.get("fullpath"));

}

Indicatetosearchwhichindex

Parsethequery

Searchthequery

Processthereturnsonebyone.Notethat‘fullpath’isafieldaddedwhileindexing

Resultranking

• Thedefaultisrelevancerankingbasedonb.idf• Cancustomizetheranking

– Sortbyindexorder– Sortbyoneormoreafributes

• Pagerankingisnotused

21

Whatifthedocumentisnotatextfile?

22

ExtracIngtextwithApacheTika

• Atoolkitdetectsandextractsmetadataandtextfromoverathousanddifferentfiletypes

• Therearemanydocumentformats– rb,pdf,ppt,outlook,flash,openoffice

– Html,xml

– Zip,tar,gzip,bzip2,..

• Therearemanydocumentfilters– Eachhasitsownapi

• Aframeworkthathostsplug-inparsersforeachdocumenttype– StandardapiforextracIngtextandmetadata

• Tikaitselfdoesnotparsedocuments23

ExtracIngfromvariousfileformats

• Microsor’sOLE2CompoundDocumentFormat(Excel,Word,PowerPoint,Visio,Outlook)

– ApachePOI

• MicrosorOffice2007OOXML– ApachePOI

•  AdobePortableDocumentFormat(PDF)– PDFBox

•  RichTextFormat(RTF)—currentlybodytextonly(nometadata)–  JavaSwingAPI(RTFEditorKit)

•  PlaintextcharactersetdetecIonICU4Jlibrary– HTMLCyberNekolibrary

•  XML–  Java’sjavax.xmlclasses

24

Compressedfiles

• ZIPArchives– Java’sbuilt-inzipclasses,ApacheCommonsCompress

• TARArchives– ApacheAnt,ApacheCommonsCompress

• ARArchives,CPIOArchives– ApacheCommonsCompress

• GZIPcompression– Java’sbuilt-insupport(GZIPInputStream),ApacheCommonsCompress

• BZIP2compression– ApacheAnt,ApacheCommonsCompress

25

mulImedia•  Imageformats(metadataonly)

–  Java’sjavax.imageioclasses

•  MP3audio(ID3v1tags)–  Implementeddirectly

•  Otheraudioformats(wav,aiff,au)J–  ava’sbuilt-insupport(javax.sound.*)

•  OpenDocument–  ParsesXMLdirectly

•  AdobeFlash–  ParsesmetadatafromFLVfilesdirectly

•  MIDIfiles(embeddedtext,egsonglyrics)–  Java’sbuilt-insupport(javax.audio.midi.*)

•  WAVEAudio(samplingmetadata)–  Java’sbuilt-insupport(javax.audio.sampled.*)

26

Programcode

• Javaclassfiles– ASMlibrary(JCR-1522)

• JavaJARfiles– Java’sbuilt-inzipclassesandASMlibrary,ApacheCommonsCompress

27

Example:Searchengineofsourcecode(e.g.,fromgithub)

28

Newrequirements

• TokenizaIon– Variablenames

–  deleIonPolicyvs.deleIonpolicyvs.deleIon_policy

– Symbols:normallydiscardedsymbolsnowimportant–  For(intI=0;i<10;i++)

– Stopwords–  Befertokeepallthewords

• Commontokens– (,– Groupintobigrams,toimproveperformance

–  if(

29

Querysyntax

• Searchrequirements– Searchforclassormethodnames

– SearchforStructurallysimilarcode

– SearchforquesIonsandanswers(e.g.fromstackoverflow.com)

– DealwithIncludesandimports

• Thequerysyntaxmaynolongerbekeywordssearchorbooleansearch

• Needtoparsethesourcecode– Differentprogramminglanguagehasdifferentgrammarandparser

– Parsergenerators

30

Substringsearch

• Wildcardsearchismoreimportant

• NamesinprogramsareorencombinaIonsofseveralwords

• Example:Searchforldap

• SourcecodecontainsgetLDAPconfig• Thequeryis:*ldap*• ImplementaIon

– newapproachoftokenizaIon.getLDAPcofigà[get][LDAP][config]

– Permutermindex

– Ngram–  getLDAPconfigàGetetltld…

31

Krugle:sourcecodesearch

• DescribedinthebookLuceneinAcIon• Indexedmillionsoflinesofcode

32

SIREN

• Searchingsemi-structureddocuments

• Casestudy2ofthelucenebook• SemanIcinformaIonretrieval

• 50millionstructureddata

• 1billionrdftriples

33

Casestudy3ofthelucenebook

• SearchforpeopleinLinkedIn• InteracIvequeryrefinement

• e.g.,searchforjavaprogrammer

34

Relatedtools

• ApacheSolr:isthesearchserverbuiltontopofLucene– Lucenefocusesontheindexing,notawebapplicaIon– Buildweb-basedsearchengines

• ApacheNutch:forlargescalecrawling– CanrunonmulIplemachines

– Supportregularexpressiontofilterwhatisneeded– support.robots.txt

35

OurstarIngpoint

• Indexacademicpapers

• Startwithasubsetofthedata(10Kpapers)• Ourcoursewebsitehas

– alinktothedata– simplecodeforindexingandsearching

36

publicclassIndexAllFilesInDirectory{

staIcintcounter=0;

publicstaIcvoidmain(String[]args)throwsExcepIon{

StringindexPath="/Users/jianguolu/data/citeseer2_index";

StringdocsPath="/Users/jianguolu/data/citeseer2";

System.out.println("Indexingtodirectory'"+indexPath+"'...");

Directorydir=FSDirectory.open(Paths.get(indexPath));

IndexWriterConfigiwc=newIndexWriterConfig(newStandardAnalyzer());

IndexWriterwriter=newIndexWriter(dir,iwc);

indexDocs(writer,Paths.get(docsPath));

writer.close();

}

37

staIcvoidindexDocs(finalIndexWriterwriter,Pathpath)throwsIOExcepIon{

Files.walkFileTree(path,newSimpleFileVisitor<Path>(){

publicFileVisitResultvisitFile(Pathfile,BasicFileAfributesafrs)throwsIOExcepIon{

indexDoc(writer,file);

returnFileVisitResult.CONTINUE;

}

});

}

38

staIcvoidindexDoc(IndexWriterwriter,Pathfile)throwsIOExcepIon{

InputStreamstream=Files.newInputStream(file);

BufferedReaderbr=newBufferedReader(newInputStreamReader(stream,StandardCharsets.UTF_8));

StringItle=br.readLine();

Documentdoc=newDocument();

doc.add(newStringField("path",file.toString(),Field.Store.YES));

doc.add(newTextField("contents",br));

doc.add(newStringField("Itle",Itle,Field.Store.YES));

writer.addDocument(doc);

counter++;

if(counter%1000==0)

System.out.println("indexing"+counter+"-thfile"+file.getFileName());

}

}

39

SearchdocumentspublicclassSearchIndexedDocs{

publicstaAcvoidmain(String[]args)throwsExcepAon{

Stringindex="/Users/jianguolu/data/citeseer2_index";

IndexReaderreader=DirectoryReader.open(FSDirectory.open(Paths.get(index)));

IndexSearchersearcher=newIndexSearcher(reader);

QueryParserparser=newQueryParser("contents",newStandardAnalyzer());

Queryquery=parser.parse("informaIonANDretrieval");

TopDocsresults=searcher.search(query,6);

System.out.println(results.totalHits+"totalmatchingdocuments");

for(inti=0;i<6;i++){

Documentdoc=searcher.doc(results.scoreDocs[i].doc);

System.out.println((i+1)+"."+doc.get("path")+"\n\t"+doc.get(";tle"));

}

reader.close();

}}

40

Documents

Lucene - jlu.myweb.cs.uwindsor.cajlu.myweb.cs.uwindsor.ca/538/538Lucene2018.pdf5 Indexing • Just the same as the index at the end of a book • Without an index, you have to search