Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Lucene
JianguoLu
SchoolofComputerScience
UniversityofWindsor
AComparisonofOpenSourceSearchEnginesfor1.69MPages
2
lucene• DevelopedbyDougCuHnginiIally
– Java-based.Createdin1999,DonatedtoApachein2001
• Features– Nocrawler,Nodocumentparsing,No“PageRank”
• WebsitesPoweredbyLucene– IBMOmnifindY!EdiIon,TechnoraI– Wikipedia,InternetArchive,LinkedIn,monster.com
• AdddocumentstoanindexviaIndexWriter– AdocumentisacollecIonoffields– Flexibletextanalysis–tokenizers,filters
• SearchfordocumentsviaIndexSearcherHits=search(Query,Filter,Sort,topN)
• Rankingbasedonb*idfsimilaritywithnormalizaIon3
4
Whatislucene
• Luceneis– anAPI– anInformaIonRetrievalLibrary
• LuceneisnotanapplicaIonreadyforuse– Notanwebserver– Notasearchengine
• Itcanbeusedto– Indexfiles– Searchtheindex
• Itisopensource,wrifeninJava• Twostages:indexandsearch
Picture From Lucene in Action
5
Indexing
• Justthesameastheindexattheendofabook
• Withoutanindex,youhavetosearchforakeywordbyscanningallthepagesofabook
• Processofindexing– Acquirecontent
– Say,semanIcwebDBPedia
– Builddocument– Transformtotextfilefromotherformatssuchaspdf,msword– Lucenedoesnotsupportthiskindoffilter– Therearetoolstodothis
– Analyzedocument– Tokenizethedocument– Stemming– Stopwords– Luceneprovidesastringofanalyzers– Usercanalsocustomizetheanalyzer
– Indexdocument
• KeyclassesinLuceneIndexing– Document,Analyzer,IndexWriter
6
CodesnippetsDirectorydir=FSDirectory.open(newFile(indexDir));writer=newIndexWriter(dir,newStandardAnalyzer(Version.LUCENE_30),true,IndexWriter.MaxFieldLength.UNLIMITED);……Documentdoc=newDocument();doc.add(newField("contents",newFileReader(f)));doc.add(newField("filename",f.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED));doc.add(newField("fullpath",f.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));writer.addDocument(doc);
Createadocumentinstancefromafile
Addthedoctowriter
Whendocisadded,UseStandardAnalyzer
IndexiswrifenInthisdirectory
7
DocumentandField
doc.add(newField("fullpath",f.getCanonicalPath(),
Field.Store.YES,Field.Index.NOT_ANALYZED));
• ConstructaField:– Firsttwoparametersarefieldnameandvalue
– Thirdparameter:whethertostorethevalue– IfNO,contentisdiscardedareritindexed.Storingthevalueisusefulifyouneedthevaluelater,likeyouwanttodisplayitinthesearchresult.
– Fourthparameter:whetherandhowthefieldshouldindexed.
doc.add(newField("contents",newFileReader(f)));
– Createatokenizedandindexedfieldthatisnotstored
8
Luceneanalyzers
• StandardAnalyzer – AsophisIcatedgeneral-purposeanalyzer.
• WhitespaceAnalyzer – Averysimpleanalyzerthatjustseparatestokensusingwhitespace.
• StopAnalyzer – RemovescommonEnglishwordsthatarenotusuallyusefulforindexing.
• SnowballAnalyzer – Astemmingthatworksonwordroots.
• AnalyzersforlanguagesotherthanEnglish
Examplesofanalyzers
%javalia.analysis.AnalyzerDemo"NoFluff,JustStuff"
Analyzing"NoFluff,JustStuff"
org.apache.lucene.analysis.WhitespaceAnalyzer:
[No][Fluff,][Just][Stuff]
org.apache.lucene.analysis.SimpleAnalyzer:
[no][fluff][just][stuff]
org.apache.lucene.analysis.StopAnalyzer:
[fluff][just][stuff]
org.apache.lucene.analysis.standard.StandardAnalyzer:
[fluff][just][stuff]
9
StandardAnalyser
• support– companyname
– XY&ZcorporaIonà[XY&Z][corporaIon]
– Email– [email protected]>[email protected]
– Ipaddress– Serialnumbers
• Supportchineseandjapanese
10
StopAnalyser
• Defaultstopwords:– "a","an","and","are","as","at","be","but","by”,"for","if","in","into","is","it","no","not","of","on”,"or","such","that","the","their","then","there","these”,"they","this","to","was","will","with”
– Notethatthereareotherstop-wordlists
• Youcanpassyourownsetofstopwords
11
Customizedanalyzers
• MostapplicaIonsdonotusebuilt-inanalyzers
• Customize– Stopwords– ApplicaIonspecifictokens(productpartnumber)
– Synonymexpansion
– Preservecaseforcertaintokens– Choosingstemmingalgorithm
• Solrconfigurethetokenizingusingsolrconfig.xml
12
N-gramfilter
privatestaIcclassNGramAnalyzerextendsAnalyzer{
publicTokenStreamtokenStream(StringfieldName,Readerreader){
returnnewNGramTokenFilter(newKeywordTokenizer(reader),2,4);
}
}
Lefuceà
1:[le]
2:[et]
3:[f]
4:[tu]
5:[uc]
6:[ce]
…
13
ExampleofSnowballAnalyzer• stemming
• EnglishAnalyzeranalyzer=newSnowballAnalyzer(Version.LUCENE_30,"English");
AnalyzerUIls.assertAnalyzesTo(analyzer,"stemmingalgorithms”,newString[]{"stem","algorithm"});
• SpanishAnalyzeranalyzer=newSnowballAnalyzer(Version.LUCENE_30,"Spanish");
AnalyzerUIls.assertAnalyzesTo(analyzer,"algoritmos”,newString[]{"algoritm"});
14
Lucenetokenizersandfilters
15
Synonymfilter
16
Highlightqueryterms
17
Analyzers(solr)
18
LexCorp BFG-9000
LexCorp BFG-9000
BFG 9000 Lex Corp
LexCorp
bfg 9000 lex corp
lexcorp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=1
LowercaseFilter
Lex corp bfg9000
Lex bfg9000
bfg 9000 Lex corp
bfg 9000 lex corp
WhitespaceTokenizer
LowercaseFilter
A Match!
corp
19
Search
• Threemodelsofsearch– Boolean– Vectorspace– ProbabilisIc
• LucenesupportsacombinaIonofbooleanandvectorspacemodel
• Stepstocarryoutasearch– Buildaquery– Issuethequerytotheindex– Renderthereturns
– Rankpagesaccordingtorelevancetothequery
20
Searchcode
Directorydir=FSDirectory.open(newFile(indexDir));
IndexSearcheris=newIndexSearcher(dir);
QueryParserparser=newQueryParser(Version.LUCENE_30,"contents",
newStandardAnalyzer(Version.LUCENE_30));
Queryquery=parser.parse(q);
TopDocshits=is.search(query,10);
for(ScoreDocscoreDoc:hits.scoreDocs){
Documentdoc=is.doc(scoreDoc.doc);
System.out.println(doc.get("fullpath"));
}
Indicatetosearchwhichindex
Parsethequery
Searchthequery
Processthereturnsonebyone.Notethat‘fullpath’isafieldaddedwhileindexing
Resultranking
• Thedefaultisrelevancerankingbasedonb.idf• Cancustomizetheranking
– Sortbyindexorder– Sortbyoneormoreafributes
• Pagerankingisnotused
21
Whatifthedocumentisnotatextfile?
22
ExtracIngtextwithApacheTika
• Atoolkitdetectsandextractsmetadataandtextfromoverathousanddifferentfiletypes
• Therearemanydocumentformats– rb,pdf,ppt,outlook,flash,openoffice
– Html,xml
– Zip,tar,gzip,bzip2,..
• Therearemanydocumentfilters– Eachhasitsownapi
• Aframeworkthathostsplug-inparsersforeachdocumenttype– StandardapiforextracIngtextandmetadata
• Tikaitselfdoesnotparsedocuments23
ExtracIngfromvariousfileformats
• Microsor’sOLE2CompoundDocumentFormat(Excel,Word,PowerPoint,Visio,Outlook)
– ApachePOI
• MicrosorOffice2007OOXML– ApachePOI
• AdobePortableDocumentFormat(PDF)– PDFBox
• RichTextFormat(RTF)—currentlybodytextonly(nometadata)– JavaSwingAPI(RTFEditorKit)
• PlaintextcharactersetdetecIonICU4Jlibrary– HTMLCyberNekolibrary
• XML– Java’sjavax.xmlclasses
24
Compressedfiles
• ZIPArchives– Java’sbuilt-inzipclasses,ApacheCommonsCompress
• TARArchives– ApacheAnt,ApacheCommonsCompress
• ARArchives,CPIOArchives– ApacheCommonsCompress
• GZIPcompression– Java’sbuilt-insupport(GZIPInputStream),ApacheCommonsCompress
• BZIP2compression– ApacheAnt,ApacheCommonsCompress
25
mulImedia• Imageformats(metadataonly)
– Java’sjavax.imageioclasses
• MP3audio(ID3v1tags)– Implementeddirectly
• Otheraudioformats(wav,aiff,au)J– ava’sbuilt-insupport(javax.sound.*)
• OpenDocument– ParsesXMLdirectly
• AdobeFlash– ParsesmetadatafromFLVfilesdirectly
• MIDIfiles(embeddedtext,egsonglyrics)– Java’sbuilt-insupport(javax.audio.midi.*)
• WAVEAudio(samplingmetadata)– Java’sbuilt-insupport(javax.audio.sampled.*)
26
Programcode
• Javaclassfiles– ASMlibrary(JCR-1522)
• JavaJARfiles– Java’sbuilt-inzipclassesandASMlibrary,ApacheCommonsCompress
27
Example:Searchengineofsourcecode(e.g.,fromgithub)
28
Newrequirements
• TokenizaIon– Variablenames
– deleIonPolicyvs.deleIonpolicyvs.deleIon_policy
– Symbols:normallydiscardedsymbolsnowimportant– For(intI=0;i<10;i++)
– Stopwords– Befertokeepallthewords
• Commontokens– (,– Groupintobigrams,toimproveperformance
– if(
29
Querysyntax
• Searchrequirements– Searchforclassormethodnames
– SearchforStructurallysimilarcode
– SearchforquesIonsandanswers(e.g.fromstackoverflow.com)
– DealwithIncludesandimports
• Thequerysyntaxmaynolongerbekeywordssearchorbooleansearch
• Needtoparsethesourcecode– Differentprogramminglanguagehasdifferentgrammarandparser
– Parsergenerators
30
Substringsearch
• Wildcardsearchismoreimportant
• NamesinprogramsareorencombinaIonsofseveralwords
• Example:Searchforldap
• SourcecodecontainsgetLDAPconfig• Thequeryis:*ldap*• ImplementaIon
– newapproachoftokenizaIon.getLDAPcofigà[get][LDAP][config]
– Permutermindex
– Ngram– getLDAPconfigàGetetltld…
31
Krugle:sourcecodesearch
• DescribedinthebookLuceneinAcIon• Indexedmillionsoflinesofcode
32
SIREN
• Searchingsemi-structureddocuments
• Casestudy2ofthelucenebook• SemanIcinformaIonretrieval
• 50millionstructureddata
• 1billionrdftriples
33
Casestudy3ofthelucenebook
• SearchforpeopleinLinkedIn• InteracIvequeryrefinement
• e.g.,searchforjavaprogrammer
34
Relatedtools
• ApacheSolr:isthesearchserverbuiltontopofLucene– Lucenefocusesontheindexing,notawebapplicaIon– Buildweb-basedsearchengines
• ApacheNutch:forlargescalecrawling– CanrunonmulIplemachines
– Supportregularexpressiontofilterwhatisneeded– support.robots.txt
35
OurstarIngpoint
• Indexacademicpapers
• Startwithasubsetofthedata(10Kpapers)• Ourcoursewebsitehas
– alinktothedata– simplecodeforindexingandsearching
36
publicclassIndexAllFilesInDirectory{
staIcintcounter=0;
publicstaIcvoidmain(String[]args)throwsExcepIon{
StringindexPath="/Users/jianguolu/data/citeseer2_index";
StringdocsPath="/Users/jianguolu/data/citeseer2";
System.out.println("Indexingtodirectory'"+indexPath+"'...");
Directorydir=FSDirectory.open(Paths.get(indexPath));
IndexWriterConfigiwc=newIndexWriterConfig(newStandardAnalyzer());
IndexWriterwriter=newIndexWriter(dir,iwc);
indexDocs(writer,Paths.get(docsPath));
writer.close();
}
37
staIcvoidindexDocs(finalIndexWriterwriter,Pathpath)throwsIOExcepIon{
Files.walkFileTree(path,newSimpleFileVisitor<Path>(){
publicFileVisitResultvisitFile(Pathfile,BasicFileAfributesafrs)throwsIOExcepIon{
indexDoc(writer,file);
returnFileVisitResult.CONTINUE;
}
});
}
38
staIcvoidindexDoc(IndexWriterwriter,Pathfile)throwsIOExcepIon{
InputStreamstream=Files.newInputStream(file);
BufferedReaderbr=newBufferedReader(newInputStreamReader(stream,StandardCharsets.UTF_8));
StringItle=br.readLine();
Documentdoc=newDocument();
doc.add(newStringField("path",file.toString(),Field.Store.YES));
doc.add(newTextField("contents",br));
doc.add(newStringField("Itle",Itle,Field.Store.YES));
writer.addDocument(doc);
counter++;
if(counter%1000==0)
System.out.println("indexing"+counter+"-thfile"+file.getFileName());
}
}
39
SearchdocumentspublicclassSearchIndexedDocs{
publicstaAcvoidmain(String[]args)throwsExcepAon{
Stringindex="/Users/jianguolu/data/citeseer2_index";
IndexReaderreader=DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearchersearcher=newIndexSearcher(reader);
QueryParserparser=newQueryParser("contents",newStandardAnalyzer());
Queryquery=parser.parse("informaIonANDretrieval");
TopDocsresults=searcher.search(query,6);
System.out.println(results.totalHits+"totalmatchingdocuments");
for(inti=0;i<6;i++){
Documentdoc=searcher.doc(results.scoreDocs[i].doc);
System.out.println((i+1)+"."+doc.get("path")+"\n\t"+doc.get(";tle"));
}
reader.close();
}}
40