43
Lucene Tutorial borrowing from: Chris Manning and Pandu Nayak

Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

Embed Size (px)

Citation preview

Page 1: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

LuceneTutorial

borrowingfrom:ChrisManningandPanduNayak

Page 2: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

OpensourceIRsystems▪ Widelyusedacademicsystems▪ Terrier(Java,U.Glasgow)http://terrier.org▪ Indri/Galago/Lemur(C++(&Java),U.Mass&CMU)▪ Tailofothers(Zettair,…)

▪ Widelyusednon-academicopensourcesystems▪ Lucene

▪ Thingsbuiltonit:Solr,ElasticSearch

▪ Afewothers(Xapian,…)

Page 3: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Lucene▪ OpensourceJavalibraryforindexingandsearching▪ Letsyouaddsearchtoyourapplication▪ Notacompletesearchsystembyitself▪ WrittenbyDougCutting

▪ Usedby:Twitter,LinkedIn,Zappos,CiteSeer,Eclipse,…▪ …andmanymore(seehttp://wiki.apache.org/lucene-java/PoweredBy)

▪ Ports/integrationstootherlanguages▪ C/C++,C#,Ruby,Perl,Python,PHP,…

Page 4: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Basedon“LuceneinAction”ByMichaelMcCandless,ErikHatcher,OtisGospodnetic

CoversLucene3.0.1.It’snowupto5.3.1

Page 5: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Resources▪ Lucene:http://lucene.apache.org

▪ LuceneinAction:http://www.manning.com/hatcher3/▪ Codesamplesavailablefordownload

▪ Ant:http://ant.apache.org/▪ Javabuildsystemusedby“LuceneinAction”code

Page 6: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Luceneinasearchsystem

RawContent

Acquirecontent

Builddocument

Analyzedocument

Indexdocument

Index

Users

SearchUI

Buildquery

Renderresults

Runquery

Page 7: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Lucenedemos▪ Sourcefilesinlia2e/src/lia/meetlucene/

▪ ActualsourcesuseLucene3.0.1▪ CodeintheseslidesupgradedtoLucene5

▪ CommandlineIndexer▪ lia.meetlucene.Indexer

▪ CommandlineSearcher▪ lia.meetlucene.Searcher

Page 8: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Coreindexingclasses▪ IndexWriter▪ Centralcomponentthatallowsyoutocreateanewindex,openanexistingone,andadd,remove,orupdatedocumentsinanindex

▪ BuiltonanIndexWriterConfigandaDirectory

▪ Directory▪ Abstractclassthatrepresentsthelocationofanindex

▪ Analyzer▪ Extractstokensfromatextstream

Page 9: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

CreatinganIndexWriterImport org.apache.lucene.analysis.Analyzer;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.store.Directory;...

private IndexWriter writer;

public Indexer(String dir) throws IOException { Directory indexDir = FSDirectory.open(new File(dir)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig cfg = new IndexWriterConfig(analyzer); cfg.setOpenMode(OpenMode.CREATE); writer = new IndexWriter(indexDir, cfg)}

Page 10: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Coreindexingclasses(contd.)▪ Document▪ RepresentsacollectionofnamedFields.TextintheseFieldsareindexed.

▪ Field▪ Note:LuceneFieldscanrepresentboth“fields”and“zones”asdescribedinthetextbook

▪ Orevenotherthingslikenumbers.▪ StringFieldsareindexedbutnottokenized▪ TextFieldsareindexedandtokenized

Page 11: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

ADocumentcontainsFieldsimport org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;...protected Document getDocument(File f) throws Exception {

Document doc = new Document();

doc.add(new TextField("contents”, new FileReader(f))) doc.add(new StringField("filename”, f.getName(),

Field.Store.YES)); doc.add(new StringField("fullpath”,

f.getCanonicalPath(),

Field.Store.YES)); return doc;}

Page 12: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

IndexaDocumentwithIndexWriter

private IndexWriter writer;...private void indexFile(File f) throws

Exception {Document doc = getDocument(f);writer.addDocument(doc);

}

Page 13: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Indexingadirectoryprivate IndexWriter writer;...public int index(String dataDir,

FileFilter filter)throws Exception {

File[] files = new File(dataDir).listFiles();for (File f: files) {

if (... && (filter == null || filter.accept(f))) {indexFile(f);

}}return writer.numDocs();

}

Page 14: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

ClosingtheIndexWriter

private IndexWriter writer;...public void close() throws IOException {

writer.close();}

Page 15: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

TheIndex▪ TheIndexisthekindofinvertedindexweknowandlove

▪ ThedefaultLucene50codecis:▪ variable-byteandfixed-widthencodingofdeltavalues▪ multi-levelskiplists▪ naturalorderingofdocIDs▪ encodesbothtermfrequenciesandpositionalinformation

▪ APIstocustomizethecodec

Page 16: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Coresearchingclasses▪ IndexSearcher▪ Centralclassthatexposesseveralsearchmethodsonanindex▪ AccessedviaanIndexReader

▪ Query▪ Abstractqueryclass.Concretesubclassesrepresentspecifictypesofqueries,e.g.,matchingtermsinfields,booleanqueries,phrasequeries,…

▪ QueryParser▪ ParsesatextualrepresentationofaqueryintoaQuery instance

Page 17: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

IndexSearcher

IndexSearcher

IndexReader

Directory

Query TopDocs

Page 18: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

CreatinganIndexSearcher

import org.apache.lucene.search.IndexSearcher;...public static void search(String indexDir,

String q)throws IOException, ParseException {

IndexReader rdr =DirectoryReader.open(FSDirectory.open(

new File(indexDir)));IndexSearcher is = new IndexSearcher(rdr);...

}

Page 19: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

QueryandQueryParserimport org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.Query;...public static void search(String indexDir, String q)

throws IOException, ParseException ...QueryParser parser =

new QueryParser("contents”, new StandardAnalyzer());

Query query = parser.parse(q);...

}

Page 20: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Coresearchingclasses(contd.)▪ TopDocs▪ Containsreferencestothetopdocumentsreturnedbyasearch

▪ ScoreDoc▪ Representsasinglesearchresult

Page 21: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

search()returnsTopDocsimport org.apache.lucene.search.TopDocs;...public static void search(String indexDir,

String q)throws IOException, ParseException

...IndexSearcher is = ...;...Query query = ...;...TopDocs hits = is.search(query, 10);

}

Page 22: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

TopDocscontainScoreDocsimport org.apache.lucene.search.ScoreDoc;...public static void search(String indexDir, String q)

throws IOException, ParseException ...IndexSearcher is = ...;...TopDocs hits = ...;...for(ScoreDoc scoreDoc : hits.scoreDocs) {

Document doc = is.doc(scoreDoc.doc);System.out.println(doc.get("fullpath"));

}}

Page 23: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

ClosingIndexSearcher

public static void search(String indexDir, String q)

throws IOException, ParseException ...IndexSearcher is = ...;...is.close();

}

Page 24: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

HowLucenemodelscontent▪ ADocumentistheatomicunitofindexingandsearching▪ ADocumentcontainsFields

▪ Fieldshaveanameandavalue▪ YouhavetotranslaterawcontentintoFields▪ Examples:Title,author,date,abstract,body,URL,keywords,...

▪ Differentdocumentscanhavedifferentfields▪ Searchafieldusingname:term,e.g.,title:lucene

Page 25: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Fields▪ Fieldsmay▪ Beindexedornot

▪ Indexedfieldsmayormaynotbeanalyzed(i.e.,tokenizedwithanAnalyzer)▪ Non-analyzedfieldsviewtheentirevalueasasingletoken(usefulforURLs,paths,dates,socialsecuritynumbers,...)

▪ Bestoredornot▪ Usefulforfieldsthatyou’dliketodisplaytousers

Page 26: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Fieldconstruction Lotsofdifferentconstructorsimport org.apache.lucene.document.Fieldimport org.apache.lucene.document.FieldType

Field(String name, String value, FieldType type);

value canalsobespecifiedwithaReader,aTokenStream,orabyte[].

FieldTypespecifiesfieldproperties.

Canalsodirectlyusesub-classeslikeTextField,StringField,…

Page 27: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

UsingField propertiesIndex Store Exampleusage

NOT_ANALYZED YES Identifiers,telephone/SSNs,URLs,dates,...

ANALYZED YES Title,abstract

ANALYZED NO Body

NO YES Documenttype,DBkeys(ifnotusedforsearching)

NOT_ANALYZED NO Hiddenkeywords

Page 28: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Analyzer

▪ Tokenizestheinputtext▪ CommonAnalyzers▪ WhitespaceAnalyzer Splitstokensonwhitespace

▪ SimpleAnalyzer Splitstokensonnon-letters,andthenlowercases

▪ StopAnalyzer SameasSimpleAnalyzer,butalsoremovesstopwords

▪ StandardAnalyzer Mostsophisticatedanalyzerthatknowsaboutcertaintokentypes,lowercases,removesstopwords,...

Page 29: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Analysisexample▪ “Thequickbrownfoxjumpedoverthelazydog”▪ WhitespaceAnalyzer▪ [The][quick][brown][fox][jumped][over][the][lazy][dog]

▪ SimpleAnalyzer▪ [the][quick][brown][fox][jumped][over][the][lazy][dog]

▪ StopAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]

▪ StandardAnalyzer▪ [quick][brown][fox][jumped][over][lazy][dog]

Page 30: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Anotheranalysisexample▪ “XY&ZCorporation–[email protected]”▪ WhitespaceAnalyzer▪ [XY&Z][Corporation][-][[email protected]]

▪ SimpleAnalyzer▪ [xy][z][corporation][xyz][example][com]

▪ StopAnalyzer▪ [xy][z][corporation][xyz][example][com]

▪ StandardAnalyzer▪ [xy&z][corporation][[email protected]]

Page 31: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

What’sinsideanAnalyzer?▪ AnalyzersneedtoreturnaTokenStream

public TokenStream tokenStream(String fieldName, Reader reader)

TokenStream

Tokenizer TokenFilter

Reader Tokenizer TokenFilter TokenFilter

Page 32: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

TokenizersandTokenFilters

▪ Tokenizer▪ WhitespaceTokenizer▪ KeywordTokenizer▪ LetterTokenizer▪ StandardTokenizer▪ ...

▪ TokenFilter▪ LowerCaseFilter▪ StopFilter▪ PorterStemFilter▪ ASCIIFoldingFilter▪ StandardFilter▪ ...

Page 33: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Adding/deletingDocumentsto/fromanIndexWriter

void addDocument(Iterable<IndexableField> d);

IndexWriter’sAnalyzerisusedtoanalyzedocument.Important:NeedtoensurethatAnalyzersusedatindexingtimeareconsistentwithAnalyzersusedatsearchingtime

// deletes docs containing terms or matching// queries. The term version is useful for// deleting one document.void deleteDocuments(Term... terms);void deleteDocuments(Query... queries);

Page 34: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Indexformat▪ EachLuceneindexconsistsofoneormoresegments▪ Asegmentisastandaloneindexforasubsetofdocuments▪ Allsegmentsaresearched▪ AsegmentiscreatedwheneverIndexWriterflushesadds/deletes

▪ Periodically,IndexWriterwillmergeasetofsegmentsintoasinglesegment▪ PolicyspecifiedbyaMergePolicy

▪ YoucanexplicitlyinvokeforceMerge()tomergesegments

Page 35: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Basicmergepolicy▪ Segmentsaregroupedintolevels▪ Segmentswithinalevelareroughlyequalsize(inlogspace)

▪ Oncealevelhasenoughsegments,theyaremergedintoasegmentatthenextlevelup

Page 36: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

SearchingachangingindexDirectory dir = FSDirectory.open(...);DirectoryReader reader = DirectoryReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);

Abovereaderdoesnotreflectchangestotheindexunlessyoureopenit.Reopeningismoreresourceefficientthanopeningabrandnewreader.

DirectoryReader newReader = DirectoryReader.openIfChanged(reader);If (newReader != null) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Page 37: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Near-real-timesearchIndexWriter writer = ...;DirectoryReader reader = DirectoryReader.open(writer, true);IndexSearcher searcher = new IndexSearcher(reader);

//Nowletussaythere’sachangetotheindexusingwriterwriter.addDocument(newDoc);

DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, true);if (newReader != null) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Page 38: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

QueryParser

▪ Constructor▪ QueryParser(String defaultField,

Analyzer analyzer);

▪ Parsingmethods▪ Query parse(String query) throws

ParseException;▪ ...andmanymore

Page 39: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

QueryParsersyntaxexamplesQueryexpression Documentmatchesif…

java Containsthetermjavainthedefaultfield

javajunitjavaORjunit

Containsthetermjavaorjunitorbothinthedefaultfield(thedefaultoperatorcanbechangedtoAND)

+java+junitjavaANDjunit

Containsbothjavaandjunitinthedefaultfield

title:ant Containsthetermantinthetitlefield

title:extreme–subject:sports Containsextremeinthetitleandnotsportsinsubject

(agileORextreme)ANDjava Booleanexpressionmatches

title:”junitinaction” Phrasematchesintitle

title:”junitaction”~5 Proximitymatches(within5)intitle

java* Wildcardmatches

java~ Fuzzymatches

lastmodified:[1/1/09TO12/31/09]

Rangematches

Page 40: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

IndexSearcher

▪ Methods▪ TopDocs search(Query q, int n);▪ Document doc(int docID);

Page 41: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

TopDocsandScoreDoc▪ TopDocsmethods▪ NumberofdocumentsthatmatchedthesearchtotalHits

▪ ArrayofScoreDocinstancescontainingresultsscoreDocs

▪ ReturnsbestscoreofallmatchesgetMaxScore()

▪ ScoreDocmethods▪ Documentid doc

▪ Documentscorescore

Page 42: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Scoring▪ Originalscoringfunctionusesbasictf-idfscoringwith▪ Programmableboostvaluesforcertainfieldsindocuments▪ Lengthnormalization▪ Boostsfordocumentscontainingmoreofthequeryterms

▪ IndexSearcherprovidesanexplain()methodthatexplainsthescoringofadocument

Page 43: Lucene Tutorial - Klinton Bicknell · Introduction to Information Retrieval Open source IR systems Widely used academic systems ... Things built on it: Solr, ElasticSearch A few others

IntroductiontoInformationRetrieval

Lucene5.0Scoring▪ Aswellastraditionaltf.idfvectorspacemodel,Lucene5.0has:▪ BM25▪ drf(divergencefromrandomness)▪ ib(information(theory)-basedsimilarity)

indexSearcher.setSimilarity( new BM25Similarity());BM25Similarity custom =

new BM25Similarity(1.2, 0.75); // k1, bindexSearcher.setSimilarity(custom);