Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Introduc)on to Lucene
Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata
Open source search engines § Academic
– Terrier (Java, University of Glasgow) – Indri, Lemur (C++, and Java too, UMass & CMU) – Zettair (University of Melbourne)
§ Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use
§ Lucene – Java search engine library, with many features – Ports/integration to other languages available (C/C++, C#,
Python, Ruby, … ) – Other projects on top of Lucene: Solr and others – Used by: LinkedIn, Twitter, CiteSeer, …
2
Lucene: overview
3
Lucene document
Documents
Tokens Index and dic)onary
Lucene query
Text query
Search results
Results display
User: document building
Lucene: Analyzing
Lucene: Indexing
Lucene: Analyzing
Lucene: Searching
User: reading search results
User: query building
UI
User
User = Programmer
Lucene document building § A document is a collection of Fields § Document: CV of a student, several fields – Last_Name: Banerjee – First_Name: Arkadeep – Age: 24 – Gender: M – Institute: ISI Kolkata – Location: Kolkata – Description: Arkadeep is a highly motivated student who
simply loves challenge. He is …
§ A field can be a text, a number, a range
4
Building document
5
import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;
……
Document doc = new Document();doc.add(new StringField(“Last_Name”,”Banerjee”, …));doc.add(new StringField(“First_Name”,”Arkadeep”, …));doc.add(new IntField(“Age”,24, …);doc.add(new TextField(“Description”,description, …));
Now, let’s understand the fields
Building document
§ Field.Store – NO : Don’t store the field
value in the index – YES : Store the field value
in the index
§ Field.Index – ANALYZED : Tokenize with
an Analyzer – NOT_ANALYZED : Do not
tokenize – NO : Do not index this field – Other options
6
new StringField(“Last_Name”,”Banerjee”,Field.Store.Yes));new StringField(“First_Name”,”Arkadeep”,Field.Store.Yes));new NumericField(“Age”,24,Field.Store.Yes);new TextField(“Description”,desc, Field.Store.No));
Indexing
§ IndexWriter: primary class for indexing – Stores the index in a directory – RAMDirectory also possible (in memory)
7
StandardAnalyzer analyzer = new StandardAnalyzer(); FSDirectory dir = FSDirectory.open(new File(“directory_path”));IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter w = new IndexWriter(index, config); w.addDocument(document);w.close();
Lucene document building § A document is a collection of Fields § Document: CV of a student, several fields – Last_Name: Banerjee – First_Name: Arkadeep – Age: 24 – Gender: M – Institute: ISI Kolkata – Location: Kolkata – Description: Arkadeep is a highly motivated student who
simply loves challenge. He is …
§ A field can be a text, a number, a range
8
What would we like to search with?
Lucene Analyzer § Tokenizes the text in the Fields § Common Analyzers – WhitespaceAnalyzer
Splits tokens on whitespace – SimpleAnalyzer
Splits tokens on non-letters, and then lowercases – StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words – StandardAnalyzer
Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...
Analysis example § “The quick brown fox jumped over the lazy dog” § WhitespaceAnalyzer – [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy]
[dog] § SimpleAnalyzer – [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy]
[dog] § StopAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
§ StandardAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
Analysis example 2 § “XY&Z Corporation – [email protected]” § WhitespaceAnalyzer – [XY&Z] [Corporation] [-] [[email protected]]
§ SimpleAnalyzer – [xy] [z] [corporation] [xyz] [example] [com]
§ StopAnalyzer – [xy] [z] [corporation] [xyz] [example] [com]
§ StandardAnalyzer – [xy&z] [corporation] [[email protected]]
Searching
§ IndexReader and IndexSearcher
12
int k = 10;StandardAnalyzer analyzer = new StandardAnalyzer();IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(
".//index")));IndexSearcher searcher = new IndexSearcher(reader);
Query q = new QueryParser("Description", analyzer).parse(query);
TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;
Query and QueryParser
QueryParser parser = new QueryParser(Version.LUCENE_40, ”LastName”,
new StandardAnalyzer()); Query query = parser.parse(q);
§ QueryParser – Need to parse the query in the same way the documents
were indexed – Tell the query which field should it use (field based search) – Use the same analyzer
TopDocs and ScoreDoc
TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;
§ Search returns TopDocs – Reference to the top ranked documents returned by search
§ TopDoc has ScoreDoc(s) – Each ScoreDoc is a single document
GeXng the results
System.out.println("Found " + hits.length + " hits.");for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " +
d.get("First_Name") + " " + d.get("Last_Name"));}
§ Get the required fields from the documents
Adding/dele)ng Documents
void addDocument(Document d);void addDocument(Document d, Analyzer a);
Important: Need to ensure that Analyzers used at indexing time are consistent with Analyzers used at searching time // deletes docs containing term or matching// query. The term version is useful for// deleting one document.void deleteDocuments(Term term);void deleteDocuments(Query query);
Index format § Each Lucene index consists of one or more segments
– A segment is a standalone index for a subset of documents – All segments are searched – A segment is created whenever IndexWriter flushes adds/
deletes § Periodically, IndexWriter will merge a set of
segments into a single segment – Policy specified by a MergePolicy– Segments are grouped into levels – Segments within a group are roughly equal size (in log space) – Once a level has enough segments, they are merged into a
segment at the next level up§ Explicitly invoke optimize() to merge segments
Searching a changing index
Directory dir = FSDirectory.open(...);IndexReader reader = IndexReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);
Above reader does not reflect changes to the index unless you reopen it. Reopening is more resource efficient than opening a new IndexReader.
IndexReader newReader = reader.reopen();If (reader != newReader) {
reader.close();reader = newReader;searcher = new IndexSearcher(reader);
}
Near-‐real-‐)me search
IndexWriter writer = ...;IndexReader reader = writer.getReader();IndexSearcher searcher = new IndexSearcher(reader);
// Now let us say there’s a change to the index using writerwriter.addDocument(newDoc);
// reopen() and getReader() force writer to flushIndexReader newReader = reader.reopen();if (reader != newReader) {
reader.close();reader = newReader;searcher = new IndexSearcher(reader);
}
Query Syntax Query expression Document matches if…
java Contains the term java in the default field
java junit java OR junit
Contains the term java or junit or both in the default field (the default operator can be changed to AND)
+java +junit java AND junit
Contains both java and junit in the default field
)tle:ant Contains the term ant in the )tle field
)tle:extreme –subject:sports Contains extreme in the )tle and not sports in subject
(agile OR extreme) AND java Boolean expression matches
)tle:”junit in ac)on” Phrase matches in )tle
)tle:”junit ac)on”~5 Proximity matches (within 5) in )tle
java* Wildcard matches
java~ Fuzzy matches
lastmodified:[1/1/09 TO 12/31/09]
Range matches
Programma)cally constructed queries § TermQuery, TermRangeQuery § NumericRangeQuery § PrefixQuery § BooleanQuery § PhraseQuery, WildcardQuery
Source and acknowledgements § Slides by Manning and Nayak:
http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx
§ The Lucene tutorial website: http://www.lucenetutorial.com
§ Apache lucene: http://lucene.apache.org
22