Introduc)ontoLucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Introduc)on to Lucene

Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Open source search engines §  Academic

–  Terrier (Java, University of Glasgow) –  Indri, Lemur (C++, and Java too, UMass & CMU) –  Zettair (University of Melbourne)

§  Apache project (non-academic) –  Lucene –  Apache license, legally easier for commercial use

§  Lucene –  Java search engine library, with many features –  Ports/integration to other languages available (C/C++, C#,

Python, Ruby, … ) –  Other projects on top of Lucene: Solr and others –  Used by: LinkedIn, Twitter, CiteSeer, …

2

Lucene: overview

3

Lucene document

Documents

Tokens Index and dic)onary

Lucene query

Text query

Search results

Results display

User: document building

Lucene: Analyzing

Lucene: Indexing

Lucene: Analyzing

Lucene: Searching

User: reading search results

User: query building

UI

User

User = Programmer

Lucene document building §  A document is a collection of Fields §  Document: CV of a student, several fields –  Last_Name: Banerjee –  First_Name: Arkadeep –  Age: 24 –  Gender: M –  Institute: ISI Kolkata –  Location: Kolkata –  Description: Arkadeep is a highly motivated student who

simply loves challenge. He is …

§  A field can be a text, a number, a range

4

Building document

5

import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;

……

Document doc = new Document();doc.add(new StringField(“Last_Name”,”Banerjee”, …));doc.add(new StringField(“First_Name”,”Arkadeep”, …));doc.add(new IntField(“Age”,24, …);doc.add(new TextField(“Description”,description, …));

Now, let’s understand the fields

Building document

§  Field.Store –  NO : Don’t store the field

value in the index –  YES : Store the field value

in the index

§  Field.Index –  ANALYZED : Tokenize with

an Analyzer –  NOT_ANALYZED : Do not

tokenize –  NO : Do not index this field –  Other options

6

new StringField(“Last_Name”,”Banerjee”,Field.Store.Yes));new StringField(“First_Name”,”Arkadeep”,Field.Store.Yes));new NumericField(“Age”,24,Field.Store.Yes);new TextField(“Description”,desc, Field.Store.No));

Indexing

§  IndexWriter: primary class for indexing –  Stores the index in a directory –  RAMDirectory also possible (in memory)

7

StandardAnalyzer analyzer = new StandardAnalyzer(); FSDirectory dir = FSDirectory.open(new File(“directory_path”));IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter w = new IndexWriter(index, config); w.addDocument(document);w.close();

Lucene document building §  A document is a collection of Fields §  Document: CV of a student, several fields –  Last_Name: Banerjee –  First_Name: Arkadeep –  Age: 24 –  Gender: M –  Institute: ISI Kolkata –  Location: Kolkata –  Description: Arkadeep is a highly motivated student who

simply loves challenge. He is …

§  A field can be a text, a number, a range

8

What would we like to search with?

Lucene Analyzer §  Tokenizes the text in the Fields §  Common Analyzers –  WhitespaceAnalyzer

Splits tokens on whitespace –  SimpleAnalyzer

Splits tokens on non-letters, and then lowercases –  StopAnalyzer

Same as SimpleAnalyzer, but also removes stop words –  StandardAnalyzer

Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...

Analysis example §  “The quick brown fox jumped over the lazy dog” §  WhitespaceAnalyzer –  [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy]

[dog] §  SimpleAnalyzer –  [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy]

[dog] §  StopAnalyzer –  [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

§  StandardAnalyzer –  [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Analysis example 2 §  “XY&Z Corporation – [email protected]” §  WhitespaceAnalyzer –  [XY&Z] [Corporation] [-] [[email protected]]

§  SimpleAnalyzer –  [xy] [z] [corporation] [xyz] [example] [com]

§  StopAnalyzer –  [xy] [z] [corporation] [xyz] [example] [com]

§  StandardAnalyzer –  [xy&z] [corporation] [[email protected]]

Searching

§  IndexReader and IndexSearcher

12

int k = 10;StandardAnalyzer analyzer = new StandardAnalyzer();IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(

".//index")));IndexSearcher searcher = new IndexSearcher(reader);

Query q = new QueryParser("Description", analyzer).parse(query);

TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;

Query and QueryParser

QueryParser parser = new QueryParser(Version.LUCENE_40, ”LastName”,

new StandardAnalyzer()); Query query = parser.parse(q);

§  QueryParser –  Need to parse the query in the same way the documents

were indexed –  Tell the query which field should it use (field based search) –  Use the same analyzer

TopDocs and ScoreDoc

TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;

§  Search returns TopDocs –  Reference to the top ranked documents returned by search

§  TopDoc has ScoreDoc(s) –  Each ScoreDoc is a single document

GeXng the results

System.out.println("Found " + hits.length + " hits.");for(int i=0;i<hits.length;++i) {

int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " +

d.get("First_Name") + " " + d.get("Last_Name"));}

§  Get the required fields from the documents

Adding/dele)ng Documents

void addDocument(Document d);void addDocument(Document d, Analyzer a);

Important: Need to ensure that Analyzers used at indexing time are consistent with Analyzers used at searching time // deletes docs containing term or matching// query. The term version is useful for// deleting one document.void deleteDocuments(Term term);void deleteDocuments(Query query);

Index format §  Each Lucene index consists of one or more segments

–  A segment is a standalone index for a subset of documents –  All segments are searched –  A segment is created whenever IndexWriter flushes adds/

deletes §  Periodically, IndexWriter will merge a set of

segments into a single segment –  Policy specified by a MergePolicy–  Segments are grouped into levels –  Segments within a group are roughly equal size (in log space) –  Once a level has enough segments, they are merged into a

segment at the next level up§  Explicitly invoke optimize() to merge segments

Searching a changing index

Directory dir = FSDirectory.open(...);IndexReader reader = IndexReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);

Above reader does not reflect changes to the index unless you reopen it. Reopening is more resource efficient than opening a new IndexReader.

IndexReader newReader = reader.reopen();If (reader != newReader) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Near-‐real-‐)me search

IndexWriter writer = ...;IndexReader reader = writer.getReader();IndexSearcher searcher = new IndexSearcher(reader);

// Now let us say there’s a change to the index using writerwriter.addDocument(newDoc);

// reopen() and getReader() force writer to flushIndexReader newReader = reader.reopen();if (reader != newReader) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Query Syntax Query expression Document matches if…

java Contains the term java in the default field

java junit java OR junit

Contains the term java or junit or both in the default field (the default operator can be changed to AND)

+java +junit java AND junit

Contains both java and junit in the default field

)tle:ant Contains the term ant in the )tle field

)tle:extreme –subject:sports Contains extreme in the )tle and not sports in subject

(agile OR extreme) AND java Boolean expression matches

)tle:”junit in ac)on” Phrase matches in )tle

)tle:”junit ac)on”~5 Proximity matches (within 5) in )tle

java* Wildcard matches

java~ Fuzzy matches

lastmodified:[1/1/09 TO 12/31/09]

Range matches

Programma)cally constructed queries §  TermQuery, TermRangeQuery §  NumericRangeQuery §  PrefixQuery §  BooleanQuery §  PhraseQuery, WildcardQuery

Source and acknowledgements §  Slides by Manning and Nayak:

http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx

§  The Lucene tutorial website: http://www.lucenetutorial.com

§  Apache lucene: http://lucene.apache.org

22

Documents

Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Introduc)ontoLucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java