IR with lucene

Introduction to Information

Retrieval with Lucene

By Stylianos Gkorilas

Introductions

Presenter Architect/Development Team Leader @Trasys Greece

Java EE projects for European Agencies

IR (Information Retrieval) The tracing and recovery of specific information from stored

data IR is interdisciplinary, based on computer science,

mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.

Lucene Open Source – Apache Software License

(http://lucene.apache.org) Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1

http://lucene.apache.org/

More Lucene Intro…

Lucene is high performance, scalable IR library (not a ready to use application) Number of full featured search applications

built on top (More later…)

Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET)

Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @ http://wiki.apache.org/lucene-java/PoweredBy

http://wiki.apache.org/lucene-java/PoweredBy



Components of a Search

Application (1/4)

Acquire Content Gather and scope the content

e.g. from the web with a spider or crawler, a CMS, a Database or a file system

Projects helping Solr: handles RDBMS and XML

feeds and rich documents through Tika integration

Nutch: web crawler - sister project at apache

Grub: open source web crawler


Application (2/4)

Build document Define the document

The unit of the search engine

Has fields

De-normalization involved

Projects helping: Usually the same frameworks cover both this and the previous step Compass and its evolution

ElasticSearch

Hibernate Search

DBSight

Oracle/Lucene Integration


Application (3/4)

Analyze Document Handled by Analyzers

Built-in and contributed

Built with tokenizers and token filters

Index Document Through Lucene API or your

framework of choice

Search User Interface/Render Results Application specific


Application (4/4)

Query Builder Lucene provides one Frameworks provide extensions but also

the application itself e.g. advanced search

Run Query Retrieve documents running the query

built Three common theoretical models

Boolean model Vector space model Probabilistic model

Administration e.g. tuning options

Analytics reporting

How Lucene models content

Documents

Fields

Denormalization of content

Flexible Schema

Inverted Index

Basic Lucene Classes

Indexing IndexWriter

Directory

Analyzer

Document

Field

Searching IndexSearcher

Query

TopDocs

Term

QueryParser

Basic Indexing

Adding documents

RAMDirectory directory = new RAMDirectory();

IndexWriter writer = new IndexWriter(directory,

new WhitespaceAnalyzer(),

IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();

doc.add(new Field(“post",

"the JHUG meeting is on this Saturday",

Field.Store.YES,

Field.Index.ANALYZED));

Deleting and updating documents Field options

Store Analyze Norms Term vectors Boost

Scoring – The formula

tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document.

idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf.

boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others.

lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor.

coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents

queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.

Querying – the API

Variety of Query class implementations TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …

Querying - Example

private void indexSingleFieldDocs(Field[] fields) throws Exception {

IndexWriter writer = new IndexWriter(directory,

new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);

for (int i = 0; i < fields.length; i++) {

Document doc = new Document();

doc.add(fields[i]);

writer.addDocument(doc);

}

writer.optimize();

writer.close();

}

public void wildcard() throws Exception {

indexSingleFieldDocs(new Field[]

{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),

new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),

new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),

new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });

IndexSearcher searcher = new IndexSearcher(directory, true);

Query query = new WildcardQuery(new Term("contents", "?ild*"));

TopDocs matches = searcher.search(query, 10);

}

Querying - QueryParser

Query query = new QueryParser("subject",

analyzer).parse("(clinical OR ethics) AND methodology");

trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus

Analyzers - Internals

At Indexing and querying time Inside an analyzer

Operates on a TokenStream A token has a text value and metadata like

Start end character offsets Token type Position increment Optionally application specific bit flags and byte[]

payload

Token stream is abstract. Tokenizer and TokenFilterare the concrete ones Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form

a chain of one another

Analyzers – building blocks

Analyzers can be created by combining token streams (Order is important)

Building blocks provided in core CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter

Analyzers - core

WhitespaceAnalyzer Splits tokens at whitespace

SimpleAnalyzer Divides text at non letter characters and lowercases

StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words

KeywordAnalyzer Treats entire text as a single token

StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mailaddresses, acronyms, Chinese-Japanese-Korean characters,alphanumerics, and more lowercases and removes stop words

Analyzers – Example (1/2)

Analyzing “The JHUG meeting is on this Saturday"

WhitespaceAnalyzer:

[The] [JHUG] [meeting] [is] [on] [this] [Saturday]

SimpleAnalyzer:

[the] [jhug] [meeting] [is] [on] [this] [saturday]

StopAnalyzer:

[jhug] [meeting] [saturday]

StandardAnalyzer:

[jhug] [meeting] [Saturday]

Analyzers – Example (2/2)

Analyzing "XY&Z Corporation - [email protected]"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [[email protected]]

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer:

[xy&z] [corporation] [[email protected]]

Analyzers – Beyond the built in

language-specific analyzers, under contrib/analyzers. language-specific stemming and stop-word removal

Sounds Like analyzer e.g. MetaphoneReplacementAnalyzerthat transforms terms to their phonetic roots

SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis

The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words: The Porter stemming algorithm (or „Porter stemmer‟) is a process

for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

SnowballAnalyzer: Stemming for many European languages

Filters

Narrow the search space

Overloaded search methods that accept Filter instances

Examples TermRangeFilter

NumericRangeFilter

PrefixFilter

QueryWrapperFilter

SpanQueryFilter

ChainedFilter

Example: Filters for Security

Constraints known at indexing time Index the constraint as a field Search wrapping a TermQuery on the constraint

field with a QueryWrapperFilter

Factor in information at search time A custom filter Filter will access an external privilege store that will

provide some means of identifying documents in the index e.g. a unique term with regard to permissions

Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search

Internals - Concurrency

Any number of IndexReaders open IndexSearchers use underlying

IndexReaders

Only one IndexWriter at a time Locking with write lock file

IndexReaders may be open while the index is being changed by an IndexWriter It will see changes only when the writer

commits and is reopened

Both are thread safe/friendly classes

Internals - Indexing concepts

Index is made up from segment files Deleting documents does not actually deletes - only

marks for deletion Index writes are buffered and flushed periodically Segments need to be merged

Automatically by the IndexWriter Explicit calls to optimize

There is the notion of commit (as you would expect), which has 4 steps Flush buffered documents and deletions Sync files; force OS to write to stable storage of the

underlying I/O system Write and sync the segments_N file Remove old commits

Internals - Transactions

Two-phase commit is supported prepareCommit performs steps 1,2 and

most of 3

Lucene implements the ACID transactional model Atomicity: all or nothing commit Consistency: e.g. update will mean both

delete and add Isolation: IndexReaders cannot see what

has not been comitted Durability: Index is not corrupted and

persists in storage

Architectures

Cluster nodes that share a remote file system index Slower than local Possible limitations due to client side caching

(Samba, NFS, AFP) or stale file handles (NFS)

Index in database Much slower

Separate write and read indexes (replication) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch

Autonomous search servers (e.g. Solr, ElasticSearch) Loose coupling through JSON or XML

Frameworks– Compass Document

definition via JPA mapping<compass-core-mapping package="eu.emea.eudract.model.entity">

<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">

<id name="ctaIdentificationId">

<meta-data>cta_id</meta-data>

</id>

<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name

</dynamic-meta-data>

<property name="fullTitle">

<meta-data>cta_full_title</meta-data>

</property><property name="sponsorProtocolVersionDate">

<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>

</property>

<property name="isResubmission">

<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>

</property>

<component name="eudractNumber" />

</class>

<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">

<property name="eudractNumberId">

<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>

<meta-data>eudract_number</meta-data>

</property>

<property name="paediatricClinicalTrial">

<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial

</meta-data>

</property>

</class>

.....

</compass-core-mapping>

Frameworks– Solr Document definition

via DB mapping<dataConfig>

<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />

<document name="products">

<entity name="item" query="select * from item">

<field column="ID" name="id" />

<field column="NAME" name="name" />

<field column="MANU" name="manu" />

<field column="WEIGHT" name="weight" />

<field column="PRICE" name="price" />

<field column="POPULARITY" name="popularity" />

<field column="INSTOCK" name="inStock" />

<field column="INCLUDES" name="includes" />

<entity name="feature" query="select description from feature where item_id='${item.ID}'">

<field name="features" column="description" />

</entity>

<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">

<entity name="category" query="select description from category where id =

'${item_category.CATEGORY_ID}'">

<field column="description" name="cat" />

</entity>

</entity>

</entity>

</document>

</dataConfig>

Frameworks– Compass/Lucene

Configuration<compass name="default">

<setting name="compass.transaction.managerLookup">

org.compass.core.transaction.manager.OC4J</setting>

<setting name="compass.transaction.factory">

org.compass.core.transaction.JTASyncTransactionFactory</setting>

<setting name="compass.transaction.lockPollInterval">400</setting>

<setting name="compass.transaction.lockTimeout">90</setting>

<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>













<setting name="compass.converter.dashHandlingConverter.type">

eu.emea.eudract.compasssearch.DashHandlingConverter

</setting>

<setting name="compass.converter.shortToYesNoNaConverter.type">

eu.emea.eudract.compasssearch.ShortToYesNoNaConverter

</setting>

<setting name="compass.converter.shortToPerDayOrTotalConverter.type">

eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter

</setting>

<setting name="compass.engine.store.jdbc.dialect">

org.apache.lucene.store.jdbc.dialect.OracleDialect

</setting>

<setting name="compass.engine.analyzer.default.type">

org.apache.lucene.analysis.standard.StandardAnalyzer

</setting>

</compass>

Cool extra features- Spellchecking

You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could

Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a

4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high!

To present or not to present (the suggestion) Maybe use the Levenshtein distance

Other ideas Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches

Even More features

Sorting Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM!

SpanQueries distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others

Synonyms Injection during indexing or during searching?

A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches

Leverage a synonyms knowledge base A good strategy is to convert it into an index

Key thing is to understand that synonyms must be injected on the same position increments

Spatial Searches Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids

Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and

latitude

MoreLikeThis One use of term vectors

Function Queries e.g. add boosts for fields at search time

Some last things to bare in mind

It would be wise to back up you index You can have hot back ups (supported through the

CommitDeletionPolicy)

Performance has some trade-offs search latency indexing throughput near real time results index replication index optimization

Resource consumption Disk space File descriptors Memory

„Luke‟ is a really handy tool You can repair a corrupted index (corrupted

segments are just lost… D‟oh!)

Resources

Book: Lucene in Action

Solr: http://lucene.apache.org/solr/

Vector Space Model: http://en.wikipedia.org/wiki/Vector_Space_Model

IR Links: http://wiki.apache.org/lucene-java/InformationRetrieval

http://lucene.apache.org/solr/

http://en.wikipedia.org/wiki/Vector_Space_Model

http://en.wikipedia.org/wiki/Vector_Space_Model

http://wiki.apache.org/lucene-java/InformationRetrieval



That’s it

Questions?

Technology

IR with lucene