Apache lucene

Apache LuceneSearching

AgendaSearch EngineLucene Java

FeaturesCode ExampleScalability

SolrNutch

About SpeakerAbhiram Gandhe

9+ Years Experience on Java/J2EE platformConsultant eCommerce Architect with Delivery

CubePursuing PhD from VNIT Nagpur on Link

Prediction on Graph DatabasesM.Tech. (Comp. Sci. & Engg.) MNNIT

Allahabad, B.E. (Comp. Tech.) YCCE Nagpur…

What is a Search Engine?Answer: A software that

Builds an index on textAnswers queries using the index

“But we have database already for that…”A Search Engine offers

Scalability Relevance Ranking Integrates different data sources (email, web pages,

files, databases, …)

Works on words not substringsauto !=automatic, automobile

Indexing Process:Convert documentExtract text and meta dataNormalize textWrite (inverted) indexExample:

Document 1: Apache Lucene at JUGNagpur Document 2: JUGNagpur conference

What is Apache Lucene?

“Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java”

- from http://lucene.apache.org/

What is Apache Lucene?Lucene is specifically an API, not an

application.Hard parts have been done, easy programming

has been left to you.You can build a search application that is

specifically suited to your needs.You can use Lucene to provide consistent full-

text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).

AvailabilityFreely Available (no cost)Open Source

Apache License, version 2.0 http://www.apache.org/licenses/LICENSE-2.0

Download from: http://www.apache.org/dyn/closer.cgi/lucene/java/

Apache Lucene OverviewThe Apache LuceneTM project develops open-source search

software, including:Lucene Core, our flagship sub-project, provides Java-based

indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

SolrTM is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

Open Relevance Project is a subproject with the aim of collecting and distributing free materials for relevance testing and performance.

PyLucene is a Python port of the Core project.

Lucene Java FeaturesPowerful Query SyntaxCreate queries from user input or programmaticallyRanked SearchFlexible Queries

Phrases, wildcard, etc.Field Specific Queries

eg. Title, artist, albumFast indexingFast searchingSorting by relevance or other Large and active communityApache License 2.0

Lucene Query ExampleJUGNagpurJUGNagpur AND Lucene +JUGNagpur +LuceneJUGNagpur OR LuceneJUGNagpur NOT PHP +JUGNagpur -PHP“Java Conference”Title: LuceneJ?GNagpurJUG*schmidt~ schmidt, schmit, schmittprice: [100 TO 500]

IndexFor this Demo, we're going to create an in-memory index from some strings.

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

Directory index = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

IndexWriter w = new IndexWriter(index, config);

addDoc(w, "Lucene in Action", "193398817");addDoc(w, "Lucene for Dummies", "55320055Z");addDoc(w, "Managing Gigabytes", "55063554A");addDoc(w, "The Art of Computer Science", "9900333X");

w.close();

Index...addDoc() is what actually adds documents to the index

private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {

Document doc = new Document();

doc.add(new TextField("title", title, Field.Store.YES));doc.add(new StringField("isbn", isbn, Field.Store.YES));

w.addDocument(doc);

}

Note the use of TextField for content we want tokenized, and StringField for id fields and the like, which we don't want tokenized.

QueryWe read the query from stdin, parse it and build a lucene Query out of it.

String querystr = args.length > 0 ? args[0] : "lucene";

Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);

SearchUsing the Query we create a Searcher to search the index. Then a TopScoreDocCollector is instantiated to collect the top 10 scoring hits.

int hitsPerPage = 10;

IndexReader reader = IndexReader.open(index);

IndexSearcher searcher = new IndexSearcher(reader);

TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

DisplayNow that we have results from our search, we display the results to the user.

System.out.println("Found " + hits.length + " hits.");

for(int i=0;i<hits.length;++i) { int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));

}

Lucene Internals

Everything is a DocumentA document can represent anything textual:

Word DocumentDVD (the textual metadata only)Website Member (name, ID, etc...)

A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database.

Each developer is responsible for turning their own data sets into Lucene Documents. Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files.

IndexesThe type of index used in Lucene and other

full- text search engines is sometimes also called an “inverted index”.

Indexes track term frequenciesEvery term maps back to a DocumentThis index is what allows Lucene to quickly

locate every document currently associated with a given set up input search terms.

Basic Indexing An index consists of one or more Lucene Documents 1. Create a Document

A document consists of one or more Fields: name-value pair Example: a Field commonly found in applications is title. In the case of a title Field, the field

name is title and the value is the title of that content item. Add one or more Fields to the Document

2. Add the Document to an Index Indexing involves adding Documents to an IndexWriter

3. Indexer will Analyze the Document We can provide specialized Analyzers such as StandardAnalyzer

Analyzers control how the text is broken into terms which are then used to index the document:

Analyzers can be used to remove stop words, perform stemming

Lucene comes with a default Analyzer which works well for unstructured English text, however it often performs incorrect normalizations on non-English texts. Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own. Lucene even includes a number of “stemming” algorithms for various languages, which can improve document retrieval accuracy whenthe source language is known at indexing time.

Basic Searching Searching requires an index to have already been built. Create a Query

E.g. Usually via QueryParser, MultiPhraseQuery, etc. That parses user input Open an Index Search the Index

E.g. Via IndexSearcher Use the same Analyzer as before Iterate through returned Documents

Extract out needed results Extract out result scores (if needed)

It is important that Queries use the same (or very similar) Analyzer that was used when the index was created. The reason for this is due to the way that the Analyzer performs normalization computations on the input text. Inorder to find Documents using the same type of text that was used when indexing, that text must be normalized in the same way that the original data was normalized.

Lucene Scalability

Scalability Limits3 main scalability factors:

Query RateIndex SizeUpdate Rate

Query Rate ScalabilityLucene is already fast

Built-in simple cache mechanism

Easy solution for heavy workloads: (gives near-linear scaling)Add more query servers behind a load balancer

Can grow as your traffic grows

Index Size ScalabilityCan easily handle millions of Documents

Lucene is very commonly deployed into systems with 10s of millions of Documents.

Although query performance can degrade as more Documents are added to the index, the growth factor is very low. The main limits related to Index size that you are likely to run in to will be disk capacity and disk I/O limits.

If you need bigger: Built-in methods to allow queries to span multiple

remote Lucene indexes Can merge multiple remote indexes at query-time.

Lucene is threadsafeCan update and query at the same time

I/O is limiting factor

Technology

Apache lucene