Building a Search Engine Using Lucene

Building a Search Engine Using Apache

Lucene/Solr

Road Map• Problem Definition• A Basic Search Engine Pipeline• Meet Lucene • Lucene API Examples • Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)• Applied Lucene (Real Examples)

Problem DefinitionYou got a farm of data, and you want it to be searchable.Analogy: Searching for a needle in a haystack with adding more hay to the stack! - SQL Databases Cons ( > 500,000,000 records …)- Scalability- Decentralization

A Basic Search Engine Pipeline• Crawling: Grapping the data • Parsing [Optional]: Understanding the data• Indexing: Build the holding structure• Ranking: Sort the data• Searching: Read that holding structure

Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting, Calculating Term Vectors, Token Filtration,

Index Inversion, etc…

What is Lucene?• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)• Free, Java information retrieval library• Application related: Indexing, Searching• High performance, A decade of research• Heavily supported, simply customized• No dependencies

What Lucene Ain’t• A complete search engine

• An application

• A crawler

• A document filter/recognizer

Lucene RolesRich Document Rich Document

GatherParse

Make Doc

Search UI

Search Appe.g. webapp

Search

Index

Index

Lucene Strength Points• Simple API • Speed• Concurrency • Smart indexing (Incremental)• Near Real Time Search• Vector Space Search • Heavily Used, Supported

Lucene Query Types• Single Term VS. Multi-Term “+name: camel + type: animal”• Wildcard Queries “text:wonder*” • Fuzzy Queries “room~0.8”• Range Queries “date:[25/5/2000 To *]”• Grouped Queries “text: animal AND small”• Proximity Queries “hamlet macbeth”~10• Boosted Queries “hamlet^5.0 AND macbeth”

API Sample I (Indexing)private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }

public void close() throws IOException { writer.close(); }

public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}

Indexing Pipeline (Simplified)

Tokenizer TokenFilterDocument DocumentWriter

InvertedIndex

add

Analysis Basic Types "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]

The Inverted Index (In a nutshell)

API Sample II (Searching)public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);

QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");

for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }

is.close();}

Index Update • Lucene doesn’t have an update mechanism. So?

• Incremental Indexing (Index Merging)

• Delete + Add = Update

• Index Optimization

API Sample III (Deleting)Via IndexReadervoid deleteDocument(int docNum)

Deletes the document numbered docNum

int deleteDocuments(Term term) Deletes all documents that have a given term indexed.

Via IndexWritervoid deleteAll()

Delete all documents in the index.

void deleteDocuments(Query query) Deletes the document(s) matching the provided query.

void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries.

void deleteDocuments(Term term) Deletes the document(s) containing term.

void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.

Some Statistics• Dependent on Lucene.NET (a .NET port of Lucene)

Local Testing (Index, Search are on the same device)

Over Network Testing (File server for index file, Standalone searching workstations)

Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)

4.3 GB ~32, 180 MB ~50 -> 300 0.2

40 GB ~360, 2.6 GB ~100 -> 3000 3.2


4.3 GB X,180 MB ~300 -> 700 X

40 GB X, 2.6 GB ~400 -> 4500 X

Lucene Wrappers (Apache Solr) • A Java wrapper over Lucene • A web application that can be deployed on any

servlet container (Apache Tomcat, Jetty)• A REST service • It has an administration interface • Built-in configuration with Apache Tika (a repository of parsers)• Scalable • Integration with Apache Hadoop, Apache Cassandra

Solr Administration Interface

Solr Architecture (The Big Picture)

Note: It includes JSON, PHP, Python,… Not only XML.

Communication with Solr (Sending Docs)• Direct Connection OR Through APIs (SolrJ, SolrNET)

// make a connection to Solr serverSolrServer server = new HttpSolrServer("http://localhost:8080/solr/");// prepare a docfinal SolrInputDocument doc1 = new SolrInputDocument();doc1.addField("id", 1);doc1.addField("firstName", "First Name");doc1.addField("lastName", "Last Name");final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();docs.add(doc1);// add docs to Solrserver.add(docs);server.commit();

Communication with Solr (Searching)final SolrQuery query = new SolrQuery();query.setQuery("*:*");query.addSortField("firstName", SolrQuery.ORDER.asc);final QueryResponse rsp = server.query(query);final SolrDocumentList solrDocumentList = rsp.getResults();for (final SolrDocument doc : solrDocumentList) {

final String firstName = (String) doc.getFieldValue("firstName");final String id = (String) doc.getFieldValue("id");

}

Some Statistics

Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it withThe pure Lucene.NET model.

Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing can cause some delay depending on the queuing strategy.


4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203

40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)

Lucene/Solr Users• Instagram (geo-search API)• NetFlix (Generic search feature)• SourceForge (Generic search feature)• Eclipse (Documentation search)• LinkedIn (Recently, Job Search)• Krugle (SourceCode Search)• Wikipedia (Recently, Generic Content Search)

References • Manning Lucene in Action (2nd Edition)• Lucene Main Website• Another Presentation on SlideShare

Thank You

Data & Analytics

Building a Search Engine Using Lucene