26
Building a Search Engine Using Apache Lucene/Solr

Building a Search Engine Using Lucene

Embed Size (px)

Citation preview

Page 1: Building a Search Engine Using Lucene

Building a Search Engine Using Apache

Lucene/Solr

Page 2: Building a Search Engine Using Lucene

Road Map• Problem Definition• A Basic Search Engine Pipeline• Meet Lucene • Lucene API Examples • Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)• Applied Lucene (Real Examples)

Page 3: Building a Search Engine Using Lucene

Problem DefinitionYou got a farm of data, and you want it to be searchable.Analogy: Searching for a needle in a haystack with adding more hay to the stack! - SQL Databases Cons ( > 500,000,000 records …)- Scalability- Decentralization

Page 4: Building a Search Engine Using Lucene

A Basic Search Engine Pipeline• Crawling: Grapping the data • Parsing [Optional]: Understanding the data• Indexing: Build the holding structure• Ranking: Sort the data• Searching: Read that holding structure

Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting, Calculating Term Vectors, Token Filtration,

Index Inversion, etc…

Page 5: Building a Search Engine Using Lucene

What is Lucene?• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)• Free, Java information retrieval library• Application related: Indexing, Searching• High performance, A decade of research• Heavily supported, simply customized• No dependencies

Page 6: Building a Search Engine Using Lucene

What Lucene Ain’t• A complete search engine

• An application

• A crawler

• A document filter/recognizer

Page 7: Building a Search Engine Using Lucene

Lucene RolesRich Document Rich Document

GatherParse

Make Doc

Search UI

Search Appe.g. webapp

Search

Index

Index

Page 8: Building a Search Engine Using Lucene

Lucene Strength Points• Simple API • Speed• Concurrency • Smart indexing (Incremental)• Near Real Time Search• Vector Space Search • Heavily Used, Supported

Page 9: Building a Search Engine Using Lucene

Lucene Query Types• Single Term VS. Multi-Term “+name: camel + type: animal”• Wildcard Queries “text:wonder*” • Fuzzy Queries “room~0.8”• Range Queries “date:[25/5/2000 To *]”• Grouped Queries “text: animal AND small”• Proximity Queries “hamlet macbeth”~10• Boosted Queries “hamlet^5.0 AND macbeth”

Page 10: Building a Search Engine Using Lucene

API Sample I (Indexing)private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); }

public void close() throws IOException { writer.close(); }

public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); }}

Page 11: Building a Search Engine Using Lucene

Indexing Pipeline (Simplified)

Tokenizer TokenFilterDocument DocumentWriter

InvertedIndex

add

Page 12: Building a Search Engine Using Lucene

Analysis Basic Types "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - [email protected]" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [[email protected]] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [[email protected]]

Page 13: Building a Search Engine Using Lucene

The Inverted Index (In a nutshell)

Page 14: Building a Search Engine Using Lucene

API Sample II (Searching)public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true);

QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)");

for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); }

is.close();}

Page 15: Building a Search Engine Using Lucene

Index Update • Lucene doesn’t have an update mechanism. So?

• Incremental Indexing (Index Merging)

• Delete + Add = Update

• Index Optimization

Page 16: Building a Search Engine Using Lucene

API Sample III (Deleting)Via IndexReadervoid deleteDocument(int docNum)

Deletes the document numbered docNum

int deleteDocuments(Term term) Deletes all documents that have a given term indexed.

Via IndexWritervoid deleteAll()

Delete all documents in the index.

void deleteDocuments(Query query) Deletes the document(s) matching the provided query.

void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries.

void deleteDocuments(Term term) Deletes the document(s) containing term.

void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.

Page 17: Building a Search Engine Using Lucene

Some Statistics• Dependent on Lucene.NET (a .NET port of Lucene)

Local Testing (Index, Search are on the same device)

Over Network Testing (File server for index file, Standalone searching workstations)

Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)

4.3 GB ~32, 180 MB ~50 -> 300 0.2

40 GB ~360, 2.6 GB ~100 -> 3000 3.2

Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)

4.3 GB X,180 MB ~300 -> 700 X

40 GB X, 2.6 GB ~400 -> 4500 X

Page 18: Building a Search Engine Using Lucene

Lucene Wrappers (Apache Solr) • A Java wrapper over Lucene • A web application that can be deployed on any

servlet container (Apache Tomcat, Jetty)• A REST service • It has an administration interface • Built-in configuration with Apache Tika (a repository of parsers)• Scalable • Integration with Apache Hadoop, Apache Cassandra

Page 19: Building a Search Engine Using Lucene

Solr Administration Interface

Page 20: Building a Search Engine Using Lucene

Solr Architecture (The Big Picture)

Note: It includes JSON, PHP, Python,… Not only XML.

Page 21: Building a Search Engine Using Lucene

Communication with Solr (Sending Docs)• Direct Connection OR Through APIs (SolrJ, SolrNET)

// make a connection to Solr serverSolrServer server = new HttpSolrServer("http://localhost:8080/solr/");// prepare a docfinal SolrInputDocument doc1 = new SolrInputDocument();doc1.addField("id", 1);doc1.addField("firstName", "First Name");doc1.addField("lastName", "Last Name");final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();docs.add(doc1);// add docs to Solrserver.add(docs);server.commit();

Page 22: Building a Search Engine Using Lucene

Communication with Solr (Searching)final SolrQuery query = new SolrQuery();query.setQuery("*:*");query.addSortField("firstName", SolrQuery.ORDER.asc);final QueryResponse rsp = server.query(query);final SolrDocumentList solrDocumentList = rsp.getResults();for (final SolrDocument doc : solrDocumentList) {

final String firstName = (String) doc.getFieldValue("firstName");final String id = (String) doc.getFieldValue("id");

}

Page 23: Building a Search Engine Using Lucene

Some Statistics

Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it withThe pure Lucene.NET model.

Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing can cause some delay depending on the queuing strategy.

Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)

4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203

40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)

Page 24: Building a Search Engine Using Lucene

Lucene/Solr Users• Instagram (geo-search API)• NetFlix (Generic search feature)• SourceForge (Generic search feature)• Eclipse (Documentation search)• LinkedIn (Recently, Job Search)• Krugle (SourceCode Search)• Wikipedia (Recently, Generic Content Search)

Page 25: Building a Search Engine Using Lucene

References • Manning Lucene in Action (2nd Edition)• Lucene Main Website• Another Presentation on SlideShare

Page 26: Building a Search Engine Using Lucene

Thank You