Indexing with solr search server and hadoop framework

Indexing with solr search server and hadoop framework.

indexing• indexing collects, parses, and stores data to facilitate fast and

accurate information retrieval. • The purpose of storing an index is to optimize speed and performance in

finding documents. • Without an index, the search engine would scan every document.• The additional computer storage required to store the index, as well as the

considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Why hadoop + solr ?• Data set outgrows the storage capacity of a single physical machine.• Distributed filesystems more complex than regular disk filesystems.• Biggest challenges is making the filesystem tolerate node failure without

suffering data loss.• Hadoop comes with a distributed filesystem called HDFS.• HDFS is built around the idea that the most efficient data processing

pattern is a write-once, read-many-times pattern.• Hadoop doesn’t require expensive, highly reliable hardware to run on.

Continue…• A program written in other frameworks may require large amounts of

refactoring when scaling from ten to one hundred or one thousand machines.

• This may involve having the program be rewritten several times• Hadoop is specifically designed to have a very flat scalability curve.• In Hadoop very little--if any--work is required for that same program to

run on a much larger amount of hardware.• Hadoop platform will manage the data and hardware resources and

provide dependable performance growth proportionate to the number of machines available.

Continue…• Highly fault-tolerant• Suitable for applications with large data sets• A HTTP browser can be used to browse the files of a HDFS instance.• Detection of faults and quick, automatic recovery from them is a core

architectural goal of HDFS.

Solr • Advanced Full-Text Search Capabilities• Optimized for High Volume Web Traffic• Standards Based Open Interfaces - XML, JSON and HTTP• Comprehensive HTML Administration Interfaces• Linearly scalable, auto index replication, auto failover and recovery• Near Real-time indexing• Flexible and Adaptable with XML configuration• Extensible Plugin Architecture

Solr cloud• New in Solr 4.0• Easier scaling• Centralized config• Fault tolerant indexing and querying• Using Apache ZooKeeper as registry

slave

slave

slave

Solr server

Solr server

Solr server

master ZooKeeper

Solr cloud

Technology and Platform

Technology: Hadoop, SolrFront End: SolrBack End: Hadoop Framework, solr search server

Thank you