Upload
keval-dalasaniya
View
414
Download
1
Embed Size (px)
DESCRIPTION
Why to combine Hadoop and solr,two cutting edge open source technologies.
Citation preview
Indexing with solr search server and hadoop framework.
indexing• indexing collects, parses, and stores data to facilitate fast and
accurate information retrieval. • The purpose of storing an index is to optimize speed and performance in
finding documents. • Without an index, the search engine would scan every document.• The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
Why hadoop + solr ?• Data set outgrows the storage capacity of a single physical machine.• Distributed filesystems more complex than regular disk filesystems.• Biggest challenges is making the filesystem tolerate node failure without
suffering data loss.• Hadoop comes with a distributed filesystem called HDFS.• HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.• Hadoop doesn’t require expensive, highly reliable hardware to run on.
Continue…• A program written in other frameworks may require large amounts of
refactoring when scaling from ten to one hundred or one thousand machines.
• This may involve having the program be rewritten several times• Hadoop is specifically designed to have a very flat scalability curve.• In Hadoop very little--if any--work is required for that same program to
run on a much larger amount of hardware.• Hadoop platform will manage the data and hardware resources and
provide dependable performance growth proportionate to the number of machines available.
Continue…• Highly fault-tolerant• Suitable for applications with large data sets• A HTTP browser can be used to browse the files of a HDFS instance.• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
Solr • Advanced Full-Text Search Capabilities• Optimized for High Volume Web Traffic• Standards Based Open Interfaces - XML, JSON and HTTP• Comprehensive HTML Administration Interfaces• Linearly scalable, auto index replication, auto failover and recovery• Near Real-time indexing• Flexible and Adaptable with XML configuration• Extensible Plugin Architecture
Solr cloud• New in Solr 4.0• Easier scaling• Centralized config• Fault tolerant indexing and querying• Using Apache ZooKeeper as registry
slave
slave
slave
Solr server
Solr server
Solr server
master ZooKeeper
Solr cloud
Technology and Platform
Technology: Hadoop, SolrFront End: SolrBack End: Hadoop Framework, solr search server
Thank you