View
192
Download
0
Category
Preview:
DESCRIPTION
Silicon Valley Code Camp 2014: presented by Yann Yu, Systems Engineer, Lucidworks.
Citation preview
Yann Yu Systems Engineer @ Lucidworks
Who am I?
Lucidworks is search.
Technology Retail Financial Services IndustrialHealthcare
Lucidworks is the commercial entity of the Lucene/Solr project.
8M+total downloads
Solr is both established & growing
250,000+monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solrmost widely used search solution on the planet.
LucidworksUnmatched Solr expertise.
1/3of the active committers
70%of the open source code is committed
Lucene/Solr Revolutionworld’s largest open source user conference dedicated to Lucene/
Solr.
Solr has tens of thousands of applications in production.
You use Solr everyday.
Why would you integrate Hadoop and Solr?(and how would you do that?)
• Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals and many other
related projects for extensibility
• Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production
environments at massive scales • Uses ZooKeeper for clustering
I have Hadoop, why do I need Solr?
• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data
• Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)
• Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests
• Share machine-learning insights created on Hadoop to a broad audience through an interactive medium
Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it
I have Solr, why do I need Hadoop?
• Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build
indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last-
second retrieval in Hadoop
As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity
?
So what does this solve?
The enterprise storage situation today
⚒• Large enterprises often have data
distributed in many different stores, making it hard to know where to start looking
• Employees have to check with others to verify versions of documents
• Even with hosting, knowledge is still largely tribal
Enterprise data deployment
Lucidworks HDFS connector processes documents and
sends to SolrCloud
Enterprise documents are stored in HDFS
Users make ad-hoc, full-text queries across the full content
of all documents in Solr
And retrieve source files directly from
HDFS as necessary
Standard document storage and search
• Documents can be migrated from other file storage systems via Flume or other scripts
• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)
Sink documents into HDFS
Index document contents into Solr
• The Lucidworks Hadoop connector parses content from files using many different tools
• Tika, GrokIngest, CSV mapping, Pig, etc.
• Content and data are added to fields in a Solr document
• The resulting document is sent to Solr for indexing
• Users are empowered with ad-hoc, full-text search in Solr
• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.
• Users only access HDFS as needed
Enable users to search and access content
The data warehouse
• Enterprises are storing data without a clear plan on how to access it
• The “data warehouse” is full of files, but with no way to pull documents, or to find what you’re looking for
• In some cases, the data is required for compliance and isn’t used otherwise
Log record search
Machine generated log records are sent to Flume.
Flume forwards raw log record to Hadoop for archiving.
Flume simultaneously parses out data in record into a Solr document,
forwarding resulting document to Solr
Lucidworks SiLK exposes real-time statistics and analytics to end-users,
as well as full-text search
High volume indexing of many small records
Flume archives data in HDFS
• Flume performs minimal work on log files and sends them directly into HDFS for archival
• Under optimal circumstances, the log files are sized to the block size of HDFS
Flume submits records to Solr
• Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields
• Once the Solr document is created, it is submitted to Solr for indexing
• This process happens in real-time, allowing for near real-time search
Real-time analytics dashboard
• Lucidworks SiLK allows users to create simple dashboards through a GUI
• The SiLK dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots
• Users can also perform full-text search across the data, allowing for extremely fine granularity
High traffic Solr deployments
• Some users of Solr, especially in the e-commerce case, are running high query volume sites with small document sets
• Master-slave works well enough, but doesn’t allow for NRT and similar features form SolrCloud
E-commerce search Lots of queries, not a lot of updates
Solr is pointed at an index on HDFS, and pulls it up to begin serving queries
Additional Solr machines can be spun-up on demand, pulling the
index directly from HDFS
Load balancer (or SolrJ) distributes query to active nodes
MapReduce Solr index generation
• Existing product tables or catalogs can stored in HDFS or HBase, and can continue to be updated as necessary
• Hadoop can utilize the MapReduceIndexerTool to parallelize building of indexes
• As many indexes as necessary can be built in this way
Ad-hoc scaling without manual replication
• Independent Solr nodes (not SolrCloud) can be started up and use the stored index data on HDFS
• These can be spun up in an ad-hoc fashion, allowing for an elastically scalable cluster
• Updates to indexes are versatile, can be pushed in via new collections or as updates to existing collections
Highly-available search
• New search nodes are simply added to the load balancer or smart-client
• Distributed queries allow for sharded data-sets • Results from all nodes are guaranteed to be
consistent with one-another
Recommended