19
1 Finding a needle in a stack of needles - adding Search to the Hadoop Ecosystem Patrick Hunt (@phunt) Bay Area Search Meetup, April 2014

Adding Search to the Hadoop Ecosystem

Embed Size (px)

Citation preview

Page 1: Adding Search to the Hadoop Ecosystem

1

Finding a needle in a stack of needles - adding Search to theHadoop Ecosystem

Patrick Hunt (@phunt)Bay Area Search Meetup, April 2014

Page 2: Adding Search to the Hadoop Ecosystem

Agenda

• Big Data and Search – setting the stage• Cloudera Search’s Architecture

• Apache Lucene/Solr• Apache Flume• Apache HBase• Apache MapReduce• Apache Sentry

• Near Real Time and Batch Use Cases• Conclusion and Q&A

Page 3: Adding Search to the Hadoop Ecosystem

Why Search?

An Integrated Frameworkon Apache Hadoop

One pool of data

One security framework

One set of system resources

One management interface

Page 4: Adding Search to the Hadoop Ecosystem

Search Simplifies Interaction

• User Goals• Explore• Navigate• Correlate

• Experts know MapReduce• Savvy people know SQL• Everyone knows Search!

Page 5: Adding Search to the Hadoop Ecosystem

What is Cloudera Search?

• Full-text, interactive search and faceted navigation• Batch, near real-time, and on-demand indexing• Apache Solr integrated with CDH

• Established, mature search with vibrant community• Separate runtime like MapReduce, Impala• Incorporated as part of the Apache Hadoop ecosystem

• Open Source• 100% Apache, 100% Solr• Standard Solr APIs

Page 6: Adding Search to the Hadoop Ecosystem

Challenges

• Scalable/Reliable Index Storage• Near Real Time (NRT) indexing• Scalable Batch Indexing• Usability

Page 7: Adding Search to the Hadoop Ecosystem

Apache Lucene/Solr

• Lucene - full text search library• Solr – search service on Lucene• SolrCloud – distributed search

• We are using version 4 (4.4 currently)

Page 8: Adding Search to the Hadoop Ecosystem

Integrate Solr/Lucene with HDFS

• Lucene Directory Abstraction• Implemented HDFSDirectory using HDFS client library• Read/Write index files directly to HDFS

• Solr DirectoryFactory Abstraction• HDFSDirectoryFactory plugs HDFSDirectory into Solr• Configuration – Solr and HDFS

Page 9: Adding Search to the Hadoop Ecosystem

Cloudera Upstream Contributions - Solr

• SOLR-3911 - Directory/DirectoryFactory now first class• Solr Replication now uses Directory abstraction• Solr Admin UI no longer assumes local directory access

• SOLR-4916 – support for reading/writing Solr index files and transaction log files to/from HDFS• HDFSDirectoryFactory/HDFSDirectory implementation

• SOLR-4655 - The Overseer should assign node names by default.• SOLR-3706 - Ship setup to log with log4j• SOLR-4494 - Clean up and polish Collections API• SOLR-4718 -Improvements to configurability

• Configuration now entirely through ZooKeeper (optional)• Many more improvements/cleanup/hardening/…

Page 10: Adding Search to the Hadoop Ecosystem

Distributed Search on Hadoop

FlumeHue UI

Custom UI

Custom App

Solr

Solr

Solr

SolrCloudquery

query

query

index

Hadoop Cluster

MR

HDFS

index

HBaseindex

Page 11: Adding Search to the Hadoop Ecosystem

Near Real Time Indexing with Flume

Log File Solr and Flume• Data ingest at scale• Flexible extraction and

mapping• Indexing at data ingest

HDFS

Flume Agent

Indexer

OtherLog File

Flume Agent

Indexer

11

Page 12: Adding Search to the Hadoop Ecosystem

Apache Flume - MorphlineSolrSink

• Flume – reliable/scalable log collection• Created a Flume “Sink” for indexing events to Solr• Integrates Cloudera Morphlines (ETL framework)

Page 13: Adding Search to the Hadoop Ecosystem

Near Real Time indexing of HBase

HDFS

HBase

inte

racti

ve lo

ad

Indexer(s)

Trig

gers

on

upda

tes Solr server

Solr serverSolr serverSolr serverSolr server

Sear

ch

+ =planet-sized tabular dataimmediate access & updatesfast & flexible informationdiscovery

B I G D ATA D ATA M A N A G E M E N T

Page 14: Adding Search to the Hadoop Ecosystem

Lily HBase Indexer

• Collaboration between NGData & Cloudera• NGData are creators of the Lily data management platform

• Lily HBase Indexer• Service which acts as a HBase replication listener• Replication updates trigger indexing of updates (rows)• Integrates Cloudera Morphlines library for ETL of rows• AL2 licensed on github https://github.com/ngdata

Page 15: Adding Search to the Hadoop Ecosystem

Scalable Batch Indexing

Index shard

Files

Index shard

Indexer

Files

Solr server

Solr server

GOLIVE

15

HDFS/HBase

Solr and MapReduce• Flexible, scalable batch

indexing• Start serving new indices

with no downtime• On-demand indexing, cost-

efficient re-indexingIndexer

Page 16: Adding Search to the Hadoop Ecosystem

16

Scalable Batch Indexing

Mapper:Parse input into

indexable document

Mapper:Parse input into

indexable document

Mapper:Parse input into

indexable document

Index shard 1

Index shard 2

Arbitrary reducing steps of indexing and merging

End-Reducer (shard 1):Index document

End-Reducer (shard 2):Index document

Page 17: Adding Search to the Hadoop Ecosystem

MapReduce Indexer

• MapReduce Job with two parts1) Scan HDFS for files (or HBase for records) to be indexed2) Mapper/Reducer indexing step

• Mapper extracts content via Cloudera Morphlines• Reducer uses Lucene to index documents directly to HDFS

• “golive”• Cloudera created this to bridge the gap between NRT (low latency,

expensive) and Batch (high latency, cheap at scale) indexing• Results of MR indexing operation are immediately merged into a

live SolrCloud serving cluster• No downtime for users• No NRT expense• Linear scale out to the size of your MR cluster

Page 18: Adding Search to the Hadoop Ecosystem

Simple, Customizable Search Interface

Hue• Simple UI• Navigated, faceted drill

down• Customizable display• Full text search,

standard Solr API and query language