30
Cloudera Search Mike Drob

Cloudera search

Embed Size (px)

Citation preview

Page 1: Cloudera search

Cloudera SearchMike Drob

Page 2: Cloudera search

Who Am I?Apache Accumulo PMCApache Curator PMC

Hobbyist contributor- Various Apache Projects- Junit, Jcommander, JLine2

Volunteer with FIRST LEGO League (FLL)

Search/Solr is my ${DayJob}

Page 3: Cloudera search

AgendaWe will cover: - Overview of projects involved - Architectural discussion of Solr on Hadoop

We will not cover:- Performance, Tuning, or Optimizations

- Writing custom applications - Tutorials (kind of)

Page 4: Cloudera search

Why Search?Hadoop for Everyone!

Typical case: Ingest data to storage engine (HDFS, HBase, etc...) Process data (MR, Hive, Impala)

Experts know MapReduceSavvy users know SQL

Everyone knows Search!

Page 5: Cloudera search

Use Case

Image Credit: Alex Moundalexis; Used With Permission

Page 6: Cloudera search

Use Case

FACETING

HIGHLIGHTING

CONTENT

Page 7: Cloudera search

Use Case

Does not contain “Hadoop” in title...

SCORING

Page 8: Cloudera search

Search on Hadoop History•Katta – Distributed Lucene •Blur – Lucene on Hadoop•SolBase - Lucene + HBase @ Photobucket•HBASE-3529 – Lucene on HBase•SOLR-1301 – MR Indexer•Ad-Hoc

Page 9: Cloudera search

Family Tree

...

Page 10: Cloudera search

Strengthen the Family Bonds •No need to build something radically new - we have the pieces we need.

•Focus on integration points.

•Create high quality, first class integrations and contribute the work to the projects involved.

•Focus on integration and quality first - then performance and scale.

Page 11: Cloudera search

Very fast and feature rich ‘core’ search engine library.

Compact and powerful, Lucene is an extremely popular full-text search library.

Provides low level APIs for analyzing, indexing, and searching text, along with a myriad of related features.

Just the core - either you write the ‘glue’ or use a higher level search engine built with Lucene.

Page 12: Cloudera search

Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine.

- Wikipedia

Page 13: Cloudera search

Node (JVM)

Architecture & Terms

Core(Index Dir)

Host

Physical

LogicalCollection

Shard 1

ReplicasShard 2Shard 3

Page 14: Cloudera search

SolrCloud

Page 15: Cloudera search

Solr Integration

•Read and Write directly to HDFS

•First Class Custom Directory Support in Solr•Support Solr Replication on HDFS

•Other improvements around usability and configuration

Page 16: Cloudera search

Putting the Index in HDFS

•Extend Lucene's Directory & DirectoryFactory to abstract HDFS implementation

•Solr relies on the FS cache to operate at full speed, while HDFS not known for it’s random access speed.

•Apache Blur has already solved this with an HdfsDirectory that works on top of a BlockDirectory.

•The “block cache” caches the hot blocks of the index off heap (direct byte array) and takes the place of the FS cache.

Page 17: Cloudera search

Putting TransactionLog in HDFS

•TransactionLog is a basic WAL

•HdfsUpdateLog added - extends UpdateLog

•Triggered by setting the UpdateLog dataDir to a path starting with hdfs:/

•Benefits from same extensive testing as used on UpdateLog

Page 18: Cloudera search

Running Solr on HDFS

•Cloudera Manager can do all of this for you.

•Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in hdfs.

•Set LockType to ‘hdfs’

•Use an UpdateLog dataDir location that begins with ‘hdfs:/’

•i.e. java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lockType=solr.HdfsLockFactory -Dsolr.updatelog=hdfs://host:port/path -jar start.jar

Page 19: Cloudera search

Solr Replication on HDFS

•Take advantage of “distributed filesystem” and allow for something similar to HBase regions.

•If a node goes down, the data is still available in HDFS - allow for that index to be automatically served by a node that is still up if it has the capacity.

Solr Node

Solr Node

Solr Node

HDFS

Page 20: Cloudera search

MR Index Building•Scalable index creation via map-reduce

•Many initial ‘homegrown’ implementations sent documents from reducer to SolrCloud over http

•To really scale, you want the reducers to create the indexes in HDFS and then load them up with Solr

•The ideal impl will allow using as many reducers as are available in your hadoop cluster, and then merge the indexes down to the correct number of ‘shards’

Page 21: Cloudera search

MR Index Building

Mapper:Parse input into

indexable document

Mapper:Parse input into

indexable document

Mapper:Parse input into

indexable document

Index shard 1

Index shard 2

Arbitrary reducing steps of indexing and merging

End-Reducer (shard 1):Index document

End-Reducer (shard 2):Index document

Page 22: Cloudera search

SolrCloud Aware

•Can ‘inspect’ ZooKeeper to learn about Solr cluster.

•What URLs to GoLive to.

•The Schema to use when building indexes.

•Match hash -> shard assignments of a Solr cluster.

Page 23: Cloudera search

GoLive

•After building your indexes with map-reduce, how do you deploy them to your Solr cluster?

•We want it to be easy - so we built the GoLive option.

•GoLive allows you to easily merge the indexes you have created atomically into a live running Solr cluster.

•Paired with the ZooKeeper Aware ability, this allows you to simply point your map-reduce job to your Solr cluster and it will automatically discover how many shards to build and what locations to deliver the final indexes to in HDFS.

Page 24: Cloudera search

HBase Integration

•Collaboration between NGData & Cloudera•NGData created the Lily data management platform•Lily HBase Indexer•Service which acts as a HBase replication listener•HBase replication features, such as filtering, supported•Replication updates trigger indexing of updates (rows)•Integrates Morphlines library for ETL of rows•AL2 licensed on github https://github.com/ngdata

Page 25: Cloudera search

HBase Integration

HDFS

HBase

inte

ract

ive load

Indexer(s)Tr

iggers

on

up

date

s

Solr serverSolr serverSolr serverSolr serverSolr server

Page 26: Cloudera search

Hue Integration

Hue•Simple UI•Navigated, faceted drill down•Customizable display•Full text search, standard Solr API and query language

Page 27: Cloudera search

Hue Integration

Page 28: Cloudera search

Sentry Integration (Security)

Collection-Level (Query, Update, Admin)

Document-Level (Filter on document metadata)

Also supports KRB and SSL

Page 30: Cloudera search

Mike Drob, Cloudera