19

Webinar: Solr & Fusion for Big Data

Embed Size (px)

Citation preview

Page 1: Webinar: Solr & Fusion for Big Data
Page 2: Webinar: Solr & Fusion for Big Data

Solr & Fusion for Big Data

• Where search fits in the big data landscape?

• Solr on HDFS• Indexing strategies• End-to-end security• Lambda architecture• Spark and how we

use it in Fusion

Page 3: Webinar: Solr & Fusion for Big Data

The standard for enterprise search. of Fortune 500

uses Solr.

90%

Page 4: Webinar: Solr & Fusion for Big Data

Why search for big data?• Speed at scale• Basic analytics (facets, pivot facets, facets +

stats) + visualizations• Query structured and unstructured data• Ad hoc exploration is inherent in big data• People grok search• Context for aggregations (drill into the numbers)

Page 5: Webinar: Solr & Fusion for Big Data

Common use case:log analysis

• Time-ordered data• Raw data stored in

HDFS• How much data? How

fast?• Access patterns?• Schema design ~ no

free lunch at scale

Page 6: Webinar: Solr & Fusion for Big Data

Time-based Partitioning SchemeFusion

Log AnalyticsDashboard

logs_feb26(daily collection)

logs_feb25(daily collection)

logs_feb01(daily collection)

h00(shard)

h22(shard)

h23(shard)

h00(shard)

h22(shard)

h23(shard)

Add replicasto support higherquery volume & fault-tolerance

recent_logs(colllection alias)

Use a collectionalias to make multiplecollections look like a single collection; minimizeexposure to partitioning strategy in client layer

Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages

Page 7: Webinar: Solr & Fusion for Big Data

Solr on HDFS• Maturing solution still some issues• My test showed ~23-25% slower than local SSD• Better ROI, operational efficiency, security• Needed for YARN• Enables auto add replicas• Interesting features coming soon: ZooKeeper lock

(SOLR-8169) and replicas share index (SOLR-6237)

Page 8: Webinar: Solr & Fusion for Big Data

Solr on HDFS

Solrshard1 / replica1

block cache

Solrshard1 / replica2

block cache

writes

reads

HDFSDataNode C

HDFSDataNode B

HDFSDataNode A writes

reads

HDFS block replication

Solr replication

Page 9: Webinar: Solr & Fusion for Big Data

Auto Add Replica

HDFSDataNode C

block cache

Solrshard1 / replica1

writes

reads

HDFSDataNode A

HDFS block replication

Solrshard1 / replica2

block cache

HDFSDataNode Bwrites

reads

Solr replication

overseer

ZooKeeper

watches

Solrshard1 / replica3

writes

reads

Page 10: Webinar: Solr & Fusion for Big Data

Indexing Strategies• Many tools available!• MapReduce indexer (Solr contrib)• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase• Storm to Solr or Fusion

(github.com/LucidWorks/storm-solr)• Spark to Solr or Fusion

(github.com/LucidWorks/spark-solr)• Lucidworks Fusion Connectors

Page 11: Webinar: Solr & Fusion for Big Data

Any Data. Any Source.

Page 12: Webinar: Solr & Fusion for Big Data

Fusion Indexing Pipelines in MapReduce

Solr

Map Task (or reducer if needed)

ZooKeeper

CloudSolrClient

HDFS

Get collection metadatafrom ZooKeeper(e.g. shard leader URL)

Send updates to shardleaders in parallel

Fusion Pipelinedocs

…N map tasks (1 per block)

30+ index stages- Field mapping- JavaScript- Tika parsing- NLP- Regex- JDBC lookup

Many common file formats supported:CSV, SequenceFile, grok, XML, warc

Page 13: Webinar: Solr & Fusion for Big Data

Security• End-to-end security is now a reality for Hadoop• Kerberos authentication (ZK, Solr, HDFS, jobs)• Pluggable authorization framework• Collection and document-level access controls (via

Fusion)• SSL• Apache Ranger (centralized admin, auditing,

monitoring for Hadoop)

Page 14: Webinar: Solr & Fusion for Big Data

Cluster Sizing Worksheet• There is no formula, only guidelines!• # of documents / avg. doc size / number of fields• Updates per second / soft-commit frequency• Storage type (local SSD vs. HDFS)• Sharding scheme (time-based vs. hash-based)• Peak QPS / 95th percentile response time / query

complexity• Must test your data on your servers ;-)

Page 15: Webinar: Solr & Fusion for Big Data

• Search engine fits perfectly with lambda

• Use batch layer to build indexes instead of “views”

• Speed layer uses Spark streaming to build near real-time index

• Aggregation collections for historical data

Lambda Architecture

source: http://lambda-architecture.net/

Page 16: Webinar: Solr & Fusion for Big Data

Spark

Spark Core

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(BSP)

Hadoop YARN Mesos Standalone

HDFSExecution

ModelThe Shuffle Caching

engine

clustermgmt

Tachyon

languages Scala Java Python R

sharedmemory

Page 17: Webinar: Solr & Fusion for Big Data

The most relevant results every single time.

Massive scale. Real-time. Secure.

Any data. Any source.

Page 18: Webinar: Solr & Fusion for Big Data

Lucidworks Is Search

Page 19: Webinar: Solr & Fusion for Big Data

Any questions?• Try Fusion http://lucidworks.com/products/fusion/

download• LinkedIn / Twitter / Solr JIRA: @thelabdude