C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

Preview:

DESCRIPTION

The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.

Citation preview

©2012 DataStax 1

DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen

©2012 DataStax 2

•  Big Data Engineer at DataStax •  Co-author of ‘Programming Hive’

and ‘Introduction to Solr’ from O’Reilly

About the Presenter

©2012 DataStax 3

•  The company behind Cassandra •  Sells DataStax Enterprise

About DataStax

©2012 DataStax 4

DataStax Enterprise 3.x

©2012 DataStax 5

•  Single stack •  Cassandra •  Solr •  Hadoop •  Consulting •  Support

DataStax Enterprise

©2012 DataStax 6

•  ZDNet article: “The biggest cloud app of all: Netflix”

•  http://zd.net/ZHtrmW •  Built on Cassandra •  “According to Cockcroft, if something

goes wrong, Netflix can continue to run the entire service on two out of three zones”

Cassandra at Netflix

©2012 DataStax 7

•  Petabytes of growing data •  Hadoop is for batch work •  What are the solutions for realtime?

What is Big Data?

©2012 DataStax 8

•  Near realtime •  1000 millisecond latency

What is realtime?

©2012 DataStax 9

•  Cost of scaling to petabytes •  Physical limitations

Why not relational databases?

©2012 DataStax 10

•  Hadoop for batch •  Solr and Cassandra for realtime •  Gives most of relational capability at

1/10 the cost, scales linearly

Relational to Big Data

©2012 DataStax 11

•  Distributed database heavy lifting •  Simple dynamo model •  Executes replication tasks extremely

well

Why Cassandra?

©2012 DataStax 12

•  Cassandra is easier, code is readable •  Fewer moving parts •  Multi-datacenter replication •  Enables low level IO tuning

Cassandra vs. HBase

©2012 DataStax 13

•  HBase runs on HDFS •  HDFS is not designed for random

access IO •  Multiple hacks / products to perform

random access (MapR, HDFS Jiras)

Cassandra vs. HBase

©2012 DataStax 14

•  Cassandra is peer to peer, there is no single point of failure (SPOF)

•  The HDFS name node is a single point of failure

Cassandra vs. HBase

©2012 DataStax 15

•  Most of Facebook runs on MySQL •  Memcache front ends the reads

HBase at Facebook

©2012 DataStax 16

•  Hive with Hadoop •  A vague dialect of SQL •  Requires Java for UDFs •  Relational Joins

Batch Analytics

©2012 DataStax 17

•  Solr •  SQL features except relational joins

•  Use Hive for relational joins

•  CEP (Complex Event Processing) •  Storm

Realtime Analytics

©2012 DataStax 18

•  Storm, computes results on streaming data

CEP (Complex Event Processing)

©2012 DataStax 19

•  Java inverted indexing library •  Text analytics is raw computation

over linear sets of data •  High speed computation engine

Lucene

©2012 DataStax 20

•  Terms dictionary points to list document ids (integers)

•  Tokenizes text •  Complete variety of computation on

vectors of data

Inverted Indexes

©2012 DataStax 21

•  Search server built around Lucene •  Adds faceting, distributed search •  Missed the cloud environment

features of NoSQL systems for many years

Solr

©2012 DataStax 22

•  Solr Cloud is a Zookeeper based system

•  New and probably not production ready

•  Playing catch up

Solr Cloud

©2012 DataStax 23

•  High overlap with Solr •  More mature than Solr Cloud •  Less distributed features than

Cassandra

Elastic Search

©2012 DataStax 24

•  Columns, column families, keyspaces •  Peer to peer •  Eventual consistency •  Implements basic Google BigTable

model

Cassandra Concepts

©2012 DataStax 25

•  Both implement a log structured merge tree file architecture

Lucene and Cassandra

©2012 DataStax 26

•  Data is stored in Cassandra •  Data placement controlled by

Cassandra •  Solr is a secondary index (only)

DataStax Enterprise with Solr

©2012 DataStax 27

•  Separation of church and state, eg, data and index

DataStax Enterprise with Solr

©2012 DataStax 28

•  Indexing is a CPU intensive task •  Not IO bound because of multi-

threading •  When a thread is flushing, other

threads are indexing, CPU is saturated at all times

Indexing

©2012 DataStax 29

•  IO bound, index needs to fit in RAM, then CPU bound

•  Lucene enables multithreading queries

•  Solr does not multithread queries

Queries

©2012 DataStax 30

•  Eventual consistency, each node has it’s own Lucene index

•  Lucene segment files are not replicated (like Solr Cloud and ElasticSearch)

DataStax Enterprise with Solr

©2012 DataStax 31

•  Query requests are round robin’d across nodes automatically

Distributed Search Architecture

©2012 DataStax 32

•  3.0.1 is the current release of DataStax Enterprise

DSE 3.0.1

©2012 DataStax 33

•  Ease of re-indexing •  Re-index the entire cluster or per-

node •  Re-indexing occurs when the Solr

schema changes

New Features in DSE 3.0.1

©2012 DataStax 34

•  Solr Cloud requires re-indexing from an external data source such as a relational database

New Features in DSE 3.0.1

©2012 DataStax 35

•  DSE re-indexes directly from Cassandra

•  No custom code is required for re-indexing

New Features in DSE 3.0.1

©2012 DataStax 36

•  View the heap memory usage of the field caches

•  Perform capacity planning

New Features in DSE 3.0.1

©2012 DataStax 37

•  Multithreaded re-indexing and repair •  Adding a new Solr node is fast

New Features in DSE 3.0.1

©2012 DataStax 38

•  Kerberos and SSL security •  Security audit logging

New Features in DSE 3.0.1

©2012 DataStax 39

•  Near realtime: per-segment filters, facets, multivalue facets

•  Solr 4.3

DataStax Enterprise 3.1

©2012 DataStax 40

•  vNodes •  Composite keys

DataStax Enterprise 3.1

©2012 DataStax 41

•  Multi datacenter live Solr schema updates and re-indexing

•  CQL -> Solr queries, makes porting SQL applications easy for SQL developers

Future

©2012 DataStax 42

Demo of Wikipedia

©2012 DataStax 43

•  Details about every trade •  Tick data generated real time and is

quantitatively query-able •  Too big to query on in real time? Not

anymore!

Real World Example: Tick Data

©2012 DataStax 44

•  Computing the moving stock price average in real time

•  Comparing multiple moving averages for different stock_symbols

•  Requires statistical analysis, group by companies, and faceting features

Tick Data - Moving Average

©2012 DataStax 45

•  Read latest ticks for a given company •  Query ticks for companies in specific

verticals during large events such as press releases

•  Compute deviation of stock data over 5 years for groups of companies

Tick Data Analytics - Ad Hoc Searches

©2012 DataStax 46

Real Time Stocks Demo

©2012 DataStax 47

General

©2012 DataStax 48

•  Like an SQL CREATE TABLE statement

•  Defines field types •  Defines fields

Schema

©2012 DataStax 49

•  XML based configuration options for Solr

Solr Config

©2012 DataStax 50

•  Commits new index segment to RAM •  Avoids ‘hard’ commit fsync

Soft Commit

©2012 DataStax 51

Auto Soft Commit <!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> !

<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> !</updateHandler>!

©2012 DataStax 52

•  Loaded for sort and facet queries •  Uses heap space

Field Cache

©2012 DataStax 53

•  Java based API for interacting with a Solr server

•  DSE supports SolrJ/HTTP with no changes

SolrJ / HTTP

©2012 DataStax 54

•  Auto data type mapping •  Copy fields •  Dynamic fields

Insert data with CQL

©2012 DataStax 55

•  Exists however is mainly useful for debugging

•  Limited functionality, queries a single node

CQL with Solr Query

©2012 DataStax 56

CQL Insert Example INSERT INTO wikipedia (key, text) !VALUES ('1', 'when in rome')!

©2012 DataStax 57

How to convert applications

©2012 DataStax 58

•  Common to convert existing SQL applications to Big Data

•  Focus on the application functionality

SQL to Solr

©2012 DataStax 59

•  Cassandra makes all distributed operations easy

SQL to Solr

©2012 DataStax 60

•  SELECT * FROM wikipedia WHERE type = ‘pdf’

•  q=type:pdf

SELECT WHERE

©2012 DataStax 61

•  SELECT title,text FROM wikipedia •  q=*:* •  fl=title,text

SELECT columns

©2012 DataStax 62

•  SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’

•  q=type:pdf •  Get the num found

SELECT COUNT

©2012 DataStax 63

•  SELECT * FROM stocks ORDER BY price ASC

•  q=*:* •  sort=price asc

SELECT ORDER BY

©2012 DataStax 64

•  SELECT AVG(price) FROM stocks •  q=*:* •  stats=true •  stats.field=price •  The average is called ‘mean’ in the

Solr results

SELECT AVG

©2012 DataStax 65

•  SELECT AVG(price) FROM stocks GROUP BY symbol

•  q=*:* •  stats=true •  stats.field=price •  stats.facet=symbol

SELECT AVG GROUP BY

©2012 DataStax 66

•  SELECT * FROM wikipedia WHERE text LIKE ‘rom%’

•  q=text:rom*

SELECT WHERE LIKE

©2012 DataStax 67

Recommended