Cassandra summit-2013

Preview:

Citation preview

Real Time Big Data With Storm, Cassandra, and In-Memory Computing

DeWayne Filppi@dfilppi

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10

This is what we’re here to discuss

VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA

11

RAM is the new disk Data partitioned across a cluster

Large “virtual” memory space Transactional Highly available Code collocated with data.

In Memory Data Grid Review

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13

Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14

Simplified Event Flow

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15

Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families.

Configurable class to column family mapping Must serialize individual fields

Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers

Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16

Virtues and Limitations

Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simple

event handlers

Complete stack, not just two tools of many Fast.

Microsecond latencies for in memory operations Fast enough for almost anybody

Highly available/self healing Elastic

BUT

Popular open source, real time, in-memory, streaming computation platform.

Includes distributed runtime and intuitive API for defining distributed processing flows.

Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17

Storm Background

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18

Storm AbstractionsSpout

Bolt

Topologies

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19

Streaming word count with Storm

Storm has a simple builder interface to creating stream processing topologies

Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20

Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demand

in case of error. Any kind of quasi-queue like data source can be fashioned

into a spout. No persistence is ever required, and speed attained by

minimizing network hops during topology processing.

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21

Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memory

persistence Asynchronously update disk based state (C*)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22

Sample Architecture

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23

References Try the Cloudify recipe

Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):

– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;

http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state

implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration

For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25

Twitter Storm With Cassandra

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26

Storm Overview

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

Storm ConceptsSpouts

Bolt

Topologies

Challenge – Word Count

Word:Count

Tweets

Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28

• Hottest topics• URL mentions• etc.

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29

Streaming word count with Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30

Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using

batching. Storm processes streams. The stream provider itself needs to

support persistency, batching, and reliability.

Tweets, events,whatever….

XAP Real Time Analytics

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Two Layer Approach Advantage: Minimal

“impedance mismatch” between layers.– Both NoSQL cluster

technologies, with similar advantages

Grid layer serves as an in memory cache for interactive requests.

Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.

In Memory Compute Cluster

NoSQL Cluster

...

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw And Derived Events

Rep

orti

ng E

ngin

e

SCALE

SCALE

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33

Simplified Architecture

Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable

layer Data grid provides a transactional/consistent façade on NoSQL

store (in this case eliminating SQL database entirely)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34

Key Concepts

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Take Aways A data grid can serve different needs for big data analytics:

Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state

Provide a general purpose analytics platform– Roll your own

Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37

Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy

tics-with-storm Learn and fork the code on github:

https://github.com/Gigaspaces/storm-integration

Twitter Storm: http://storm-project.net

XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/

References

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38