Bigdata roundtable-storm

Preview:

DESCRIPTION

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Citation preview

Storm - pipes and filters on steroids

Andre Sprenger

BigData Roundtable

Hamburg 30. Nov 2011

My background• info@andresprenger.de

• Studied Computer Science and Economics

• Background: banking, ecommerce, online advertising

• Freelancer

• Java, Scala, Ruby, Rails

• Hadoop, Pig, Hive, Cassandra

“Next click” problemRaymie Strata (CTO, Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”

“Next click” problem

collect data process data

real time layer

max latency80 ms

HTTPRequest

HTTPResponse

web server

max latency80 ms

(next)HTTP

RequestHTTP

Response

realtimeresponse

near realtimeresponse

time

Example problems• Realtime statistics - counting, trends, moving average

• Read Twitter stream and output images that are trending in the last 10 minutes

• CTR calculation - read ad clicks/ad impressions and calculate new click through rate

• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist

• Search advertising

Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model

• Storm - Twitter

• MapReduce Online - Yahoo

• Cloud Map Reduce - Accenture

• HStreaming - Startup, based on Hadoop

• Brisk - DataStax, Cassandra

System requirements• Fault tolerance - system keeps running when a node

fails

• Horizontal scalability - should be easy, just add a node

• Low latency

• Reliable - does not loose data

• High availability - well, if it’s down for an hour its not realtime

Storm in a nutshell

• Written by Backtype (aquired by Twitter)

• Open Source, Github

• Runs on JVM

• Clojure, Python, Zookeeper, ZeroMQ

• Currently used by Twitter for real time statistics

Programming model• Tuple - name/value list

• Stream - unbounded sequence of Tuples

• Spout - source of Streams

• Bolt - consumer / producer of Streams

• Topology - network of Streams, Spouts and Bolts

Spout

Spout

tuple tuple tupletuple

tuple tuple tupletuple

Bolt

Bolt

tuple tuple tupletuple

tuple tuple tupletuple

tuple tuple tupletuple

Processes streams and generates new streams.

Bolt

• filtering

• transformation

• split / aggregate streams

• counting, statistics

• read from / write to database

Topology

Bolt

Network of Streams, Spouts and Bolts

Bolt

Bolt

Spout

Spout

Bolt

Bolt

TaskParallel processor inside Spouts and Bolts.

Each Spout / Bolt has a fixed number of Tasks.

Spout

Task

Bolt

Task

Task Task

Task

Stream grouping

Which Task does a Tuple go to?

• shuffle grouping - distribute randomly

• field grouping - partition by field value

• all grouping - send to all Tasks

• custom grouping - implement your own logic

Word count example

SentenceSplitter

BoltSpout

WordCountBolt

(“a b c a b d”)

(“a”)(“b”)(“c”)(“a”)(“b”)(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Guaranteed processing

Spout (“a b c a b d”)

(“a”)

(“b”)

(“c”)

(“a”)

(“b”)

(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Topology has a timeout for processing of the tuple tree

Runtime view

Reliability• Nimbus / Supervisor are SPOF

• both are stateless, easy to restart without data loss

• Failure of master node (?)

• Running Topologies should not be affected!

• Failed Workers are restarted

• Guaranteed message processing

Administration

• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)

• Cluster nodes can be added at runtime

• But: existing Topologies are not rebalanced (there is a ticket)

• Administration web GUI

Community• Source is on Github - https://github.com/

nathanmarz/storm.git

• Wiki - https://github.com/nathanmarz/storm/wiki

• Nice documentation

• Google Group

• People start to build add-ons: JRuby integration, adapters for JMS, AMQP

Storm summary

• Nice programming model

• Easy to deploy new topologies

• Horizontal scalability

• Low latency

• Fault tolerance

• Easy to setup on EC2

Questions?

Recommended