Bigdata roundtable-storm

Storm - pipes and filters on steroids

Andre Sprenger

BigData Roundtable

Hamburg 30. Nov 2011

My background• info@andresprenger.de

• Studied Computer Science and Economics

• Background: banking, ecommerce, online advertising

• Freelancer

• Java, Scala, Ruby, Rails

• Hadoop, Pig, Hive, Cassandra

“Next click” problemRaymie Strata (CTO, Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”

“Next click” problem

collect data process data

real time layer

max latency80 ms

HTTPRequest

HTTPResponse

web server

max latency80 ms

(next)HTTP

RequestHTTP

Response

realtimeresponse

near realtimeresponse

Example problems• Realtime statistics - counting, trends, moving average

• Read Twitter stream and output images that are trending in the last 10 minutes

• CTR calculation - read ad clicks/ad impressions and calculate new click through rate

• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist

• Search advertising

Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model

• Storm - Twitter

• MapReduce Online - Yahoo

• Cloud Map Reduce - Accenture

• HStreaming - Startup, based on Hadoop

• Brisk - DataStax, Cassandra

System requirements• Fault tolerance - system keeps running when a node

• Horizontal scalability - should be easy, just add a node

• Low latency

• Reliable - does not loose data

• High availability - well, if it’s down for an hour its not realtime

Storm in a nutshell

• Written by Backtype (aquired by Twitter)

• Open Source, Github

• Runs on JVM

• Clojure, Python, Zookeeper, ZeroMQ

• Currently used by Twitter for real time statistics

Programming model• Tuple - name/value list

• Stream - unbounded sequence of Tuples

• Spout - source of Streams

• Bolt - consumer / producer of Streams

• Topology - network of Streams, Spouts and Bolts

tuple tuple tupletuple

Processes streams and generates new streams.

• filtering

• transformation

• split / aggregate streams

• counting, statistics

• read from / write to database

Topology

Network of Streams, Spouts and Bolts

TaskParallel processor inside Spouts and Bolts.

Each Spout / Bolt has a fixed number of Tasks.

Task Task

Stream grouping

Which Task does a Tuple go to?

• shuffle grouping - distribute randomly

• field grouping - partition by field value

• all grouping - send to all Tasks

• custom grouping - implement your own logic

Word count example

SentenceSplitter

BoltSpout

WordCountBolt

(“a b c a b d”)

(“a”)(“b”)(“c”)(“a”)(“b”)(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Guaranteed processing

Spout (“a b c a b d”)

(“a”)

(“b”)

(“c”)

(“a”)

(“b”)

(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Topology has a timeout for processing of the tuple tree

Runtime view

Reliability• Nimbus / Supervisor are SPOF

• both are stateless, easy to restart without data loss

• Failure of master node (?)

• Running Topologies should not be affected!

• Failed Workers are restarted

• Guaranteed message processing

Administration

• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)

• Cluster nodes can be added at runtime

• But: existing Topologies are not rebalanced (there is a ticket)

• Administration web GUI

Community• Source is on Github - https://github.com/

nathanmarz/storm.git

• Wiki - https://github.com/nathanmarz/storm/wiki

• Nice documentation

• Google Group

• People start to build add-ons: JRuby integration, adapters for JMS, AMQP

Storm summary

• Nice programming model

• Easy to deploy new topologies

• Horizontal scalability

• Low latency

• Fault tolerance

• Easy to setup on EC2

Questions?

Bigdata roundtable-storm

Technology

BigData Analytics

Bigdata presentation

Zookeeper at the bigdata roundtable

IoT Slam 2015 Keynote, A surthrival guide for the perfect storm of BigData, Cloud and IoT

BigData @ comScore

Bigdata brochure

Bhupeshbansal bigdata

CloudStack and BigData

Introduction Lecture BigData Analytics€¦ · Cost & energy efﬁciency Julian M. Kunkel Lecture BigData Analytics, 2015 4/51. Introduction BigData Challenges Analytical Workﬂow

Security bigdata

The State of BigData - meetup bigdata @ovh

Clinicalgenomics,bigdata,and …bigdata,and electronicmedicalrecords:reconciling patientrightswithresearchwhenprivacy andsciencecollide JenniferKulynych1 ,

BigData environment

Bigdata for Healthcare

BigData—ConceptualModelingtotheRescue (ExtendedAbstract) Extended Abstract.pdf · BigData—ConceptualModelingtotheRescue (ExtendedAbstract) DavidW.Embley1 andStephenW.Liddle2 1

Lecture BigData Analytics Julian M. Kunkel · 2018-04-17 · Stream-Processing with Storm Lecture BigData Analytics Julian M. Kunkel ... Online processing of large data volume Speed

Introduction Lecture BigData Analytics … · Cost & energy efﬁciency Julian M. Kunkel Lecture BigData Analytics, 2016 4/52. Introduction BigData Challenges Analytical Workﬂow

Foxvalley bigdata

BigData Insights

BIGDATA Workshop