View
457
Download
2
Category
Preview:
Citation preview
Storm: Distributed and fault-tolerant realtime
computationFerran Galí i Reniu
@ferrangali
19/06/2014
Ferran Galí i Reniu
● UPC - FIB● Trovit
○ Hadoop○ Lucene/Solr○ Storm
Big Data
● Too much data○ Store○ Compute○ Analyse
● Distributed systems○ Provide horizontal scalability
● Hadoop
Distributed Systems
HDFS HDFS HDFS
File
● Hadoop
Distributed Systems
HDFS
MapReduce
HDFS
MapReduce
HDFS
MapReduce
File
Distributed Systems
● Hadoop○ Huge files○ Useful for batch○ High latency○ No real time
Storm
“Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!”
http://storm.incubator.apache.org/
Storm
● Who’s using it?
● Tuple○ Ordered list of elements○ Any type
Storm
String Integer SerializedObject ...
Storm
● Stream○ Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Storm
● Spout○ Source of streams
○ From data sources: Queues, API...
Tuple Tuple Tuple Tuple Tuple
Storm
● Bolt○ Consumes streams○ Does some processing (transform, join,...)○ Emits streams
Tuple Tuple Tuple
TupleTuple
Tuple
Tuple Tuple Tuple
Storm
● Topology○ Graph of spouts & bolts○ Runs forever
Architecture
Nimbus
Zookeeper
Zookeeper
Zookeeper
Master
Worker
Worker
Coordinator
Supervisor
Slot
Slot
Slot
Slot
Supervisor
Slot
Slot
Slot
Slot
Architecture
Supervisor
Slot
Slot
Slot
SlotWorker process
Single JVM
Tasks - Threads
parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint = 4
Supervisor
Slot
Slot
Slot
Slot
Supervisor
Slot
Slot
Slot
Slot
Worker processes = 8
parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint = 4
Worker processes = 8
combined parallelism = 4 + 1 + 2 + 2 + 3 + 4 = 16
Tasks per worker = 16 / 8 = 2
Supervisor
Supervisor
Example: Word Count
line line line word word wordFile
FileSpout SplitterBolt CounterBoltparallelism hint = 2 parallelism hint = 3 parallelism hint = 2
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
realtime computationsystem. Storm provides a
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
realtime computationsystem. Storm provides a
shuffle grouping
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
realtime computationsystem. Storm provides a
Storm a
isdistributed
realtime
computationsystem
provides
Storm a
shuffle grouping
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
realtime computationsystem. Storm provides a
Storm a
isdistributed
realtime
computationsystem
provides
Storm a
Storm
a
is
distributed
realtime
computation
system
provides
Storm
a
x1
x1
x1
x1
x1
x1
x1
x1
x1
x1
shuffle grouping
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed realtime computation
system. Storm provides a set of general primitives
for doing realtime computation. Storm is
simple, can be used with any programming
language, is used by many companies, and is
a lot of fun to use!
realtime computationsystem. Storm provides a
shuffle grouping
ais
Storm distributed
provides a
Storm
is
distributed
realtime
computation
system
a
x2
x1
x1
x1
x2
x1
x1
x1
realtime
computation
provides
fields grouping
systemStorm
Groupings
● Shuffle grouping● Fields grouping● All grouping● Global grouping● Direct grouping● Local or shuffle grouping
Fault-tolerance
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
● Worker dies○ Supervisor will restart it
● Worker dies too many times○ Nimbus will reassign it to another node
● Node dies○ Nimbus will reassign task to another node
● Nimbus is not a SPOF● Nimbus & Supervisors are fail-fast
Fault-tolerance
Guaranteeing message processing
● Through API○ ack○ fail
● Manual tuple replay○ e.g: Spout emits again message with specific id
Guaranteeing message processing
● When is a message “fully processed”?
● Solutions○ Transactional Topologies○ Trident framework
Storm is a distributed
Storm
is
distributed
a
Ok
Fail
Ok
Ok
Yet another example
tweet tweet tweet
wordword
word
TwitterSpout SplitterBolt
CounterBolt
CommitBolt
signalsignal
signal
DB
shuffle groupingfields grouping
all grouping
https://github.com/ferrangali/betabeers-storm
Batch + Real time
● Lambda architecture
Serving
Batch layer
● High latency● Reprocesses all data
New data
Batch + Real time
● Lambda architecture
Speed layer
Serving
Batch layer
● Low latency● Fast & incremental algorithms● Eventually overridden by batch layer
● High latency● Reprocesses all data
New data
Storm
● Who’s using it?
Trovit
● 40 countries● 5 verticals● Hundreds of millions of ads
Trovit
● Batch layer:○ MapReduce pipeline over HDFS
HDFS
Filter Enrich Dedup Index
kafka
xml
Trovit
● Speed layer○ Storm topology
adad
ad
adad
adrich ad rich ad rich ad
Feeds Spout
Kafka Spout
Processor Bolt Indexer Bolt
Group by index
Commit in batch every 5 minutes
kafka
xml
Trovit
HDFS
Filter Enrich Dedup Index
adad
ad
adad
adrichad richad richad
HBaseZookeeper
kafka
xml
Questions?Ferran Galí i Reniu
@ferrangali
19/06/2014
Recommended