Upload
hatram
View
229
Download
0
Embed Size (px)
Citation preview
© Hortonworks Inc. 2013
Apache Storm
Page 1
© Hortonworks Inc. 2013
What is Storm?
• Real time stream processing framework
• Scalable
–Up to 1 million tuples per second per node
• Fault Tolerant
–Tasks reassigned on failure
• Guaranteed Processing
–At least once processing
–Exactly once processing with some more work
• Relatively language agnostic
–Primarily JVM based
–Thrift API for defining and submitting topologies
–JSON based protocol for defining components in other languages
Page 2
© Hortonworks Inc. 2013
Motivation
• Process large amount of incoming data real time
• Classic use case is processing streams of tweets
–Calculate trending users
–Calculate reach of a tweet
• Data cleansing and normalization
• Personalization and recommendation
• Log processing
Page 3
© Hortonworks Inc. 2013
Lambda Architecture
Page 4
Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/
• Most useful when
– Batch & speed layers do essentially the same
computation
– Sample use case: KPI dashboard
• Less useful when
– When batch & speed layers
do different computation
– Sample use case: Real-time model scoring
© Hortonworks Inc. 2013
Basic Concepts
Page 5
Tuple: Most fundamental data structure
and is a named list of values that can be of any datatype
Streams: Groups of tuples
Spouts: Generate streams.
Bolts: Contain data processing, persistence and alerting logic. Can also
emit tuples for downstream bolts
Tuple Tree: First tuple and all the tuples
that were emitted by the bolts that
processed it
Topology: Group of spouts and bolts wired together into a workflow
© Hortonworks Inc. 2013
Architecture
Nimbus(Management server)• Similar to job tracker
• Distributes code around cluster
• Assigns tasks • Handles failures
Supervisor(Worker nodes):
• Similar to task tracker• Run bolts and spouts as ‘tasks’
ZooKeeper:• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
• Consumption related metadata for Trident topologies
© Hortonworks Inc. 2013
Relationship Between Supervisors, Workers, Executors
& Tasks
Page 7
Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Each supervisor machine in storm has specific
Predefined ports to which a worker process is assigned
supervisor
© Hortonworks Inc. 2013
Tuple Routing
Page 8
Grouping type What it does When to use
Shuffle Grouping Sends tuple to a bolt in random round robin sequence
- Doing atomic operations eg. mathoperations.
Fields Grouping Sends tuples to a bolt based on one or or more field's in the tuple
- Segmentation of the incoming stream.- Counting tuples of a certain type.
All grouping Sends a single copy of each tuple to all instances of a receiving bolt
- Send some signal to all bolts like clear cache or refresh state etc.
- Send ticker tuple to signal bolts to save state etc.
Custom grouping Implement your own field grouping so tuples are routed based on custom logic
- Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc.
Direct grouping Source decides which bolt will receive tuple
- Depends.
Global grouping Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID)
- Global counts.
Fields grouping provides various ways to control tuple routing to bolts.
© Hortonworks Inc. 2013
Topology creation example
Page 9
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", kafkaSpout);
builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout");
builder.setBolt("enumerator", new
HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag"));
builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator");
Get Tweet Find Hashtags Report FindingsCount Hashtags
Kafka Spout"reader"
Bolt"normalizer"
Removes non-alphanumeric characters, extracts hashtag values and emits them.
Bolt"enumerator"
Keeps track of how
many instances of
each hashtag have
occurred.
Bolt"reporter"
Regularly creates reportand uploads it to Amazon S3.
© Hortonworks Inc. 2013
What happens on failure?
• Run everything with monitoring
–E.g. daemontools or monit
–Restarts Nimbus and Supervisors on failure
• Nimbus
–Stateless (kept in either ZooKeeper or on disk)
–Single Point of Failure, Sort Of
– Workers still function, but can’t be reassigned when a node fails
– Supervisors continue as normal
• Supervisor
–Stateless
• Entire Node
–Nimbus reassigns tasks on that machine after timeout
Page 10
© Hortonworks Inc. 2013
Guaranteed Processing
• Tuples from Spout are tagged with a message ID
• Each of these tuples can result in a tuple tree
• Once every tuple in the tuple tree is processed, the original tuple is considered to be processed.
• Requires two pieces from the user
–Explicitly anchoring an emitted tuple to the input tuple(s)
–Ack or fail every tuple.
• If a tuple isn’t processed quickly enough, a timeout value will cause a failure.
• Spouts like the Kafka spout can replay tuples on failure, either as explicitly indicated by bolts or from timeouts.
–At least once processing!
Page 11
© Hortonworks Inc. 2013
What is Trident?
• Provides exactly once processing semantics in Storm
• Core concept is to process a group of tuples as a batch rather than process tuple at a time like core Storm does.
• Higher level API for defining topologies.
• All Trident topologies under the covers are automatically converted into Spouts and Bolts.
Page 12
© Hortonworks Inc. 2013
Parallelism
• Three basic variables: # Slots, # Workers, # Tasks
–No general way to answer beyond profiling and adjusting.
• Can set the number of executors (threads)
• Can set the number of tasks
–Tasks are NOT parallel within an executor
–More than one task for executor is useful for rebalancing while the topology is running
• Number of workers
–Increase when bottlenecked on CPU and each worker has many tuples to process
Page 13
© Hortonworks Inc. 2013
Patterns – Streaming Joins
• Combine two or more data streams
• Unlike database join, streaming join has infinite input, and unclear semantics.
• Different types of joins for different use cases
• Partition input streams the same wayFields groupbuilder.setBolt("join", new
MyJoiner(), parallelism)
.fieldsGrouping("1", new Fields("joinfield1",
"joinfield2"))
.fieldsGrouping("2", new Fields("joinfield1",
"joinfield2"))
.fieldsGrouping("3", new Fields("joinfield1",
"joinfield2"));
Page 14
© Hortonworks Inc. 2013
Patterns – Batching
• For efficiency
–E.g. Elasticsearch bulk API
• Hold on to tuples in instance variable
• Process tuples
• Ack all the instance tuples
• When emitting, consider multi-anchored tuple to ensure reliability.
–Anchor to batched tuples to ensure all batched tuples are replayed.
Page 15
© Hortonworks Inc. 2013
Patterns – Streaming Top N
• Simplest way is to have a bolt that does global grouping on stream and maintains list in memory of top N items
–Doesn’t scale because whole stream goes through one task
• Alternative: Do many top N’s across partitions of stream
• Merge each partition top N to get global top N
• Use fields grouping to get partitioning
builder.setBolt("rank", new RankObjects(), parallelism)
.fieldsGrouping("objects", new Fields("value"));
builder.setBolt("merge", new MergeObjects())
.globalGrouping("rank");
Page 16