Apache Storm Basics

Preview:

Citation preview

Apache StormParallel Real Time Computation

What’s Storm

• It’s a distributed real time computation system

• It’s free and open source

Storm Applications

• Real time analytics• Online machine learning• Distributed RPC• Others

Storm Qualities• Broad set of use cases• Scalable• Guaranteed no data loss• Robust / Fault Tolerant• Programming language agnostic

Storm Architecture

Streams

• A stream is an unbounded sequence of tuples.

• Streams are defined with a schema that names the fields in the stream’s tuples.

Spouts

• Spouts - a spout is a source of streams for a given topology.

• It will read data from an external source and emit them into the topology as tuples.

Bolts

• A bolt is the processing element in the topology.

• Bolts can do simple stream transformations like: filtering, aggregations, functions, joins, etc.

Topologies

• A topology contains all the logic for the realtime application.

• A topology is a graph of spouts and bolts that are connected by stream groupings.

Tasks• Each spout or bolt executes as many tasks

across the cluster.• Each task corresponds to one thread of

execution.• Stream groupings define how to send

tuples from one set of tasks to another set of tasks.

Stream Groupings

• A stream grouping defines for a given bolt which streams it should receive as input.

• A stream grouping also defines how the stream’s tuples are partitioned among the bolt tasks.

Shuffle Grouping

• Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

Fields Grouping

• The stream is partitioned by the fields specified in the grouping.

• If the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task.

Global Grouping

• The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.

Workers• Topologies execute across one or more

worker processes.• Each worker process is a physical JVM and

executes a subset of all the tasks for the topology.

• If the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks

A Basic StormTopology

A (not so) Basic StormTopology

Demo

Thanks!

Recommended