Upload
revathi-desai
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Excellent presentation about Storm and Kafka
Citation preview
Kafka and Storm event processing in realtime Guido Schmutz
2013 Trivadis
BASEL BERN LAUSANNE ZRICH DSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MNCHEN STUTTGART WIEN
WELCOME Kafka and Storm event processing in realtime Guido Schmutz
Jazoon 2013
23.10.2013
22.10.2013 Kafka and Storm event processing in realtime
2
2013 Trivadis
Guido Schmutz
Working for Trivadis for more than 16 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA,
EDA, BigData und FastData Member of Trivadis Architecture Board Technology Manager @ Trivadis
More than 20 years of software development experience
Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Twitter: gschmutz
22.10.2013
3 Kafka and Storm event processing in realtime
2013 Trivadis
Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria.
We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems.
Our company
O P E R A T I O N
22.10.2013 Kafka and Storm event processing in realtime
4
2013 Trivadis
With over 600 specialists and IT experts in your region
5
12 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 / EUR 4 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
22.10.2013
Hamburg
Dsseldorf
Frankfurt
Freiburg Mnchen
Wien
Basel Zurich Bern
Lausanne
Stuttgart
Brugg
Kafka and Storm event processing in realtime 5
2013 Trivadis
Agenda
1. Introduction 2. What is Apache Kafka? 3. What is Twitter Storm? 4. Summary
22.10.2013 Kafka and Storm event processing in realtime
6
2013 Trivadis
Big Data Definition (Gartner et al)
22.10.2013 Kafka and Storm event processing in realtime
7
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
Traditional computing in RDBMS is not scalable enough.
We search for linear scalability
Only structured information is not enough 95% of produced data in
unstructured
Characteristics of Big Data: Its Volume, Velocity and Variety in combination
+ Veracity (IBM) - information uncertainty + Time to action ? Big Data + Event Processing = Fast Data
2013 Trivadis
22.10.2013 Kafka and Storm event processing in realtime
8
2013 Trivadis
Sample Show the counts of tweets related to Jazoon (using term jazoon)
22.10.2013 Kafka and Storm event processing in realtime
9
23.10.2013 15:45 not showing #jazoon
http://54.217.159.208:8484/jazoon-restapi/
2013 Trivadis
Velocity
22.10.2013 Kafka and Storm event processing in realtime
10
Velocity requirement examples: Recommendation Engine Predictive Analytics Marketing Campaign Analysis Customer Retention and Churn Analysis Social Graph Analysis Capital Markets Analysis Risk Management Rogue Trading Fraud Detection Retail Banking Network Monitoring Research and Development
2013 Trivadis
Agenda
1. Introduction 2. What is Apache Kafka? 3. What is Twitter Storm? 4. Summary
22.10.2013 Kafka and Storm event processing in realtime
11
2013 Trivadis
Apache Kafka
A distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics
collections, social media streams, )
Initially developed at LinkedIn, now part of Apache Does not follow JMS Standards and does not use JMS API Kafka maintains feeds of messages in topics
22.10.2013 Kafka and Storm event processing in realtime
12
2013 Trivadis
Sample (#1)
1) Use the Twitter Streaming API (Twitter Horsebird Client) to get all the tweets containing the term jazoon
22.10.2013 Kafka and Storm event processing in realtime
13
Kafka Topic
Twitter Stream
tweet Twitter to Kafka
tweet
Create a Filter Stream for tracking a list of terms
2013 Trivadis
Sample (#1)
22.10.2013 Kafka and Storm event processing in realtime
14
Kafka Topic
Twitter Stream
tweet Twitter to Kafka
tweet
Create a fully configured Horsebird Client instance
Start the client in multiple threads
2013 Trivadis
Sample (#1)
22.10.2013 Kafka and Storm event processing in realtime
15
Kafka Topic
Twitter Stream
tweet Twitter to Kafka
tweet
Create the Kafka producer
Send the message to the Kafka topic
Convert Twitter status to Avro format
2013 Trivadis
Agenda
1. Introduction 2. What is Apache Kafka? 3. What is Twitter Storm? 4. Summary
22.10.2013 Kafka and Storm event processing in realtime
16
2013 Trivadis
What is Twitter Storm ?
A platform for doing analysis on streams of data as they come in, so you can react to data as it happens.
A highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant complementary to Hadoop
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache Incubator since September 2013
22.10.2013 Kafka and Storm event processing in realtime
17
2013 Trivadis
Hadoop vs. Storm
22.10.2013 Kafka and Storm event processing in realtime
18
Storm
Real-time processing
Topologies run forever
Stateless nodes
Scalable
Guarantees no data loss
Open Source
=> Fast, reactive, real time procesing
Hadoop
Batch processing
Jobs runs to completion
Stateful nodes
Scalable
Guarantees no data loss
Open Source
=> big batch processing
=
!=
2013 Trivadis
Idea: Simplify dealing with queue & workers
22.10.2013 Kafka and Storm event processing in realtime
19
Scaling is painful queue partitioning & worker deploy
Operational overhead worker failures & queue backups
No guarantees on data processing
2013 Trivadis
What does Storm provide?
At least once message processing Fault-tolerant Horizontal scalability No intermediate queues Less operational overhead Just works Broad set of use cases
Stream processing Continous computation Distributed RPC
22.10.2013 Kafka and Storm event processing in realtime
20
2013 Trivadis
Main Concepts - Streams
Stream an unbounded sequence of tuples core abstraction in Storm Defined with a schema that names the
fields in the tuple Value must be serializable Every stream has an id
Topology graph where each node is a spout or bolt edges indicating which bolt subscribes
to which streams
22.10.2013 Kafka and Storm event processing in realtime
21
Source of stream A
Source of stream B
Subscribes: A Emits: C
Subscribes: A Emits: D
Subscribes: A & B Emits: -
Subscribes: C & D Emits: -
T T T T T T T T
2013 Trivadis
Worker executes a subset of a topology, may run one or more executors for one or
more components
Executor thread spawned by worker
Task performs the actual data processing
Main Concepts Worker, Executor, Task
22.10.2013 Kafka and Storm event processing in realtime
22
Worker Process
Task
Task
Task
Task
2013 Trivadis
Main Concepts - Spout
A spout is the component which acts as the source of streams in a topology
Generally spouts read tuples from an external source and emit them into the topology
Spouts can emit more than one stream
22.10.2013 Kafka and Storm event processing in realtime
23
2013 Trivadis
Bolt is doing the processing of the message
filtering, functions, aggregations, joins, talking to databases. Can do simple stream transformations Complex transformations often require multiple steps and therefore
multiple bolts
Bolts can emit one or more streams
Main Concepts - Bolt
22.10.2013 Kafka and Storm event processing in realtime
24
2013 Trivadis
Sample (#2)
2) Use a Spout to subscribe to the Kafka Topic to get the related tweets
22.10.2013 Kafka and Storm event processing in realtime
25
Kafka Topic
tweet tweet KafkaSpout
Twitter Stream
tweet
Twitter to Kafka
tweet
2013 Trivadis
Sample (#2)
Use existing implementation of the Kafka Spout from storm-kafka project
22.10.2013 Kafka and Storm event processing in realtime
26 Create Kafka spout to connect to the topic
Create a Kafka configuration
Kafka Topic
Tweets tweet KafkaSpout
2013 Trivadis
3) Use a bolt to retrieve the hashtags for each tweet
Sample (#3)
22.10.2013 Kafka and Storm event processing in realtime
27
Kafka Topic
tweet tweet HashtagSplitter
KafkaSpout
hashtag
Twitter Stream
tweet
Twitter to Kafka
tweet
2013 Trivadis
Sample (#3)
22.10.2013 Kafka and Storm event processing in realtime
28
Kafka Topic
Tweets tweet HashtagSplitter
KafkaSpout
hashtag
Execute is called for each tuple arriving on the input stream
Declares the output fields for the component and is called when bolt is created
2013 Trivadis
Sample (#4)
4) Use a bolt to count the occurrences of hashtags into a Redis NoSQL DB
22.10.2013 Kafka and Storm event processing in realtime
29
Kafka Topic
Tweets tweet Hashtag Splitter
KafkaSpout
hashtag Hashtag Counter
Tweets
Twitter to Kafka
Tweets
Twitter Stream
2013 Trivadis
Sample (#4)
22.10.2013 Kafka and Storm event processing in realtime
30
Kafka Topic
Tweets tweet HashtagSplitter
KafkaSpout
hashtag HashtagCounter
Prepare is called when bolt is created
Execute is called for each tuple arriving on the input stream
2013 Trivadis
Main Concepts - Topology
Each Spout or Bolt are running N instances in parallel
22.10.2013 Kafka and Storm event processing in realtime
31
Kafka Topic
Tweets
Hashtag Splitter
KafkaSpout
Hashtag Counter
Hashtag Splitter
Hashtag Counter
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bold/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
2013 Trivadis
Sample How does it work ?
22.10.2013 Kafka and Storm event processing in realtime
32
Kafka Topic
My presentation Kafka and Storm event processing in realtime tomorrow
12:00 at Jazoon Zurich #jazoon
#storm #kafka CU there!
HashtagSplitter
KafkaSpout
HashtagCounter
HashtagSplitter
HashtagCounter
2013 Trivadis
Sample How does it work ?
22.10.2013 Kafka and Storm event processing in realtime
33
Kafka Topic
My presentation Kafka and Storm event processing in realtime tomorrow
12:00 at Jazoon Zurich #jazoon
#storm #kafka CU there!
HashtagSplitter
KafkaSpout
HashtagCounter
HashtagSplitter
HashtagCounter
storm
kafka
jazoon
2013 Trivadis
Sample How does it work ?
22.10.2013 Kafka and Storm event processing in realtime
34
Kafka Topic
My presentation Kafka and Storm event processing in realtime tomorrow
12:00 at Jazoon Zurich #jazoon
#storm #kafka CU there!
HashtagSplitter
KafkaSpout
HashtagCounter
HashtagSplitter
HashtagCounter
storm
kafka
jazoon INCR jazoon
INCR storm
INCR kafka storm = 1
kafka = 1
jazoon = 1
2013 Trivadis
5) Setup the topology with the necessary groupings (distribution)
Sample (#5)
22.10.2013 Kafka and Storm event processing in realtime
35
Create Stream called tweet-stream
Run this bolt by 2 workers
Register Kafka spout and run by 1 worker
Subscribe to stream tweet-stream using shuffle grouping
Subscribe to hashtag-splitter using fields grouping on field hashtag
2013 Trivadis
Trident
High-Level abstraction on top of storm Makes it easier to build topologies Core data model is the stream
Processed as a series of batches Stream is partitioned among nodes in cluster
5 kinds of operations in Trident Operations that apply locally to each partition and cause no network
transfer Repartitioning operations that dont change the contents Aggregation operations that do network transfer Operations on grouped streams Merges and Joins
22.10.2013 Kafka and Storm event processing in realtime
36
2013 Trivadis
Trident
Supports Joins Aggregations Grouping Functions Filters
Similar to Hadoop and Pig/Cascading
22.10.2013 Kafka and Storm event processing in realtime
37
2013 Trivadis
Trident Concepts - Topology
22.10.2013 Kafka and Storm event processing in realtime
38
Kafka Topic
tweet tweet HashtagSplitter
KafkaSpout
hashtag HashtagNormalizer
Persistent Aggregate
hashtag
groupBy local
Bolt Bolt
2013 Trivadis
Trident Concepts - Function
takes in a set of input fields and emits zero or more tuples as output fields of the output tuple are appended to the original input tuple in the
stream If a function emits no tuples, the original input tuple is filtered out Otherwise the input tuple is duplicated for each output tuple
22.10.2013 Kafka and Storm event processing in realtime
39
2013 Trivadis
Agenda
1. Introduction 2. What is Apache Kafka? 3. What is Twitter Storm? 4. Summary
22.10.2013 Kafka and Storm event processing in realtime
40
2013 Trivadis
Summary
22.10.2013 Kafka and Storm event processing in realtime
41
Storm
Express real-time processing naturally Not a Hadoop/MapReduce competitor Supports other languages Hard to debug Object Serialization Didnt cover
Reliability through guaranteed message processing
Distributed RPC Storm cluster setup and deploy
Kafka
Distributed Scalable Pub/Sub system for Big Data
Producer => Broker => Consumer of message topics
Persists messages with ability to rewind
Consumer decides what he as consumed so far
http://54.217.159.208:8484/jazoon-restapi/
2013 Trivadis
Lambda ArchitectureBig Data and Fast Data combined
22.10.2013 Kafka and Storm event processing in realtime
42
Speed Layer
Precompute Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 Trivadis
Lambda ArchitectureBig Data and Fast Data combined
22.10.2013 Kafka and Storm event processing in realtime
43
one possible product/framework mapping
Speed Layer
Precompute Views
query
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 Trivadis
Resources
Sample Code: https://github.com/gschmutz/jazoon-storm-twitter-sample
Twitter Streaming API: https://dev.twitter.com/docs/streaming-apis
Twitter Horsebird Client (HBC): https://github.com/twitter/hbc
Apache Kafka: http://kafka.apache.org/
Storm Website: http://storm-project.net/
Storm Wiki: https://github.com/nathanmarz/storm/wiki
Storm Doc: https://github.com/nathanmarz/storm/wiki/Documentation
22.10.2013 Kafka and Storm event processing in realtime
44
2013 Trivadis
BASEL BERN LAUSANNE ZRICH DSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MNCHEN STUTTGART WIEN
Thank You! Trivadis AG Guido Schmutz [email protected]
22.10.2013 Kafka and Storm event processing in realtime
45
2013 Trivadis
Lambda ArchitectureBig Data and Fast Data combined
22.10.2013 Kafka and Storm event processing in realtime
46
Immutable data
Batch View
Query
Data Stream
Realtime View
Incoming Data
Serving Layer
Speed Layer
Batch Layer
A
B C D
E F
G