View
5.873
Download
0
Category
Preview:
Citation preview
Storm distributed processing
BarCamp Saigon 2012Duc Quoc
Hello! I’m Duc
• Senior Software Engineer– KMS Technology
• Open source advocate– www.ducquoc.vn – ducquoc.vn@gmail.com – @ducquoc_vn
Agenda
• Why Storm created
• Basic concepts
• Some use cases
• Q&A
Agenda
• Why Storm created
• Basic concepts
• Some use cases
• Q&A
Storm?
• Twitter’s stream processing framework
Storm
• Originally from BackType for analyzing tweets– (More than 2000 watchers on GitHub)
• “the realtime Hadoop”– continuous computation system (open source)
• distributed, reliable, fault-tolerant– suitable for big data processing
Big Data challenges
• Scalability– vertical, horizontal
• (high) Avalaibility
• Stability (fault-tolerance)
caching, replication, partitioning/sharding, load-balancing, …
Google!
• published papers on MapReduce, Google FileSystem (GFS), BigTable
Apache Hadoop
• MapReduce, HDFS, HBase– later on: Hive, Pig, Mahout, ZooKeeper, …
JobTracker
ZooKeeper
ZooKeeper
ZooKeeper
TaskTracker
TaskTracker
TaskTracker
TaskTracker
TaskTracker
Hadoop limits
• Batch processing with jobs -> not realtime• Stateful nodes, SPOF – JobTracker/NameNode• Cumbersome API
t
nowUnprocessed
Data
Fully processed Latest full period
Hadoop job takes this long for this data
Agenda
• Why Storm created
• Basic concepts
• Some use cases
• Q&A
Cluster
• Nimbus: daemon master node• Supervisor: daemon worker nodes• Coordination via ZooKeeper
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
SupervisorUI
Tuple
• Ordered list of elements– (“user-1234”, “email:ducquoc.vn@gmail.com”)
Stream
• Unbounded sequence of tuples
Spout
• Source of stream – emitting tuples• Talks with queue, logs, API calls, event data
Bolt
• Process tuples, may emit new stream
• Apply functions, transforms, access DB & API– filter, aggregate, join, …
Topology
• A directed graph of Spout and Bolt
Task
• Thread which executes a Spout or Bolt
• Deploy a topology:$ storm jar myCode.jar com.example.MyTopology arg1 arg2
• Kill a topology:$ storm kill topologyName
Sample code
Source code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/
Create stream called “word”
Run 10 tasksCreate stream called “first-…”
Run 3 tasksSubscribes to stream “word”,using shuffle grouping
Sample code (2/3)
• RandomWordSpout
emits a random string from the array words, each 100 milliseconds
Sample code (3/3)
• InterrogativeBolt
appends a question mark to the first field of Tuple then emit
Stream grouping
• Decides which task in the bolt, the tuple is sent to
• ShuffleGrouping: randomly• FieldsGrouping: groups tuples by named fields• Global grouping, All grouping, None grouping,
Direct grouping
Local/distributed mode
More abstractions
• Distributed RPC server
• Transactional/Batch
• Trident
• https://github.com/nathanmarz/storm/wiki– http://groups.google.com/group/storm-user
Agenda
• Why Storm created
• Basic concepts
• Some use cases
• Q&A
Popular use cases
• Continuous/realtime query with low latency– analyzing, monitoring, statistics, classifying, …
• Back-end processing for streaming data– automated scoring, log processing/auditing, …
• Distributed, high-volume data processing– ETL, realtime integration/synchronization, …
Storm integration
• Data to Storm– storm-jms, storm-kafka, storm-redis-pubsub, storm-
scribe, storm-contrib-sqs, …
• Storm to databases– storm-cassandra, storm-hbase, storm-contrib-mongo,
storm-state, storm-rdbms, …
• Polyglotism (language agnostic)– Clojure, Java, python, ruby, PHP, Perl, JRuby, …
Storm dependencies
• Java 5+, Clojure
• ZeroMQ 2.1.7-, JZMQ, Python 2.6+
• Thrift, ZooKeeper, Kryo, Jetty, … – slf4j, joda, snakeyaml, guava, …
Storm UI
In production
• https://github.com/nathanmarz/storm/wiki/Powered-By
Agenda
• Why Storm created
• Basic concepts
• Some use cases
• Q&A
Q&A
Thank you!
Bonus
• I wanna know how many queries I get– Per second, minute, day, week
• Results should be available– within <2 seconds 99.8+% of the time– within 50 seconds almost always
• History should last >2 years• Should work for 0.01 q/s up to 50,000 q/s• Failure tolerant, yadda, yadda
t
now
Hadoop works great back here
Storm workshere
Real-time and Long-time together
Blended view
Blended view
Blended View
Recommended