46
storm stream processing @twitter Krishna Gade Twitter @krishnagade Sunday, June 16, 13

storm at twitter

Embed Size (px)

DESCRIPTION

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Citation preview

Page 1: storm at twitter

stormstream processing @twitter

Krishna GadeTwitter

@krishnagade

Sunday, June 16, 13

Page 2: storm at twitter

what is storm?

storm is a platform for doing analysis on streams of data as they come in, so you can react to data as it

happens.

Sunday, June 16, 13

Page 3: storm at twitter

storm v hadoop

storm & hadoop are complementary!

hadoop => big batch processingstorm => fast, reactive, real time processing

Sunday, June 16, 13

Page 4: storm at twitter

origins

• originated at backtype, acquired by twitter in 2011.

• to vastly simplify dealing with queues & workers.

Sunday, June 16, 13

Page 5: storm at twitter

queue-worker model

queues workers

a a a a a

Sunday, June 16, 13

Page 6: storm at twitter

typical workflow

queues queues

workers workers

datastore

Sunday, June 16, 13

Page 7: storm at twitter

problems

• scaling is painful - queue partitioning & worker deploy.

• operational overhead - worker failures & queue backups.

• no guarantees on data processing.

Sunday, June 16, 13

Page 8: storm at twitter

storm

Sunday, June 16, 13

Page 9: storm at twitter

what does storm provide?

• at least once message processing.

• horizontal scalability.

• no intermediate queues.

• less operational overhead.

• “just works”.

Sunday, June 16, 13

Page 10: storm at twitter

storm primitives

• streams

• spouts

• bolts

• topologies

Sunday, June 16, 13

Page 11: storm at twitter

streams

unbounded sequence of tuples

T T T T T T T T T T T T T T T

Sunday, June 16, 13

Page 12: storm at twitter

spouts

source of streams

A A A A A A A A A A A A

B B B B B B B B B B B B

Sunday, June 16, 13

Page 13: storm at twitter

typical spouts

• read from a kestrel/kafka queue. {tuples = events}

• read from a http server log. {tuples = http requests}

• read from twitter streaming api. {tuples = tweets}

Sunday, June 16, 13

Page 14: storm at twitter

bolts

process input stream - Aproduce output stream - B

A A A A A A A A B B B B B B B B

Sunday, June 16, 13

Page 15: storm at twitter

bolts

• filtering tuples in a stream.

• aggregation of tuples.

• joining multiple streams.

• arbitrary functions on streams.

• communication with external caches/dbs.

Sunday, June 16, 13

Page 16: storm at twitter

topology

directed-acyclic-graph of spouts and bolts.

s1

s2

b1

b2

b3

b4

b5

Sunday, June 16, 13

Page 17: storm at twitter

storm cluster

nimbus

supervisor

w1 w2 w3 w4

supervisor

w1 w2 w3 w4

ZK

topology map

sync code

topology submission

master node

slave nodesSunday, June 16, 13

Page 18: storm at twitter

nimbus

• master node.

• manages the topologies.

• job tracker in hadoop.

$ storm jar myapp.jar com.twitter.MyTopology demo

Sunday, June 16, 13

Page 19: storm at twitter

supervisor

• runs on slave nodes.

• co-ordinates with zookeeper.

• manages workers.

Sunday, June 16, 13

Page 20: storm at twitter

worker

jvm process

executor

task task

task

task

executor executor

Sunday, June 16, 13

Page 21: storm at twitter

recap

• worker - process that executes a subset of a topology.

• executor - a thread spawned by a worker.

• task - performs the actual data processing.

Sunday, June 16, 13

Page 22: storm at twitter

stream grouping

• shuffle grouping - random distribution of tuples.

• field grouping - groups tuples by a field.

• all grouping - replicates to all tasks.

• global grouping - sends the entire stream to one task.

Sunday, June 16, 13

Page 23: storm at twitter

streaming word-count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweet_spout", new RandomTweetSpout(), 5); builder.setBolt("parse_bolt", new ParseTweetBolt(), 8) .shuffleGrouping("tweet_spout") .setNumTasks(2); builder.setBolt("count_bolt", new WordCountBolt(), 12) .fieldsGrouping("parse_bolt", new Fields("word"));

Config config = new Config(); config.setNumWorkers(3); StormSubmitter.submitTopology(“demo”, config, builder.createTopology());

Sunday, June 16, 13

Page 24: storm at twitter

tweet spoutclass RandomTweetSpout extends BaseRichSpout { SpoutOutputCollector collector; Random rand; String[] tweets = new String[] { "@jkrums: There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy", "@barackobama: Four more years. pic.twitter.com/bAJE6Vom", ...

};

....

@Override public void nextTuple() { Utils.sleep(100); String tweet = tweets[rand.nextInt(tweets.length)]; collector.emit(new Values(tweet)); }}

Sunday, June 16, 13

Page 25: storm at twitter

parse boltclass ParseTweetBolt extends BaseBasicBolt {

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String tweet = tuple.getString(0); for (String word : tweet.split(" ")) { collector.emit(new Values(word)); } }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

Sunday, June 16, 13

Page 26: storm at twitter

word count boltclass WordCountBolt extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); count = (count == null) ? 1 : count + 1; counts.put(word, count); collector.emit(new Values(word, count)); }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}

Sunday, June 16, 13

Page 27: storm at twitter

word-count topology

RandomTweetSpout ParseTweetBolt WordCountBolt

shuffle grouping fields grouping

Sunday, June 16, 13

Page 28: storm at twitter

how do we run storm @twitter ?

Sunday, June 16, 13

Page 29: storm at twitter

storm on mesos

node node node node

mesos

we run multiple instances of storm on the same cluster via mesos.

storm(production)

storm(dev) provides efficient

resource isolation and sharing across distributed

frameworks such as storm.

Sunday, June 16, 13

Page 30: storm at twitter

topology isolation

isolation scheduler solves the problem of multi-tenancy – avoiding resource contention between topologies, by providing full isolation

between topologies.

Sunday, June 16, 13

Page 31: storm at twitter

topology isolation

• shared pool - multiple topologies can run on the same host.

• isolated pool - dedicated set of hosts to run a single topology.

Sunday, June 16, 13

Page 32: storm at twitter

topology isolationshared pool

storm cluster

Sunday, June 16, 13

Page 33: storm at twitter

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

Sunday, June 16, 13

Page 34: storm at twitter

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

Sunday, June 16, 13

Page 35: storm at twitter

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

Sunday, June 16, 13

Page 36: storm at twitter

topology isolation

X

shared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

host failure

Sunday, June 16, 13

Page 37: storm at twitter

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

repair hostadd host

Sunday, June 16, 13

Page 38: storm at twitter

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

add to shared pool

Sunday, June 16, 13

Page 39: storm at twitter

numbers

• benchmarked at a million tuples processed per second per node.

• running 30 topologies in a 200 node cluster..

• processing 50 billion messages a day with an average complete latency under 50 ms.

Sunday, June 16, 13

Page 40: storm at twitter

storm use-cases@twitter

Sunday, June 16, 13

Page 41: storm at twitter

stream processing applications

tweets

favorites, retweets

impressions

twitter stormstreams

spout

bolt

bolt

$$$$

realtime dashboards

new features

Sunday, June 16, 13

Page 42: storm at twitter

current use-cases

• discovery of emerging topics/stories.

• online learning of tweet features for search result ranking.

• realtime analytics for ads.

• internal log processing.

Sunday, June 16, 13

Page 43: storm at twitter

tweet scoring pipeline

tweets

data streams

impressions

interactions

storm topology

graphstore

metadatastore

join: tweets, impressions

join: tweets, interactions

last 7 days of:tweet ->

feature_val, feature_type,

timestamp

persistent store:

tweet -> feature_val,

feature_type,timestamp

thriftservice

cassandra

twemcache

input: tweet idoutput: score

write tweetfeatures

Sunday, June 16, 13

Page 44: storm at twitter

road ahead

• auto scaling.

• persistent bolts.

• better grouping schemes.

• replicated computation.

• higher-level abstractions.

Sunday, June 16, 13

Page 45: storm at twitter

companies using storm

Sunday, June 16, 13