Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

• Push processing into ingest phase• Make queries fast

tweets

counterupdates

?Twitter

Okay, so how are we going to do it?

For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

QueryingStep 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

May Jun Jul Aug Sept Oct Nov

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

Demo./painbird.py -u tom_wilkie

http://ec2-176-34-212-226.eu-west-1.compute.amazonaws.com:8000

Now its your turn.....

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

http://goo.gl/O9hkv

Get some Cassandra VMs

Cluster them up• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

Get the code

SSH into one of the VMs:

# curl https://acunu-oss.s3.amazonaws.com/painbird-2.tar.gz | tar zxf -

# cd release

# ./painbird.py -u tom_wilkie

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

Extensions

• Pretty graphs• Automatically periodically update?• Search multiple terms

Painbird

• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)

Extensions

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Technology

More Than Websites: PHP And The Firehose @DataSift (2013)

Marketing Flow Firehose - Visual Marketing

Amazon Kinesis Firehose - Developer Guide · Amazon Kinesis Firehose Developer Guide Key Concepts What Is Amazon Kinesis Data Firehose? Amazon Kinesis Data Firehose is …

Realtime advertising

Amazon Kinesis Firehose - Firehose

Drinking from the Social Media Firehose

Shazam ~ Leopard Shazam enjoys his firehose … Helping Wildcats Shazam ~ Leopard Shazam enjoys his firehose hammock Projects you and your school or organization can do to help wildcats

Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Realtime Marketing

Drinking from the Inbound Marketing Firehose

Realtime Analytics on the Twitter Firehose with Cassandra

(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose

Drinking from the Firehose: Web Weaving and Acquisitions

Jeremy Harkins - ineni Realtime + Lucid Edge - Knowledge through RealTime Visualisation

Amazon Kinesis Data Firehose · Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service

Drupal Theming from a firehose

Sipping from the firehose

Realtime Analytics

Realtime Search and Monetizing the Realtime Web

Machine Data. Big Data. Where do you point the firehose?