Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Preview:

DESCRIPTION

Slides from my tutorial at Denormalized London on 21 Sept 2012

Citation preview

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

2

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

• Push processing into ingest phase• Make queries fast

tweets

counterupdates

?Twitter

Okay, so how are we going to do it?

Okay, so how are we going to do it?

For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

QueryingStep 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

Now its your turn.....

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

http://goo.gl/O9hkv

Get some Cassandra VMs

Cluster them up• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

Extensions

UI

• Pretty graphs• Automatically periodically update?• Search multiple terms

Painbird

• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)

Extensions

Recommended