32
Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie

Realtime Analytics with Apache Cassandra

  • Upload
    acunu

  • View
    2.345

  • Download
    0

Embed Size (px)

DESCRIPTION

The latest version of my talk, as given at the NoSQL Roadshow Amasterdam, 29th Nov 2012

Citation preview

Page 1: Realtime Analytics with Apache Cassandra

Realtime Analytics with Apache Cassandra

Tom WilkieFounder & CTO, Acunu Ltd

@tom_wilkie

Page 2: Realtime Analytics with Apache Cassandra

Analytics2

101• BigTable-style datamodel combined with

Dynamo-style consistency

• Simple queries - put, get, range queries

•Multi-master architecture: no SPOF

• Tunable consistency, multi-DC aware

•Optimised for random writes & random range queries

• Atomic counters, wide rows, composite keys

Page 3: Realtime Analytics with Apache Cassandra

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

3

Page 4: Realtime Analytics with Apache Cassandra

Analytics4

Solution Con

Scalability$$$

Not realtime

Spartan query semantics => complex, DIY solutions

Page 5: Realtime Analytics with Apache Cassandra

Analytics

Example I

eg “show me the number of mentions of ‘Acunu’ per day, between May and

November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2

TB of datahttp://blog.twitter.com/2011/03/numbers.html

5

Page 6: Realtime Analytics with Apache Cassandra

Analytics

Okay, so how are we going to do it?

For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

6

Page 7: Realtime Analytics with Apache Cassandra

Analytics

Preparing the data

Step 1: Get a feed of the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

7

Page 8: Realtime Analytics with Apache Cassandra

Analytics

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

8

Page 9: Realtime Analytics with Apache Cassandra

Analytics9

k4

k1k3

k2

Cassandra keys distributed based on hash or row key, ie randomly

Page 10: Realtime Analytics with Apache Cassandra

Analytics

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

Instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket10

Page 11: Realtime Analytics with Apache Cassandra

Analytics11

Towards a more general solution...

(Example II)

Page 12: Realtime Analytics with Apache Cassandra

Analytics

countgrouped by ...

daycount

distinct (session)

count ... geography

... browseravg(duration)

12

Page 13: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

13

Page 14: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :00→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

14

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

Page 15: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

15

Page 16: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

16

where time 21:00-22:00count(*)

Page 17: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

17

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

Page 18: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

18

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

Page 19: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

19

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

Page 20: Realtime Analytics with Apache Cassandra

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

20

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

group all by geo

Page 21: Realtime Analytics with Apache Cassandra

Analytics21

What about more thanjust aggregates?

Page 22: Realtime Analytics with Apache Cassandra

Analytics

Approximate Analytics

Exact

Large ScaleReal-time

22

Page 23: Realtime Analytics with Apache Cassandra

Analytics

Count Distinct

Plan A: keep a list of all the things you’ve seen count them at query time

Quick to update ... but at scale ...Takes lots of spaceTakes a long time to query

23

Page 24: Realtime Analytics with Apache Cassandra

Analytics

Approximate Distinct

xitem

00101001110...

hash max so far

22leading zeroes

y 11010100111... 0 2z 00011101011... 3 3

...

max # leading zeroes seen so far

... to see a max of M takes about 2M items

24

Page 25: Realtime Analytics with Apache Cassandra

Analytics

Approximate Distinct

to reduce var, average over m=2k sub-streams

xitem

00101001110...

hash

0, 0

index, zeroes max so far

0,0,0,0y 11010100111... 3, 1 0,0,0,1z 00011101011... 0, 1 1,0,0,1

...

take the harmonic mean

25

Page 26: Realtime Analytics with Apache Cassandra

Analytics

Okay... now what?

Page 27: Realtime Analytics with Apache Cassandra

Analytics

• Aggregate incrementally, on the fly• Store live + historical aggregates

events

counterupdates

Acunu Analytics

Click streamSensor data

etc

Page 28: Realtime Analytics with Apache Cassandra

Analytics

10x vs MySQL...

Page 29: Realtime Analytics with Apache Cassandra

Analytics29

Dashboard UI

Page 30: Realtime Analytics with Apache Cassandra

Analytics

“Up and running in about 4 hours”

“We found out a competitor was scraping our data”

“We keep discovering use cases we hadn’t thought of ”

http://vimeo.com/54026096

Page 31: Realtime Analytics with Apache Cassandra

Analytics

"We're still finding new and interesting use cases, which just aren't possible with our

current datastores."

"Quick, efficient and easy to get started"