32
Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie

Realtime Analytics with Apache Cassandra · Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie. Analytics 2 101 ... 10x vs MySQL... Analytics

  • Upload
    ngophuc

  • View
    261

  • Download
    0

Embed Size (px)

Citation preview

Realtime Analytics with Apache Cassandra

Tom WilkieFounder & CTO, Acunu Ltd

@tom_wilkie

Analytics2

101• BigTable-style datamodel combined with

Dynamo-style consistency

• Simple queries - put, get, range queries

•Multi-master architecture: no SPOF

• Tunable consistency, multi-DC aware

•Optimised for random writes & random range queries

• Atomic counters, wide rows, composite keys

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

3

Analytics4

Solution Con

Scalability$$$

Not realtime

Spartan query semantics => complex, DIY solutions

Analytics

Example I

eg “show me the number of mentions of ‘Acunu’ per day, between May and

November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2

TB of datahttp://blog.twitter.com/2011/03/numbers.html

5

Analytics

Okay, so how are we going to do it?

For each tweet,

increment a bunch of counters,

such that answering a query

is as easy as reading some counters

6

Analytics

Preparing the data

Step 1: Get a feed of the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

7

Analytics

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

8

Analytics9

k4

k1k3

k2

Cassandra keys distributed based on hash or row key, ie randomly

Analytics

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

Instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket10

Analytics11

Towards a more general solution...

(Example II)

Analytics

countgrouped by ...

daycount

distinct (session)

count ... geography

... browseravg(duration)

12

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

13

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :00→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

14

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

15

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

16

where time 21:00-22:00count(*)

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

17

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

18

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

19

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

20

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

group all by geo

Analytics21

What about more thanjust aggregates?

Analytics

Approximate Analytics

Exact

Large ScaleReal-time

22

Analytics

Count Distinct

Plan A: keep a list of all the things you’ve seen count them at query time

Quick to update

... but at scale ...

Takes lots of space

Takes a long time to query

23

Analytics

Approximate Distinct

xitem

00101001110...

hash max so far

22leading zeroes

y 11010100111... 0 2z 00011101011... 3 3

...

max # leading zeroes seen so far

... to see a max of M takes about 2M items

24

Analytics

Approximate Distinct

to reduce var, average over m=2k sub-streams

xitem

00101001110...

hash

0, 0

index, zeroes max so far

0,0,0,0y 11010100111... 3, 1 0,0,0,1z 00011101011... 0, 1 1,0,0,1

...

take the harmonic mean

25

Analytics

Okay... now what?

Analytics

• Aggregate incrementally, on the fly

• Store live + historical aggregates

events

counterupdates

Acunu Analytics

Click streamSensor data

etc

Analytics

10x vs MySQL...

Analytics29

Dashboard UI

Analytics

“Up and running in about 4 hours”

“We found out a competitor was scraping our data”

“We keep discovering use cases we hadn’t thought of ”

http://vimeo.com/54026096

Analytics

"We're still finding new and interesting use cases, which just aren't possible with our

current datastores."

"Quick, efficient and easy to get started"