72
Cassandra’s sweet spot Dave Gardner @davegardnerisme

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Embed Size (px)

DESCRIPTION

Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling. Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880

Citation preview

Page 1: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Cassandra’s sweet spot

Dave Gardner@davegardnerisme

Page 2: Cassandra's Sweet Spot - an introduction to Apache Cassandra

jobs.hailocab.com

Looking for an expert backend Java dev – speak to me!

meetup.com/Cassandra-London

Next event 21st November

Page 3: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Building applications with Cassandra

• Key features

• Creating an application

• Data modeling

Page 4: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?”

27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Page 5: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“They have approximately nothing in common. And, no, Cassandra is definitely not dying off.”

28th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Page 6: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

Page 7: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #1

To use a NoSQL solution effectively, we need to identify it's sweet spot.

This means learning about each solution; how is it designed? what algorithms does it use?http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html

Page 8: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Comparing Cassandra with X

“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”

Benjamin Black – NoSQL Tapes (at 30:15)

http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip

Page 9: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

1. Elastic

Read and write throughput increases linearly as new machines are added

http://cassandra.apache.org/

Page 10: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

2. Decentralised

Fault tolerant with no single point of failure; no “master” node

http://cassandra.apache.org/

Page 11: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The dynamo paper

• Consistent hashing• Vector clocks• Gossip protocol• Hinted handoff• Read repair

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Page 12: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The dynamo paper

RF = 3#1

#4

#6

#2

#3

Client

#5

Coordinator

Page 13: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

3. Rich data model

Column based, range slices, column slices, secondary indexes, counters, expiring columns

http://cassandra.apache.org/

Page 14: Cassandra's Sweet Spot - an introduction to Apache Cassandra

The big table paper

• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829

Page 15: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Row Key

The big table paper

Name

Value

Column

Name

Value

Column

Name

Value

Column

Column Family

Page 16: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

4. You're in control

Tunable consistency, per operation

http://cassandra.apache.org/

Page 17: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels

How many replicas must respond to declare success?

Page 18: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels: write operations

Level Description

ANY One node, including hinted handoff

ONE One node

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Write

Page 19: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Consistency levels: read operations

Level Description

ONE 1st Response

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Page 20: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Headline features

5. Performant

Well known for high write performance

http://www.datastax.com/docs/1.0/introduction/index#core-strengths-of-cassandra

Page 21: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Benchmark*

http://blog.cubrid.org/dev-platform/nosql-benchmarking/

* Add pinch of salt

Page 22: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: headline features

1. Elastic

2. Decentralised

3. Rich data model

4. You’re in control (tunable consistency)

5. Performant

Page 23: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Some ads

Our user knowledge

Choose which ad to show

Page 24: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets)

http://pixel.wehaveyourkidneys.com/add.php?add=foo

Page 25: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Record clicks and impressions of each ad; storing data per-ad and per-segment

http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1http://pixel.wehaveyourkidneys.com/adClick.php?ad=1

Page 26: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Real-time ad performance analytics, broken down by segment(which segments are performing well?)

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1

Page 27: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A simple ad-targeting application

Recommendations based on best-performing ads

(this is left as an exercise for the reader)

Page 28: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Additional requirements

• Large number of users

• High volume of impressions

• Highly available – downtime is money

Page 29: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A good fit for Cassandra?

Yes!

Big data, high availability and lots of writes are all good signs that Cassandra will fit well.

http://www.nosqldatabases.com/main/2010/10/19/what-is-cassandra-good-for.html

Page 30: Cassandra's Sweet Spot - an introduction to Apache Cassandra

A good fit for Cassandra?

Although there are many things that people are using Cassandra for.

Highly available HTTP request routing (tiny data!)

http://blip.tv/datastax/highly-available-http-request-routing-dns-using-cassandra-5501901

Page 31: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #2

Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.

Page 32: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Demo

Live demo before we start

Page 33: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling

Start from your queries, work backwards

http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906

Page 34: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Page 35: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Page 36: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ column: zebra, value: foo, timestamp: 1000}

{ column: badger, value: foo, timestamp: 1001}

Page 37: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ badger: foo, zebra: foo}

with AsciiType column schema

Page 38: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

Add user to bucket X, with expiry time YWhich buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Page 39: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Is user in segment X?A: Single column fetch

Page 40: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

user Column Family:

[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Which segments is user X in?A: Column slice fetch

Page 41: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #3

With column slices, we get the columns back ordered, according to our schema

We cannot do the same for rows however, unless we use the Order Preserving Partitioner

Page 42: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #4

Don’t use the Order Preserving Partitioner unless you absolutely have to

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Page 43: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

Add user to bucket X, with expiry time Y

Which buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value

Page 44: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Expiring columns

An expiring column will be automatically deleted after n seconds

http://cassandra.apache.org/

Page 45: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

$pool = new ConnectionPool( 'whyk', array('localhost') );$users = new ColumnFamily($pool, 'users');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );

Using phpcassa client: https://github.com/thobbs/phpcassa

Page 46: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: user segments

UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'

Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language

http://www.datastax.com/docs/1.0/references/cql

Page 47: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #5

Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns

Page 48: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

Track overall ad performance; how many clicks/impressions per ad?

["ads"][<adId>][<stamp>]["click"] = #["ads"][<adId>][<stamp>]["impression"] = #

[CF] [Row] [S.Col] [Col] = value

Using super columns

Page 49: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Top Tip #6

Friends don’t let friends use Super Columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/

Page 50: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

Try again using regular columns:

["ads"][<adId>][<stamp>-"click"] = #["ads"][<adId>][<stamp>-"impression"] = #

[CF] [Row] [Col] = value

Page 51: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

ads Column Family:

[1][2011103015-click] = 1[1][2011103015-impression] = 3434[1][2011103016-click] = 12[1][2011103016-impression] = 5411[1][2011103017-click] = 2[1][2011103017-impression] = 345

Q: Get performance of ad X between two date/timesA: Column slice against single row specifying a start stamp and end stamp + 1

Page 52: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Think carefully about your data

This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread.

Other options:http://rubyscale.com/2011/basic-time-series-with-cassandra/

Page 53: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Counters

• Distributed atomic counters

• Easy to use

• Not idempotent

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

Page 54: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

$stamp = date('YmdH');$ads->add( $adId, // row key "$stamp-impression", // column 1 // increment );

We’ll store performance metrics in hour buckets for graphing.

Page 55: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: ad performance

UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’

Page 56: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

We can add in another dimension to our stats so we can breakdown by segment.

["ads"][<adId>] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Page 57: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

ads Column Family:

[1][2011103015-bar-click] = 1[1][2011103015-bar-impression] = 3434[1][2011103015-foo-click] = 12[1][2011103015-foo-impression] = 5411[1][2011103016-bar-click] = 2

Q: Get performance of ad X between two date/times, split by segmentA: Column slice against single row specifying a start stamp and end stamp + 1

Page 58: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: performance/segment

$stamp = date('YmdH');$ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr );

We’ll store performance metrics in hour buckets for graphing.

Page 59: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data modeling: segment stats

Track overall clicks/impressions per bucket; which buckets are most clicky?

["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = #

[CF] [Row] [Col] = value

Page 60: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Data modeling

• Think about the queries, work backwards

• Don’t overuse single rows; try to spread the load

• Don’t use super columns

• Ask on IRC! #cassandra

Page 61: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Common data modeling patterns

1. Using column names with no value

[cf][rowKey][columnName] = 1

Page 62: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recap: Common data modeling patterns

2. Counters

[cf][rowKey][columnName]++

Page 63: Cassandra's Sweet Spot - an introduction to Apache Cassandra

And also…

3. Serialising a whole object

[cf][rowKey][columnName] = { foo: 3, bar: 11 }

Page 64: Cassandra's Sweet Spot - an introduction to Apache Cassandra

There’s more: Brisk

Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra

DataStax now offer this functionality in their “Enterprise” product

http://www.datastax.com/products/enterprise

Page 65: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Hive

CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );

SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;

Page 66: Cassandra's Sweet Spot - an introduction to Apache Cassandra

There’s more: Supercharged Cassandra

Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads

Includes instant snapshot of CFs

http://www.acunu.com/products/choosing-cassandra/

Page 67: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Cassandra is founded on sound design principles

Page 68: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful

Page 69: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

The clients are getting better; CQL is a step forward

Page 70: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Hadoop integration means we can analyse data directly from a Cassandra cluster

Page 71: Cassandra's Sweet Spot - an introduction to Apache Cassandra

In conclusion

Cassandra’s sweet spot is highly available “big data” (especially time-series) with large numbers of writes

Page 72: Cassandra's Sweet Spot - an introduction to Apache Cassandra

Thanks

Learn more about Cassandrameetup.com/Cassandra-London

Checkout the code https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations