Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source...

Cassandra

Jonathan Ellis

Motivation

● Scaling reads to a relational database is hard

● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational

anymore

The new face of data

● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware

CAP theorem

● Pick two of Consistency, Availability, Partition tolerance

Two famous papers

● Bigtable: A distributed storage system for structured data, 2006

● Dynamo: amazon's highly available key-value store, 2007

Two approaches

● Bigtable: “How can we build a distributed db on top of GFS?”

● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

10,000 ft summary

● Dynamo partitioning and replication● Log-structured ColumnFamily data model

similar to Bigtable's

Cassandra highlights

● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency

and latency● Minimal administration● No SPF

Dynamo architecture & Lookup

Architecture details

● O(1) node lookup● Explicit replication● Eventually consistent

Architecture layers

Messaging service

Gossip

Failure detection

Cluster state

Partitioner

Replication

Commit log

Memtable

SSTable

Indexes

Compaction

Tombstones

Hinted handoff

Read repair

Bootstrap

Monitoring

Admin tools

Writes

● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses

Memtable / SSTable

Commit log

SSTable format

● Key / data

SSTable Indexes

● Bloom filter● Key● Column

(Similar to Hadoop MapFile / Tfile)

Compaction

● Merge keys● Combine columns● Discard tombstones

Remove

● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction

● Read repair complicates things a little● Eventually consistent complicates things

more● Solution: configurable delay before

tombstone GC, after which tombstones are not repaired

Cassandra write properties

● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable

Read path

● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the

background and perform read repair

Cassandra read properties

● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows

Consistency in a BASE world

● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1

vs MySQL with 50GB of data

● MySQL● ~300ms write

● ~350ms read

● Cassandra● ~0.12ms write

● ~15ms read

● Achtung!

Data model

● Rows, ColumnFamilies, Columns

ColumnFamilies

keyA column1 column2 column3

keyC column1 column7 column11

Column

Byte[] Name

Byte[] Value

I64 timestamp

Super ColumnFamilies

keyF Super1 Super2

keyJ Super1 Super5

column column column column column column

Types of queries

● Single column● Slice

● Set of names / range of names

● Simple slice -> columns

● Super slice -> supercolumns

● Key range

Range queries

● Add “master” server● Implement on top of K/V● Order-preserving partitioning

Modification

● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for

Thriftstruct Column { 1: binary name, 2: binary value, 3: i64 timestamp,}

struct SuperColumn { 1: binary name, 2: list<Column> columns,}

Column get_column(table, key, column_path, block_for=1)

list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)

void insert(table, key, column_path, value, timestamp, block_for=0)

void remove(tablename, key, column_path_or_parent, timestamp)

Honestly, Thrift kinda sucks

Example: a multiuser blog

Two queries

- the most recent posts belonging to a given blog, in reverse chronological order

- a single post and its comments, in chronological order

First try

JBE blog

Cassandra is teh awesome BASE FTW

Evan blog

I like kittens And Ruby

post comment comment post comment comment

<ColumnFamily

Type="Super"

CompareWith="TimeString"

CompareSubcolumnsWith="UUID"

Name="Blog"/>

Second try

<ColumnFamily

CompareWith="UUIDType"

Name="Blog"/>

JBE blog Cassandra is teh awesome

BASE FTW

Evan blog I like kittens And Ruby

Cassandra is teh awesome

comment comment

Base FTW comment comment

I like kittens

comment comment

And Ruby comment comment

<ColumnFamily

CompareWith="UUIDType"

Name="Comment"/>

Roadmap

Cassandra 0.3

● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support

Cassandra 0.4

● Branched May 18● Data file format change to support billions

of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface

Cassandra 0.5

● Bootstrap● Load balancing

● Closely related to “bootstrap done right”

● Merkle tree repair● Millions of columns per row

● This will require another data format change

● Multiget● Callout support

Production: facebook, RocketFuel

Production RSN: Digg, Rackspace

No date yet: IBM Research, Twitter

Evaluating: 50+ in #cassandra on freenode

● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059

● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations

● #cassandra on irc.freenode.net

Cassandra

Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source...

Documents

Veracity - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Introduction to Veracity Presentation.pdfWhat is Veracity? •It’s a DVCS (distributed version control system) •It’s

Solr Application Development Tutorial - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Solr Application Development...Solr Application Development Tutorial ... An open source Java-based

Monitoring MySQL - O'Reilly Mediaassets.en.oreilly.com/1/event/2/Benchmarking and Monitoring_ Tools... · Monitoring MySQL A quick overview of ... Presentation. Nice graphs built

Visualizing Geo Data - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Visualizing Geo Data Presentation.pdfVisualizing Geo Data Jason Sundram, Data Scientist, PayPal @jsundram. From

Writing Efficient JavaScript - O'Reilly Mediaassets.en.oreilly.com/1/event/29/Writing Efficient JavaScript... · Execution Context Identifier Resolution •Start at scope chain position

Protecting “Cloud” Secrets With Grendel - O'Reilly Mediaassets.en.oreilly.com/1/event/44/Protecting _Cloud_ Secrets With... · passwords

Integrating PHP Projects with Jenkins - O'Reilly Mediaassets.en.oreilly.com/1/event/80/Integrating PHP Projects with... · Integrating PHP Projects with Jenkins Sebastian Bergmann

Connector/J Performance Gems - O'Reilly Mediaassets.en.oreilly.com/1/event/21/Connector_J Performance... · 2009-04-22 · Cache Server Configuration “cacheServerConfiguration=true”

Building Android Accesories - O'Reilly Mediaassets.en.oreilly.com/1/event/68/Building Android... · · 2011-10-10Building Android Accesories... using ... -Open Accessory and Arduino-Arduino

Writing’a’Django’ ecommerce’ - O'Reilly Mediaassets.en.oreilly.com/1/event/80/Writing a Django e-Commerce... · from django.db import models class NielsenDataFile(models.Model):

Prof Derek Keats - O'Reilly Mediaassets.en.oreilly.com/1/event/12/Creating _ Supporting Free... · Prof Derek Keats Executive Director Information & Communication Services ... Source:

SQL Injection Myths and Fallacies - O'Reilly Mediaassets.en.oreilly.com/1/event/36/SQL Injection Myths and Fallacies... · 2009 Data Breach Investigations Report Verizon Business

OpenStack Fundamentals Training Part 2 - O'Reilly Mediaassets.en.oreilly.com/1/event/61/OpenStack Fundamentals Training... · OpenStack Fundamentals Training Part 2! Compute.

Conﬁguration management with Chef - O'Reilly Mediaassets.en.oreilly.com/1/event/24/Running the Show_ Configuration... · Conﬁguration management with Chef Edd Dumbill edd@oreilly.com

Throughout Your Company Lifecycle - O'Reilly Mediaassets.en.oreilly.com/1/event/70/Applying Lean Methods to Fat... · Throughout Your Company Lifecycle Hiten Shah ... • Fake it

Introduction to Django - OSCON 2012 - O'Reilly Mediaassets.en.oreilly.com/1/event/80/Introduction to Django... · Django Training. Section 1: Introduction to Django. 1 “ ” Django

MySQL Cluster Deployment Best Practices - O'Reilly Mediaassets.en.oreilly.com/1/event/36/MySQL Cluster - Deployment Best... · MySQL Cluster Deployment Best Practices ... (REDO LOG

Building Civil Protection 2 - O'Reilly Mediaassets.en.oreilly.com › 1 › event › 31 › Building Civil... · Building Civil Protection 2.0 The Italian Civil Protection System

CoreLocation in Practice - O'Reilly Mediaassets.en.oreilly.com/1/event/41/Location Sensors Presentation.pdf · CoreLocation in Practice . CoreLocation Overview Core Location GPS Cell-ID

Hacking your portable Linux Server - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Hacking your... · Federico Lucifredi MMIX ethernet VIA Cyclada Simpliphy vt6122 Gigabit Ethernet