View
216
Download
0
Category
Preview:
Citation preview
Cassandra
Jonathan Ellis
Motivation
● Scaling reads to a relational database is hard
● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational
anymore
The new face of data
● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware
CAP theorem
● Pick two of Consistency, Availability, Partition tolerance
Two famous papers
● Bigtable: A distributed storage system for structured data, 2006
● Dynamo: amazon's highly available key-value store, 2007
Two approaches
● Bigtable: “How can we build a distributed db on top of GFS?”
● Dynamo: “How can we build a distributed hash table appropriate for the data center?”
10,000 ft summary
● Dynamo partitioning and replication● Log-structured ColumnFamily data model
similar to Bigtable's
Cassandra highlights
● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency
and latency● Minimal administration● No SPF
Dynamo architecture & Lookup
Architecture details
● O(1) node lookup● Explicit replication● Eventually consistent
Architecture layers
Messaging service
Gossip
Failure detection
Cluster state
Partitioner
Replication
Commit log
Memtable
SSTable
Indexes
Compaction
Tombstones
Hinted handoff
Read repair
Bootstrap
Monitoring
Admin tools
Writes
● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses
Memtable / SSTable
Commit log
Disk
SSTable format
● Key / data
SSTable Indexes
● Bloom filter● Key● Column
(Similar to Hadoop MapFile / Tfile)
Compaction
● Merge keys● Combine columns● Discard tombstones
Remove
● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction
● Read repair complicates things a little● Eventually consistent complicates things
more● Solution: configurable delay before
tombstone GC, after which tombstones are not repaired
Cassandra write properties
● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable
Read path
● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the
background and perform read repair
Cassandra read properties
● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows
Consistency in a BASE world
● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1
vs MySQL with 50GB of data
● MySQL● ~300ms write
● ~350ms read
● Cassandra● ~0.12ms write
● ~15ms read
● Achtung!
Data model
● Rows, ColumnFamilies, Columns
ColumnFamilies
keyA column1 column2 column3
keyC column1 column7 column11
Column
Byte[] Name
Byte[] Value
I64 timestamp
Super ColumnFamilies
keyF Super1 Super2
keyJ Super1 Super5
column column column column column column
column column column column column column
Types of queries
● Single column● Slice
● Set of names / range of names
● Simple slice -> columns
● Super slice -> supercolumns
● Key range
Range queries
● Add “master” server● Implement on top of K/V● Order-preserving partitioning
Modification
● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for
Thriftstruct Column { 1: binary name, 2: binary value, 3: i64 timestamp,}
struct SuperColumn { 1: binary name, 2: list<Column> columns,}
Column get_column(table, key, column_path, block_for=1)
list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)
void insert(table, key, column_path, value, timestamp, block_for=0)
void remove(tablename, key, column_path_or_parent, timestamp)
Honestly, Thrift kinda sucks
Example: a multiuser blog
Two queries
- the most recent posts belonging to a given blog, in reverse chronological order
- a single post and its comments, in chronological order
First try
JBE blog
Cassandra is teh awesome BASE FTW
Evan blog
I like kittens And Ruby
post comment comment post comment comment
post comment comment post comment comment
<ColumnFamily
Type="Super"
CompareWith="TimeString"
CompareSubcolumnsWith="UUID"
Name="Blog"/>
Second try
<ColumnFamily
CompareWith="UUIDType"
Name="Blog"/>
JBE blog Cassandra is teh awesome
BASE FTW
Evan blog I like kittens And Ruby
Cassandra is teh awesome
comment comment
Base FTW comment comment
I like kittens
comment comment
And Ruby comment comment
<ColumnFamily
CompareWith="UUIDType"
Name="Comment"/>
Roadmap
Cassandra 0.3
● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support
Cassandra 0.4
● Branched May 18● Data file format change to support billions
of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface
Cassandra 0.5
● Bootstrap● Load balancing
● Closely related to “bootstrap done right”
● Merkle tree repair● Millions of columns per row
● This will require another data format change
● Multiget● Callout support
Users
Production: facebook, RocketFuel
Production RSN: Digg, Rackspace
No date yet: IBM Research, Twitter
Evaluating: 50+ in #cassandra on freenode
More
● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059
● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations
● #cassandra on irc.freenode.net
Cassandra
Recommended