The Cassandra Distributed Database

Eric Evanseevans@rackspace.com

@jericevans

FOSDEMFebruary 7, 2010

A prophetess in Troy during the Trojan War. Her predictions werealways true, but never believed.

A massively scalable, decentralized, structured data store (akadatabase).

Outline

1 Project History

2 Description

3 Case Studies

4 Roadmap

• 7 new committers added

• Dozens of contributors

• 100+ people on IRC

• Hundreds of closed issues (bugs, features, etc)

• 3 major releases, 2 point releases

• Graduation to TLP?

Outline

1 Project History

2 Description

3 Case Studies

4 Roadmap

Cassandra is...

• O(1) DHT

• Eventual consistency

• Tunable trade-offs, consistency vs. latency

But...

• Values are structured, indexed

• Columns / column families

• Slicing w/ predicates (queries)

Column families

Supercolumn families

Querying

• get(): retrieve by column name

• multiget(): by column name for a set of keys

• get slice(): by column name, or a range of names• returning columns• returning super columns

• multiget slice(): a subset of columns for a set of keys

• get count: number of columns or sub-columns

• get range slice(): subset of columns for a range of keys

Column comparators

• TimeUUID

• LexicalUUID

• UTF8

• Long

• Bytes

• ...

Updating

• insert(): add/update column (by key)

• batch insert(): add/update multiple columns (by key)

• remove(): remove a column

• batch mutate(): like batch insert() but can also delete(new for 0.6, deprecates batch insert())

• Remove key range RSN

Consistency

CAP Theorem: choose any two of Consistency, Availability, orPartition tolerance.

• Zero

• One

• Quorum ((N / 2) + 1)

• All

Client API

• Thrift (12 different languages!)

• Ruby• http://github.com/fauna/cassandra/tree/master• http://github.com/NZKoz/cassandra object/tree/master

• Python• http://github.com/digg/lazyboy/tree/master• http://github.com/driftx/Telephus/tree/master (Twisted)

• Scala• http://github.com/viktorklang/Cassidy/tree/master• http://github.com/nodeta/scalandra/tree/master

Performance vs MySQL w/ 50GB

• MySQL• 300ms write• 350ms read

• Cassandra• 0.12ms write• 15ms read

Writes

About writes...

• No reads

• No seeks

• Sequential disk access

• Atomic within a column family

• Fast

• Any node

• Always writeable (hinted hand-off)

About reads...

• Any node

• Read repair

• Usual caching conventions apply

Outline

1 Project History

2 Description

3 Case Studies

4 Roadmap

Case 1: Digg

Digg is a social news site that allows people to discover and sharecontent from anywhere on the Internet by submitting stories andlinks, and voting and commenting on submitted stories and links.

Ranked 98th by Alexa.com.

Problem

• Terabytes of data; high transaction rate (reads dominated)

• Multiple clusters; heavily sharded

• Management nightmare (high effort, error prone)

• Unsatisfied availability requirements (geographic isolation)

Solution

• Currently production on ”Green Badges”

• Cassandra as primary data store RSN

• Datacenter and rack-aware replication

Case 2: Twitter

Twitter is a social networking and microblogging service thatenables its users to send and read tweets, text-based posts of up to140 characters.

Ranked 12th by Alexa.com.

Twitter

• Terabytes of data, ˜1,000,000 ops/s

• Calls for heavy sharding, light replication

• Schema changes are very difficult, (if possible at all)

• Manual sharding is very high effort

• Automated sharding and replication is Hard

Case 3: Facebook

Facebook is a social networking site where users can create aprofile, add friends, and send them messages. Users can also joingroups organized by location or other points of common interest.

Ranked #2 by Alexa.com.

Inbox Search

• 100 TB

• 160 nodes

• 1/2 billion writes per day (2yr old number?)

Case 4: Mahalo

Mahalo.com is a web directory and knowledge exchange. Itdifferentiates itself by tracking and building hand-crafted resultsets for many of the popular search terms.

(it also means ”thank you” in Hawaiian)

• Partial deployment; 16 million video records (and growing)

• Writes (and storage) rapidly exceeding single box limitations

• Managability suffering (clustering is painful)

• Concerns over availability

Outline

1 Project History

2 Description

3 Case Studies

4 Roadmap

• batch mutate command

• authentication (basic)

• new consistency level, ANY

• fat client

• mmapped i/o reads (default on 64bit jvm)

• improved write concurrency (HH)

• networking optimizations

• row caching

• improved management tools

• per-keyspace replication factor

• more efficient compactions (row sizes bigger than memory)

• easier (dynamic?) column family changes

• SSTable versioning

• SSTable compression

• support for column family truncation

• improved configuration handling

• remove key range command

• even more improved management tools

• vector clocks w/ server-side conflict resolution

THE END

The Cassandra Distributed Database

Technology

Distributed database

Making Cassandra Perform as a Time Series Database - Cassandra Summit 15

NOSQL Database: Apache Cassandra

Distributed Database Systems COP5711. What is a Distributed Database System ? A distributed database is a collection of databases which are distributed

Vertafore: Database Evaluation - Selecting Apache Cassandra

CASSANDRA + SIGNALFX INTEGRATION · Apache Cassandra is an open-source distributed database for managing large amounts of data across multiple servers and ensuring ... SignalFx provides

Oracle NoSQL Database · Oracle NoSQL Database Compared to Cassandra and HBase Overview Oracle NoSQL Database is licensed under AGPL while Cassandra and HBase are Apache 2.0 licensed

Persistent Memory Industry Status and Update · PM Storage Engine for Cassandra Cassandra is a popular distributed NoSQL database written in Java Uses a storage engine based on a

Object oriented Development of Distributed Applications ...next-scripting.org/xowiki/download/file/docs/nx/... · § NoSQL Database Concepts § Redis, Cassandra, MongoDB § Consistency

Göteborg Distributed: Eventual Consistency in Apache Cassandra

GUI based System for Data MigrationMODELING CASSANDRA Cassandra [4] is another column-oriented database platform by Apache software foundation. It is a distributed platform and has

Performance Comparison of Cassandra in LXC and Bare metal ...mmc.geofisica.unam.mx/acl/Textos/MaquinasVirtuales/... · Cassandra is a non-relational and largely distributed database

Online Banking Application with Angular JS, RESTful Web Services and Cassandra … · 2019-10-06 · Apache Cassandra. Apache Cassandra is an open source distributed database management

Benefits of EMC XtremIO iCDM for Cassandra Database · WHITE PAPER BENEFITS OF EMC XTREMIO ICDM FOR CASSANDRA DATABASE Using EMC XtremIO Virtual Copies (XVC) to create Cassandra database

Cassandra - Distributed Data Store

Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli

Microsoft Azure IoT services Reference Architecture...DataStax Enterprise is a geographically distributed and horizontally scalable transactional database based on Apache Cassandra

Evaluating Apache Cassandra as a Cloud Database · Apache Cassandra is a highly scalable and high-performance distributed database management system that excels at being a real-time

The Internet in Database: A Cassandra Use Case

Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra