Cassandra Introduction & Features

Preview:

DESCRIPTION

This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/

Citation preview

Cassandra Introduction & Key Features

Meetup Vienna Cassandra Users

13th of January 2014

philipp.potisk@geroba.com

Definition

Apache Cassandra is an open source, distributed,decentralized, elastically scalable, highly available,fault-tolerant, tuneably consistent, column-orienteddatabase that bases its distribution design on Amazon’sDynamo and its data model on Google’s Bigtable.Created at Facebook, it is now used at some of the mostpopular sites on the Web [The Definitive Guide, EbenHewitt, 2010]

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 2

History

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Bigtable, 2006 Dynamo, 2007

OpenSource, 2008

3

Key Features

Cassandra

Distributed and

Decentra-lized

Elastic Scalability

High Availability and Fault Tolerance

TuneableConsistency

Column-oriented

Key-Value store

CQL – A SQL like query interface

High Perfor-mance

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 4

Distributed and Decentralized

• Distributed: Capable of running on multiple machines

• Decentralized: No single point of failure

No master-slave issues due to peer-to-peer architecture (protocol "gossip")

Single Cassandra cluster may run across geographically dispersed data centers

Read- and write-requests to any node

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 5

1

35

4

Datacenter 1

7

9

10

Datacenter 2

26 812

11

Elastic Scalability

• Cassandra scales horizontally, adding more machines that have all or some of the data on

• Adding of nodes increase performance throughput linearly

• De-/ and increasing the nodecount happen seamlessly

Linearly scales to terabytes and

petabytes of data

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 6

12

2

8

4

63

4

1

5

7 3Performance throughput = N x 2

Performance throughput = N

Scaling Benchmark By Netflix*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Cassandra scales linearly far beyond our current capacity requirements, and very rapid deploymentautomation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable,

*http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. Each client generated ~20.000w/s having 400byte in size

7

High Availability and Fault Tolerance

• High Availability?Multiple networked computers

operating in a cluster

Facility for recognizing node failures

Forward failing over requests to another part of the system

• Cassandra has High AvailabilityNo single point of failuredue to the peer-to-peer

architecture

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 8

1

35

4

26

Tunable Consistency

• Choose between strong and eventual consistency

• Adjustable for read- and write-operations separately

• Conflicts are solved during reads, as focus lies on write-performance

Use case dependent level of consistency

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Available Consistency

TUNABLE

9

When do we have strong consistency?

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

• Simple Formula:(nodes_written + nodes_read) >

replication_factor

• Ensures that a read always reflects the most recent write

• If not: Weak consistency Eventually consistent

NW: 2

NR: 2

RF: 3

t2

t2

t1

jsmith t2

t1

t1

10

jsmith jsmith

jsmith

Column-oriented Key-Value Store

• Data is stored in sparse multidimensional hash tables

• A row can have multiple columns –not necessarily the same amount of columns for each row

• Each row has a unique key, which also determines partitioning

• No relations!

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 11

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Row Key1

ColumnKey1

ColumnKey2

ColumnValue1

ColumnValue2

ColumnKey3

ColumnValue3

……

Stored sorted by column key/value

Sto

red

sort

edb

yro

wke

y*

* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly

CQL – An SQL-like query interface

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 12

• “CQL 3 is the default and primary interface into the Cassandra DBMS” *

• Familiar SQL-like syntax that maps to Cassandras storage engine and simplifies data modelling

* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf

CRETE TABLE songs (

id uuid PRIMARY KEY,

title text,

album text,

artist text,

data blob,

tags set<text>

);

INSERT INTO songs

(id, title, artist,

album, tags)

VALUES(

'a3e64f8f...',

'La Grange',

'ZZ Top',

'Tres Hombres'‚

{'cool', 'hot'});

SELECT *

FROM songs

WHERE id = 'a3e64f8f...';

“SQL-like” but NOT relational SQL

High Performance

• Optimized from the ground up for high throughput

• All disk writes are sequential, append only operations

• No reading before writing

• Cassandra`s threading-concept is optimized for running on multiprocessor/ multicore machines

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Optimized for writing, but fast reads are possible as well

13

Benchmark from 2011 (Cassandra 0.7.4)*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

*NoSql Benchmarking by Curbithttp://www.cubrid.org/blog/dev-platform/nosql-benchmarking/

Cassandra showed outstanding throughput in “INSERT-only” with 20,000 ops

Insert: Enter 50 million 1K-sized recordsRead: Search key for a one hour period + optional updateHardware: Nehalem 6 Core x 2 CPU, 16GB Memory

ops

14

Benchmark from 2013 (Cassandra 1.1.6)*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

* Benchmarking Top NoSQL Databases by End Point Corporation, http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdfYahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB

15

When do we need these features?

Large Deployments

Lots of Writes,

Statistics, and Analysis

Geographical Distribution

Evolving Applications

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 16

Who is using Cassandra?

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 17

ebay Data Infrastructure*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

• 10+ clusters• 100+ nodes• > 250 TB provisioned

(local HDD + shared SSD)• > 9 billion writes/day• > 5 billion reads/day

• Thousands of nodes • The world largest cluster

with 2K+ nodes

• Thousands of nodes• > 2K sharded logical host• > 16K tables• > 27K indexes• > 140 billion SQLs/day• > 5 PB provisioned

• Hundreds of nodes• Persistent & in-memory• > 40 billion SQLs/day

Hundreds of nodes> 50 TB> 2 billion ops/day

18

Not replacing RDMBS but complementing!

*by Jay Patel, Cassandra Summit June 2013 San Francisco

Cassandra Use Case at Ebay

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 19

Application/Use Case

• Time-series data and real-time insights

• Fraud detection & prevention

• Quality Click Pricing for affiliates

• Order & Shipment Tracking

• …

• Server metrics collection

• Taste graph-based next-gen recommendation system

• Social Signals on eBay Product & Item pages

Why Cassandra?

• Multi-Datacenter (active-active)

• No SPOF

• Easy to scale

• Write performance

• Distributed Counters

Cassandra/Hadoop Deployment

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 20

Summary• History

• Key features of Cassandra• Distributed and Decentralized

• Elastic Scalability

• High Availability and Fault Tolerance

• Tunable Consistency

• Column-oriented key-value store

• CQL interface

• High Performance

• Ebay Use Case

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Community portal: http://planetcassandra.org

21

Documentation: http://www.datastax.com/docs

Apache project: http://cassandra.apache.org

Recommended