84
©2013 DataStax Confidential. Do not distribute without consent. CTO, DataStax Jonathan Ellis Project Chair, Apache Cassandra Cassandra 2.0 and 2.1 1

Tokyo cassandra conference 2014

  • Upload
    jbellis

  • View
    770

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Tokyo cassandra conference 2014

©2013 DataStax Confidential. Do not distribute without consent.

CTO, DataStax

Jonathan EllisProject Chair, Apache Cassandra

Cassandra 2.0 and 2.1

1

Page 2: Tokyo cassandra conference 2014

Five years of Cassandra

Jul-09 Jun-10 May-11 Apr-12 Mar-13 Mar-14

0.1 0.3 0.6 0.7 1.0 1.2...

2.0

DSE

Jul-08

I’ve been working on Cassandra for five years now. Facebook open sourced it in July of 2008, and I started working on it at Rackspace in December. A year and a half later, I started DataStax to commercialize it.

Page 3: Tokyo cassandra conference 2014

Core values•Massive scalability•High performance

•Reliability/Availabilty

Cassandra HBase RedisMySQL

For the first four years we focused on these three core values.

Page 4: Tokyo cassandra conference 2014

New core value•Massive scalability•High performance

•Reliability/Availabilty

•Ease of use

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);

CREATE INDEX ON users(state);

SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;

2013 saw us focus on a fourth value, ease of use, starting with the introduction of CQL3 in January with Cassandra 1.2.

CQL (Cassandra Query Language) is a dialect of SQL optimized for Cassandra. All the statements on the right of this slide are valid in both CQL and SQL.

Page 5: Tokyo cassandra conference 2014

Native Drivers•CQL native protocol: efficient, lightweight, asynchronous•Java (GA): https://github.com/datastax/java-driver

•.NET (GA): https://github.com/datastax/csharp-driver

•Python (Beta): https://github.com/datastax/python-driver

•C++ (Beta): https://github.com/datastax/cpp-driver•Coming soon: PHP, Ruby

We also introduced a native CQL protocol, cutting out the overhead and complexity of Thrift. DataStax has open sourced half a dozen native CQL drivers and is working on more.

Page 6: Tokyo cassandra conference 2014

DataStax DevCenter

We’ve also released DevCenter, an interactive tool for exploring and querying your Cassandra databases. DevCenter is the first tool of its kind for a NoSQL database.

Page 7: Tokyo cassandra conference 2014

Tracingcqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2);Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

activity | timestamp | source | source_elapsed-------------------------------------+--------------+-----------+---------------- Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779 Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888 Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550

Perhaps the biggest problem people have after deploying Cassandra is understanding what goes on under the hood. We introduced query tracing to shed some light on this. One of the challenges is gathering information from all the nodes that participate in processing a query; here, the coordinator (in blue) receives the query from the client and forwards it to a replica (in green) which then responds back to the coordinator.

Page 8: Tokyo cassandra conference 2014

Authentication[cassandra.yaml]authenticator: PasswordAuthenticator# DSE offers KerberosAuthenticator

We added authentication and authorization, following familiar patterns. Note that the default user and password is cassandra/cassandra, so good practice is to create a new superuser and drop or change the password on the old one.

Apache Cassandra ships with password authentication built in; DSE (DataStax Enterprise) adds Kerberos single-sign-on integration.

Page 9: Tokyo cassandra conference 2014

Authentication[cassandra.yaml]authenticator: PasswordAuthenticator# DSE offers KerberosAuthenticator

CREATE USER robinWITH PASSWORD 'manager' SUPERUSER;

ALTER USER cassandraWITH PASSWORD 'newpassword';

LIST USERS;

DROP USER cassandra;We added authentication and authorization, following familiar patterns. Note that the default user and password is cassandra/cassandra, so good practice is to create a new superuser and drop or change the password on the old one.

Apache Cassandra ships with password authentication built in; DSE (DataStax Enterprise) adds Kerberos single-sign-on integration.

Page 10: Tokyo cassandra conference 2014

Authorization[cassandra.yaml]authorizer: CassandraAuthorizer

GRANT select ON audit TO jonathan;

GRANT modify ON users TO robin;

GRANT all ON ALL KEYSPACES TO lara;

select and modify privileges may be granted separately or together to users on a per-table or per-keyspace basis.

Page 11: Tokyo cassandra conference 2014

Cassandra 2.0

Everything I’ve talked about so far is “ancient history” from Cassandra 1.2, but I wanted to cover it again as a refresher. Now let’s talk about what we added for Cassandra 2.0, released in September.

Page 12: Tokyo cassandra conference 2014

Race conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

The first such feature is Lightweight Transactions. This is motivated by the fact that while Cassandra’s eventually consistent model can provide “strong consistency,” where readers always see the most recent writes, it cannot provide “linearizable consistency,” where some writes are guaranteed to happen sequentially with respect to others.

Consider the case of user account creation. If two users attempt to create the same name simultaneously, they will both see that it does not yet exist and proceed to attempt to create the account, resulting in corruption.

Page 13: Tokyo cassandra conference 2014

Race conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

The first such feature is Lightweight Transactions. This is motivated by the fact that while Cassandra’s eventually consistent model can provide “strong consistency,” where readers always see the most recent writes, it cannot provide “linearizable consistency,” where some writes are guaranteed to happen sequentially with respect to others.

Consider the case of user account creation. If two users attempt to create the same name simultaneously, they will both see that it does not yet exist and proceed to attempt to create the account, resulting in corruption.

Page 14: Tokyo cassandra conference 2014

Race conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

The first such feature is Lightweight Transactions. This is motivated by the fact that while Cassandra’s eventually consistent model can provide “strong consistency,” where readers always see the most recent writes, it cannot provide “linearizable consistency,” where some writes are guaranteed to happen sequentially with respect to others.

Consider the case of user account creation. If two users attempt to create the same name simultaneously, they will both see that it does not yet exist and proceed to attempt to create the account, resulting in corruption.

Page 15: Tokyo cassandra conference 2014

Race conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ea24e13ad9...', '2011-06-20 13:50:01');

The first such feature is Lightweight Transactions. This is motivated by the fact that while Cassandra’s eventually consistent model can provide “strong consistency,” where readers always see the most recent writes, it cannot provide “linearizable consistency,” where some writes are guaranteed to happen sequentially with respect to others.

Consider the case of user account creation. If two users attempt to create the same name simultaneously, they will both see that it does not yet exist and proceed to attempt to create the account, resulting in corruption.

Page 16: Tokyo cassandra conference 2014

Race conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

This one wins

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ea24e13ad9...', '2011-06-20 13:50:01');

The first such feature is Lightweight Transactions. This is motivated by the fact that while Cassandra’s eventually consistent model can provide “strong consistency,” where readers always see the most recent writes, it cannot provide “linearizable consistency,” where some writes are guaranteed to happen sequentially with respect to others.

Consider the case of user account creation. If two users attempt to create the same name simultaneously, they will both see that it does not yet exist and proceed to attempt to create the account, resulting in corruption.

Page 17: Tokyo cassandra conference 2014

Lightweight transactionsINSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

Lightweight transactions roll the “check” and “modify” stages into a single atomic operation, so we can guarantee that only one user will create a given account. The other will get back the row that was created concurrently as an explanation.

UPDATE can similarly take an IF ... clause checking that no modifications have been made to a set of columns since they were read.

Page 18: Tokyo cassandra conference 2014

Lightweight transactionsINSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

[applied]----------- True

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ea24e13ad9...', '2011-06-20 13:50:01')IF NOT EXISTS;

Lightweight transactions roll the “check” and “modify” stages into a single atomic operation, so we can guarantee that only one user will create a given account. The other will get back the row that was created concurrently as an explanation.

UPDATE can similarly take an IF ... clause checking that no modifications have been made to a set of columns since they were read.

Page 19: Tokyo cassandra conference 2014

Lightweight transactions

[applied] | username | created_date | name -----------+----------+----------------+---------------- False | pmcfadin | 2011-06-20 ... | Patrick McFadin

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

[applied]----------- True

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['[email protected]'], 'ea24e13ad9...', '2011-06-20 13:50:01')IF NOT EXISTS;

Lightweight transactions roll the “check” and “modify” stages into a single atomic operation, so we can guarantee that only one user will create a given account. The other will get back the row that was created concurrently as an explanation.

UPDATE can similarly take an IF ... clause checking that no modifications have been made to a set of columns since they were read.

Page 20: Tokyo cassandra conference 2014

Paxos•All operations are quorum-based•Each replica sends information about unfinished operations to the leader during prepare

•Paxos made Simple

Under the hood, lightweight transactions are implemented with the Paxos consensus protocol.

Page 21: Tokyo cassandra conference 2014

Details•Paxos state is durable•Immediate consistency with no leader election or failover

•ConsistencyLevel.SERIAL

•http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

•4 round trips vs 1 for normal updates

Paxos has these implications for our implementation.

Page 22: Tokyo cassandra conference 2014

Use with caution•Great for 1% of your application•Eventual consistency is your friend

•http://www.slideshare.net/planetcassandra/c-summit-2013-eventual-consistency-hopeful-consistency-by-christos-kalantzis

“4 round trips” is the big downside for Paxos. This makes lightweight transactions a big performance hit in single-datacenter deployments and wildly impractical for multi-datacenter clusters. They should only be used for targeted pieces of an application when the alternative is corruption, like our account creation example.

Page 23: Tokyo cassandra conference 2014

Cursors (before)

SELECT *FROM timelineWHERE (user_id = :last_key AND tweet_id > :last_tweet) OR token(user_id) > token(:last_key)LIMIT 100

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Cassandra 2.0 introduced cursors to the native protocol. This makes paging through large resultsets much simpler. Note how we need one clause per component of the primary key to fetch the next 100 rows here.

Page 24: Tokyo cassandra conference 2014

Cursors (after)SELECT *FROM timeline

Now Cassandra handles the details of getting extra results as you iterate through a resultset. In fact, our cursors are a little bit smarter than in your favorite RDBMS (relational database management system) since they are failover-aware: if the coordinator in use fails, the cursor will pick up where it left off against a different node in the cluster.

Page 25: Tokyo cassandra conference 2014

Other CQL improvements

We made some other miscellaneous improvements in CQL for 2.0 as well.

Page 26: Tokyo cassandra conference 2014

Other CQL improvements•SELECT DISTINCT pk

We made some other miscellaneous improvements in CQL for 2.0 as well.

Page 27: Tokyo cassandra conference 2014

Other CQL improvements•SELECT DISTINCT pk•CREATE TABLE IF NOT EXISTS table

We made some other miscellaneous improvements in CQL for 2.0 as well.

Page 28: Tokyo cassandra conference 2014

Other CQL improvements•SELECT DISTINCT pk•CREATE TABLE IF NOT EXISTS table

•SELECT ... AS• SELECT event_id, dateOf(created_at) AS creation_date

We made some other miscellaneous improvements in CQL for 2.0 as well.

Page 29: Tokyo cassandra conference 2014

Other CQL improvements•SELECT DISTINCT pk•CREATE TABLE IF NOT EXISTS table

•SELECT ... AS• SELECT event_id, dateOf(created_at) AS creation_date

•ALTER TABLE DROP column

We made some other miscellaneous improvements in CQL for 2.0 as well.

Page 30: Tokyo cassandra conference 2014

Off-HeapNot managed by GC

Java Process

On-HeapManaged by GC

On-Heap/Off-Heap

We’ve put a lot of effort into improveming how Cassandra manages its memory. You’re looking at a limit of about 8GB for a JVM heap, even though modern servers have much more RAM available. So we’re optimizing heap use, pushing internal structures into off-heap memory where possible.

Page 31: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 32: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk

Partitionkey cache

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 33: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk

Partitionkey cache

Partitionsummary

0X...0X...0X...

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 34: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk0X...0X...0X...0X...

Partitionindex

Partitionkey cache

Partitionsummary

0X...0X...0X...

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 35: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk0X...0X...0X...0X...

Partitionindex

Compressionoffsets

Partitionkey cache

Partitionsummary

0X...0X...0X...

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 36: Tokyo cassandra conference 2014

Read path (per sstable)

Bloomfilter

Memory

Disk0X...0X...0X...0X...

PartitionindexData

Compressionoffsets

Partitionkey cache

Partitionsummary

0X...0X...0X...

To understand what we’ve done, I need to explain how a read works in Cassandra.

Page 37: Tokyo cassandra conference 2014

Off heap in 2.0Partition key bloom filter1-2GB per billion partitions

Bloomfilter

Memory

Disk0X...0X...0X...0X...

PartitionindexData

Compressionoffsets

Partitionkey cache

Partitionsummary

0X...0X...0X...

These are the components that are allocated off-heap now. We use reference counting to deallocate them when the sstable (data file) they are associated with is obsoleted by compaction.

Page 38: Tokyo cassandra conference 2014

Off heap in 2.0Compression metadata~1-3GB per TB compressed

Bloomfilter

Memory

Disk0X...0X...0X...0X...

PartitionindexData

Compressionoffsets

Partitionkey cache

Partitionsummary

0X...0X...0X...

Page 39: Tokyo cassandra conference 2014

Off heap in 2.0Partition index summary(depends on rows per partition)

Bloomfilter

Memory

Disk0X...0X...0X...0X...

PartitionindexData

Compressionoffsets

Partitionkey cache

Partitionsummary

0X...0X...0X...

Page 40: Tokyo cassandra conference 2014

Compaction•Single-pass, always•LCS performs STCS in L0

LCS = leveled compaction strategySTCS = size-tiered compaction strategy

Page 41: Tokyo cassandra conference 2014

Healthy leveled compaction

L0

L1

L2

L3

L4

L5

The goal of leveled compaction is to provide a read performance guarantee. We divide the sstables up into levels, where each level has 10x as much data as the previous (so the diagram here is not to scale!), and guarantee that any given row is only present in at most one sstable per level.

Newly flushed sstables start in level zero, which is not yet processed into the tiered levels, and the one-per-sstable rule does not apply there. So we need to check potentially each sstable in L0.

Page 42: Tokyo cassandra conference 2014

Sad leveled compaction

L0

L1

L2

L3

L4

L5

The problem is that we can fairly easily flush new sstables to L0 faster than compaction can level them. That results in poor read performance since we need to check so many sstables for each row. This in turn results in even less i/o available for compaction and L0 will fall even further behind.

Page 43: Tokyo cassandra conference 2014

STCS in L0

L0

L1

L2

L3

L4

L5

So what we do in 2.0 is perform size-tiered compaction when L0 falls behind. This doesn’t magically make LCS faster, since we still need to process these sstables into the levels, but it does mean that we prevent read performance from going through the floor in the meantime.

Page 44: Tokyo cassandra conference 2014

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Now let’s look at reads from the perspective of the whole cluster. A client sends a query to a coordinator, which forwards it to the least-busy replica, and returns the answer to the client.

Page 45: Tokyo cassandra conference 2014

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Now let’s look at reads from the perspective of the whole cluster. A client sends a query to a coordinator, which forwards it to the least-busy replica, and returns the answer to the client.

Page 46: Tokyo cassandra conference 2014

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Now let’s look at reads from the perspective of the whole cluster. A client sends a query to a coordinator, which forwards it to the least-busy replica, and returns the answer to the client.

Page 47: Tokyo cassandra conference 2014

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Now let’s look at reads from the perspective of the whole cluster. A client sends a query to a coordinator, which forwards it to the least-busy replica, and returns the answer to the client.

Page 48: Tokyo cassandra conference 2014

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Now let’s look at reads from the perspective of the whole cluster. A client sends a query to a coordinator, which forwards it to the least-busy replica, and returns the answer to the client.

Page 49: Tokyo cassandra conference 2014

A failure

Client Coordinator

40%busy

90%busy

30%busy

What happens if that replica fails before replying? In earlier versions of Cassandra, we’d return a timeout error.

Page 50: Tokyo cassandra conference 2014

A failure

Client Coordinator

40%busy

90%busy

30%busy

What happens if that replica fails before replying? In earlier versions of Cassandra, we’d return a timeout error.

Page 51: Tokyo cassandra conference 2014

A failure

Client Coordinator

40%busy

90%busy

30%busy

What happens if that replica fails before replying? In earlier versions of Cassandra, we’d return a timeout error.

Page 52: Tokyo cassandra conference 2014

A failure

Client Coordinator

40%busy

90%busy

30%busyX

What happens if that replica fails before replying? In earlier versions of Cassandra, we’d return a timeout error.

Page 53: Tokyo cassandra conference 2014

A failure

Client Coordinator

40%busy

90%busy

30%busyXtimeout

What happens if that replica fails before replying? In earlier versions of Cassandra, we’d return a timeout error.

Page 54: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 55: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 56: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 57: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 58: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 59: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 60: Tokyo cassandra conference 2014

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyXsuccess

In Cassandra 2.0, the coordinator will detect slow responses and retry those queries to another replica to prevent timing out.

Page 61: Tokyo cassandra conference 2014

Rapid Read Protection

NONE

Here we have a graph of read performance over time in a small four-node cluster. One of the nodes is killed halfway through. You can see how the rapid read protection results in a much lower impact on throughput. (There is still some drop since we need to repeat 25% of the queries against other relicas all at once.)

Page 62: Tokyo cassandra conference 2014

Latency (mid-compaction)

Rapid Read Protection can also reduce latency variance. Look at the 99.9th percentile numbers here. With no rapid read protection, the slowest 0.1% of reads took almost 50ms. Retrying the slowest 10% of queries brings that down to 14.5ms. If we only retry the slowest 1%, that’s 19.6ms. But note that issuing extra reads for all requests actually results in a higher 99th percentile! Looking at the throughput number shows us why -- we’re running out of capacity in our cluster to absorb the extra requests.

Page 63: Tokyo cassandra conference 2014

Cassandra 2.1

Page 64: Tokyo cassandra conference 2014

User defined typesCREATE TYPE address ( street text, city text, zip_code int, phones set<text>)

CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address>)

SELECT id, name, addresses.city, addresses.phones FROM users;

id | name | addresses.city | addresses.phones--------------------+----------------+-------------------------- 63bf691f | jbellis | Austin | {'512-4567', '512-9999'}

We introduced collections in Cassandra 1.2, but they had a number of limitations. One is that collections could not contain other collections. User defined types in 2.1 allow that. Here we have an address type, that holds a set of phone numbers. We can then use that address type in a map in the users table.

Page 65: Tokyo cassandra conference 2014

Collection indexingCREATE TABLE songs ( id uuid PRIMARY KEY, artist text, album text, title text, data blob, tags set<text>);

CREATE INDEX song_tags_idx ON songs(tags);

SELECT * FROM songs WHERE tags CONTAINS 'blues';

id | album | artist | tags | title----------+---------------+-------------------+-----------------------+------------------ 5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

2.1 also brings index support to collections.

Page 66: Tokyo cassandra conference 2014

Inefficient bloom filters

+

= ?

Page 67: Tokyo cassandra conference 2014

+

=

Inefficient bloom filters

Page 68: Tokyo cassandra conference 2014

+

=

Inefficient bloom filters

Page 69: Tokyo cassandra conference 2014

Inefficient bloom filters

Page 70: Tokyo cassandra conference 2014

HyperLogLog applied

Page 71: Tokyo cassandra conference 2014

HLL and compaction

Page 72: Tokyo cassandra conference 2014

HLL and compaction

Page 73: Tokyo cassandra conference 2014

HLL and compaction

Page 74: Tokyo cassandra conference 2014

More-efficient repair

We’re making some big improvements to repair for 2.1. Repair is very network-efficient because we build a hash tree of the data to compare across different replicas. Then we only have to send actual rows across the network where the tree indicates an inconsistency.

Page 75: Tokyo cassandra conference 2014

More-efficient repair

We’re making some big improvements to repair for 2.1. Repair is very network-efficient because we build a hash tree of the data to compare across different replicas. Then we only have to send actual rows across the network where the tree indicates an inconsistency.

Page 76: Tokyo cassandra conference 2014

More-efficient repair

We’re making some big improvements to repair for 2.1. Repair is very network-efficient because we build a hash tree of the data to compare across different replicas. Then we only have to send actual rows across the network where the tree indicates an inconsistency.

Page 77: Tokyo cassandra conference 2014

More-efficient repair

The problem is that this tree is constructed at repair time, so when we add some new sstables and repair again, merkle tree (hash tree) construction has to start over. So repair ends up taking time proportional to the amount of data in the cluster, not because of network transfers but because of tree construction time.

Page 78: Tokyo cassandra conference 2014

More-efficient repair

The problem is that this tree is constructed at repair time, so when we add some new sstables and repair again, merkle tree (hash tree) construction has to start over. So repair ends up taking time proportional to the amount of data in the cluster, not because of network transfers but because of tree construction time.

Page 79: Tokyo cassandra conference 2014

More-efficient repair

The problem is that this tree is constructed at repair time, so when we add some new sstables and repair again, merkle tree (hash tree) construction has to start over. So repair ends up taking time proportional to the amount of data in the cluster, not because of network transfers but because of tree construction time.

Page 80: Tokyo cassandra conference 2014

More-efficient repair

So what we’re doing in 2.1 is allowing Cassandra to mark sstables as repaired and only build merkle trees from sstables that are new since the last repair. This means that as long as you run repair regularly, it will stay lightweight and performant even as your dataset grows.

Page 81: Tokyo cassandra conference 2014

More-efficient repair

So what we’re doing in 2.1 is allowing Cassandra to mark sstables as repaired and only build merkle trees from sstables that are new since the last repair. This means that as long as you run repair regularly, it will stay lightweight and performant even as your dataset grows.

Page 82: Tokyo cassandra conference 2014

More-efficient repair

So what we’re doing in 2.1 is allowing Cassandra to mark sstables as repaired and only build merkle trees from sstables that are new since the last repair. This means that as long as you run repair regularly, it will stay lightweight and performant even as your dataset grows.

Page 83: Tokyo cassandra conference 2014

Performance•Memtable memory use cut by 85%

•larger sstables, less compaction

•~50% better write performance

•Full results after beta1

Page 84: Tokyo cassandra conference 2014

Questions?