C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

June 17, 2013

#Cassandra13

Axel Liljencrantz [email protected]

How not to use Cassandra

#Cassandra13

About me

#Cassandra13

The Spotify backend

#Cassandra13

The Spotify backend

•  Around 4000 servers in 4 datacenters •  Volumes

-  We have ~ 12 soccer fields of music -  Streaming ~ 4 Wikipedias/second -  ~ 24 000 000 active users

#Cassandra13

The Spotify backend

•  Specialized software powering Spotify -  ~ 70 services -  Mostly Python, some Java -  Small, simple services responsible for single task

#Cassandra13

Storage needs

•  Used to be a pure PostgreSQL shop •  Postgres is awesome, but... -  Poor cross-site replication support -  Write master failure requires manual intervention -  Sharding throws most relational advantages out the

window

#Cassandra13

Cassandra @ Spotify

•  We started using Cassandra 2+ years ago -  ~ 24 services use it by now -  ~ 300 Cassandra nodes -  ~ 50 TB of data

•  Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra

#Cassandra13

Cassandra @ Spotify

•  We started using Cassandra 2+ years ago -  ~ 24 services use it by now -  ~ 300 Cassandra nodes -  ~ 50 TB of data

•  Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra

•  So we screwed up •  A lot

#Cassandra13

How to misconfigure Cassandra

#Cassandra13

Read repair

•  Repair from outages during regular read operation •  With RR, all reads request hash digests from all nodes •  Result is still returned as soon as enough nodes have replied •  If there is a mismatch, perform a repair

#Cassandra13

Read repair

•  Useful factoid: Read repair is performed across all data centers

•  So in a multi-DC setup, all reads will result in requests being sent to every data center

•  We've made this mistake a bunch of times •  New in 1.1: dclocal_read_repair

#Cassandra13

Row cache

•  Cassandra can be configured to cache entire data rows in RAM

•  Intended as a memcache alternative •  Lets enable it. What's the worst that could happen, right?

#Cassandra13

Row cache

NO! •  Only stores full rows •  All cache misses are silently promoted to full row slices •  All writes invalidate entire row •  Don't use unless you understand all use cases

#Cassandra13

Compression

•  Cassandra supports transparent compression of all data •  Compression algorithm (snappy) is super fast •  So you can just enable it and everything will be better, right?

#Cassandra13

Compression

•  Cassandra supports transparent compression of all data •  Compression algorithm (snappy) is super fast •  So you can just enable it and everything will be better, right?

• NO! •  Compression disables a bunch of fast paths, slowing down

fast reads

#Cassandra13

How to misuse Cassandra

#Cassandra13

Performance worse over time

•  A freshly loaded Cassandra cluster is usually snappy •  But when you keep writing to the same columns over for a

long time, the row will spread over more SSTables •  And performance jumps off a cliff •  We've seen clusters where reads touch a dozen SSTables on

average •  nodetool cfhistograms is your friend

#Cassandra13

Performance worse over time

•  CASSANDRA-5514 •  Every SSTable stores first/last column of SSTable •  Time series-like data is effectively partitioned

#Cassandra13

Few cross continent clusters

•  Few cross continent Cassandra users •  We are kind of on our own when it comes to some problems •  CASSANDRA-5148 •  Disable TCP nodelay •  Reduced packet count by 20 %

#Cassandra13

How not to upgrade Cassandra

#Cassandra13

How not to upgrade Cassandra •  Very few total cluster outages -  Clusters have been up and running since the early 0.7

days, been rolling upgraded, expanded, full hardware replacements etc.

•  Never lost any data! -  No matter how spectacularly Cassandra fails, it has

never written bad data -  Immutable SSTables FTW

#Cassandra13

Upgrade from 0.7 to 0.8

•  This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6 •  Everyone claimed rolling upgrade would work

-  It did not •  One would expect 0.8.6 to have this fixed •  Patched Cassandra and rolled it a day later •  Takeaways:

-  ALWAYS try rolling upgrades in a testing environment -  Don't believe what people on the Internet tell you

#Cassandra13


•  We tried upgrading in test env, worked fine •  Worked fine in production... •  Except the last cluster •  All data gone

#Cassandra13


•  We tried upgrading in test env, worked fine •  Worked fine in production... •  Except the last cluster •  All data gone •  Many keys per SSTable ⇾ corrupt bloom filters •  Made Cassandra think it didn't have any keys •  Scrub data ⇾ fixed •  Takeaway: ALWAYS test upgrades using production data

#Cassandra13


•  After the previous upgrades, we did all the tests with production data and everything worked fine...

•  Until we redid it in production, and we had reports of missing rows

•  Scrub ⇾ restart made them reappear •  This was in December, have not been able to reproduce •  PEBKAC? •  Takeaway: ?

#Cassandra13

How not to deal with large clusters

#Cassandra13

Coordinator

•  Coordinator performs partitioning, passes on request to the right nodes

•  Merges all responses

#Cassandra13

What happens if one node is slow?

#Cassandra13


Many reasons for temporary slowness: •  Bad raid battery •  Sudden bursts of compaction/repair •  Bursty load •  Net hiccup •  Major GC •  Reality

#Cassandra13


•  Coordinator has a request queue •  If a node goes down completely, gossip will notice quickly

and drop the node •  But what happens if a node is just super slow?

#Cassandra13


•  Gossip doesn't react quickly to slow nodes •  The request queue for the coordinator on every node in

the cluster fills up •  And the entire cluster stops accepting requests

#Cassandra13


•  Gossip doesn't react quickly to slow nodes •  The request queue for the coordinator on every node in

the cluster fills up •  And the entire cluster stops accepting requests •  No single point of failure?

#Cassandra13


•  Solution: Partitioner awareness in client •  Max 3 nodes go down •  Available in Astyanax

#Cassandra13

How not to delete data

#Cassandra13


How is data deleted? •  SSTables are immutable, we can't remove the data •  Cassandra creates tombstones for deleted data •  Tombstones are versioned the same way as any other

write

#Cassandra13


Do tombstones ever go away? •  During compactions, tombstones can get merged into

SStables that hold the original data, making the tombstones redundant

•  Once a tombstone is the only value for a specific column, the tombstone can go away

•  Still need grace time to handle node downtime

#Cassandra13


•  Tombstones can only be deleted once all non-tombstone values have been deleted

•  Tombstones can only be deleted if all values for the specified row are all being compacted

•  If you're using SizeTiered compaction, 'old' rows will rarely get deleted

#Cassandra13


•  Tombstones are a problem even when using levelled compaction

•  In theory, 90 % of all rows should live in a single SSTable •  In production, we've found that only 50 - 80 % of all reads

hit only one SSTable •  In fact, frequently updated columns will exist in most

levels, causing tombstones to stick around

#Cassandra13


•  Deletions are messy •  Unless you perform major compactions, tombstones will

rarely get deleted •  The problem is much worse for «popular» rows •  Avoid schemas that delete data!

#Cassandra13

TTL:ed data

•  Cassandra supports TTL:ed data •  Once TTL:ed data expires, it should just be compacted

away, right? •  We know we don't need the data anymore, no need for a

tombstone, so it should be fast, right?

#Cassandra13

TTL:ed data

•  Cassandra supports TTL:ed data •  Once TTL:ed data expires, it should just be compacted

away, right? •  We know we don't need the data anymore, no need for a

tombstone, so it should be fast, right?

• Noooooo... •  (Overwritten data could theoretically bounce back)

#Cassandra13

TTL:ed data

•  CASSANDRA-5228 •  Drop entire sstables when all columns are expired

#Cassandra13

The Playlist service

Our most complex service •  ~ 1 billion playlists •  40 000 reads per second •  22 TB of compressed data

#Cassandra13


Our old playlist system had many problems: •  Stored data across hundreds of millions of files, making

backup process really slow. •  Home brewed replication model that didn't work very well •  Frequent downtimes, huge scalability problems

#Cassandra13


Our old playlist system had many problems: •  Stored data across hundreds of millions of files, making

backup process really slow. •  Home brewed replication model that didn't work very well •  Frequent downtimes, huge scalability problems

•  Perfect test case for Cassandra!

#Cassandra13

Playlist data model

•  Every playlist is a revisioned object •  Think of it like a distributed versioning system •  Allows concurrent modification on multiple offlined clients •  We even have an automatic merge conflict resolver that

works really well! •  That's actually a really useful feature

#Cassandra13

Playlist data model

•  Every playlist is a revisioned object •  Think of it like a distributed versioning system •  Allows concurrent modification on multiple offlined clients •  We even have an automatic merge conflict resolver that

works really well! •  That's actually a really useful feature said no one ever

#Cassandra13

Playlist data model

•  Sequence of changes •  The changes are the authoritative data •  Everything else is optimization •  Cassandra pretty neat for storing this kind of stuff •  Can use consistency level ONE safely

#Cassandra13

#Cassandra13

Tombstone hell

•  The HEAD column family stores the sequence ID of the latest revision of each playlist

•  90 % of all reads go to HEAD •  mlock

#Cassandra13

Tombstone hell

•  Noticed that HEAD requests took several seconds for some lists

•  Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; •  1-15 seconds latency; should be < 0.1 s •  Copy SSTables to development machine for investigation

#Cassandra13

Tombstone hell

•  Noticed that HEAD requests took several seconds for some lists

•  Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; •  1-15 seconds latency; should be < 0.1 s •  Copy SSTables to development machine for investigation •  Cassandra tool sstabletojson showed that the row contained

600 000 tombstones!

#Cassandra13

Tombstone hell

•  WAT‽ •  Data is in the column name •  Used to detect forks

#Cassandra13

Tombstone hell

•  We expected tombstones would be deleted after 30 days •  Nope, all tombstones since 1.5 years ago were there •  Revelation: Rows existing in 4+ SSTables never have

tombstones deleted during minor compactions •  Frequently updated lists exists in nearly all SSTables Solution: •  Major compaction (CF size cut in half)

#Cassandra13

Zombie tombstones

•  Ran major compaction manually on all nodes during a few days.

•  All seemed well... •  But a week later, the same lists took several seconds

again‽‽‽

#Cassandra13

Repair vs major compactions

A repair between the major compactions "resurrected" the tombstones :(

New solution: •  Repairs during Monday-Friday •  Major compaction Saturday-Sunday A (by now) well-known Cassandra anti-pattern:

Don't use Cassandra to store queues

#Cassandra13

Cassandra counters

•  There are lots of places in the Spotify UI where we count things

•  # of followers of a playlist •  # of followers of an artist •  # of times a song has been played •  Cassandra has a feature called distributed counters that

sounds suitable •  Is this awesome?

#Cassandra13

Cassandra counters

•  Yep •  They've actually worked pretty well for us.

#Cassandra13

Lessons

#Cassandra13

How not to fail

•  Treat Cassandra as a utility belt •  Flash Lots of one-off solutions: •  Weekly major compactions •  Delete all sstables and recreate from scratch every day •  Memlock frequently used SSTables in RAM

#Cassandra13

Lessons

•  Cassandra read performance is heavily dependent on the temporal patterns of your writes

•  Cassandra is initially snappy, but various write patterns make read performance slowly decrease

•  Making benchmarks close to useless

#Cassandra13

Lessons

•  Avoid repeatedly writing data to the same row over very long spans of time

•  Avoid deleting data •  If you're working at scale, you'll need to know how

Cassandra works under the hood •  nodetool cfhistograms is your friend

#Cassandra13

Lessons

•  There are still various esoteric problems with large scale Cassandra installations

•  Debugging them is really interesting •  If you agree with the above statements, you should totally

come work with us

June 17, 2013

#Cassandra13

spotify.com/jobs

Questions?

Technology

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz