64
June 17, 2013 #Cassandra13 Axel Liljencrantz [email protected] How not to use Cassandra

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

Embed Size (px)

DESCRIPTION

At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.

Citation preview

Page 1: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

June 17, 2013

#Cassandra13

Axel Liljencrantz [email protected]

How not to use Cassandra

Page 2: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

About me

Page 3: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Spotify backend

Page 4: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Spotify backend

•  Around 4000 servers in 4 datacenters •  Volumes

-  We have ~ 12 soccer fields of music -  Streaming ~ 4 Wikipedias/second -  ~ 24 000 000 active users

Page 5: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Spotify backend

•  Specialized software powering Spotify -  ~ 70 services -  Mostly Python, some Java -  Small, simple services responsible for single task

Page 6: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Storage needs

•  Used to be a pure PostgreSQL shop •  Postgres is awesome, but... -  Poor cross-site replication support -  Write master failure requires manual intervention -  Sharding throws most relational advantages out the

window

Page 7: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Cassandra @ Spotify

•  We started using Cassandra 2+ years ago -  ~ 24 services use it by now -  ~ 300 Cassandra nodes -  ~ 50 TB of data

•  Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra

Page 8: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Cassandra @ Spotify

•  We started using Cassandra 2+ years ago -  ~ 24 services use it by now -  ~ 300 Cassandra nodes -  ~ 50 TB of data

•  Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra

•  So we screwed up •  A lot

Page 9: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How to misconfigure Cassandra

Page 10: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Read repair

•  Repair from outages during regular read operation •  With RR, all reads request hash digests from all nodes •  Result is still returned as soon as enough nodes have replied •  If there is a mismatch, perform a repair

Page 11: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Read repair

•  Useful factoid: Read repair is performed across all data centers

•  So in a multi-DC setup, all reads will result in requests being sent to every data center

•  We've made this mistake a bunch of times •  New in 1.1: dclocal_read_repair

Page 12: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Row cache

•  Cassandra can be configured to cache entire data rows in RAM

•  Intended as a memcache alternative •  Lets enable it. What's the worst that could happen, right?

Page 13: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Row cache

NO! •  Only stores full rows •  All cache misses are silently promoted to full row slices •  All writes invalidate entire row •  Don't use unless you understand all use cases

Page 14: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Compression

•  Cassandra supports transparent compression of all data •  Compression algorithm (snappy) is super fast •  So you can just enable it and everything will be better, right?

Page 15: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Compression

•  Cassandra supports transparent compression of all data •  Compression algorithm (snappy) is super fast •  So you can just enable it and everything will be better, right?

• NO! •  Compression disables a bunch of fast paths, slowing down

fast reads

Page 16: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How to misuse Cassandra

Page 17: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Performance worse over time

•  A freshly loaded Cassandra cluster is usually snappy •  But when you keep writing to the same columns over for a

long time, the row will spread over more SSTables •  And performance jumps off a cliff •  We've seen clusters where reads touch a dozen SSTables on

average •  nodetool cfhistograms is your friend

Page 18: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Performance worse over time

•  CASSANDRA-5514 •  Every SSTable stores first/last column of SSTable •  Time series-like data is effectively partitioned

Page 19: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Few cross continent clusters

•  Few cross continent Cassandra users •  We are kind of on our own when it comes to some problems •  CASSANDRA-5148 •  Disable TCP nodelay •  Reduced packet count by 20 %

Page 20: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to upgrade Cassandra

Page 21: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to upgrade Cassandra •  Very few total cluster outages -  Clusters have been up and running since the early 0.7

days, been rolling upgraded, expanded, full hardware replacements etc.

•  Never lost any data! -  No matter how spectacularly Cassandra fails, it has

never written bad data -  Immutable SSTables FTW

Page 22: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Upgrade from 0.7 to 0.8

•  This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6 •  Everyone claimed rolling upgrade would work

-  It did not •  One would expect 0.8.6 to have this fixed •  Patched Cassandra and rolled it a day later •  Takeaways:

-  ALWAYS try rolling upgrades in a testing environment -  Don't believe what people on the Internet tell you

Page 23: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Upgrade from 0.8 to 1.0

•  We tried upgrading in test env, worked fine •  Worked fine in production... •  Except the last cluster •  All data gone

Page 24: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Upgrade from 0.8 to 1.0

•  We tried upgrading in test env, worked fine •  Worked fine in production... •  Except the last cluster •  All data gone •  Many keys per SSTable ⇾ corrupt bloom filters •  Made Cassandra think it didn't have any keys •  Scrub data ⇾ fixed •  Takeaway: ALWAYS test upgrades using production data

Page 25: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Upgrade from 1.0 to 1.1

•  After the previous upgrades, we did all the tests with production data and everything worked fine...

•  Until we redid it in production, and we had reports of missing rows

•  Scrub ⇾ restart made them reappear •  This was in December, have not been able to reproduce •  PEBKAC? •  Takeaway: ?

Page 26: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to deal with large clusters

Page 27: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Coordinator

•  Coordinator performs partitioning, passes on request to the right nodes

•  Merges all responses

Page 28: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

Page 29: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

Many reasons for temporary slowness: •  Bad raid battery •  Sudden bursts of compaction/repair •  Bursty load •  Net hiccup •  Major GC •  Reality

Page 30: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

•  Coordinator has a request queue •  If a node goes down completely, gossip will notice quickly

and drop the node •  But what happens if a node is just super slow?

Page 31: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

•  Gossip doesn't react quickly to slow nodes •  The request queue for the coordinator on every node in

the cluster fills up •  And the entire cluster stops accepting requests

Page 32: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

•  Gossip doesn't react quickly to slow nodes •  The request queue for the coordinator on every node in

the cluster fills up •  And the entire cluster stops accepting requests •  No single point of failure?

Page 33: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

What happens if one node is slow?

•  Solution: Partitioner awareness in client •  Max 3 nodes go down •  Available in Astyanax

Page 34: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

Page 35: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

How is data deleted? •  SSTables are immutable, we can't remove the data •  Cassandra creates tombstones for deleted data •  Tombstones are versioned the same way as any other

write

Page 36: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

Do tombstones ever go away? •  During compactions, tombstones can get merged into

SStables that hold the original data, making the tombstones redundant

•  Once a tombstone is the only value for a specific column, the tombstone can go away

•  Still need grace time to handle node downtime

Page 37: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

•  Tombstones can only be deleted once all non-tombstone values have been deleted

•  Tombstones can only be deleted if all values for the specified row are all being compacted

•  If you're using SizeTiered compaction, 'old' rows will rarely get deleted

Page 38: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

•  Tombstones are a problem even when using levelled compaction

•  In theory, 90 % of all rows should live in a single SSTable •  In production, we've found that only 50 - 80 % of all reads

hit only one SSTable •  In fact, frequently updated columns will exist in most

levels, causing tombstones to stick around

Page 39: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to delete data

•  Deletions are messy •  Unless you perform major compactions, tombstones will

rarely get deleted •  The problem is much worse for «popular» rows •  Avoid schemas that delete data!

Page 40: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

TTL:ed data

•  Cassandra supports TTL:ed data •  Once TTL:ed data expires, it should just be compacted

away, right? •  We know we don't need the data anymore, no need for a

tombstone, so it should be fast, right?

Page 41: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

TTL:ed data

•  Cassandra supports TTL:ed data •  Once TTL:ed data expires, it should just be compacted

away, right? •  We know we don't need the data anymore, no need for a

tombstone, so it should be fast, right?

• Noooooo... •  (Overwritten data could theoretically bounce back)

Page 42: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

TTL:ed data

•  CASSANDRA-5228 •  Drop entire sstables when all columns are expired

Page 43: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Playlist service

Our most complex service •  ~ 1 billion playlists •  40 000 reads per second •  22 TB of compressed data

Page 44: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Playlist service

Our old playlist system had many problems: •  Stored data across hundreds of millions of files, making

backup process really slow. •  Home brewed replication model that didn't work very well •  Frequent downtimes, huge scalability problems

Page 45: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

The Playlist service

Our old playlist system had many problems: •  Stored data across hundreds of millions of files, making

backup process really slow. •  Home brewed replication model that didn't work very well •  Frequent downtimes, huge scalability problems

•  Perfect test case for Cassandra!

Page 46: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Playlist data model

•  Every playlist is a revisioned object •  Think of it like a distributed versioning system •  Allows concurrent modification on multiple offlined clients •  We even have an automatic merge conflict resolver that

works really well! •  That's actually a really useful feature

Page 47: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Playlist data model

•  Every playlist is a revisioned object •  Think of it like a distributed versioning system •  Allows concurrent modification on multiple offlined clients •  We even have an automatic merge conflict resolver that

works really well! •  That's actually a really useful feature said no one ever

Page 48: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Playlist data model

•  Sequence of changes •  The changes are the authoritative data •  Everything else is optimization •  Cassandra pretty neat for storing this kind of stuff •  Can use consistency level ONE safely

Page 49: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Page 50: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Tombstone hell

•  The HEAD column family stores the sequence ID of the latest revision of each playlist

•  90 % of all reads go to HEAD •  mlock

Page 51: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Tombstone hell

•  Noticed that HEAD requests took several seconds for some lists

•  Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; •  1-15 seconds latency; should be < 0.1 s •  Copy SSTables to development machine for investigation

Page 52: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Tombstone hell

•  Noticed that HEAD requests took several seconds for some lists

•  Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; •  1-15 seconds latency; should be < 0.1 s •  Copy SSTables to development machine for investigation •  Cassandra tool sstabletojson showed that the row contained

600 000 tombstones!

Page 53: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Tombstone hell

•  WAT‽ •  Data is in the column name •  Used to detect forks

Page 54: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Tombstone hell

•  We expected tombstones would be deleted after 30 days •  Nope, all tombstones since 1.5 years ago were there •  Revelation: Rows existing in 4+ SSTables never have

tombstones deleted during minor compactions •  Frequently updated lists exists in nearly all SSTables Solution: •  Major compaction (CF size cut in half)

Page 55: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Zombie tombstones

•  Ran major compaction manually on all nodes during a few days.

•  All seemed well... •  But a week later, the same lists took several seconds

again‽‽‽

Page 56: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Repair vs major compactions

A repair between the major compactions "resurrected" the tombstones :(

New solution: •  Repairs during Monday-Friday •  Major compaction Saturday-Sunday A (by now) well-known Cassandra anti-pattern:

Don't use Cassandra to store queues

Page 57: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Cassandra counters

•  There are lots of places in the Spotify UI where we count things

•  # of followers of a playlist •  # of followers of an artist •  # of times a song has been played •  Cassandra has a feature called distributed counters that

sounds suitable •  Is this awesome?

Page 58: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Cassandra counters

•  Yep •  They've actually worked pretty well for us.

Page 59: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Lessons

Page 60: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

How not to fail

•  Treat Cassandra as a utility belt •  Flash Lots of one-off solutions: •  Weekly major compactions •  Delete all sstables and recreate from scratch every day •  Memlock frequently used SSTables in RAM

Page 61: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Lessons

•  Cassandra read performance is heavily dependent on the temporal patterns of your writes

•  Cassandra is initially snappy, but various write patterns make read performance slowly decrease

•  Making benchmarks close to useless

Page 62: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Lessons

•  Avoid repeatedly writing data to the same row over very long spans of time

•  Avoid deleting data •  If you're working at scale, you'll need to know how

Cassandra works under the hood •  nodetool cfhistograms is your friend

Page 63: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

#Cassandra13

Lessons

•  There are still various esoteric problems with large scale Cassandra installations

•  Debugging them is really interesting •  If you agree with the above statements, you should totally

come work with us

Page 64: C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

June 17, 2013

#Cassandra13

spotify.com/jobs

Questions?