Upload
planet-cassandra
View
6.281
Download
0
Tags:
Embed Size (px)
DESCRIPTION
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
Citation preview
#Cassandra13
About me
#Cassandra13
The Spotify backend
#Cassandra13
The Spotify backend
• Around 4000 servers in 4 datacenters • Volumes
- We have ~ 12 soccer fields of music - Streaming ~ 4 Wikipedias/second - ~ 24 000 000 active users
#Cassandra13
The Spotify backend
• Specialized software powering Spotify - ~ 70 services - Mostly Python, some Java - Small, simple services responsible for single task
#Cassandra13
Storage needs
• Used to be a pure PostgreSQL shop • Postgres is awesome, but... - Poor cross-site replication support - Write master failure requires manual intervention - Sharding throws most relational advantages out the
window
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra 2+ years ago - ~ 24 services use it by now - ~ 300 Cassandra nodes - ~ 50 TB of data
• Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra
#Cassandra13
Cassandra @ Spotify
• We started using Cassandra 2+ years ago - ~ 24 services use it by now - ~ 300 Cassandra nodes - ~ 50 TB of data
• Back then, there was little information about how to design efficient, scalable storage schemas for Cassandra
• So we screwed up • A lot
#Cassandra13
How to misconfigure Cassandra
#Cassandra13
Read repair
• Repair from outages during regular read operation • With RR, all reads request hash digests from all nodes • Result is still returned as soon as enough nodes have replied • If there is a mismatch, perform a repair
#Cassandra13
Read repair
• Useful factoid: Read repair is performed across all data centers
• So in a multi-DC setup, all reads will result in requests being sent to every data center
• We've made this mistake a bunch of times • New in 1.1: dclocal_read_repair
#Cassandra13
Row cache
• Cassandra can be configured to cache entire data rows in RAM
• Intended as a memcache alternative • Lets enable it. What's the worst that could happen, right?
#Cassandra13
Row cache
NO! • Only stores full rows • All cache misses are silently promoted to full row slices • All writes invalidate entire row • Don't use unless you understand all use cases
#Cassandra13
Compression
• Cassandra supports transparent compression of all data • Compression algorithm (snappy) is super fast • So you can just enable it and everything will be better, right?
#Cassandra13
Compression
• Cassandra supports transparent compression of all data • Compression algorithm (snappy) is super fast • So you can just enable it and everything will be better, right?
• NO! • Compression disables a bunch of fast paths, slowing down
fast reads
#Cassandra13
How to misuse Cassandra
#Cassandra13
Performance worse over time
• A freshly loaded Cassandra cluster is usually snappy • But when you keep writing to the same columns over for a
long time, the row will spread over more SSTables • And performance jumps off a cliff • We've seen clusters where reads touch a dozen SSTables on
average • nodetool cfhistograms is your friend
#Cassandra13
Performance worse over time
• CASSANDRA-5514 • Every SSTable stores first/last column of SSTable • Time series-like data is effectively partitioned
#Cassandra13
Few cross continent clusters
• Few cross continent Cassandra users • We are kind of on our own when it comes to some problems • CASSANDRA-5148 • Disable TCP nodelay • Reduced packet count by 20 %
#Cassandra13
How not to upgrade Cassandra
#Cassandra13
How not to upgrade Cassandra • Very few total cluster outages - Clusters have been up and running since the early 0.7
days, been rolling upgraded, expanded, full hardware replacements etc.
• Never lost any data! - No matter how spectacularly Cassandra fails, it has
never written bad data - Immutable SSTables FTW
#Cassandra13
Upgrade from 0.7 to 0.8
• This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6 • Everyone claimed rolling upgrade would work
- It did not • One would expect 0.8.6 to have this fixed • Patched Cassandra and rolled it a day later • Takeaways:
- ALWAYS try rolling upgrades in a testing environment - Don't believe what people on the Internet tell you
#Cassandra13
Upgrade from 0.8 to 1.0
• We tried upgrading in test env, worked fine • Worked fine in production... • Except the last cluster • All data gone
#Cassandra13
Upgrade from 0.8 to 1.0
• We tried upgrading in test env, worked fine • Worked fine in production... • Except the last cluster • All data gone • Many keys per SSTable ⇾ corrupt bloom filters • Made Cassandra think it didn't have any keys • Scrub data ⇾ fixed • Takeaway: ALWAYS test upgrades using production data
#Cassandra13
Upgrade from 1.0 to 1.1
• After the previous upgrades, we did all the tests with production data and everything worked fine...
• Until we redid it in production, and we had reports of missing rows
• Scrub ⇾ restart made them reappear • This was in December, have not been able to reproduce • PEBKAC? • Takeaway: ?
#Cassandra13
How not to deal with large clusters
#Cassandra13
Coordinator
• Coordinator performs partitioning, passes on request to the right nodes
• Merges all responses
#Cassandra13
What happens if one node is slow?
#Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness: • Bad raid battery • Sudden bursts of compaction/repair • Bursty load • Net hiccup • Major GC • Reality
#Cassandra13
What happens if one node is slow?
• Coordinator has a request queue • If a node goes down completely, gossip will notice quickly
and drop the node • But what happens if a node is just super slow?
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes • The request queue for the coordinator on every node in
the cluster fills up • And the entire cluster stops accepting requests
#Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes • The request queue for the coordinator on every node in
the cluster fills up • And the entire cluster stops accepting requests • No single point of failure?
#Cassandra13
What happens if one node is slow?
• Solution: Partitioner awareness in client • Max 3 nodes go down • Available in Astyanax
#Cassandra13
How not to delete data
#Cassandra13
How not to delete data
How is data deleted? • SSTables are immutable, we can't remove the data • Cassandra creates tombstones for deleted data • Tombstones are versioned the same way as any other
write
#Cassandra13
How not to delete data
Do tombstones ever go away? • During compactions, tombstones can get merged into
SStables that hold the original data, making the tombstones redundant
• Once a tombstone is the only value for a specific column, the tombstone can go away
• Still need grace time to handle node downtime
#Cassandra13
How not to delete data
• Tombstones can only be deleted once all non-tombstone values have been deleted
• Tombstones can only be deleted if all values for the specified row are all being compacted
• If you're using SizeTiered compaction, 'old' rows will rarely get deleted
#Cassandra13
How not to delete data
• Tombstones are a problem even when using levelled compaction
• In theory, 90 % of all rows should live in a single SSTable • In production, we've found that only 50 - 80 % of all reads
hit only one SSTable • In fact, frequently updated columns will exist in most
levels, causing tombstones to stick around
#Cassandra13
How not to delete data
• Deletions are messy • Unless you perform major compactions, tombstones will
rarely get deleted • The problem is much worse for «popular» rows • Avoid schemas that delete data!
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data • Once TTL:ed data expires, it should just be compacted
away, right? • We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
#Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data • Once TTL:ed data expires, it should just be compacted
away, right? • We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
• Noooooo... • (Overwritten data could theoretically bounce back)
#Cassandra13
TTL:ed data
• CASSANDRA-5228 • Drop entire sstables when all columns are expired
#Cassandra13
The Playlist service
Our most complex service • ~ 1 billion playlists • 40 000 reads per second • 22 TB of compressed data
#Cassandra13
The Playlist service
Our old playlist system had many problems: • Stored data across hundreds of millions of files, making
backup process really slow. • Home brewed replication model that didn't work very well • Frequent downtimes, huge scalability problems
#Cassandra13
The Playlist service
Our old playlist system had many problems: • Stored data across hundreds of millions of files, making
backup process really slow. • Home brewed replication model that didn't work very well • Frequent downtimes, huge scalability problems
• Perfect test case for Cassandra!
#Cassandra13
Playlist data model
• Every playlist is a revisioned object • Think of it like a distributed versioning system • Allows concurrent modification on multiple offlined clients • We even have an automatic merge conflict resolver that
works really well! • That's actually a really useful feature
#Cassandra13
Playlist data model
• Every playlist is a revisioned object • Think of it like a distributed versioning system • Allows concurrent modification on multiple offlined clients • We even have an automatic merge conflict resolver that
works really well! • That's actually a really useful feature said no one ever
#Cassandra13
Playlist data model
• Sequence of changes • The changes are the authoritative data • Everything else is optimization • Cassandra pretty neat for storing this kind of stuff • Can use consistency level ONE safely
#Cassandra13
#Cassandra13
Tombstone hell
• The HEAD column family stores the sequence ID of the latest revision of each playlist
• 90 % of all reads go to HEAD • mlock
#Cassandra13
Tombstone hell
• Noticed that HEAD requests took several seconds for some lists
• Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; • 1-15 seconds latency; should be < 0.1 s • Copy SSTables to development machine for investigation
#Cassandra13
Tombstone hell
• Noticed that HEAD requests took several seconds for some lists
• Easy to reproduce in cassandra-cli: • get playlist_head[utf8('spotify:user...')]; • 1-15 seconds latency; should be < 0.1 s • Copy SSTables to development machine for investigation • Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
#Cassandra13
Tombstone hell
• WAT‽ • Data is in the column name • Used to detect forks
#Cassandra13
Tombstone hell
• We expected tombstones would be deleted after 30 days • Nope, all tombstones since 1.5 years ago were there • Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions • Frequently updated lists exists in nearly all SSTables Solution: • Major compaction (CF size cut in half)
#Cassandra13
Zombie tombstones
• Ran major compaction manually on all nodes during a few days.
• All seemed well... • But a week later, the same lists took several seconds
again‽‽‽
#Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the tombstones :(
New solution: • Repairs during Monday-Friday • Major compaction Saturday-Sunday A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
#Cassandra13
Cassandra counters
• There are lots of places in the Spotify UI where we count things
• # of followers of a playlist • # of followers of an artist • # of times a song has been played • Cassandra has a feature called distributed counters that
sounds suitable • Is this awesome?
#Cassandra13
Cassandra counters
• Yep • They've actually worked pretty well for us.
#Cassandra13
Lessons
#Cassandra13
How not to fail
• Treat Cassandra as a utility belt • Flash Lots of one-off solutions: • Weekly major compactions • Delete all sstables and recreate from scratch every day • Memlock frequently used SSTables in RAM
#Cassandra13
Lessons
• Cassandra read performance is heavily dependent on the temporal patterns of your writes
• Cassandra is initially snappy, but various write patterns make read performance slowly decrease
• Making benchmarks close to useless
#Cassandra13
Lessons
• Avoid repeatedly writing data to the same row over very long spans of time
• Avoid deleting data • If you're working at scale, you'll need to know how
Cassandra works under the hood • nodetool cfhistograms is your friend
#Cassandra13
Lessons
• There are still various esoteric problems with large scale Cassandra installations
• Debugging them is really interesting • If you agree with the above statements, you should totally
come work with us
June 17, 2013
#Cassandra13
spotify.com/jobs
Questions?