Cassandra at Instagram (August 2013)

CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson

SF Cassandra MeetupAugust 29, 2013

Disqus HQ

September 2012Redis fillin' up.

What sucks?

THE OBVIOUSMemory is expensive.

LESS OBVIOUS:In-memory "degrades" poorly

• Flat namespace. What's in there?

• Heap fragmentation

• Single threaded

BGSAVE

• Boils down to centralized logging

• VERY high skew of writes to reads (1,000:1)

• Ever growing data set

• Durability highly valued

• Dumb to store it in RAM, basically...

The Data

• Cassandra 1.1

• 3 EC2 m1.xlarge (2-core, 15GB RAM)

• RAIDed ephemerals (1.6TB of SATA)

• RF=3

• 6GB Heap, 200MB NewSize

• HSHA

The Setup

It worked. Mostly.

The horriblecool thing about Chef...

commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700

Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake

commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700

Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+

November 2013Doubled to 6 nodes.

18,000 connections. Spread those more evenly.

commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800

Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.

commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800

Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap

1.2.1.It went well.well... until...

commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800

Switch Cassandra from tokens to vnodes

commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800

We aren't ready for vnodes yet guys

TAKEAWAYLet stupidenterprising, experienced operators that

will submit patches take the first few bullets on brand-new major versions.

commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800

Doubled C* cluster

commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700

Subtract token from C*ua7 to replace the node

pycassa exceptions (last 6 months)

• 3.4TB

• vnode migration still pending

TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...

• Sharded master/slave Redis

• 32x68GB (m2.4xlarge)

• Space (memory) bound

• Resharding sucks

• Failover is manual, wakes us up at night

user_id: [ activity, activity, ...]

user_id: [ activity, activity, ...]

Thrift Serialized Activity

Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]

LTRIM <user_id> 0 99

Undo

user_id: [ activity1, activity2, activity3, ...]

LREM <user_id> 0 <activity2>

C* data model

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

Bound the Size


...TimeUUID101


...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])

The great destroyer of systems shows up. Tombstones abound.

user_id

TimeUUID1 TimeUUID2

...

TimeUUID2

user_id <activity> <activity> ... [tombstone]user_id

timestamp1 timestamp2

...

timestamp2

Cassandra internally stores deletes as tombstones, which mark data for a given column as deleted at-or-before a timestamp.

Column Delete

tombstone timestamp is >= live column timestamp, so it will be

hidden from queries and compacted away.

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id


...

timestamp101

TimeUUID = timestamp

To avoid tombstones, exploit that the timestamp embedded in our TimeUUID (ordering) is the same as the column timestamp.

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id


...

timestamp101

delete(<user_id>, timestamp=<timestamp101>)

Row DeleteCassandra can also store row tombstones, which delete all data from a row at-or-before the timestamp provided.

Optimizes Reads

SSTable

max_ts=100

SSTable

max_ts=200

SSTable

max_ts=300

SSTable

max_ts=400

SSTable

max_ts=500

SSTable

max_ts=600

SSTable

max_ts=700

SSTable

max_ts=800

Contains row tombstonewith timestamp 350

Safely ignoredusing in-memorymetadata

~10% of actions are undos.

Undo Support


...TimeUUID101


...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Simple Race ConditionThe state of the row may have changed between these two operations.

💩

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas

Replica[A, B]

Replica[A]

Writer

read [A]

Replica[A, B]

Undo Like

Diverging Replicas

Replica is missing B, so if a read is required to find B before deleting it, it's going to fail.

SuperColumn = Old/Busted AntiColumn = New/Hotness

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

"Anti-Column"Borrowing from the idea of Cassandra's by-name tombstones, Contains an MD5 hash of the activity data "value" it is marking as deleted.

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas: Solved

Replica[A, B, C]

Replica[A, C]

Writer

insert C

Replica[A, B, C]

Undo Like

Diverging Replicas: Solved

OK

Instead of read-before-write, an anti-column is inserted to mark the activity as deleted.

TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the

data into place.

• Keep 30% "buffer" for trims.

• Undo without read. (thumbsup)

• Large lists suck for this. (thumbsdown)

• CASSANDRA-5527

Built in two days.Experience paid off.

Reusability is key to rapid rollout.Great documentation eases concerns.

• C* 1.2.3

• vnodes, LeveledCompactionStrategy

• 12 hi1.4xlarge (8-core, 60GB, 2T SSD)

• 3 AZs, RF=3, CL W=TWO R=ONE

• 8G heap, 800M NewSize

Initial Setup

1. Dial up Double Writes

2. Test with "Shadow" Reads

3. Dial up "Real" Reads

Rollout

commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700

Bump C* inbox heap size 8G -> 10G, seeing heap pressure

Bootstrapping sucked because compacting10,000 SSTables takes forever.

sstable_size_in_mb: 5 => 25

Monitor Consistency

$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545

UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;

99.63% consistent

SSTable Size (again)Saw lots of GC pressure related to buffer

garbage. Eventually they landed on a new default in 1.2.9+ (160MB).

sstable_size_in_mb: 25 => 128

Fetch & Deserialize Time (measured from app)

Mean vs P90 (ms), trough-to-peak

Space used (live): 180114509324Space used (total): 180444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020

20K 200-column slice reads/sec

30K 1-column mutations/sec

30% CPU utilization48K clients

Peak Stats

Exciting Future Things

• Python Native Protocol Driver

• Read CPU Consumption Work

• Mass CQL Adoption

• Triggers

• CAS (for limited use cases)

Next 6 Months...

• Node repair visibility & monitoring

• Objects & Associations Storage API on C* + memcache

• Migrate more from Redis

• New major use case

• Cassandra 2.0?

We're hiring!

Technology

Cassandra at Instagram (August 2013)