Upload
datastax-academy
View
1.293
Download
1
Embed Size (px)
Citation preview
Automating Cassandra Repairs
Radovan [email protected]
github.com/spotify/cassandra-reaper#CassandraSummit
Running Cassandra
Requires many things
One of them is keeping data consistent
Otherwise it can get lost or reappear
Repair gone wild
Eats a lot of disk IOSaturates the network● because of streaming a lot of data around
Repair gone wild
Eats a lot of disk IOSaturates the networkFills up the disk● because of receiving all replicas, possibly
from all other data centers
Repair gone wild
Eats a lot of disk IOSaturates the networkFills up the diskCauses a ton of compactions● because of having to merge the received
data
Repair gone wild
Eats a lot of disk IOSaturates the networkFills up the diskCauses a ton of compactions
… one better be careful
Requires splitting the ring into smaller intervals
Smaller intervals mean less data
Less data means fewer repairs gone wild
Careful repair
Smaller intervals also mean more intervals
More intervals mean more actual repairs
Repairs need to be babysat :(
Careful repair
The Spotify way
Feature teams meant to do featuresNot waste time operating their C* clusters
Cron-ing nodetool repair is no good ● mostly due to no feedback loop
This all led to creation of the Reaper
The reaping
You:curl http://reaper/cluster --data ‘{“seedHost” : “my.cassandra.host.net”}’
The Reaper:● Figures out cluster info (e.g. name, partitioner)
The reaping
You:curl http://reaper/repair_run --data ‘{“clusterName”: “myCluster”}’
The Reaper:● Prepares repair intervals
The reaping
You:curl -X PUT http://reaper/repair_run/42 -d state=RUNNING
The Reaper:● Starts triggering repairs of repair intervals
Reaper’s features
Carefulness - doesn’t kill a node● checks for node load● backs off after repairing an interval
Reaper’s features
Carefulness - doesn’t kill a nodeResilience - retries when things break● because things break all the time
Reaper’s features
Carefulness - doesn’t kill a nodeResilience - retries when things breakParallelism - no idle nodes● multiple small intervals in parallel
Reaper’s features
Carefulness - doesn’t kill a nodeResilience - retries when things breakParallelism - no idle nodesScheduling - setup things only once● regular full-ring repairs
Reaper’s features
Carefulness - doesn’t kill a nodeResilience - retries when things breakParallelism - no idle nodesScheduling - setup things only oncePersistency - state saved somewhere● a bit of extra resilience
What we reaped
First repair done 2015-01-281,700 repairs since then, recently 90 per week176,000 (16%) segments failed at least once60 repair failures
Greatest benefitCassandra Reaper automates a very tedious maintenance operation of Cassandra clusters in a rather smart, efficient and careful manner while requiring minimal Cassandra expertise
github.com/spotify/cassandra-reaper
#CassandraSummit