Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Preview:

DESCRIPTION

Presenter: Gurashish Brar, Member of Technical Staff at Bloomreach Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed.

Citation preview

Cassandra Compute Cloud An Elastic Cassandra Infrastructure

Gurashish Singh Brar

Member of Technical Staff @ BloomReach

Abstract Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed

What is it about ?

•  Dynamically scaling the infrastructure to support large EMR jobs

•  Throughput SLA to backend applications

•  TP95 latency SLA to frontend applications

•  Cassandra 2.0 using vnodes

Agenda

•  Application requirements

•  Major issues we encountered

•  Solutions to the issues

Application Requirements

•  Backend EMR jobs performing scans, lookups and writes Heterogeneous applications with varying degree of throughput SLAs. Very high peak loads Always available (no maintenance periods or planned downtimes)

•  Frontend applications performing lookups Data from backend applications expected in realtime Low latencies

•  Developer support

How we started Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Frontend isolation using multiple DCs

Cassandra Cluster

Frontend DC

Backend DC

Frontend Issue: Spillover Reads

Cassandra Cluster

Frontend DC

Backend DC

Frontend Issue: Latencies vs Replication load Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC EM

R Jo

bs

Backend Issue: Fixed resource

Cassandra Cluster

Backend DC

EMR

Jobs

EMR

Jobs

Backend Issue: Fixed Resource

Cassandra Cluster

Backend DC

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

Backend Issue: Starvation

Cassandra Cluster

Backend DC

Large EMR Jobs with

relaxed SLA

Small EMR job with

tighter SLA

Summary of Issues

•  Frontend isolation is not perfect

•  Frontend latencies are impacted by backend write load

•  EMR jobs can overwhelm the Cassandra cluster

•  Large EMR jobs can starve smaller ones

Rate Limiter Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Token Server (Redis)

Rate Limiter

•  QPS allocated on per operation and application level

•  Operations can be: scans, reads, writes, prepare, alter, create … etc

•  Each mapper/reducer obtains permits for 1 minute (configurable).

•  The token bucket is periodically refreshed with allocated capacity

•  Quotas are dynamically adjusted to take advantage of unused quotas of applications ( We do want to maximize the cluster usage)

Why Redis ?

•  High load from all EMR nodes

•  Low latency

•  Support high number of concurrent connections

•  Support atomic fetch and add

Cost of Rate Limiter

•  We converted EMR from an elastic resource to a fixed resource

•  To scale EMR we have to scale Cassandra

•  Adding capacity to Cassandra cluster is not trivial

•  Adding capacity under heavy load is harder

•  Auto scaling and reducing under heavy load is even harder

Managing capacity - Requirements

•  Time to increase capacity should be in minutes

•  Programmatic management and not manual

•  Minimum load on the production cluster during the operation

C* increasing capacity

C* Cluster

Adding nodes is expensive

C* increasing capacity

C* Cluster

C* Cluster Sol: Replicate to a

new cluster

Custom Replication Service

Source Cluster

Destination Cluster

SSTable file copy

Custom Replication Service

Custom Replication Service •  Replication Service (source node) takes snapshot of column family

•  SSTables in snapshot are evenly streamed on destination cluster

•  Replication Service (destination node) splits a single source SSTable to N SSTables

•  Splits computed using SSTableReader & SSTableWriter classes. A single SSTable can be split in parallel by multiple threads

Custom Replication Service •  Once split the new SSTables are streamed to correct destination nodes

•  Rolling restart is initiated on the destination cluster (we could have used nodetool refresh, but it was unreliable)

•  The cluster is ready for use

•  In parallel trigger compaction on destination cluster for optimizing reads

Cluster Provisioning •  Estimate the required cluster size based on column family disk size on source cluster

•  Provision machines on AWS (Cassandra is pre-installed on AMI , so no setup required)

•  Generate yaml and topology file with the new cluster and create a backend datacenter (Application agnostic)

•  Copy schema from source cluster to destination cluster

•  Call replication service on source cluster to replicate data

C* Compute Cloud

Source Cluster

Cluster Management service

On-demand cluster

On-demand cluster On-demand cluster

On-demand cluster

EMR

Jobs

C* Compute Cloud •  Very high throughput in moving raw data from source to destination cluster (10 X

increase in network usage compared to normal)

•  Little CPU/Memory load on the source cluster

•  Leverage the size of destination cluster to compute new SSTables for the new ring

•  Time to provision varies between 10 minutes to 40 minutes

•  API driven so automatically scales up and down with demand

•  Application agnostic

C* Compute Cloud - Limitations •  Snapshot model : Take a snapshot of production and operate on it

This works really well for some use cases, good for most, but not all

•  Provisioning time order of minutes

Works for EMR jobs which themselves take few minutes to provision but does not work for dedicated backend applications

•  Writes still need to happen on production reserved cluster

Where we are now Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Token Server (Redis)

On-demand cluster

On-demand cluster

On-demand cluster

Replication

Cluster Management service

Exploiting the C* compute cloud

•  Key feature: Easy, automated and fast cluster provisioning with production data

•  Use Spot Instances instead of On-Demand

•  Failures in few nodes are survivable due to C* redundancy

•  In case of too many failures, just rebuild on retry (its fast ! & automatic)

Spot Instances

•  Service supports all instance types in AWS and all AZs

•  Pick the optimal Spot Instance type & AZ that is the cheapest and satisfies the constraints

•  Further reduces cost and improves reliability of the service

•  If r3.2xlarge spot price spikes on retry service might pick c3.8xlarge

•  Auto expire clusters to adjust automatically to cheaper instances

Cost or Capacity (take your pick)

Capacity of C* compute cloud on spot instances

~=

(5 to 10) X C* cluster using on-demand instances

for same $ value

Issues Addressed

•  Backend Read Capacity can scale linearly with C* compute cloud

•  Frontend latencies are protected from write load through rate limit

Remaining issues

•  Read load on backend DC can spillover to frontend DC causing spikes

•  Write capacity is still defined by frontend latencies

Issue: Spillover Reads

Cassandra Cluster

Frontend DC

Backend DC

Spillover Reads Fix: Fail the read

Cassandra Cluster

Frontend DC

Backend DC

X

Addressing the Write Capacity

•  The obvious : Only push the updates that are new and not the same

Big improvement, 80-90% data did not change

•  Add more nodes : With the backend read load off production it is lot easier to expand capacity

•  But we are still operating at ~ 3rd or 5th the write capacity to keep read latencies low

Addressing the Write Capacity

•  Experimental changes under evaluation

•  Prioritize reads over writes on frontend Pause write stage during a read •  Reduce replication load to frontend DC from backend DC ColumnLevel replication strategy Most frontend applications operate on a subset view of backend data

Key Takeaways

•  Scale Cassandra dynamically for backend load by creating snapshot clusters

•  Use rate limiter to protect the production cluster from spiky and unexpected backend traffic

•  Build better isolation between frontend DC and backend DC

•  Writes throughput from backend to frontend is a challenge

Questions ?

Thank you

Recommended