Replica selection in Apache Cassandra827372/FULLTEXT01.pdf · algorithm called C3, recently developed by Lalith Suresh, Marco Canini, Stefan Schmid and Anja Feldmann. Through extensive

DEGREE PROJECT, IN , SECOND LEVELCOMPUTER SCIENCESTOCKHOLM, SWEDEN 2015

Replica selection in ApacheCassandraREDUCING THE TAIL LATENCY FOR READSUSING THE C3 ALGORITHM

SOFIE THORSEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

Replica selection in Apache Cassandra

Reducing the tail latency for reads using the C3 algorithm

Val av replikor i Apache Cassandra

SOFIE THORSEN

[email protected]’s Thesis at CSCSupervisor: Per Austrin

Examiner: Johan HåstadEmployer: Spotify

14-06-2015

Abstract

Keeping response times low is crucial in order to providea good user experience. Especially the tail latency provesto be a challenge to keep low as size, complexity and over-all use of services scale up. In this thesis we look at re-ducing the tail latency for reads in the Apache Cassandradatabase system by implementing the new replica selectionalgorithm called C3, recently developed by Lalith Suresh,Marco Canini, Stefan Schmid and Anja Feldmann.

Through extensive benchmarks with several stress tools,we find that C3 indeed decreases the tail latencies of Cas-sandra on generated load. However, when evaluating C3on production load, results does not show any particularimprovement. We argue that this is mostly due to the vari-able size records in the data set and token awareness in theproduction client. We also present a client-side implemen-tation of C3 in the DataStax Java driver in an attempt toremove the caveat of token aware clients.

The client-side implementation did give positive results, butas the benchmark results showed a lot of variance we deemthe results to be too inconclusive to confirm that the imple-mentation works as intended. We conclude that the server-side C3 algorithm will work e�ectively for systems with ho-mogeneous row sizes where the clients are not token aware.

Sammanfattning

Val av replikor i Apache Cassandra

För att kunna erbjuda en bra användarupplevelse så är detav högsta vikt att hålla responstiden låg. Speciellt svans-latensen är en utmaning att hålla låg då dagens applika-tioner växer både i storlek, komplexitet och användning.I denna rapport undersöker vi svanslatensen vid läsning idatabassystemet Apache Cassandra och huruvida den gåratt förbättra. Detta genom att implementera den nya selek-tionsalgoritmen för replikor, kallad C3, nyligen framtagenav Lalith Suresh, Marco Canini, Stefan Schmid och AnjaFeldmann.

Genom utförliga tester med flera olika stressverktyg så fin-ner vi att C3 verkligen förbättrar Cassandras svanslatenserpå genererad last. Dock så visade använding av C3 på pro-duktionslast ingen större förbättring. Vi hävdar att dettaframförallt beror på en variabel storlek på datasetet och attproduktionsklienten är tokenmedveten. Vi presenterar ock-så en klientimplementation av C3 i Java-drivrutinen frånDataStax, i ett försök att åtgärda problemet med token-medventa klienter.

Klientimplementationen av C3 gav positiva resultat, mendå testresultaten uppvisade stor varians så anser vi att re-sultaten är för osäkra för att kunna bekräfta att implenta-tionen fungerar så som den är avsedd. Vi drar slutsatsenatt C3, implementerad på servern, fungerar e�ektivt på sy-stem med homogen storlek på datat och där klienter ej ärtokenmedvetna.

Contents

Acknowledgements

1 Introduction 1

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Terminology and definitions . . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Load balancing and replica selection . . . . . . . . . . . . . . 32.1.2 Percentiles and tail latency . . . . . . . . . . . . . . . . . . . 32.1.3 CAP theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Eventual consistency . . . . . . . . . . . . . . . . . . . . . . . 42.1.5 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.6 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.7 Accrual failure detection . . . . . . . . . . . . . . . . . . . . . 52.1.8 Exponentially weighted moving averages (EMWA) . . . . . . 62.1.9 RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.10 Apache Cassandra . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Load balancing techniques in distributed systems . . . . . . . . . . . 102.2.1 The power of d choices . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Join-Shortest-Queue . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Join-Idle-Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.4 Speculative retries . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Tied requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 The C3 algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Replica ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Rate control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 Notes on the C3 implementation . . . . . . . . . . . . . . . . 16

3 Method 19

3.1 Tools for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.1 The cassandra-stress tool . . . . . . . . . . . . . . . . . . . . 193.1.2 The Yahoo Cloud Serving Benchmark . . . . . . . . . . . . . 203.1.3 The Java driver stress tool . . . . . . . . . . . . . . . . . . . 20

3.1.4 Darkloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Test environment setup . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Testing on generated load . . . . . . . . . . . . . . . . . . . . 213.2.2 Testing on production load . . . . . . . . . . . . . . . . . . . 21

4 Implementation 23

4.1 Implementing C3 in Cassandra 2.0.11 . . . . . . . . . . . . . . . . . 234.2 Implementing C3 in the DataStax Java driver 2.1.5 . . . . . . . . . . 23

4.2.1 Naive implementation . . . . . . . . . . . . . . . . . . . . . . 244.3 Benchmarking with YCSB . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Benchmarking with cassandra-stress . . . . . . . . . . . . . . . . . . 254.5 Benchmarking with the java-driver stress tool . . . . . . . . . . . . . 254.6 Darkloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Results 27

5.1 Benchmarking with YCSB . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Benchmarking with cassandra-stress . . . . . . . . . . . . . . . . . . 285.3 Benchmarking with the java-driver stress tool . . . . . . . . . . . . . 29

5.3.1 Performance of the C3 client . . . . . . . . . . . . . . . . . . 295.4 Darkloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.1 Performance with token awareness . . . . . . . . . . . . . . . 305.4.2 Performance with round robin . . . . . . . . . . . . . . . . . 30

6 Discussion 33

6.1 Performance of server side C3 . . . . . . . . . . . . . . . . . . . . . . 336.1.1 YCSB vs. cassandra-stress . . . . . . . . . . . . . . . . . . . 336.1.2 Darkloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.2 Performance of client side C3 . . . . . . . . . . . . . . . . . . . . . . 346.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A Results from benchmarks 37

A.1 YCSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.2 cassandra-stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.3 java-driver stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.4 Darkloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A.4.1 Token aware . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.4.2 Round robin . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 41

Acknowledgements

I want to thank Lalith Suresh and Marco Canini for continuously discussing thoughtsand sharing ideas throughout this project.

I also want to thank Jimmy Mårdell for his support and expertise with the quirkinessand caveats that Cassandra presents, as well for volunteering to be my supervisorin the first place.

Chapter 1

Introduction

For all service-oriented applications, fast response times are vital for a good userexperience. To examine the exact impact of server delays, Amazon and Googleconducted experiments where they added extra delays on every query before sendingback results to users [21]. One of their findings was that an extra delay of only 500milliseconds per query resulted in a 1.2% loss of users and revenue, with the e�ectpersisting even after the delay was removed.

However, keeping response times low is not an easy task. As Google reported [12],especially the tail latency is challenging to keep low as size, complexity and overalluse of services scale up. When serving a single user request, multiple servers canbe involved. Bad latency on a few machines then quickly results in higher overalllatencies, and the more machines, the worse the tail latency. To illustrate why,consider a client making a request to a single server. Suppose that the server hasan acceptable response time in 99% of the requests, but the last 1% of the requeststakes a second or more to serve. This scenario is not too bad, as it only means thatone client gets a slightly slower response every now and then.

Consider instead a hundred servers like this and that a request requires a responsefrom all servers. This will greatly change the responsiveness of the system. From1% of the requests being slow, suddenly 63%1 of the requests will take more than asecond to serve.

It is then apparent that the tail latency must be taken seriously in order to providea good service.

Apache Cassandra is the database of choice at Spotify for end user facing features.Spotify runs more than 80 Cassandra clusters on over 650 servers, managing im-

1Assuming independence between response times, the probability that at least one responsetakes more than a second is 1 ≠ 0.99100 ¥ 0.63

1

CHAPTER 1. INTRODUCTION

portant data such as playlists, music collections, account information, user/artistfollowers and more. Since an end user request often involves reading from severaldatabases, poor tail latencies will a�ect the user experience negatively for a largenumber of users.

In this thesis a replica selection algorithm for usage with Cassandra was imple-mented and evaluated, with focus on reducing the tail latency for reads.

1.1 Problem statement

The data in Cassandra is replicated to several nodes in the cluster to provide highavailability. The performance of the nodes in the cluster varies over time though, forinstance due to internal data maintenance operations and Java garbage collections.When data is read, a replica selection algorithm in Cassandra determines whichnode in the cluster the request should be sent to. The built in replica selectionalgorithm provides good median latency, but the tail latency is often an order ofmagnitude worse than the median, which leads to the following question:

• Can the tail latency for reads in Cassandra be reduced in practice by using amore sophisticated replica selection algorithm?

2

Chapter 2

Background

2.1 Terminology and definitions

In this section we discuss concepts and technology necessary to follow the thesis.The reader familiar with the concepts can skip this section.

2.1.1 Load balancing and replica selection

Load balancing is the process of distributing workload across multiple computingresources, such as servers. Replica selection is a form of load balancing as it triesto balance requests across the set of nodes that own the requested data.

2.1.2 Percentiles and tail latency

A percentile is a statistical measure that indicates the value below which a givenpercentage of observations in a group of observations fall. For example, the 95thpercentile is the smallest value which is greater than or equal to 95% of the obser-vations.

In the context of latencies, percentiles are important measures when analyzing data.For example, if only using mean and median in analysis, outliers can remain hidden.In contrast, the maximum value gives a pessimistic view since it can be distortedby a single data point.

Consider the graph in Figure 2.1, showing latencies over time. If only presentingmean and median latency, crucial information is lost. After 5 hours, the 99thpercentile shows a spike that is not noticeable in the mean or median. After 8.5hours the 99th percentile shows that 1% of the users are experiencing more than800 ms latencies, while the mean is only 75 ms. The higher percentiles, commonly95-99th, are often referred to as the tail latency.

3

CHAPTER 2. BACKGROUND

0 1 2 3 4 5 6 7 8 9 10 11 12 13 140

200

400

600

800

1,000

Time (hours)

Late

ncy

(ms)

MeanMedian

99th

Figure 2.1: Latencies over time.

2.1.3 CAP theorem

The CAP theorem, also known as Brewer’s theorem [14], states that for a distributedcomputer system it is impossible to simultaneously provide all three of the following:

• Consistency - all nodes see the same data at the same time.

• Availability - every request receives a response about whether it succeeded orfailed.

• Partition tolerance - the system continues to operate despite arbitrary messageloss or failure of part of the system.

2.1.4 Eventual consistency

Eventual consistency is a consistency model used in distributed systems to achievehigh availability. The consistency model informally guarantees that, if no newupdates are made to a given data item, eventually all accesses to that item willreturn the last updated value.

2.1.5 SQL

Structured Query Language (SQL) is a special-purpose programming language de-signed for managing data held in a traditional relational database management sys-tem (RDBMS). The data model in a relational database uses tables with rows and

4

2.1. TERMINOLOGY AND DEFINITIONS

columns, with rows containing information about one specific entity and columnsbeing the separate data points. For example, a row could represent a specific car,in which the columns are “Model”, “Color” and so on. The tables can have rela-tionships between each other and the data is queried using SQL.

2.1.6 NoSQL

NoSQL1 databases are an alternative to the tabular relations used in relationaldatabases. The motivation for this approach includes simplicity of design, hor-izontal scaling and availability. The data structures used by NoSQL databases(e.g. column, document, key-value or graph) di�ers from those used in relationaldatabases, making some operations faster in NoSQL and others faster in relationaldatabases. The suitability of a particular database, regardless of it being relationalor NoSQL, depends on the problem it must solve.

There are many di�erent distributed NoSQL databases and their functionality candi�er a lot depending on which two properties from the CAP theorem they support.

2.1.7 Accrual failure detection

In distributed systems, a failure detector is an application or a subsystem that isresponsible for detecting slow or failing nodes. This mechanism is important todetect situations where the system would perform better by excluding the culpritnode or putting it on probation. To decide if a node is subject for exclusion/proba-tion a suspicion level is used. For example, traditional failure detectors use booleaninformation as the suspicion level: a node is simply suspected or not suspected.

Accrual failure detectors are a class of failure detectors where the information isa value on a continuous scale rather than a boolean value. The higher the value,the higher the confidence that the monitored node has failed. If an actual crashoccurs, the output of the accrual failure detector will accumulate over time andtend towards infinity (hence the name). This model provides more flexibility as theapplication itself can decide an appropriate suspicion threshold. Note that a lowthreshold means quick detection in the event of a real crash, but also increases thelikelihood of incorrect suspicion. On the other hand, a high threshold makes lessmistakes but makes the failure detector slower to detect failing nodes.

Hayashibara et al. describe an implementation of such an accrual failure detectorin [17], called the Ï accrual failure detector. In the Ï failure detector the arrivaltimes of heartbeats2 are used to approximate the probabilistic distribution of futureheartbeat messages. With this information, a value Ï is calculated with a scale thatchanges dynamically to match recent network conditions.

1interpreted as Not only SQL, to emphasize that they may also support SQL-like languages.2A periodic signal generated by hardware/software for activation or synchronization purposes.

5


2.1.8 Exponentially weighted moving averages (EMWA)

A moving average (also known as rolling average or running average) is a techniqueused to analyze trends in a data set by creating a series of averages of di�erentsubsets of the full data set. Given a sequence of numbers and a fixed subset size,the first element of the moving average sequence is obtained by taking the averageof the initial fixed subset of the number sequence. Then the subset is modified byexcluding the first number of the series and including the next number followingthe original subset in the series. This creates a new averaged subset of numbers.More mathematically formulated:

Given a sequence {a

i

}N

i=1

, an n-moving average is a new sequence {s

i

}N≠n+1

i=1

definedfrom the a

i

sequence by taking the mean of subsequences of n terms:

s

i

= 1

n

i+n≠1ÿ

j=i

a

j

The sequences S

n

giving n-moving averages then are:

S

2

= 1

2

(a1

+ a

2

, a

2

+ a

3

, . . . , a

n≠1

+ a

n

)

S

3

= 1

3

(a1

+ a

2

+ a

3

, a

2

+ a

3

+ a

4

, . . . , , a

n≠2

+ a

n≠1

+ a

n

)

and so on. An example of di�erent moving averages can be seen in Figure 2.2.

0 1 2 3 4 5 6 7 8 9 100

2

4

6

8

Figure 2.2: The 2-(red), 3-(green), and 4-(blue) moving averages for 20 data points.

6


An exponentially weighted moving average (EMWA), instead of using the averageof a fixed subset of data points, applies weighting factors to the data points. Theweighting for each older data point decreases exponentially, never reaching zero.The EMWA for a series Y can be calculated as:

S

1

= Y

1

for t > 1 : S

t

= – · Y

t

+ (1 ≠ –) · S

t≠1

Where – represents the degree of weighting decrease, a constant smoothing factorbetween 0 and 1. A higher value of – discounts older observations faster. Y

t

is thevalue at a time period t, and S

t

is the value of the EMWA at a time period t.

2.1.9 RAID

RAID3 is a virtualization technology for data storage which combines multiple diskdrives into one logical unit.

Data is distributed across the drives in di�erent ways called RAID levels, dependingon the specific level of redundancy and performance wanted. The di�erent schemesare named by the word RAID followed by a number (e.g. RAID 0, RAID 1). Eachscheme provides di�erent balance between the key goals: reliability, availability,performance and capacity.

RAID 10, or RAID 1+0 is a scheme where throughput and latency are prioritizedand is therefore the preferable RAID level for I/O intense applications such asdatabases.

2.1.10 Apache Cassandra

Apache Cassandra, born at Facebook [18] and built on ideas from Amazon’s Dynamo[13] and Google’s BigTable [3], is an open source NoSQL distributed database systemdesigned to handle large amounts of data across many commodity servers, providinghigh availability with no single point of failure.

DataStax

DataStax is a computer software company whose business model centers aroundselling an enterprise distribution of the Cassandra project which includes extensionsto Cassandra, analytics and search functionality. DataStax also employ more thanninety percent of the Cassandra committers.

3Originally redundant array of inexpensive disks, now commonly redundant array of indepen-dent disks.

7


Replication

To ensure fault tolerance and reliability, Cassandra stores copies of data, calledreplicas, on multiple nodes. The total number of replicas across the cluster isreferred to as the replication factor. A replication factor of 1 means that there isone copy of each row on one node. A factor of two means two copies of each row,where each copy is on a di�erent node [7].

When a client read or write request is issued, it can go to any node in the clustersince all nodes in Cassandra are peers. When a client connects to a node, that nodeserves as the coordinator for that particular client operation. What the coordinatorthen does is to act as a proxy between the client application and the nodes thatown the requested data. The coordinator is responsible for determining which nodeshould get the request based on the cluster configuration and replica placementstrategy.

Partitioners and tokens

A partitioner in Cassandra determines how the data is distributed across the nodesin a cluster, including replicas. In essence, the partitioner is a hash function forderiving a token, representing a row from its partion key4 [9].

The basic idea is that each node in the Cassandra cluster is assigned a token thatdetermines what data in the cluster it is responsible for [2]. The tokens assigned toa node needs to be distributed throughout the entire possible range of tokens.

As a simple example, consider a cluster with four nodes and a possible token rangeof 0-80. Then you would want the tokens for the nodes to be 0, 20, 40, 60, makingeach node responsible for an equal portion of the data.

Data consistency

As Cassandra sacrifices consistency for availability and partition tolerance, makingit an AP system in the CAP theorem sense, replicas may not always be synchro-nized. Cassandra extends the concept of eventual consistency by o�ering tunableconsistency, meaning that the client application can decide how consistent the re-quested data must be.

In the context of read requests, the consistency level specifies how many replicasmust respond to a read request before data is returned to the client application.Examples of consistency levels can be seen in Table 2.1.

4The partition key is the first column declared in the PRIMARY KEY definition. Each row ofdata is uniquely identified by the partition key.

8


Level Description

ALL Returns the data after all replicas has responded. The readoperation fails if a replica does not respond.

QUORUM Returns the data once a quorum, i.e. a majority, of replicashas responded.

ONE Return the data from the closest replica.TWO Return the data from the two closest replicas.

Table 2.1: Examples of read consistency levels.

To minimize the amount of data sent over the network when doing reads with aconsistency level above ONE, Cassandra makes use of “digest requests”. A digestrequest is just like a regular read request except that instead of the node actuallysending the data it only returns a digest, i.e. a hash of the data.

The intent is to discover whether two or more nodes agree on what the currentdata is, without actually sending the data over the network and therefore savebandwidth. Cassandra sends one data request to one replica and digest requests tothe remaining replicas. Note that the digest queried nodes still will do all the workof fetching data, they will just not return it.

Replica selection

In order for the coordinator node to route requests e�ciently it makes use of asnitch. A snitch informs Cassandra about the network topology and determineswhich data centers and racks nodes belong to. This information allows Cassandrato distribute replicas according to the replication strategy [11] by grouping machinesinto data centers and racks.

In addition, all snitches also use a dynamic snitch layer that provides an adaptivebehaviour when performing reads [24]. It uses an accrual failure detection mecha-nism, based on the Ï failure detector discussed in section 2.1.7, to calculate a pernode threshold that takes into account network performance, workload and histori-cal latency conditions. This information is used to detect failing or slow nodes, butalso for calculating the best host in terms of latency, i.e. selecting the best replica.

However, calculating the best host is expensive. If too much CPU time is spenton calculations it would become counterproductive as it would sacrifice overall readthroughput. The dynamic snitch therefore adopts two separate operations. One isreceiving the updates, which is cheap, and the other is calculating scores for eachhost which is more expensive.

In the update part latencies of the hosts are sampled and weighted with EMWA:s.The calculation part in turn iterates through the recorded latencies of each host to

9


find the worst latency as a measure for the scoring. After finding the worst latencyit makes a second pass over the hosts and score them against the maximum value.This calculation has been configured to only run once every 100 ms to reduce thecost. As hosts can not inform the system of their recovery once put on probation,all computed scores are reset once every ten minutes as well.

Client drivers and token awareness

To enable communication between client applications and a Cassandra cluster, mul-tiple client drivers for Cassandra exists. Cassandra supports two communicationprotocols, the legacy Thrift interface [22], and the newer native binary protocol thatenables use of the Cassandra Query Language (CQL) [6], resembling SQL. Di�erentdrivers can therefore use di�erent protocols.

Popular drivers includes Astyanax, which uses the Thrift interface, and the Javadriver from DataStax which only supports CQL. As these drivers can get the tokeninformation from the nodes during initialization, they can be configured to be tokenaware. This means that the client driver can make a qualified choice about whichnodes to issue requests to, based on the data requested.

2.2 Load balancing techniques in distributed systems

There exists numerous ideas and techniques to improve load balancing in distributedsystems. The problem is often to decide on a good trade-o� between exchanginga lot of communication between servers and clients and making guesses and ap-proximations on the tra�c. Intuitively, more information makes it easier to dogood decisions, but information passing can be costly. This section briefly discussesprevious work, ideas and algorithms for load balancing techniques in distributedsystems, not necessarily with focus on improving the tail latency.

2.2.1 The power of d choices

Consider a system with n requests and n servers to serve them. If each request isdispatched independently and uniformly at random to a server then the maximumload, or the largest number of requests at any server, is approximately log n

log log n

.Suppose instead that each request gets placed sequentially onto the least loaded(in terms of number of requests enqueued on a server) of d Ø 2 servers chosenindependently and uniformly at random. It has then been shown that with highprobability5 the maximum load is instead only log log n

log d

+ C, where C is a constantfactor [1] [20]. This means that getting two choices instead of just one leads to anexponential improvement in the maximum load.

5High probability means here at least 1 ≠ 1n , where n is the number of requests.

10

2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS

This result demonstrates the power of two choices, which is a commonly used prop-erty in load balancing strategies. When referring to this idea the common way todenote it is by SQ(d), meaning shortest-queue-of-d-choices.

2.2.2 Join-Shortest-Queue

The Join-Shortest-Queue (JSQ) algorithm is a popular routing policy used in pro-cessor sharing server clusters. In JSQ, an incoming request gets dispatched to theserver with the least number of currently active requests. Ties are broken by chos-ing randomly between the two servers. JSQ therefore tries to load balance acrossservers by reducing the chance of one server having multiple jobs while anotherserver has none. This is a greedy policy since the incoming request prefers sharinga server with as few jobs as possible.

Figure 2.3 illustrates the algorithm, with the clients at the top, A-C being serverswith their respective queues and pending jobs.

An interesting result that was shown by Gupta et al. [15], is that the performanceof JSQ on a processor sharing system shows near insensitivity to di�erences on thejob size distribution. This is di�erent from similar routing policies like Least-Work-Left (send the job to the host with the least total work) or Round-Robin which arehighly sensitive to the job size distribution.

JSQ is not optimal6, but was still shown to have great performance in comparison toalgorithms with much higher complexity. A potential drawback with JSQ though, isthat as the system grows, the amount of communication over the network betweendispatchers and servers could get overwhelming given that each of the distributeddispatchers will need to obtain the number of jobs at every server before every jobassignment.

2.2.3 Join-Idle-Queue

The Join-Idle-Queue (JIQ) algorithm, described in [19], tries to decouple detectionof lightly loaded servers from the job assignment. The idea is to have idle processorsinform the dispatchers as they become idle, without interfering with job arrivals.This removes the load balancing work from request processing.

JIQ consists of two parts, the primary and the secondary load balancing problem,which communicate via a data structure called an I-queue. An I-queue is a listof processors that have reported themselves as idle. When a processor becomesidle it joins an I-queue based on a load balancing algorithm. Two load balancingalgorithms for this purpose was considered in [19]: Random and SQ(d). With

6In the optimal solution, each incoming job is assigned as to minimize the mean response timefor all jobs currently in the system, assuming there are 0 future arrivals.

11


Figure 2.3: The join-shortest-queue algorithm. Clients prefer the server with theshortest queue.

JIQ-Random an idle processor joins an I-queue uniformly at random, and withJIQ-SQ(d) an idle processor chooses d random I-queues and joins the one with theshortest queue length. If a client do not have any servers in its I-queue it will in turnmake a choice based on the SQ(d) algorithm. Figure 2.4 illustrates the algorithm,again with the clients at the top with their respective I-queues, A-F being serverswith their respective queues and pending jobs. It is worth noting that JIQ-Randomhas the additional advantage of having a oneway communication, without requiringmessages from the I-queues.

Lu et al. showed three interesting results:

• JIQ-Random outperforms traditional SQ(2), in respect to mean response time.

• JIQ-SQ(2) achieves close to the minimum possible mean response time.

• Both JIQ-Random and JIQ-SQ(2) are near-insensitive to job size distributionwith processor sharing in a finite system.

2.2.4 Speculative retries

Speculative retries, also denoted “eager retries” and “hedged requests” [12] is theprocess of sending requests to several servers and use the one that responds first.The client initially sends one request to the server that is believed to perform thebest, but falls back on sending a secondary request after a delay. The client cancelsremaining outstanding requests once a result is received.

12

2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS

Figure 2.4: The join-idle-queue algorithm. Servers join an I-queue based on thepower of d choices algorithm. If a client do not have any servers in its I-queue itwill in turn make a choice based on the power of d choices algorithm.

Implementing speculative retries adds some overhead, but can still give latency-reduction e�ects while increasing load only modestly. A way to achieve this is bywaiting to send a second request until the first one has been outstanding for morethan the 95th or 99th percentile expected latency for that type of request. Thislimits the additional load to only a couple of percents (~1-5%) while substantially re-ducing the tail latency, since the pending request might be a several second timeoutfor example.

Speculative retries was implemented in Cassandra 2.0.2 with the default of sendingthe next request in the 99th percentile [10].

2.2.5 Tied requests

Dean and Barroso [12] stated that instead of letting the client choose according tothe SQ(d) algorithm, you should let the request be sent to multiple servers simulta-neously while making sure that the servers are allowed to communicate updates onthe status of the request with each other. These requests where servers use statusupdates are called “tied request”. As soon as one server starts processing a request,it sends a cancellation message to the other servers (“ties”), which keeps the clientout of the loop for the cancel logic. The corresponding requests, if still enqueuedon the other servers can then be aborted immediately or be deprioritized.

13


2.3 The C3 algorithm

The C3 algorithm, described in [23], is a replica selection algorithm for Cassandrausage. Suresh et al. argue that replica selection is an overlooked process whichshould be a cause for concern. They argue that putting mechanisms such as spec-ulative retries on top of bad replica selection may increase system utilization forlittle benefit.

C3 tries to solve the problem by using two concepts. Firstly it uses additionalfeedback from server nodes in order for the clients to rank them and prefer fasterones. Secondly, the clients implement a rate control mechanism to prevent nodesfrom being overwhelmed. A note worth making is that a client in the C3 design isactually the coordinator node in Cassandra, so the entire algorithm is implementedserver side. The current implementation is in Cassandra version 2.0.0.

2.3.1 Replica ranking

In the C3 replica ranking, the clients ranks the server nodes using a scoring function,just like the dynamic snitch, with the score working as a measure of latency to expectfrom the node in question. Clients prefer lower scores which corresponds to fasternodes for each request.

Instead of only using the latency, the C3 scoring function tries to minimize theproduct of the job queue size7

q

s

and the service time 1/µ

s

(the time to fetch therequested rows) across every server s.

Along with each response to a client, the servers send back additional informationabout their queue sizes and service times. The queue size is recorded after a requesthas been served and when the response is about to be returned. To make a betterforecast, the values are smoothed with EWMA:s, denoting the new values q̄

s

andµ̄

s

. In addition to these values, the response time R

s

(i.e the di�erence between thelatency for the entire request and the service time) is also recorded and smoothed.

To account for other clients in the system as well as ongoing requests, each clientalso maintain, for each server s, an instantaneous count of its outstanding requestsos

s

(requests for which a response is yet to be received). It is assumed that eachclient knows how many other clients there are in the system (n). The clients thenmake an estimate of the queue size of each server as:

q̂

s

= os

s

· n + q̄

s

+ 1 (2.1)

7The job queue size refers to the number of pending requests at a server.

14

2.3. THE C3 ALGORITHM

where the os

s

· n term is referred to as the concurrency compensation.

The idea behind the concurrency compensation is that clients will account for thescenario of multiple clients concurrently issuing requests to the same server. Theclients with a higher value of os

s

will therefore give a higher estimate of the queuesize at s and rank it lower than a client with fewer requests to s. This results inclients that have a higher demand will be more likely to rank s lower than clientswith a lighter demand.

Using this estimation, clients compute the queue size to service rate ratio (q̂s

/µ̄

s

)of each server and rank them accordingly. However, a function linear in q̂

s

is notsu�cient as it would demand a rather large increase in queue size in order for aclient to switch back to a slower server again, which could result in accumulation ofjobs at the faster nodes. Instead, C3 penalizes longer queue lengths by raising theq̂

s

term to a higher power, b: (q̂s

)b

/µ̄

s

. For higher values of b, clients are less greedyabout preferring a server with a lower service time as the (q

s

)b term will dominatethe scoring function more strongly. In C3, b is set to 3, yielding a cubic function.

This results in a final scoring function:

�s

= R

s

+ (q̂s

)3

/µ̄

s

(2.2)

where R

s

and µ̄

s

are the EWMA:s of the response time and service rate and q̂

s

isthe queue size estimate described in equation 2.1.

2.3.2 Rate control

To prevent exceeding server capacity, clients incorporate a rate limiting mechanisminspired by the congestion control in the CUBIC TCP implementation [16]. Thismechanism is decentralized as clients do not inform each other of their demands ofa server.

Every client uses a rate limiter for each server which limits the number of requestssent within a configured time window of ” ms. The limit is referred to as the sendingrate (srate). By letting the clients track the number of responses received from aserver in the ” ms interval (the receive rate, rrate) the rate limiter adapts andadjusts srate to match the rrate of the server.

When a client receives a response from a server s, the client compares the currentsrate and rrate for s. If srate is found to be lower than rrate, the client increasesits rate according to a cubic function:

15


srate Ω “ ·A

�T ≠ 3

Û— · R

0

“

B3

+ R

0

(2.3)

where �T is the elapsed time since the last rate decrease, and R

0

is the rate atthe time of the last rate decrease. If the rrate is lower than the srate, the clientinstead decreases its srate multiplicatively by —, in C3 set to 0.2. The “ valuerepresents a scaling factor and is used to set the desired duration of the saddleregion. Additionally a cap for the step size is set by a parameter s

max

. The scalingfactor in C3 is set to 100 milliseconds and the cap size is set to 10.

To get a better understanding of the properties of the rate controlling function,consider Figure 2.5. The proposed benefits with using this function is mostly theconfigurable saddle region. While the sending rate is significantly lower than thesaturation rate, the client will increase the rate aggressively (low rate region). Whenthe sending rate is close to the perceived saturation point of the server, that is, R

0

,the client stabilizes its sending rate and increases it conservatively (saddle region).Lastly, when the client has spent enough time in the stable region, it will againincrease its rate aggressively, probing for more capacity (optimistic probing region).

0 50 100 150 200

Saddle regionR0

Lowrate region

Optimisticprobing region

�T (ms)

Rat

e(r

eque

sts

per

”m

s)

Figure 2.5: Growth curve for the rate control function.

2.3.3 Notes on the C3 implementation

Some notes are worth making regarding the C3 algorithm. Firstly, C3 will alwaysroute requests solely based on the replica scoring. This means that if the coordinatoralready has the requested data locally, it might route the request to a remote node,if that node has a better score than the coordinator itself.

16

2.3. THE C3 ALGORITHM

Secondly, although all replicas get sorted, C3 will stop processing as soon as ithas found the best replica that is not limited and put it first in queue for requestprocessing. This means that when using consistency level QUORUM, i.e. sendingmultiple requests, only the data request will be rate limited, leaving the digestuna�ected by the rate limiting part of C3.

17

Chapter 3

Method

Evaluating performance is not a trivial task. While the focus in this thesis was onimproving the tail latency, it was important to not achieve this by sacrificing theaverage case performance, i.e. the average latency of a request.

A good starting point was to implement C3 in Cassandra 2.0.11 (the version thatSpotify uses), to try and verify if the performance gains seen by the C3 authors inversion 2.0.0, could also be seen in the newer version, despite the version gap.

3.1 Tools for testing

This section describes tools used while implementing the algorithm and evaluatingCassandra performance. In the process of benchmarking, guidelines and advice fromDataStax [8] was adhered to.

3.1.1 The cassandra-stress tool

The cassandra-stress tool is a stress testing utility for Cassandra clusters writtenin Java which is included in the Cassandra installation [5]. It has three modes ofoperations: inserting data, reading data and indexed range slicing. For the purposeof this thesis the read mode is what was used for analysis.

During a run, the cassandra-stress tool reports information at a configurable inter-val. Example output can be seen below:

t o t a l , interval_op_rate , interval_key_rate , latency , 9 5 th , 9 9 . 9 th , elapsed_time

. . .

1 0 5 7 8 , 1 0 5 7 , 1 0 5 7 , 1 5 . 4 , 3 6 . 4 , 5 7 1 . 6 , 1 0

2 6 7 8 2 , 1 6 2 0 , 1 6 2 0 , 1 0 . 5 , 3 2 . 9 , 4 7 5 . 8 , 2 1

4 7 4 9 5 , 2 0 7 1 , 2 0 7 1 , 4 . 0 , 2 9 . 4 , 3 8 0 . 6 , 3 2

7 1 8 6 4 , 2 4 3 6 , 2 4 3 6 , 2 . 5 , 2 7 . 1 , 3 7 8 . 1 , 4 3

. . .

19

CHAPTER 3. METHOD

Here, each line reports data for the interval between the last elapsed time and cur-rent elapsed time (default is 10 seconds). The columns of interest are in particular,latency, 95th and 99.9th. The latency column describes the average latency in mil-liseconds for each operation during that interval. The 95th and 99.9th columnsdescribe the percentiles, i.e. 95% and 99.9th% of the time the latency was less thanthe number displayed.

The cassandra-stress tool is highly configurable, for example it is possible to specifythe number of threads, read and write consistency and size of the records.

3.1.2 The Yahoo Cloud Serving Benchmark

The Yahoo Cloud Serving Benchmark (YCSB) is a framework for benchmarkingvarious cloud serving systems [4]. The YCSB client is a workload generator, andthe core workloads included in the installation is a set of workload scenarios to beexecuted by the generator.

Just like the cassandra-stress tool, the YCSB client is highly configurable. Forexample it is possible to specify the number of threads, read and write consistency,size of the records and format of the output. Below is example output where theformat is a time series:. . .

[READ] , 40 , 27509.0

[READ] , 50 , 31255.0

[READ] , 60 , 12345.5

[READ] , 70 , 15203.66

[READ] , 80 , 20668.25

. . .

Here, each line reports the average read latency (in microseconds) at an interval often milliseconds.

3.1.3 The Java driver stress tool

The Java driver stress tool is a simple example application that uses the DataStaxJava driver to stress test Cassandra - which also stress test the Java driver as aresult.

The example tool is by no means a complete stress application and supports only avery limited number of stress scenarios.

3.1.4 Darkloading

To test new versions of Cassandra, Spotify makes use of Darkloading. Darkloadingis the process of duplicating the tra�c of a certain system and replay it on anothersystem, to compare the performance.

20

3.2. TEST ENVIRONMENT SETUP

This is done by snooping on the tra�c to the original system and then make aduplicate request to another system.

3.2 Test environment setup

In the process of evaluating the performance of di�erent Cassandra versions, the taskwas divided into two parts. The first was evaluating performance by using stresstools such as cassandra-stress, YCSB and the Java driver stress tool which generatethe workload and tra�c by itself. The other part was evaluating performance onproduction workload and tra�c, which was obtained with the Darkloading strategy.

Testing on dedicated hardware is preferable as it removes the uncertainty of skewedresults due to resource sharing. Therefore, dedicated hardware was used for bothcases. For the Cassandra cluster, machines suited for databases was provisioned,with 16 cores, 32 GB of RAM and spinning disks in a RAID 10 configuration. Forthe machines which send the tra�c, dedicated service machines with 32 cores and64 GB of RAM was used instead.

When di�erent benchmarks were conducted it was deemed interesting to test bothconsistency level ONE and QUORUM.

Testing with speculative retries both enabled and disabled was tried, but as thisdid not yield any interesting results1 it was omitted as a testing parameter.

3.2.1 Testing on generated load

When testing on generated workload there were two things in particular desirableto achieve. The first was that enough data was inserted to ensure that the entiredataset does not fit in memory. The other part was running the test long enough,since a cluster has very bad performance at the start of a run (due to the JavaVirtual Machine warming up). Due to this the first 15% of all recorded values wasdiscarded to only record values when the cluster performance had stabilized. The15% breakpoint was not thoroughly analyzed, but was simply decided appropriatewhen looking at the raw output from test runs.

3.2.2 Testing on production load

To try and make the comparison between di�erent Cassandra versions as fair aspossible, the same production tra�c was used in each test run. The data wassampled from the real service and saved to file, making it possible to replay thesame data multiple times.

1A slight improvement could be seen in the higher percentiles, but as this improved performanceequally across di�erent versions, it was deemed irrelevant.

21

CHAPTER 3. METHOD

As the tra�c was replayed at a fixed rate (in production the rate varies over theday) it only made sense to compare test runs against each other and not againstthe real production cluster performance.

22

Chapter 4

Implementation

4.1 Implementing C3 in Cassandra 2.0.11

As Spotify uses Cassandra 2.0.11 (and above) for new applications, their develop-ment environment is also suited for those versions. Due to the fact that Cassandra2.0.0 and Cassandra 2.0.11 are incompatible, C3 was instead implemented directlyin Cassandra 2.0.11, making the comparisons and cluster setup easier in the Spotifyenvironment.

The implementation did not need much additional reworking of the newer code1,making the process simple.

4.2 Implementing C3 in the DataStax Java driver 2.1.5

As previously mentioned, the entire C3 algorithm is implemented server side. How-ever, a client implementation may be preferable as many newer Cassandra clientdrivers are token aware, meaning that the coordinator node will be able to servethe requested data directly. By implementing C3 in the client, we can send therequest to the best replica in the first step, removing the need of going through thecoordinator node just to rank the replicas.

With that in mind, the C3 algorithm was implemented in the DataStax Java driver.The Java driver was chosen since it is actively maintained, uses the newer commu-nication protocol and also since it has good support for implementing new loadbalancing policies. There were some impediments along the way though. Firstly,the queue size and service time as recorded by the server could not be used as thisis an extension in the C3 server code. This means that the replica scoring onlyused metrics as seen by the clients which might have had a significant impact onthe performance.

1To make C3 work in 2.0.11, moving some method calls was su�cient.

23

CHAPTER 4. IMPLEMENTATION

Secondly, as the driver code is substantially di�erent from the server code, theparameters set in the C3 server code might not have been suitable values for theclient.

4.2.1 Naive implementation

To decide which hosts to send a request to, the driver makes use of a load balancingpolicy. For each request, the load balancing policy is responsible for returning aniterator containing the hosts to query.

This served as a suitable place to implement the replica scoring part of C3. Thereforea new policy called HostScoringPolicy was implemented, responsible for the logicof ranking hosts.

As mentioned earlier, the scoring function was simplified as the metrics from theservers used in the original C3 version were not available. The metrics used in theclient-side ranking are the latency for the entire request (L

s

), the queue size (qs

),and the outstanding requests to a host (os

s

), all as seen by the client. Just like theserver implementation, EMWA:s was used to smooth the values.

The client version of q̂

s

is therefore defined just as before:

q̂

s

= os

s

· n + q̄

s

+ 1 (4.1)

But with the di�erence that the queue size here is recorded from the client perspec-tive and not by the server itself as in the original C3 implementation.

This results in the final client scoring function:

�s

= L

s

+ (q̂s

)3 · L

s

(4.2)

Here we can notice the big di�erence that we do not have the service time metric,leaving us with the entire latency of the request as the only measure.

The rate limiting part of C3 was however easily plugged in as the functionality isself contained and not relying on external metrics.

4.3 Benchmarking with YCSB

To confirm that C3 performs as suggested, as well as verify that the 2.0.11 imple-mentation worked as intended, it was desirable to reproduce the results presented

24

4.4. BENCHMARKING WITH CASSANDRA-STRESS

by Suresh et al. in [23]. To achieve this, the YCSB framework was used, just likein the original paper.

The test scenario with a read-heavy workload (95% reads, 5% writes) was chosento be reproduced. In the original experiment 15 Cassandra nodes were used, witha replication factor of 3. 500 million records of 1KB each were inserted across thenodes, yielding approximately 100 GB of data per machine. Since the test setuponly had 8 Cassandra nodes the record count was modified to be similar to the loadin the original experiment. Therefore 250 million records of 1KB each was inserted,yielding near to 100 GB of data per machine.

Just like the original test scenario three YCSB instances were used (running onseparate machines) each running 40 threads, yielding a total of 120 generators.

Then for each Cassandra version and consistency level, just like in the original test,two million rows were read, five times. The duration of a read run was about 30-60minutes depending on consistency level.

4.4 Benchmarking with cassandra-stress

As the cassandra-stress tool already comes packaged together with the Cassandrainstallation, C3 was also tested with this tool, to gain further confidence about theperformance of C3.

The deployment again consisted of the 8 Cassandra nodes, and one separate servicemachine, running the cassandra-stress tool with the default of 50 threads. 250million records of 1KB each were inserted across the cluster.

Due to a design choice in the cassandra-stress tool2 100 million rows were read. Theduration of a read run was about 5-7 hours depending on consistency level.

4.5 Benchmarking with the java-driver stress tool

As creating a custom stress tool for the purpose of client evaluation is outside thescope of this thesis, the stress application that comes together with the Java driverwas used to evaluate the client implementation of C3.

By having made some small modifications in the source code of the stress applicationit was possible to test the di�erent load balancing policies with di�erent consistencylevels.

2For example, inserting 100000 rows will write rows with key values 000000-099999, meaningthat if you try to read rows of a di�erent magnitude, the keys will not match and the read will fail.

25

CHAPTER 4. IMPLEMENTATION

The deployment again consisted of the 8 Cassandra nodes, and 6 service machines,each running 100 threads. 250 million records of 1KB each were inserted across thecluster.

For each Cassandra version and consistency level, 100 million rows were read. Theduration of a read run was about 5-7 hours depending on consistency level.

4.6 Darkloading

In order to benchmark the performance of C3 under production load, a clusterhad to be duplicated. A suitable cluster was decided with the recommendationsfrom Jimmy Mårdell at Spotify. The chosen cluster consists of 8 Cassandra nodeswith approximately 130 GB of data per node and 6 service machines sending tra�cto the cluster. The read/write ratio of the incoming requests to the service isapproximately 97% reads and 3% writes.

To send tra�c to the test cluster, two versions of the service client was used. Thefirst version was token aware and used consistency level QUORUM, just like theoriginal service. In the other version the token awareness was replaced by plainround robin, and the consistency level was set to ONE, to try and match the set-tings that the original C3 was developed with. Due to the service client using theAstyanax client and not the Java driver, it was unfortunately not possible to Dark-load the C3 client. Although Astyanax supports a beta version that uses the Javadriver under the hood, it only does so for older versions of the Java driver.

For each setup, the sampled tra�c was replayed at a configured rate which resultedin a disk I/O utilization of around 50-60%, making sure that the cluster had asmuch tra�c as possible without choking the disks. Note however that even thoughthe same tra�c was replayed, writes altered the data in the cluster, potentiallya�ecting some reads, but given the low amount of writes this was deemed to benegligible.

In the Darkloading setup an extention in C3 was also tried, were the coordinatornode would always serve the data locally if possible (this while using round robinin the client), but as this showed no di�erence in performance, that particular testwas omitted.

26

Chapter 5

Results

Here we present the results from our di�erent benchmarks. The standard deviationfor each measure is marked in all charts. In some charts, where the di�erence wassmall, we have omitted the average latencies as the focus lies on the improving thetail latency. All exact numbers, including averages, are available in Appendix A.

5.1 Benchmarking with YCSB

Here we present the results from the YCSB runs. The results are the averages ofthe combined values outputted from the three YCSB instances. In Figure 5.1 wehave consistency level ONE to the left and QUORUM to the right.

Mean 95:th 99:th 99.9:th0

20

40

60

80

100

Late

ncy

(ms)

Consistency level ONE.

2.0.11 C3Mean 95:th 99:th 99.9:th

0

20

40

60

80

100

Late

ncy

(ms)

Consistency level QUORUM.

2.0.11 C3

Figure 5.1: Benchmark of C3 with YCSB.

27

CHAPTER 5. RESULTS

5.2 Benchmarking with cassandra-stress

Here we present the results from the cassandra-stress runs. The results are theaverages from the single cassandra-stress instance. In Figure 5.2 we have the resultsfrom the 95th and 99.9th percentile latencies, with consistency level ONE to theleft and QUORUM to the right.

95:th 99.9:th0

50

100

150

Late

ncy

(ms)


2.0.11 C395:th 99.9:th

0

50

100

150

Late

ncy

(ms)


2.0.11 C3

Figure 5.2: Benchmark of C3 with cassandra-stress.

28

5.3. BENCHMARKING WITH THE JAVA-DRIVER STRESS TOOL

5.3 Benchmarking with the java-driver stress tool

Here we present the results from the java-driver stress runs. The default we compareagainst is the java-driver 2.1.5 with the default LoadBalancingPolicy that is tokenaware.

5.3.1 Performance of the C3 client

In Figure 5.3 we have the results from the mean, 95 and 99th percentile latency,with consistency level ONE to the left and QUORUM to the right. For both the2.1.5 and the C3 client, Cassandra 2.0.11 was running server side.

Mean 95:th 99:th0

100

200

Late

ncy

(ms)


2.1.5 C3Mean 95:th 99:th

0

100

200La

tenc

y(m

s)


2.1.5 C3

Figure 5.3: Benchmark of client C3 with server 2.0.11.

29

CHAPTER 5. RESULTS

5.4 Darkloading

Here we present the results from the Darkloading runs. First we present the perfor-mance with token awareness in the client, followed by the performance with plainround robin.

5.4.1 Performance with token awareness

In Figure 5.4 we have the results from the 95, 98, 99 and 99.9th percentile latencies.

95:th 98:th 99:th 99.9:th0

20

40

60

80

100

Late

ncy

(ms)


2.0.11 C3

Figure 5.4: Darkloading with token awareness.

5.4.2 Performance with round robin

In Figure 5.5 we have the results from the 95, 98, 99 and 99.9th percentile latencies.

30

5.4. DARKLOADING

95:th 98:th 99:th 99.9:th0

20

40

60

80Consistency level ONE.

2.0.11 C3

Figure 5.5: Darkloading with round robin.

31

Chapter 6

Discussion

6.1 Performance of server side C3

6.1.1 YCSB vs. cassandra-stress

The YCSB stress runs confirms the results from the original experiment, that C3is superior to the original dynamic snitch. Furthermore we found that regardlessof using consistency level ONE or QUORUM (in the original experiment only con-sistency level ONE was evaluated), C3 proved to reduce both latency and varianceacross all percentiles.

In our cassandra-stress runs, results were again positive but not at all with the sameconfidence as in the YCSB runs. Although it would have been reassuring to getmore similar results between tools we want to emphasize the di�erences betweensetups. The cassandra-stress runs were read only, whereas the YCSB runs were readheavy. We had a di�erent number of instances running, as well as a di�erent threadcount. We also do not have any control over the read patterns in cassandra-stress,which also could contribute to the di�ering results.

Additional YCSB runs similar to the cassandra-stress setup is desirable to see if thedi�erence between results decreased, but due to time constraints we leave this tofuture work.

6.1.2 Darkloading

When evaluating C3 on production load, results were a bit di�erent from the stresstool results. In the case of the token aware client, we actually saw a little bit (around100 µs) of overhead in the average case. Not until the 98th percentile did we see anactual improvement, and it was only by a couple of ms, which is not strong enoughto suggest an actual performance gain.

33

CHAPTER 6. DISCUSSION

We believe that one reason for not seeing much improvement in this case is the factthat the client is token aware. The client will therefore already send the request toa node that has the data, meaning that C3 in some cases will not be able to improvethe routing. Darkloading C3 with round robin in the client (and consistency levelONE) actually did improve the results, supporting this claim. Although still havingthe small 100 µs overhead in the average case, we could now see an improvementalready in the 95th percentile, with the 99.9th percentile having improved withabout 20%. Even though we did see this improvement, there is still a big gap inperformance gain compared to the stress tool results.

This could have several reasons. Firstly, when generating workload, all the recordswere of equal size (1KB), meaning that all read requests are equally large. In thecase of production load, some rows might contain more data than others due tothe nature of the Darkloaded service. This means that some reads will have higherlatencies, not due to slow servers but due to how the data is structured. The resultof this would be that C3 might rank fast servers as slow ones just because theyhappen to get heavier reads.

Another point worth making is that problems such as garbage collections, where C3really could improve the performance, commonly does not occur until the clusterhas been running for a couple of weeks, which makes it a hard scenario to simulatein the scope of this thesis.

6.2 Performance of client side C3

Although not having the exact metrics like the server C3, the C3 client implemen-tation did lower the tail latency. However, the benchmark showed a lot of variance,making the results inconclusive. Since the variance was present in both the defaultjava-driver version and the C3 implementation we deem this to be a fault in thebenchmark setup and not in the implementation.

We suggest that making repeated benchmarks and perhaps tweaking the parameterscould give a more conclusive result. However, we are under the impression that C3in the client could work well, and perhaps be a substitute for token aware clients.

6.3 Conclusion

Given the right conditions the C3 algorithm has proven to be an e�ective way todecrease tail latencies in Cassandra. We would recommend the current implemen-tation in systems where row sizes are homogeneous as variable size records are nottaken into account in the scoring function.

However, we see no problem with extending the algorithm to take into account

34

6.3. CONCLUSION

variable size rows. Given that one can obtain the size of the data requested, itshould be possible to make a weighted scoring function, but this is outside thescope of this thesis.

We would also argue that C3 will be most e�ective if the client is not token aware.A client implementation of C3 could resolve this, but the results found in this thesiswas too inconclusive to support this claim, and further testing is needed.

35

Appendix A

Results from benchmarks

A.1 YCSB

System 2.0.11 C3

Average latency (ms) 11.59, ‡=1.74 7.81, ‡=0.9795th percentile (ms) 21.22, ‡=3.84 11.92, ‡=1.8199th percentile (ms) 30.28, ‡=4.65 15.36, ‡=2.1099.9th percentile (ms) 54.85, ‡=5.25 24.64, ‡=2.32

Table A.1: YCSB read latencies with consistency level ONE.

System 2.0.11 C3

Average latency (ms) 16.11, ‡=2.38 12.19, ‡=1.6895th percentile (ms) 28.46, ‡=5.03 20.03, ‡=3.3499th percentile (ms) 40.69, ‡=6.08 27.01, ‡=3.9499.9th percentile (ms) 80.45, ‡=6.97 44.99, ‡=4.37

Table A.2: YCSB read latencies with consistency level QUORUM.

37

APPENDIX A. RESULTS FROM BENCHMARKS

A.2 cassandra-stress

System 2.0.11 C3

Average latency (ms) 3.93, ‡=0.71 3.87, ‡=0.7195th percentile (ms) 27.12, ‡=1.83 23.05, ‡=1.5399.9th percentile (ms) 113.57, ‡=37.86 82.66, ‡=23.89

Table A.3: cassandra-stress read latencies with consistency level ONE.

System 2.0.11 C3

Average latency (ms) 8.44, ‡=0.29 8.42, ‡=0.3295th percentile (ms) 34.54, ‡=1.94 31.57, ‡=2.3099.9th percentile (ms) 131.58, ‡=35.85 105.59, ‡=32.18

Table A.4: cassandra-stress read latencies with consistency level QUORUM.

A.3 java-driver stress

System java-driver 2.1.5 client C3

Average latency (ms) 8.75, ‡=1.11 10.40, ‡=0.3495th percentile (ms) 75.05, ‡=28.99 33.04, ‡=26.1799th percentile (ms) 149.44, ‡=44.38 67.76, ‡=55.80

Table A.5: java-driver stress read latencies with consistency level ONE.

System java-driver 2.1.5 client C3

Average latency (ms) 14.95, ‡=1.56 15.18, ‡=1.1895th percentile (ms) 104.42, ‡=27.80 100.43, ‡=37.8899th percentile (ms) 169.03, ‡=38.65 158.42, ‡=42.77

Table A.6: java-driver stress read latencies with consistency level QUORUM.

38

A.4. DARKLOADING

A.4 Darkloading

A.4.1 Token aware

System 2.0.11 C3

50th percentile (ms) 0.90, ‡=0.01 1.01, ‡=0.0175th percentile (ms) 1.12, ‡=0.04 1.27, ‡=0.0595th percentile (ms) 14.59, ‡=3.93 15.41, ‡=3.8298th percentile (ms) 27.31, ‡=5.48 26.91, ‡=4.7199th percentile (ms) 36.97, ‡=6.26 35.11, ‡=4.9499.9th percentile (ms) 70.61, ‡=10.11 63.65, ‡=7.47

Table A.7: Darkloading read latencies with consistency level QUORUM.

A.4.2 Round robin

System 2.0.11 C3

50th percentile (ms) 0.88, ‡=0.01 0.98, ‡=0.0175th percentile (ms) 1.11, ‡=0.03 1.20, ‡=0.0395th percentile (ms) 12.27, ‡=3.03 10.69, ‡=2.6598th percentile (ms) 23.25, ‡=4.04 20.16, ‡=3.3499th percentile (ms) 31.85, ‡=4.69 26.80, ‡=3.2599.9th percentile (ms) 61.96, ‡=8.68 48.60, ‡=3.95

Table A.8: Darkloading read latencies with consistency level ONE.

39

Bibliography

[1] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced alloca-tions. SIAM journal on computing, 29(1):180–200, 1999.

[2] Nick Bailey. Balancing your Cassandra cluster.http://www.datastax.com/dev/blog/balancing-your-cassandra-cluster.Accessed: 2015-06-12.

[3] Fay Chang, Je�rey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah AWallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber.Bigtable: A distributed storage system for structured data. ACM Transactionson Computer Systems (TOCS), 26(2):4, 2008.

[4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, andRussell Sears. Benchmarking cloud serving systems with YCSB. In Proceedingsof the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010.

[5] DataStax. The cassandra-stress tool.http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCStress_t.html. Accessed: 2015-03-16.

[6] DataStax. Coming in Cassandra 1.2: binary CQL protocol.http://www.datastax.com/dev/blog/binary-protocol. Accessed: 2015-05-24.

[7] DataStax. Data replication.http://www.datastax.com/documentation/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html.Accessed: 2015-01-27.

[8] DataStax. How not to benchmark cassandra.http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra.Accessed: 2015-03-16.

[9] DataStax. Partitioners.http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html. Accessed: 2015-06-12.

41

http://www.datastax.com/dev/blog/balancing-your-cassandra-cluster

http://www.datastax.com/dev/blog/balancing-your-cassandra-cluster

http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCStress_t.html

http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCStress_t.html

http://www.datastax.com/dev/blog/binary-protocol

http://www.datastax.com/documentation/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html

http://www.datastax.com/documentation/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html

http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html

http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html

BIBLIOGRAPHY

[10] DataStax. Rapid read protection in cassandra 2.0.2.http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. Accessed: 2015-01-28.

[11] DataStax. Snitches.http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchesAbout_c.html.Accessed: 2015-01-23.

[12] Je�rey Dean and Luiz André Barroso. The tail at scale. Communications ofthe ACM, 56(2):74–80, 2013.

[13] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula-pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, PeterVosshall, and Werner Vogels. Dynamo: amazon’s highly available key-valuestore. In ACM SIGOPS Operating Systems Review, volume 41, pages 205–220.ACM, 2007.

[14] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility ofconsistent, available, partition-tolerant web services. ACM SIGACT News,33(2):51–59, 2002.

[15] Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis ofjoin-the-shortest-queue routing for web server farms. Performance Evaluation,64(9):1062–1081, 2007.

[16] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speedtcp variant. ACM SIGOPS Operating Systems Review, 42(5):64–74, 2008.

[17] Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. TheÏ accrual failure detector. In Reliable Distributed Systems, 2004. Proceedingsof the 23rd IEEE International Symposium on, pages 66–78. IEEE, 2004.

[18] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structuredstorage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.

[19] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and AlbertGreenberg. Join-idle-queue: A novel load balancing algorithm for dynamicallyscalable web services. Performance Evaluation, 68(11):1056–1071, 2011.

[20] Michael Mitzenmacher. The power of two choices in randomized load balancing.Parallel and Distributed Systems, IEEE Transactions on, 12(10):1094–1104,2001.

[21] Eric Schurman and Jake Brutlag. The user and business impact of serverdelays, additional bytes, and http chunking in web search. In Velocity WebPerformance and Operations Conference, 2009.

42

http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2

http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchesAbout_c.html

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchesAbout_c.html

BIBLIOGRAPHY

[22] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable cross-language services implementation. Facebook White Paper, 5(8), 2007.

[23] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cuttingtail latency in cloud data stores via adaptive replica selection. In Proceedingsof the 12th USENIX Conference on Networked Systems Design and Implemen-tation, 2015.

[24] Brandon Williams. Dynamic snitching in cassandra: past, present, and future.http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future. Accessed: 2015-03-01.

43

http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future

http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future

www.kth.se

Documents

Replica selection in Apache Cassandra827372/FULLTEXT01.pdf · algorithm called C3, recently developed by Lalith Suresh, Marco Canini, Stefan Schmid and Anja Feldmann. Through extensive