Causality for the Masses - OPODIS 2017opodis2017.campus.ciencias.ulisboa.pt/presentations/...Causality for the Masses: Offering Fresh Data, Low Latency, and High Throughput Manuel

Causality for the Masses:

Luís Rodrigues

Offering Fresh Data, Low Latency, and High Throughput

Causality for the Masses:Offering Fresh Data, Low Latency,

and High Throughput

Manuel Bravo, Luís Rodrigues, Chathuri Gunawardhana, Peter van Roy

Causal consistency

Strongest without

compromising availability

3

Causal consistency

Key ingredientof several

consistency criteria

Parallel Snapshot Isolation  [SOSP’11]

RedBlue Consistency  [OSDI’12]

Session guarantees  [SOSP’97]

Explicit Consistency  [EuroSys’15]

4

Causal consistency

5

Causal consistency

Operations may arrive in the “wrong”

order

5

6

Alice

Bob

Dan

6

Alice

Dan is in the  hospital!

Bob

Dan

6

Alice


Dan is ok!

Bob

Dan

6

Alice


Dan is ok!

Bob


Dan

6

Alice


Dan is ok!

Bob

Dan is in the  hospital! Dan is ok!

Dan

6

Alice


Dan is ok!

Bob


Dan

That’s great!

6

Alice


Dan is ok!

Bob


Dan


That’s great!

6

Alice


Dan is ok!

Bob


That’s great!

Dan


That’s great!

Causal consistency

7

Causal consistency

Data center should delay the

visibility of inconsistent operations

7

8

Alice


Dan is ok!

Bob


Dan

That’s great!

8

Alice


Dan is ok!

Bob


Dan


That’s great!

8

Alice


Dan is ok!

Bob


That’s great!

Dan


That’s great!

8

Alice


Dan is ok!

Bob


That’s great!

Dan


That’s great!

Dan is ok!

8

Alice


Dan is ok!

Bob


Dan


That’s great!

Dan is ok! That’s great!

8

Alice


Dan is ok!

Bob


Dan


That’s great!


This presentation is about keeping Dan happy

8

Alice


Dan is ok!

Bob


Dan


That’s great!



Requires maintaing

and exchangingmetadata!

8

Alice


Dan is ok!

Bob


Dan


That’s great!



Requires maintaing

and exchangingmetadata!

Lots of metadata!

Causal consistency

If causal dependencies are not accurately tracked metadata may generate false positives

Problem

9

False dependencies!

10

Alice

Bob

Dan

11

Metadata

more metadata less metadata

11

Metadata


Matrix/vector clocks

11

Metadata



One vector per item.

One entry in each vector per DC.

11

Metadata


precise

expensive


11

Metadata


precise

expensive

Matrix/vector clocks Lamport’s clocks

11

Metadata


precise

expensive


One scalar.

11

Metadata


precise

expensive

false positives

cheap


Problems of the previous state-of-the-art Throughput vs. data staleness tradeoff

-20-16-12-8-4 0

3 4 5 6 7

Th

rou

gh

pu

t p

en

alty

(%

)

GentleRain Cure

0 20 40 60 80

100 120

3 4 5 6 7

Da

ta s

tale

ne

ss o

verh

ea

d (

%)

Number of datacenters

GentleRain [SoCC’ 14]: Optimizes throughput  Compresses metadata into a scalar

Cure [ICDCS’ 16]: Optimizes data freshness  Relies on a vector clock with an entry per data center

12


-20-16-12-8-4 0

3 4 5 6 7

Th

rou

gh

pu

t p

en

alty

(%

)

GentleRain Cure

0 20 40 60 80

100 120

3 4 5 6 7

Da

ta s

tale

ne

ss o

verh

ea

d (

%)




12

Metadata size affects throughput


-20-16-12-8-4 0

3 4 5 6 7

Th

rou

gh

pu

t p

en

alty

(%

)

GentleRain Cure

0 20 40 60 80

100 120

3 4 5 6 7

Da

ta s

tale

ne

ss o

verh

ea

d (

%)




12

False dependencies damage data

freshness

Metadata size affects throughput


Visibility latencies have direct impact on client response latency

Partial replication aggravates the problem

13

Our take on the problem

14

Saturn Eunomia

+


15

Saturn Eunomia

God in ancient Roman religion, that become the god of time

+


16

Saturn Eunomia

Greek goddess of law and legislation

+


17

Saturn Eunomia

If this couple cannot fix the problem, nobody can…

+


18

Saturn Eunomia

+


18

Saturn Eunomia

Orders events across datacenters

+


18

Saturn Eunomia

Orders events in each

datacenter

+

Distributed metadata service

pluggable to existing geo-distributed data services

handles the dissemination of operations among data centers

Ensures thatclients always observe a causally consistent state

with a negligible performance overhead when compared to an eventually consistency system

20

key features

21

key features

Requires a constant and small amount of metadata  regardless of the system’s scale (servers, partitions, and locations)

21

key features


to avoid impairing throughput

21

key features

Mitigates the impact of false dependencies  by relying on a tree-based dissemination



21

key features




to enhance data freshness

21

key features


Implements genuine partial replication data centers only manage data and metadata of the items replicated locally




21

key features


Implements genuine partial replication data centers only manage data and metadata of the items replicated locally




to take full advantage of partial replication

21

Decoupling data and metadata

3 421

22


3 421

data transfer22


3 421

data transfer

metadata transfer

22


3 421

data transfer

metadata transferData centers

only make remote updates visible when

they have received both the metadata and its corresponding data

22

Example: write request


…

data

labels

3 N21

data centers


…

data

labels

3 N21Client 1

data centers


…put(a1)

data

labels

3 N21Client 1

data centers


…put(a1)

data

labels

3 N21Client 1

data centersa1


…put(a1)

data

labels

3 N21Client 1

data centers

Metadata propagation

3 421

24


a (2)

3 421

b (4)c (6)

24


a (2)

3 421

b (4)c (6)

bc

a a concurrent to both b and cc causally depends on b

24


a (2)

3 421

b (4)c (6) ??

bc


24


a (2)

3 421

b (4)c (6) ??

a b c ?

bc


24


a (2)

3 421

b (4)c (6) ??

b a c ?

a b c ?

bc


24


a (2)

3 421

b (4)c (6) ?

b c a ??

b a c ?

a b c ?

bc


24


3 421

25


a (2)

3 421

b (4)c (6)

bc


25


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

bc


10ms

25


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

bc


10ms

25

a -> 3  b -> 14c -> 16


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

bc


10ms

25

a -> 3  b -> 14c -> 16

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

delayed

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

delayed

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

b -> 5  a -> 12c -> 12

delayed

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

b -> 5  a -> 12c -> 12

delayed

a c b

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

?

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

b -> 5  a -> 12c -> 12

delayed

a c b

b -> 5  c -> 7

a -> 12

bc


10ms

25

c  ba


a (2)

3 421

b (4)c (6)

1ms 1ms

10ms

b -> 5  c -> 7

a -> 12

c b a

a -> 12  b -> 12c -> 12

delayed

c a b

b -> 5  a -> 12c -> 12

delayed

a c b

b -> 5  c -> 7

a -> 12

ac  b

bc


10ms

25

c  ba


Causal consistency is a partial order

Saturn exploits this fact by serving possibly different linear extensions of the causal order to each data center

Each served serialization aims at optimising data freshness, and thus, reducing the impact of false dependencies

26


Causal consistency is a partial order

Saturn exploits this fact by serving possibly different linear extensions of the causal order to each data center

Each served serialization aims at optimising data freshness, and thus, reducing the impact of false dependencies

a c b

c b a

26

Metadata propagation: architecture

Saturn leverages a set of cooperating servers, namely serializers, forming a tree

Causal consistency is trivially enforced by the tree assuming  

FIFO links among serializers  

Serializers propagate metadata preserving the observed order

Serializers are geographically distributed to optimize data freshness

27

Metadata dissemination graph

3 421

Saturn

S1 S2 S3 S4

S5


3 421

Saturn


3 421

Saturn

S1 S2 S3 S4

S5 S6

Optimal dissemination graph

Weighted Minimal Mismatch

5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.

In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we

denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among

these datacenters; latij denotes the latency between dat-acenters i and j.

• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to

place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:

�M (i, j) =P

Sk2PMi,j\{So}(dk,k+1 + �k,k+1)

and the mismatch between the resulting latency and theoptimal label propagation latency is given by:

mismatchi,j = |�M (i, j)��(i, j)|

Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects

the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:

DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:

minP

8i,j2V ci,j · mismatchi,j

5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.

The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.

Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these

2 A reduction from the Steiner tree problem [35] can be used to prove this.







�M (i, j) =P







minP






The goal is to build the tree such that metadata-paths latencies (through the tree) match data-paths

30









�M (i, j) =P







minP












�M (i, j) =P







minP






The goal is to build the tree such that metadata-paths latencies (through the tree) match data-pathsabsolute

difference between label-paths and data

paths

30









�M (i, j) =P







minP












�M (i, j) =P







minP






The goal is to build the tree such that metadata-paths latencies (through the tree) match data-paths

minimize mismatch of busiest paths

30

Metadata propagation: building the tree

Finding the optimal tree is modelled as a constraint optimization problem

Input  

Data-paths average latencies  

Candidate locations for serializers (an latencies among them)   Access-patterns: to minimize the impact of mismatches

31

Example

3 421 1ms 1ms

10ms

bc


10ms

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

bc


10ms

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

bc


10ms

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0

bc


10ms

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 02

bc


10ms

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 02

bc


10ms

a

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 023

bc


10ms

a

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234

bc


10ms

a

b

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5

bc


10ms

a

b

32

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5

bc


10ms

a b

32

b

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5

bc


10ms

a

33

b

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6

bc


10ms

a

c

33

b

Example

3 421 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7

bc


10ms

a

c

33

4b

Example

321 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7

bc


10ms

a

c

34

4b

Example

321 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12

bc


10ms

a

a

c

34

4b

Example

321 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12

bc


10ms

a

a

c

a

34

4b

Example

321 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12

bc


10ms

a

a

c

a

34

b -> 5  c -> 7

a -> 12

4b

Example

321 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12 …14

bc


10ms

a

a

c

a

b

34

b -> 5  c -> 7

a -> 12

2 4b

Example

31 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12 …14

bc


10ms

a

c

a

b

35

2 4b

Example

31 1ms 1ms

10ms

S1 S2 S3 S4

S5 S6

S5 @ dc1 (or dc2)

S6 @ dc4 (or dc3)

Time: 0234 5 6 7…12 …14…16

bc


10ms

a

c

a

b

c

35

…put(a1)

data

labels

3 N21Client 1

data centersa1

…put(a1)

data

labels

3 N21Client 1

data centersa1

Saturn requires each datacenterto output a

SERIAL sequence of updates

38

each datacenter has many machines

Eunomia

how to serialize updates performed on different machines without loosing performance?

usually one relies on sequencers that limit concurrency

conceived to replace sequencers used to serialize updates in replicated storage systems

39

totally orders—consistently with causality—local updates, before shipping them to Saturn

and other dcs

the ordering is done in the background, out of client’s critical path

Eunomia

40

A CB

Sequencer

41

A CB

Eunomia

42

put(a) put(b)

A CB

42

put(a) put(b)

1 1

A CB

42

put(a) put(b)

1

1 1

1

A CB

43

put(c)

1 1

A CB

43

put(c)

1 1

2

2

A CB

44

1 1

put(d)

2

A CB

44

1 1

put(d)

1

12

A CB

45

1 1 12

put(e)

A CB

45

1 1 12

put(e)

2

2

A CB

46

1 1 12 2

A CB

47

1 1 12 2

A CB

48

1 1 1 2 2

A CB

48

49

1

1

1

2 2

A CB

50

2 2

put(f)

A CB

50

2 2

put(f)

3

3

A CB

51

+

Evaluation

• How important is the configuration of Saturn?

• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?

Evaluation: goals

52

• How Eunomia compares to sequencers?

• Why Saturn’s genuine partial replication matters?



Evaluation: goals

53



Saturn vs state-of-the-art

Partitioning among 7 data centers  Replication factor min. 2 max. [2,5]  [J. M. Pujol et al. The little engine(s) that could: Scaling online social networks. SIGCOMM ’10]

Workload based on real social network workloads  [F. Benevenuto et al. Characterizing user behavior in online social networks. ICM ’09]

Public Facebook dataset  [B. Viswanath et al. On the evolution of user interaction in Facebook. WOSN ’09]

N.

(O

T

(N.

OS

(N. (Ir

N. Ir Fr

54

Saturn vs state-of-the-art  Throughput

0 20000 40000 60000 80000

100000 120000

5 4 3 2Th

rou

gh

pu

t (o

ps/s

ec) Eventual

SaturnGentleRain

Cure

max replication degree

Thro

ughp

ut (o

ps/s

ec)

55


0 20000 40000 60000 80000

100000 120000

5 4 3 2Th

rou

gh

pu

t (o

ps/s

ec) Eventual

SaturnGentleRain

Cure


Thro

ughp

ut (o

ps/s

ec)

Saturn exhibits a throughput very close to

eventual consistency55


0 20000 40000 60000 80000

100000 120000

5 4 3 2Th

rou

gh

pu

t (o

ps/s

ec) Eventual

SaturnGentleRain

Cure


Thro

ughp

ut (o

ps/s

ec)

Saturn exhibits a throughput very close to

eventual consistency

Cure significantly penalizes throughput due

to the metadata size

55

Saturn vs state-of-the-art  Data freshness: average

0 20 40 60 80

100 120 140 160

Rem

ote

vis

ibili

ty la

tency

(m

s)

56

Even

tual

Satur

n

Gentl

eRainCu

re

Saturn vs state-of-the-art  Data freshness: average

0 20 40 60 80

100 120 140 160

Rem

ote

vis

ibili

ty la

tency

(m

s)

56

Even

tual

Satur

n

Gentl

eRainCu

re

Still better than Curethat uses vector clocks

and very close to eventual



Evaluation: goals

57



Saturn Genuine partial replication matters

58

N. Virgina Ireland Sydney

• Sydney receives updates from Ireland • Ireland updates depend on N. Virginia updates • Sydney does not replicate all N. Virginia data


59

0

0.2

0.4

0.6

0.8

1

0 40 80 120 160 200 240 280

CD

F

Remote update visibility (milliseconds)

SaturnVector

Latency of updates from Ireland when applied at Sydney


59

0

0.2

0.4

0.6

0.8

1

0 40 80 120 160 200 240 280

CD

F

Remote update visibility (milliseconds)

SaturnVector

Latency of updates from Ireland when applied at Sydney

Updates from Ireland are stalled until metadata is received from N. Virginia



Evaluation: goals

60



Saturn configuration matters 

NC O I F T SNV 37 ms 49 ms 41 ms 45 ms 73 ms 115 msNC - 10 ms 74 ms 84 ms 52 ms 79 msO - - 69 ms 79 ms 45 ms 81 msI - - - 10 ms 107 ms 154 msF - - - - 118 ms 161 msT - - - - - 52 ms

Table 1: Average latencies (half RTT) among Amazon EC2regions: N. Virginia (NV), N. California (NC), Oregon (O),Ireland (I), Frankfurt (F), Tokyo (T), and Sydney (S)

In order to compare SATURN with other solutions fromthe state-of-the-art (data services), we attached SATURN toan eventually consistent geo-replicated storage system wehave built. Throughout the evaluation, we use this eventu-ally consistent data service as the baseline, as it adds nooverheads due to consistency management, to better under-stand the overheads introduced by SATURN. Note that thisbaseline represents a throughput upper-bound and a latencylower-bound. Thus, when we refer to the optimal visibilitylatency throughout the experiments, we are referring to thelatencies provided by the eventually consistent system.

Implementation. Our SATURN prototype implements allfunctionality described in the paper. It has been built usingthe Erlang/OTP programming language. In our prototype,gears rely on physical clocks to generate monotonically in-creasing timestamps. To balance the load among frontends ateach datacenter, we use Riak Core [14], an open source dis-tribution platform. The optimization problem (Definition 2)used to configure SATURN is modeled using OscaR [44], aScala toolkit for solving Operations Research problems.

Setup. We use Amazon EC2 m4.large instances run-ning Ubuntu 12.04 in our experiments. Each instance hastwo virtual CPU cores, and 8 GB of memory. We use sevendifferent regions in our experiments. Table 1 lists the aver-age latencies we measured among regions. Our experimentssimulate one datacenter per region. Clients are co-locatedwith their preferred datacenter in separate machines. Eachclient machine runs its own instance of a custom version ofBasho Bench [13], a load-generator and benchmarking tool.Each client eagerly sends requests to its preferred datacen-ter with zero thinking time. We deploy as many clients asnecessary in order to reach the system’s maximum capacity,without overloading it. Each experiment runs for more than5 minutes. In our results, the first and the last minute of eachexperiment are ignored to avoid experimental artifacts. Wemeasure the visibility latencies of remote update operationsby storing the physical time at the origin datacenter when theupdate is applied locally, and subtracting it from the physi-cal time at the destination datacenter when the update be-comes visible. To reduce the errors due to clock skew, phys-ical clocks are synchronized using the NTP protocol [43]before each experiment, making the remaining clock skewnegligible in comparison to inter-datacenter travel time.

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250

CD

F

Remote update

P-conf

0 100 200 300 400 500visibility (milliseconds)

M-conf S-conf

Figure 4: Left: Ireland to Frankfurt (10ms); Right: Tokyo toSydney (52ms)

7.1 SATURN Configuration MattersWe compare different configurations of SATURN to bet-ter understand their impact on the system performance.We compare three alternative configurations: (i) a single-serializer configuration (S), (ii) a multi-serializer configu-ration (M), and (iii) a peer-to-peer version of SATURN thatrelies on the conservative label’s timestamp order to applyremote operations (P). We focus on the visibility latenciesprovided by the different implementations.

For the S-configuration, we placed the serializer in Ire-land. For the M-configuration, we build the configurationtree by relying on Algorithm 5.4. We run an experiment witha read dominant workload (90% reads). Figure 4 shows thecumulative distribution of the latency before updates origi-nating in Ireland become visible in Frankfurt (left plot) andbefore updates originating in Tokyo become visible in Syd-ney (right plot). Results show that both the S and M config-urations provide comparable results for updates being repli-cated in Frankfurt. This is because we placed the serializerof the S-configuration in Ireland, and therefore, the prop-agation of labels is done efficiently among these two re-gions. Unsurprisingly, when measuring visibility latenciesbefore updates originating in Tokyo become visible in Syd-ney, the S-configuration performs poorly because labels haveto travel from Tokyo to Ireland and then from Ireland toSydney. Plus, results show that the P-configuration, that re-lies on the label’s timestamp order, is not able to providelow visibility latencies in these settings. This is expected as,when tracking causality with a single scalar, latencies tend tomatch the longest network travel time (161ms in this case)due to false dependencies. In turn, the M-configuration isable to provide significantly lower visibility latencies to alllocations (deviating only 8.2ms from the optimal on aver-age).

7.2 Impact of Latency Variability on SATURNThe goal of this section is to better understand how changesin the link latency affect SATURN’s performance. We havejust seen that the correct configuration of the serializers’ treehas an impact on performance. Therefore, if changes in thelink latencies are large enough to make the current configu-ration no longer suitable, and these changes are permanent,a reconfiguration of SATURN should be triggered. In prac-tice, transient changes in the link latencies are unlikely to

Ireland Frankfurt Tokyo Sydney

Ireland to Frankfurt Tokyo to Sydney

Saturn configuration matters 

0

20000

40000

60000

80000

100000

120000

8 32 128 512 2048

(a) Value size in bytes

50:50 75:25 90:10 99:1

Eventual Saturn

(b) Read:Write ratio

exponential proportional uniform full

GentleRain Cure

(c) Correlation distribution

0% 5% 10% 20% 40%

(d) Percentage of remote reads

Figure 5: Dynamic workload throughput experiments.

0 10 20 30 40 50 60

0 25 50 75 100 125

Ext

ra v

isib

ility

late

ncy

(m

s)

Injected delay (ms)

T1T2

Figure 6: Impact of latency variability on remote updatevisibility in SATURN.

justify a reconfiguration; therefore we expect their effect onperformance to be small.

To validate this assumption, we set a simple experimentwith three datacenters, each located in a different EC2 re-gion: N. Carolina, Oregon and Ireland. For the experiment,we artificially inject extra latency between N. Carolina andOregon datacenters (averaged measured latency is 10ms).From our experience, we expect the latency among EC2regions to deviate from its average only slightly and tran-siently. Nevertheless, to fully understand the consequencesof latency variability, we also experimented with unrealisti-cally large deviations (up to 125ms).

Figure 6 shows the extra remote visibility latency that twodifferent configurations of SATURN add on average whencompared to an eventually consistent storage system whichmakes no attempt to enforce causality. Both configurations,T1 and T2, use a single serializer: configuration T1 placesthe serializer in Oregon, representing the optimal configura-tion under normal conditions and configuration T2, instead,places the serializer in Ireland.

As expected, under normal conditions, T1 performs sig-nificantly better than T2, confirming the importance ofchoosing the right configuration. As we add extra latency,T1 degrades its performance, but only slightly. One can ob-serve that, in fact, slight deviations in the averaged latencyhave no significant impact in SATURN: even with an extradelay of 25ms (more than twice the average delay), T1 onlyadds 14ms of extra visibility latency on average. Interest-ingly, it is only with more than 55ms of injected latencythat T2 becomes the optimal configuration, exhibiting lowerremote visibility latency than T1. Observing a long and sus-tained increase of 55ms of delay on a link that averages10ms is highly unlikely. Indeed, this scenario has the sameeffect of migrating the datacenter from N. Carolina to SãoPaulo. Plus, if such large deviation becomes the norm, sys-

tem operators can always rely on SATURN’s reconfigurationmechanism to change SATURN configuration.

7.3 SATURN vs. the State-of-the-artWe compare the performance of SATURN against eventualconsistency and against the most performant causally con-sistent storage systems in the state-of-the-art.

7.3.1 GentleRain and CureWe consider GentleRain [26] and Cure [3] the current state-of-the-art. We have also experimented with solutions basedon explicit dependency checking such as COPS [39] andEiger [40]. Nevertheless, we concluded that approachesbased on explicit dependency checking are not practicalunder partial geo-replication. Their practicability dependson the capability of pruning client’s list of dependenciesafter update operations due to the transitivity rule of causal-ity [39]. Under partial geo-replication, this is not possible,causing client’s list of dependencies to potentially grow upto the entire database.

At its core, both GentleRain and Cure implement causalconsistency very similarly: they rely on a background stabi-lization mechanism that requires all partitions in the systemto periodically exchange metadata. This equips each parti-tion with sufficient information to locally decide when re-mote updates can be safely–with no violation of causality—made visible to local clients. In our experiments, GentleRainand Cure’s stabilization mechanisms run every 5ms follow-ing the authors’ specifications. The interested reader can findmore details in the original papers [3, 26]. We recall thatSATURN does not require such a mechanism, as the order inwhich labels are delivered to each datacenter already deter-mines the order in which remote updates have to be applied.

The main difference between GentleRain and Cure re-sides in the way causal consistency is tracked. While Gen-tleRain summarizes causal dependencies in a single scalar,Cure uses a vector clock with an entry per datacenter. Thisenables Cure to track causality more precisely—loweringremote visibility latency—but the metadata managementincreases the computation and storage overhead—harmingthroughput. Concretely, by relying on a vector, Cure remoteupdate visibility latency lower-bound is determined by thelatency between the originator of the update and the remotedatacenter. Differently, in GentleRain, the lower-bound is

N. Carolina Oregon Ireland N. Carolina Oregon Ireland

N. Carolina Oregon



Evaluation: goals

63



64

Evaluation

0 50

100 150 200 250 300 350 400

Thro

ughp

ut (K

ops/

sec) Eunomia 15

Eunomia 30Eunomia 45Eunomia 60Eunomia 75Sequencer

maximum throughput achievable by Eunomia vs a classical sequencer

64

Evaluation

0 50

100 150 200 250 300 350 400

Thro

ughp

ut (K

ops/

sec) Eunomia 15

Eunomia 30Eunomia 45Eunomia 60Eunomia 75Sequencer

maximum throughput achievable by Eunomia vs a classical sequencer

x7.7

Other features…

• On-line reconfiguration protocols• Efficient migration of clients among data centers

65

+

• Impact of stragglers• More experiments

• Support for fault-tolerance

Check the papers for more details.

66

+

EUROSYS 2017 ATC 2017

Take-away messages

• It is possible to preserve concurrency inside a DC, and to capture causality with a single scalar, using an ordering service that is out the the client’s critical path

• These scalar clocks can be propagated using tree-based dissemination, that can effectively mitigate the impact of false dependencies

• Combined, they allow the optimisation of both throughput and data freshness

67

+

Documents

Causality for the Masses - OPODIS 2017opodis2017.campus.ciencias.ulisboa.pt/presentations/...Causality for the Masses: Offering Fresh Data, Low Latency, and High Throughput Manuel