View
0
Download
0
Category
Preview:
Citation preview
Causality for the Masses:
Luís Rodrigues
Offering Fresh Data, Low Latency, and High Throughput
Causality for the Masses:Offering Fresh Data, Low Latency,
and High Throughput
Manuel Bravo, Luís Rodrigues, Chathuri Gunawardhana, Peter van Roy
Causal consistency
Strongest without
compromising availability
3
Causal consistency
Key ingredientof several
consistency criteria
Parallel Snapshot Isolation [SOSP’11]
RedBlue Consistency [OSDI’12]
Session guarantees [SOSP’97]
Explicit Consistency [EuroSys’15]
4
Causal consistency
5
Causal consistency
5
Causal consistency
5
Causal consistency
Operations may arrive in the “wrong”
order
5
6
6
Alice
Bob
Dan
6
Alice
Dan is in the hospital!
Bob
Dan
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital!
Dan
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
That’s great!
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
That’s great!
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
6
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
Causal consistency
7
Causal consistency
Data center should delay the
visibility of inconsistent operations
7
8
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
Dan is ok!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
That’s great!
Dan
Dan is in the hospital!
That’s great!
Dan is ok!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
Dan is ok! That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
Dan is ok! That’s great!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
Dan is ok! That’s great!
This presentation is about keeping Dan happy
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
Dan is ok! That’s great!
This presentation is about keeping Dan happy
Requires maintaing
and exchangingmetadata!
8
Alice
Dan is in the hospital!
Dan is ok!
Bob
Dan is in the hospital! Dan is ok!
Dan
Dan is in the hospital!
That’s great!
Dan is ok! That’s great!
This presentation is about keeping Dan happy
Requires maintaing
and exchangingmetadata!
Lots of metadata!
Causal consistency
If causal dependencies are not accurately tracked metadata may generate false positives
Problem
9
False dependencies!
10
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
10
Alice
Bob
Dan
11
Metadata
more metadata less metadata
11
Metadata
more metadata less metadata
Matrix/vector clocks
11
Metadata
more metadata less metadata
Matrix/vector clocks
One vector per item.
One entry in each vector per DC.
11
Metadata
more metadata less metadata
precise
expensive
Matrix/vector clocks
11
Metadata
more metadata less metadata
precise
expensive
Matrix/vector clocks Lamport’s clocks
11
Metadata
more metadata less metadata
precise
expensive
Matrix/vector clocks Lamport’s clocks
One scalar.
11
Metadata
more metadata less metadata
precise
expensive
false positives
cheap
Matrix/vector clocks Lamport’s clocks
Problems of the previous state-of-the-art Throughput vs. data staleness tradeoff
-20-16-12-8-4 0
3 4 5 6 7
Th
rou
gh
pu
t p
en
alty
(%
)
GentleRain Cure
0 20 40 60 80
100 120
3 4 5 6 7
Da
ta s
tale
ne
ss o
verh
ea
d (
%)
Number of datacenters
GentleRain [SoCC’ 14]: Optimizes throughput Compresses metadata into a scalar
Cure [ICDCS’ 16]: Optimizes data freshness Relies on a vector clock with an entry per data center
12
Problems of the previous state-of-the-art Throughput vs. data staleness tradeoff
-20-16-12-8-4 0
3 4 5 6 7
Th
rou
gh
pu
t p
en
alty
(%
)
GentleRain Cure
0 20 40 60 80
100 120
3 4 5 6 7
Da
ta s
tale
ne
ss o
verh
ea
d (
%)
Number of datacenters
GentleRain [SoCC’ 14]: Optimizes throughput Compresses metadata into a scalar
Cure [ICDCS’ 16]: Optimizes data freshness Relies on a vector clock with an entry per data center
12
Metadata size affects throughput
Problems of the previous state-of-the-art Throughput vs. data staleness tradeoff
-20-16-12-8-4 0
3 4 5 6 7
Th
rou
gh
pu
t p
en
alty
(%
)
GentleRain Cure
0 20 40 60 80
100 120
3 4 5 6 7
Da
ta s
tale
ne
ss o
verh
ea
d (
%)
Number of datacenters
GentleRain [SoCC’ 14]: Optimizes throughput Compresses metadata into a scalar
Cure [ICDCS’ 16]: Optimizes data freshness Relies on a vector clock with an entry per data center
12
False dependencies damage data
freshness
Metadata size affects throughput
Problems of the previous state-of-the-art Throughput vs. data staleness tradeoff
Visibility latencies have direct impact on client response latency
Partial replication aggravates the problem
13
Our take on the problem
14
Saturn Eunomia
+
Our take on the problem
15
Saturn Eunomia
God in ancient Roman religion, that become the god of time
+
Our take on the problem
16
Saturn Eunomia
Greek goddess of law and legislation
+
Our take on the problem
17
Saturn Eunomia
If this couple cannot fix the problem, nobody can…
+
Our take on the problem
18
Saturn Eunomia
+
Our take on the problem
18
Saturn Eunomia
Orders events across datacenters
+
Our take on the problem
18
Saturn Eunomia
Orders events in each
datacenter
+
19
Distributed metadata service
pluggable to existing geo-distributed data services
handles the dissemination of operations among data centers
Ensures thatclients always observe a causally consistent state
with a negligible performance overhead when compared to an eventually consistency system
20
key features
21
key features
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
21
key features
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
to avoid impairing throughput
21
key features
Mitigates the impact of false dependencies by relying on a tree-based dissemination
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
to avoid impairing throughput
21
key features
Mitigates the impact of false dependencies by relying on a tree-based dissemination
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
to avoid impairing throughput
to enhance data freshness
21
key features
Mitigates the impact of false dependencies by relying on a tree-based dissemination
Implements genuine partial replication data centers only manage data and metadata of the items replicated locally
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
to avoid impairing throughput
to enhance data freshness
21
key features
Mitigates the impact of false dependencies by relying on a tree-based dissemination
Implements genuine partial replication data centers only manage data and metadata of the items replicated locally
Requires a constant and small amount of metadata regardless of the system’s scale (servers, partitions, and locations)
to avoid impairing throughput
to enhance data freshness
to take full advantage of partial replication
21
Decoupling data and metadata
3 421
22
Decoupling data and metadata
3 421
data transfer22
Decoupling data and metadata
3 421
data transfer
metadata transfer
22
Decoupling data and metadata
3 421
data transfer
metadata transferData centers
only make remote updates visible when
they have received both the metadata and its corresponding data
22
Example: write request
Example: write request
…
data
labels
3 N21
data centers
Example: write request
…
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centersa1
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Example: write request
…put(a1)
data
labels
3 N21Client 1
data centers
Metadata propagation
3 421
24
Metadata propagation
a (2)
3 421
b (4)c (6)
24
Metadata propagation
a (2)
3 421
b (4)c (6)
bc
a a concurrent to both b and cc causally depends on b
24
Metadata propagation
a (2)
3 421
b (4)c (6) ??
bc
a a concurrent to both b and cc causally depends on b
24
Metadata propagation
a (2)
3 421
b (4)c (6) ??
a b c ?
bc
a a concurrent to both b and cc causally depends on b
24
Metadata propagation
a (2)
3 421
b (4)c (6) ??
b a c ?
a b c ?
bc
a a concurrent to both b and cc causally depends on b
24
Metadata propagation
a (2)
3 421
b (4)c (6) ?
b c a ??
b a c ?
a b c ?
bc
a a concurrent to both b and cc causally depends on b
24
Metadata propagation
3 421
25
Metadata propagation
a (2)
3 421
b (4)c (6)
bc
a a concurrent to both b and cc causally depends on b
25
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
bc
a a concurrent to both b and cc causally depends on b
10ms
25
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
bc
a a concurrent to both b and cc causally depends on b
10ms
25
a -> 3 b -> 14c -> 16
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
bc
a a concurrent to both b and cc causally depends on b
10ms
25
a -> 3 b -> 14c -> 16
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
delayed
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
delayed
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
b -> 5 a -> 12c -> 12
delayed
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
b -> 5 a -> 12c -> 12
delayed
a c b
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
?
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
b -> 5 a -> 12c -> 12
delayed
a c b
b -> 5 c -> 7
a -> 12
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
a (2)
3 421
b (4)c (6)
1ms 1ms
10ms
b -> 5 c -> 7
a -> 12
c b a
a -> 12 b -> 12c -> 12
delayed
c a b
b -> 5 a -> 12c -> 12
delayed
a c b
b -> 5 c -> 7
a -> 12
ac b
bc
a a concurrent to both b and cc causally depends on b
10ms
25
c ba
Metadata propagation
Causal consistency is a partial order
Saturn exploits this fact by serving possibly different linear extensions of the causal order to each data center
Each served serialization aims at optimising data freshness, and thus, reducing the impact of false dependencies
26
Metadata propagation
Causal consistency is a partial order
Saturn exploits this fact by serving possibly different linear extensions of the causal order to each data center
Each served serialization aims at optimising data freshness, and thus, reducing the impact of false dependencies
a c b
c b a
26
Metadata propagation: architecture
Saturn leverages a set of cooperating servers, namely serializers, forming a tree
Causal consistency is trivially enforced by the tree assuming
FIFO links among serializers
Serializers propagate metadata preserving the observed order
Serializers are geographically distributed to optimize data freshness
27
Metadata dissemination graph
3 421
Saturn
S1 S2 S3 S4
S5
Metadata dissemination graph
3 421
Saturn
Metadata dissemination graph
3 421
Saturn
S1 S2 S3 S4
S5 S6
Optimal dissemination graph
Weighted Minimal Mismatch
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
The goal is to build the tree such that metadata-paths latencies (through the tree) match data-paths
30
Optimal dissemination graph
Weighted Minimal Mismatch
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
The goal is to build the tree such that metadata-paths latencies (through the tree) match data-pathsabsolute
difference between label-paths and data
paths
30
Optimal dissemination graph
Weighted Minimal Mismatch
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
5.4 Configuring SATURN’s Metadata ServiceThe quality of the serialization served by SATURN to eachdatacenter depends on how the service is configured. A SAT-URN’s configuration defines: (i) the number of serializers touse and where to place them; (ii) how these serializers areconnected (among each other and with datacenters); and (iii)what delays (if any) should a serializer artificially add whenpropagating labels (in order to match the optimal visibilitytime). Let �ij denote the artificial delay added by serializeri when propagating metadata to serializer j.
In practice, when deploying SATURN, one has not com-plete freedom to select the geo-location of serializers. In-stead, the list of potential locations for serializers is limitedby the availability of suitable points-of-presence that resultsfrom business constraints. Therefore, the task of setting-up aserializer network is based on:• The set V of datacenters that need to be connected (we
denote N = |V | the total number of datacenters).• The latencies of the bulk data transfer service among
these datacenters; latij denotes the latency between dat-acenters i and j.
• The set W of potential locations for placing serializers(M = |W |). Since each datacenter is a natural potentialserializer location, M � N . Let dij denote the latencybetween two serializer locations i and j.However, given a limited set of potential locations to
place serializers, it is unlikely (impossible in most cases) tomatch the optimal label propagation latency for every pair ofdatacenters. Therefore, the best we can aim when setting-upSATURN is to minimize the mismatch among the achievablelabel propagation latency and the optimal label propagationlatency. More precisely, consider that the path for a giventopology between two datacenters, i and j, denoted PMi,j iscomposed by a set of serializers PMi,j = {Sk, ..., So}, whereSk connects to datacenter i and So connects to datacenter j.The latency of this path �M (i, j) is defined by the latencies(d) between adjacent nodes in the path, plus any artificialdelays that may be added at each step, i.e.:
�M (i, j) =P
Sk2PMi,j\{So}(dk,k+1 + �k,k+1)
and the mismatch between the resulting latency and theoptimal label propagation latency is given by:
mismatchi,j = |�M (i, j)��(i, j)|
Finally, one can observe that in general, the distributionof client requests, among items and datacenters may notbe uniform, i.e., some items and some datacenters may bemore accessed than others. As a result, a mismatch thataffects the data visibility of a highly accessed item mayhave a more negative effect on the user experience thana mismatch on a seldom accessed item. Therefore, in thescenario where it is possible to collect statistics regardingwhich items and datacenters are more used, it is possible toassign a weight ci,j to each metadata path PMi,j , that reflects
the relative importance of that path for the business goalsof the application. Using these weights, we can now defineprecisely an optimization criteria that should be followedwhen setting up the serializers topology:
DEFINITION 2 (Weighted Minimal Mismatch). The config-uration that better approximates the optimal visibility timefor data updates, considering the relative relevance of eachtype of update, is the one that minimizes the weighted globalmismatch, defined as:
minP
8i,j2V ci,j · mismatchi,j
5.5 Configuration GeneratorThe problem of finding a configuration that minimizes theWeighted Minimal Mismatch criteria, among all possibleconfigurations that satisfy the constraints of the problem,is NP-hard.2 Therefore, we have designed a heuristic thatapproximates the optimal solution using a constraint solveras a building block. We have modeled the minimizationproblem captures by Definition 2 as a constraint problemsuch that for a given tree, finds the optimal location ofserializers (for a given set of possible location candidates)and the optimal (if any) propagation delays.
The proposed algorithm, depicted in Alg. 3, works as fol-lows. Iteratively, starting with a full binary tree with onlytwo leaves (Alg. 3, line 3), generates all possible isomor-phism classes of full binary trees with N labeled leaves (i.e.,datacenters). The algorithm adds one labeled leaf (datacen-ter) at each iteration until the number of leaves is equal tothe total number of datacenters. For a given full binary treeT of f leaves, there exist 2⇤ f �1 isomorphic classes of fullbinary trees with f+1 leaves. One can obtain a new isomor-phic class by either inserting a new internal node within anedge of T from which the new leaf hangs (Alg. 3, line 14),or by creating a new root from which the new leaf and Thang (Alg. 3, line 10). We could iterate until generating allpossible trees of N leaves. Nevertheless, in order to avoida combinatorial explosion (for nine datacenters there wouldalready be 2,027,025 possible trees), the algorithm selectsat each iteration the most promising trees and discards therest. In order to rank the trees at each iteration, we use theconstraint solver. Therefore, given a totally ordered list ofranked trees, if the difference between the rankings of twoconsecutive trees T1 and T2 is greater than a given thresh-old, T2 and all following trees are discarded (Alg. 3, line 18).At the last iteration, among all trees with N leaves, we pickthe one that produces the smallest global mismatch from theoptimal visibility times by relying on the constraint solver.
Note that Algorithm 3 always returns a binary tree. Nev-ertheless, SATURN does not require the tree to be binary.One can easily fuse two serializers into one if both are di-rectly connected, placed in the same location, and the artif-ical propagation delays among them are zero. Any of these
2 A reduction from the Steiner tree problem [35] can be used to prove this.
The goal is to build the tree such that metadata-paths latencies (through the tree) match data-paths
minimize mismatch of busiest paths
30
Metadata propagation: building the tree
Finding the optimal tree is modelled as a constraint optimization problem
Input
Data-paths average latencies
Candidate locations for serializers (an latencies among them) Access-patterns: to minimize the impact of mismatches
31
Example
3 421 1ms 1ms
10ms
bc
a a concurrent to both b and cc causally depends on b
10ms
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
bc
a a concurrent to both b and cc causally depends on b
10ms
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
bc
a a concurrent to both b and cc causally depends on b
10ms
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0
bc
a a concurrent to both b and cc causally depends on b
10ms
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 02
bc
a a concurrent to both b and cc causally depends on b
10ms
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 02
bc
a a concurrent to both b and cc causally depends on b
10ms
a
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 02
bc
a a concurrent to both b and cc causally depends on b
10ms
a
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 023
bc
a a concurrent to both b and cc causally depends on b
10ms
a
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 023
bc
a a concurrent to both b and cc causally depends on b
10ms
a
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234
bc
a a concurrent to both b and cc causally depends on b
10ms
a
b
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5
bc
a a concurrent to both b and cc causally depends on b
10ms
a
b
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5
bc
a a concurrent to both b and cc causally depends on b
10ms
a
b
32
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5
bc
a a concurrent to both b and cc causally depends on b
10ms
a b
32
b
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5
bc
a a concurrent to both b and cc causally depends on b
10ms
a
33
b
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
33
b
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
33
b
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
33
b
Example
3 421 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
33
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
34
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
34
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
34
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
34
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
34
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
34
b -> 5 c -> 7
a -> 12
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
b
34
b -> 5 c -> 7
a -> 12
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
b
34
b -> 5 c -> 7
a -> 12
4b
Example
321 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14
bc
a a concurrent to both b and cc causally depends on b
10ms
a
a
c
a
b
34
b -> 5 c -> 7
a -> 12
2 4b
Example
31 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
a
b
35
2 4b
Example
31 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14…16
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
a
b
c
35
2 4b
Example
31 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14…16
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
a
b
c
35
2 4b
Example
31 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14…16
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
a
b
c
35
2 4b
Example
31 1ms 1ms
10ms
S1 S2 S3 S4
S5 S6
S5 @ dc1 (or dc2)
S6 @ dc4 (or dc3)
Time: 0234 5 6 7…12 …14…16
bc
a a concurrent to both b and cc causally depends on b
10ms
a
c
a
b
c
35
36
…put(a1)
data
labels
3 N21Client 1
data centersa1
…put(a1)
data
labels
3 N21Client 1
data centersa1
Saturn requires each datacenterto output a
SERIAL sequence of updates
38
each datacenter has many machines
Eunomia
how to serialize updates performed on different machines without loosing performance?
usually one relies on sequencers that limit concurrency
conceived to replace sequencers used to serialize updates in replicated storage systems
39
totally orders—consistently with causality—local updates, before shipping them to Saturn
and other dcs
the ordering is done in the background, out of client’s critical path
Eunomia
40
A CB
Sequencer
41
A CB
Eunomia
42
put(a) put(b)
A CB
42
put(a) put(b)
A CB
42
put(a) put(b)
1 1
A CB
42
put(a) put(b)
1
1 1
1
A CB
43
put(c)
1 1
A CB
43
put(c)
1 1
2
2
A CB
44
1 1
put(d)
2
A CB
44
1 1
put(d)
1
12
A CB
45
1 1 12
put(e)
A CB
45
1 1 12
put(e)
2
2
A CB
46
1 1 12 2
A CB
47
1 1 12 2
A CB
48
1 1 1 2 2
A CB
48
49
1
1
1
2 2
A CB
50
2 2
put(f)
A CB
50
2 2
put(f)
3
3
A CB
51
+
Evaluation
• How important is the configuration of Saturn?
• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?
Evaluation: goals
52
• How Eunomia compares to sequencers?
• Why Saturn’s genuine partial replication matters?
• How important is the configuration of Saturn?
• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?
Evaluation: goals
53
• How Eunomia compares to sequencers?
• Why Saturn’s genuine partial replication matters?
Saturn vs state-of-the-art
Partitioning among 7 data centers Replication factor min. 2 max. [2,5] [J. M. Pujol et al. The little engine(s) that could: Scaling online social networks. SIGCOMM ’10]
Workload based on real social network workloads [F. Benevenuto et al. Characterizing user behavior in online social networks. ICM ’09]
Public Facebook dataset [B. Viswanath et al. On the evolution of user interaction in Facebook. WOSN ’09]
N.
(O
T
(N.
OS
(N. (Ir
N. Ir Fr
54
Saturn vs state-of-the-art Throughput
0 20000 40000 60000 80000
100000 120000
5 4 3 2Th
rou
gh
pu
t (o
ps/s
ec) Eventual
SaturnGentleRain
Cure
max replication degree
Thro
ughp
ut (o
ps/s
ec)
55
Saturn vs state-of-the-art Throughput
0 20000 40000 60000 80000
100000 120000
5 4 3 2Th
rou
gh
pu
t (o
ps/s
ec) Eventual
SaturnGentleRain
Cure
max replication degree
Thro
ughp
ut (o
ps/s
ec)
Saturn exhibits a throughput very close to
eventual consistency55
Saturn vs state-of-the-art Throughput
0 20000 40000 60000 80000
100000 120000
5 4 3 2Th
rou
gh
pu
t (o
ps/s
ec) Eventual
SaturnGentleRain
Cure
max replication degree
Thro
ughp
ut (o
ps/s
ec)
Saturn exhibits a throughput very close to
eventual consistency
Cure significantly penalizes throughput due
to the metadata size
55
Saturn vs state-of-the-art Data freshness: average
0 20 40 60 80
100 120 140 160
Rem
ote
vis
ibili
ty la
tency
(m
s)
56
Even
tual
Satur
n
Gentl
eRainCu
re
Saturn vs state-of-the-art Data freshness: average
0 20 40 60 80
100 120 140 160
Rem
ote
vis
ibili
ty la
tency
(m
s)
56
Even
tual
Satur
n
Gentl
eRainCu
re
Still better than Curethat uses vector clocks
and very close to eventual
• How important is the configuration of Saturn?
• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?
Evaluation: goals
57
• How Eunomia compares to sequencers?
• Why Saturn’s genuine partial replication matters?
Saturn Genuine partial replication matters
58
N. Virgina Ireland Sydney
• Sydney receives updates from Ireland • Ireland updates depend on N. Virginia updates • Sydney does not replicate all N. Virginia data
Saturn Genuine partial replication matters
59
0
0.2
0.4
0.6
0.8
1
0 40 80 120 160 200 240 280
CD
F
Remote update visibility (milliseconds)
SaturnVector
Latency of updates from Ireland when applied at Sydney
Saturn Genuine partial replication matters
59
0
0.2
0.4
0.6
0.8
1
0 40 80 120 160 200 240 280
CD
F
Remote update visibility (milliseconds)
SaturnVector
Latency of updates from Ireland when applied at Sydney
Updates from Ireland are stalled until metadata is received from N. Virginia
• How important is the configuration of Saturn?
• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?
Evaluation: goals
60
• How Eunomia compares to sequencers?
• Why Saturn’s genuine partial replication matters?
Saturn configuration matters
NC O I F T SNV 37 ms 49 ms 41 ms 45 ms 73 ms 115 msNC - 10 ms 74 ms 84 ms 52 ms 79 msO - - 69 ms 79 ms 45 ms 81 msI - - - 10 ms 107 ms 154 msF - - - - 118 ms 161 msT - - - - - 52 ms
Table 1: Average latencies (half RTT) among Amazon EC2regions: N. Virginia (NV), N. California (NC), Oregon (O),Ireland (I), Frankfurt (F), Tokyo (T), and Sydney (S)
In order to compare SATURN with other solutions fromthe state-of-the-art (data services), we attached SATURN toan eventually consistent geo-replicated storage system wehave built. Throughout the evaluation, we use this eventu-ally consistent data service as the baseline, as it adds nooverheads due to consistency management, to better under-stand the overheads introduced by SATURN. Note that thisbaseline represents a throughput upper-bound and a latencylower-bound. Thus, when we refer to the optimal visibilitylatency throughout the experiments, we are referring to thelatencies provided by the eventually consistent system.
Implementation. Our SATURN prototype implements allfunctionality described in the paper. It has been built usingthe Erlang/OTP programming language. In our prototype,gears rely on physical clocks to generate monotonically in-creasing timestamps. To balance the load among frontends ateach datacenter, we use Riak Core [14], an open source dis-tribution platform. The optimization problem (Definition 2)used to configure SATURN is modeled using OscaR [44], aScala toolkit for solving Operations Research problems.
Setup. We use Amazon EC2 m4.large instances run-ning Ubuntu 12.04 in our experiments. Each instance hastwo virtual CPU cores, and 8 GB of memory. We use sevendifferent regions in our experiments. Table 1 lists the aver-age latencies we measured among regions. Our experimentssimulate one datacenter per region. Clients are co-locatedwith their preferred datacenter in separate machines. Eachclient machine runs its own instance of a custom version ofBasho Bench [13], a load-generator and benchmarking tool.Each client eagerly sends requests to its preferred datacen-ter with zero thinking time. We deploy as many clients asnecessary in order to reach the system’s maximum capacity,without overloading it. Each experiment runs for more than5 minutes. In our results, the first and the last minute of eachexperiment are ignored to avoid experimental artifacts. Wemeasure the visibility latencies of remote update operationsby storing the physical time at the origin datacenter when theupdate is applied locally, and subtracting it from the physi-cal time at the destination datacenter when the update be-comes visible. To reduce the errors due to clock skew, phys-ical clocks are synchronized using the NTP protocol [43]before each experiment, making the remaining clock skewnegligible in comparison to inter-datacenter travel time.
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
CD
F
Remote update
P-conf
0 100 200 300 400 500visibility (milliseconds)
M-conf S-conf
Figure 4: Left: Ireland to Frankfurt (10ms); Right: Tokyo toSydney (52ms)
7.1 SATURN Configuration MattersWe compare different configurations of SATURN to bet-ter understand their impact on the system performance.We compare three alternative configurations: (i) a single-serializer configuration (S), (ii) a multi-serializer configu-ration (M), and (iii) a peer-to-peer version of SATURN thatrelies on the conservative label’s timestamp order to applyremote operations (P). We focus on the visibility latenciesprovided by the different implementations.
For the S-configuration, we placed the serializer in Ire-land. For the M-configuration, we build the configurationtree by relying on Algorithm 5.4. We run an experiment witha read dominant workload (90% reads). Figure 4 shows thecumulative distribution of the latency before updates origi-nating in Ireland become visible in Frankfurt (left plot) andbefore updates originating in Tokyo become visible in Syd-ney (right plot). Results show that both the S and M config-urations provide comparable results for updates being repli-cated in Frankfurt. This is because we placed the serializerof the S-configuration in Ireland, and therefore, the prop-agation of labels is done efficiently among these two re-gions. Unsurprisingly, when measuring visibility latenciesbefore updates originating in Tokyo become visible in Syd-ney, the S-configuration performs poorly because labels haveto travel from Tokyo to Ireland and then from Ireland toSydney. Plus, results show that the P-configuration, that re-lies on the label’s timestamp order, is not able to providelow visibility latencies in these settings. This is expected as,when tracking causality with a single scalar, latencies tend tomatch the longest network travel time (161ms in this case)due to false dependencies. In turn, the M-configuration isable to provide significantly lower visibility latencies to alllocations (deviating only 8.2ms from the optimal on aver-age).
7.2 Impact of Latency Variability on SATURNThe goal of this section is to better understand how changesin the link latency affect SATURN’s performance. We havejust seen that the correct configuration of the serializers’ treehas an impact on performance. Therefore, if changes in thelink latencies are large enough to make the current configu-ration no longer suitable, and these changes are permanent,a reconfiguration of SATURN should be triggered. In prac-tice, transient changes in the link latencies are unlikely to
Ireland Frankfurt Tokyo Sydney
Ireland to Frankfurt Tokyo to Sydney
Saturn configuration matters
0
20000
40000
60000
80000
100000
120000
8 32 128 512 2048
(a) Value size in bytes
50:50 75:25 90:10 99:1
Eventual Saturn
(b) Read:Write ratio
exponential proportional uniform full
GentleRain Cure
(c) Correlation distribution
0% 5% 10% 20% 40%
(d) Percentage of remote reads
Figure 5: Dynamic workload throughput experiments.
0 10 20 30 40 50 60
0 25 50 75 100 125
Ext
ra v
isib
ility
late
ncy
(m
s)
Injected delay (ms)
T1T2
Figure 6: Impact of latency variability on remote updatevisibility in SATURN.
justify a reconfiguration; therefore we expect their effect onperformance to be small.
To validate this assumption, we set a simple experimentwith three datacenters, each located in a different EC2 re-gion: N. Carolina, Oregon and Ireland. For the experiment,we artificially inject extra latency between N. Carolina andOregon datacenters (averaged measured latency is 10ms).From our experience, we expect the latency among EC2regions to deviate from its average only slightly and tran-siently. Nevertheless, to fully understand the consequencesof latency variability, we also experimented with unrealisti-cally large deviations (up to 125ms).
Figure 6 shows the extra remote visibility latency that twodifferent configurations of SATURN add on average whencompared to an eventually consistent storage system whichmakes no attempt to enforce causality. Both configurations,T1 and T2, use a single serializer: configuration T1 placesthe serializer in Oregon, representing the optimal configura-tion under normal conditions and configuration T2, instead,places the serializer in Ireland.
As expected, under normal conditions, T1 performs sig-nificantly better than T2, confirming the importance ofchoosing the right configuration. As we add extra latency,T1 degrades its performance, but only slightly. One can ob-serve that, in fact, slight deviations in the averaged latencyhave no significant impact in SATURN: even with an extradelay of 25ms (more than twice the average delay), T1 onlyadds 14ms of extra visibility latency on average. Interest-ingly, it is only with more than 55ms of injected latencythat T2 becomes the optimal configuration, exhibiting lowerremote visibility latency than T1. Observing a long and sus-tained increase of 55ms of delay on a link that averages10ms is highly unlikely. Indeed, this scenario has the sameeffect of migrating the datacenter from N. Carolina to SãoPaulo. Plus, if such large deviation becomes the norm, sys-
tem operators can always rely on SATURN’s reconfigurationmechanism to change SATURN configuration.
7.3 SATURN vs. the State-of-the-artWe compare the performance of SATURN against eventualconsistency and against the most performant causally con-sistent storage systems in the state-of-the-art.
7.3.1 GentleRain and CureWe consider GentleRain [26] and Cure [3] the current state-of-the-art. We have also experimented with solutions basedon explicit dependency checking such as COPS [39] andEiger [40]. Nevertheless, we concluded that approachesbased on explicit dependency checking are not practicalunder partial geo-replication. Their practicability dependson the capability of pruning client’s list of dependenciesafter update operations due to the transitivity rule of causal-ity [39]. Under partial geo-replication, this is not possible,causing client’s list of dependencies to potentially grow upto the entire database.
At its core, both GentleRain and Cure implement causalconsistency very similarly: they rely on a background stabi-lization mechanism that requires all partitions in the systemto periodically exchange metadata. This equips each parti-tion with sufficient information to locally decide when re-mote updates can be safely–with no violation of causality—made visible to local clients. In our experiments, GentleRainand Cure’s stabilization mechanisms run every 5ms follow-ing the authors’ specifications. The interested reader can findmore details in the original papers [3, 26]. We recall thatSATURN does not require such a mechanism, as the order inwhich labels are delivered to each datacenter already deter-mines the order in which remote updates have to be applied.
The main difference between GentleRain and Cure re-sides in the way causal consistency is tracked. While Gen-tleRain summarizes causal dependencies in a single scalar,Cure uses a vector clock with an entry per datacenter. Thisenables Cure to track causality more precisely—loweringremote visibility latency—but the metadata managementincreases the computation and storage overhead—harmingthroughput. Concretely, by relying on a vector, Cure remoteupdate visibility latency lower-bound is determined by thelatency between the originator of the update and the remotedatacenter. Differently, in GentleRain, the lower-bound is
N. Carolina Oregon Ireland N. Carolina Oregon Ireland
N. Carolina Oregon
• How important is the configuration of Saturn?
• Can Saturn+Eunomia optimize both throughput and data freshness simultaneously?
Evaluation: goals
63
• How Eunomia compares to sequencers?
• Why Saturn’s genuine partial replication matters?
64
Evaluation
0 50
100 150 200 250 300 350 400
Thro
ughp
ut (K
ops/
sec) Eunomia 15
Eunomia 30Eunomia 45Eunomia 60Eunomia 75Sequencer
maximum throughput achievable by Eunomia vs a classical sequencer
64
Evaluation
0 50
100 150 200 250 300 350 400
Thro
ughp
ut (K
ops/
sec) Eunomia 15
Eunomia 30Eunomia 45Eunomia 60Eunomia 75Sequencer
maximum throughput achievable by Eunomia vs a classical sequencer
x7.7
Other features…
• On-line reconfiguration protocols• Efficient migration of clients among data centers
65
+
• Impact of stragglers• More experiments
• Support for fault-tolerance
Check the papers for more details.
66
+
EUROSYS 2017 ATC 2017
Take-away messages
• It is possible to preserve concurrency inside a DC, and to capture causality with a single scalar, using an ordering service that is out the the client’s critical path
• These scalar clocks can be propagated using tree-based dissemination, that can effectively mitigate the impact of false dependencies
• Combined, they allow the optimisation of both throughput and data freshness
67
+
Recommended