Reliable Distributed Computing: The Price of Mastering Churn in Distributed Systems

Roberto BaldoniUniversità di Roma “La Sapienza”

Retirement Seminar for Professor Santosh Shrivastava8th of September 2011, Newcastle, UK

The Price of Mastering Churn in Distributed Systems

Roberto Baldoni, “The price of mastering churn in a distributed system”

1

Santosh reminds me … a set of acronims

MIDAS (2001)

EUCOSM (2003) LUCID (2004) MAGNET (2005) VIRTUE (2007) SEGOVIA (2009)


2

Large and promising IP rejected --too many Chinese!

FET IP - Very strong consortium - rejected reason «very nice projects, however it wants to provide a real

software platfom for pooling together on-demand resources in a multi-tenant

environment resistant to byzantine attack…. in FET program we do not

fund engineering work»

Just below the bar!

Outline Dynamic Distributed Systems System Model with Churn Regular Registers Other interesting Abstractions Conclusion


3

Advent of Complex Distributed Applications

Peer-to-peer Sensor Networks Mobile networks Cloud computing federations Internet supercomputing Smart environments


4

Managed vs. Unmanaged distributed applications (i)

Managed Distributed Application Existence of a manager that can control the

entities comprising or running the application The manager guarantees a suitable environment

for a duration of time sufficient for a distributed system to behave correctly wrt its system model assumptions, e.g., Providing needed/sufficient/appropriate entities to enable

correct behavior of the application (global application view)

Providing operational guarantees of QoS and the necessary degree of synchrony in the underlying distributed platformRoberto Baldoni, “The price of mastering churn in a distributed

system”5

Managed Distributed Applications: Consequences

Main characteristics: a predefined setting, i.e., The application knows, directly or indirectly, the set of processes

that will participate in the computation The application knows if it can exploit synchrony assumptions

The system can be carefully and "centrally" configured through an appropriate tuning phase in order to get the best performance

The application cycle is: Design, deployment optimization, configuration, final deployment, operation

Managed Distributed Applications run on the top of a Distributed System that is piecewise static wrt time

N entitiesN-1

entitiesN-2 N+3 time


6

Managed vs. Unmanaged distributed applications (ii)

Unmanaged Distributed Applications No assumption of a manager or access to

equivalent management facilities Each process autonomously decides to locally run

a component of a distributed application when (a) joining and (b) leaving the system the system and/or its components do not start with a

known and pre-defined setting “Nice” manageable system model assumptions

either cannot be guaranteed or do not last for long


7

Unmanaged distributed applications: Consequences

Autonomic/autonomous behavior of entities Self-defined, self-instantiating (& self*?) and

perpetually evolving distributed system It is impossible to know the set of processes

participating to the computation because it changes dynamically and can potentially grow without bounds

E.g., the system could cease existing when no process is active, and at other times the system may be made of thousands of active processes

. . . Dynamic Distributed System


8

WorldOrderly Chaotic

Spectrum of Possible System Models


9

Air traffic Control

Mobile ad-hoc Systems

Cloud Computing

Peer-to-peer

Uncertainty in Dynamic Distributed Systems Static Distributed Systems:

Lack of temporal knowledge Failures Unknown communication delays

Dynamic Distributed Systems Same issues as in static distributed systems,

plus Non-monotonic and unknown size of the system Potentially changing properties of the “universe” Unclear notions of efficiency, effectiveness,

scalability


10

• Solid theoretical foundations

• Precise problem specifications

• Rigorously correct solutions• Solid theoretical foundations

• Precise problem specifications

• Rigorously correct solutions

System Model with Churn


11

The distributed system is dynamic In each run, infinitely many processes can arrive and depart from

the system but at any point in time the number of processes is finite (Infinite Arrival Model)

Processes participate in a distributed computation running on top of the distributed system Processes of the distributed system decide at their will to join and

leave the distributed computation (i.e. the computation is affected by continuous churn)

No process is guaranteed to participate for ever in the distributed computation

Each process has a unique identifier

Processes can crash and this can be seen as a leave of the process


12

System Model with Churn

Abstractions Shared Memory

Registers Sets

One-shot problem Interval valid queries

Agreement Problem Leader Election


13

Churn

Distributed System

Distributed Computation

Connectivity Protocol

Communication Protocols

Abstraction

For simplicity we assume N processes are in the distributed computation at any given time


14

Object Abstraction: The Regular Register

A register is a shared variable accessed by processes through read and write operations


15

Regular Register Architecture at node i


16

Connectivity Layer

Point-to-PointLink

Broadcast

Regular Register

If pi invokes the send(m) operation to pj at time t then pj will receive m by time t+ if it has not left the system by that time

If pi invokes the broadcast(m) operation at time t and does not leave the system by time t+ then all the processes that are in the system at time t and does not leave the system by time t+ will deliver m by time t+

(liveness) If a process invokes a read or a write operation and does not leave the system, it eventually returns from that operation

(safety) A read operation returns the last value written or a value written by a concurrent write

Read() write(v) join()

REG

SystemComputation

Regular Register: write()


17

The writer process pw wants to write the value v

pw sends a broadcast message (WRITE, v, sn)

… in the meanwhile processes join and leave the computation

OBS. Only processes belonging to the computation when pw starts the write and that remain in the computation for all the time of the write will maintain the updated copy of the register

Active Processes keeps the state of the computation

Dis

trib

uted

Sys

tem

A subset of processes participate to the register computation

pw

Processes in the distributed computation vs Active Processes


18

N

ChurnA(t)

t

Correctness bound

#pro

cess

es

Joining processe=leaving processes



19

N

ChurnA(t)

t

Correctness bound

#pro

cess

es


Movement of the bound is impacted by the system model. The weaker the system model is the more «static» the system becomes. This brings several impossibility results in presence of churn.


N

ChurnA(t)

t

#pro

cess

es


Correctness bound

Liveness and Safety issues

Movement of the bound is impacted by the system model. The weaker the system model is the more «static» the system becomes. This brings several impossibility results in presence of churn. Roberto Baldoni, “The price of mastering churn in a distributed system”20

21

An Algorithm in Synchronous System

Assumption there is a bound δ such that any message sent

(broadcast) at time τ ≥ t, is received (delivered) by time τ + δ to the processes that are in the system during the interval [τ, τ + δ].

A process remain in the system at least 3δ

Algorithm Read local Write global Join global


Synchronous System Safety: case registeri ≠


22

Join()

0

0

0

1pi

pj

ph

pk

Join

Write

Reply

pi has received a WRITE(< val,sn >) message during the first waiting phase and accordingly updated registeri

• the write operation lasts time• the join operation lasts at least time• a write message takes at most time to be delivered

Then the join and the write are concurrent and the join terminates with the last value written

write (1) 1

1

1

WRITE(1, 1)

Synchronous System Safety: case registeri =


23

Join()

0

0

0

0pi

pj

ph

pk

Join

Write

Reply

INQUIRY(i)

REPLY(h, 0, 0)

If no write is concurrent with the join operation, and c<1/3 then there always exists an active process that replies with the last written value

Synchronous System Safety: case registeri =


24

write (1)

Join()

0

0

0

1

1

1pi

pj

ph

pk

INQUIRY(i)

REPLY(h, 0, 0)

WRITE(1, 1)

pi can receive both WRITE(< val,sn >) messagesand REPLY(< j, val, sn >) messages. Accordingthe values received at time τ + 2δ, pi will updateregisteri to the value written by a concurrent update,or the value written before the concurrent writes

WRITE(1, 1)

If pi receives the write before the reply, pi does not overwrite the value and then any following write will return the last value written.

25

Synchronous System

Termination. If a process invokes the join() operation and does not leave the system for at least 3 time units, or invokes the read() operation, or invokes the write() operation and does not leave the system for at least time units, it does terminates the invoked operation.

Safety. Let [, + ] any interval of the computation. if (c x n) in [, + ] is lesser than n/(3) (i.e., c < 1/3 ). A read() operation returns the last value written before the read invocation, or a value written by a write operation concurrent with it.


Horizontal Quorums for Register Persistence


26

3δ

1

5

9

3

1

5

9

8

1

5

7

8

2

5

7

8

2 joining

Active processNon-active process

Horizontal Quorums for Register Persistence


27

3δ3δ

1

5

9

3

1

5

9

8

1

5

7

8

2

5

7

8

2

6

7

8

2

6

7

32 joining joining

3

Active processNon-active process

The register persistence is preserved iff the churn is below a given bound depending of protocol implementation

Eventually Synchronous System

Assumption There exists a time t after that there is a bound δ such that any

message sent (broadcast) at time τ ≥ t, is received (delivered) by time τ + δ to the processes that are in the system during the interval [τ, τ + δ].

There exists a time t after that c < 1/3δ. A process remain in the system at least 3δ

Algorithm Read global Write global Join global


28


29

Vertical Quorums for Register Validity in Asynchronous Periods

Validity of the read:During asynchrony periods to be sure to read the last written value you need to read/write registers from a majority of processes in the system (you do not have anymore the guarantee that messages are delivered within a known bound)

time

Termination. Let us assume that |A(t)| > n/2 (i.e., majority of processes is active at any time), if a process invokes join(), read() or write (), and does not leave the system, it terminates its operation.

Safety. Let us assume that |A(t)| > n/2, a read operation returns the last value written before the read invocation, or a value written by a write operation concurrent with i

Asynchronous System There are no bound on message transfer

delays

Theorem It is not possible to implement a regular register

in a fully asynchronous dynamic system.

The results is similar to the one of [Attiya – Bar-Noy -Dolev JACM95] when considering a static system with any number of process failures


30

Regular Register with Byzantine Failures


31

Regular Register with Byzantine Failures

Composed by an arbitrary large set of client c1... cm

Dynamic: servers may join and leave (infinite arrival model)

Join_System() operation: connects new processes to the system

Leave_System() operation: passive leave

Connection Layer (e.g. Overlay Management

Protocol)

(Authenticated)Communication Layer

(Best-effort Semantics)

Distributed Computation(i.e. Regular Register)

Computation Model

Client are correctNo information about register stateClients triggers read() and write() operations

Write (v)

Read ()

Computation Model

Initially n servers are part of the register computationUp to f byzantine failures (f < n/3)Servers maintain locally a copy of the register valueAlternating periods of churn and stability

No stable processes In churn periods the servers

set is refreshed of cn servers in each time unit (c [0, 1]).

Write (v)

Read ()

v

v

v

vx

vx

v

Join_Server()

Requirements

Write Persistency: Servers maintain the last value written by a write operation despite servers departures

Byzantine Resiliency: There are always at least f+1 servers maintaining the same value

Read- Validity: any read() operation returns the last value written by a completed write() or a value concurrently written

Issues in read() operations

v xxv

v xxvv

v xxv

v xxv

v

v

y

time

t1

t2

ti

tk

Validity Bound

Consider a generic protocol P= {AJS, AR, AW } implementing a regular register such that1) every operation eventually terminates and

2) there exists a period of churn longer than the longest operation issued on the register

Theorem: Let AJS, AR and AW be the algorithms implementing respectively join_Server(), read() and write() operations. Let tj, tr and tw be the maximum time intervals needed by the previous algorithm to terminate the operation. If

c min {(n-3f)/(n tr), (n-3f)/(n (tj+ tw)}

then it is not possible to ensure both write persistency and read validity

38 Roberto Baldoni, “The price of mastering churn in a distributed system”

Validity Bound in a synchronous system TimelyBroadcastDelivery(TBDel) : There exists a known and finite

bound such that every message broadcast at some time t is delivered up to time t + .

TimelyChannelDelivery(TCDel) : There exists a known and finite bound ’ < such that every message sent at some time t is delivered up to time t + ’ .

Pictorial Related Work and summary of results for Regular Register

System Model

Churn Model

Failure model

Asyncronous

Eventuallysynchronous

synchronous

crash

byzantine

static quiescent continuous

Aguilera et al. PODC 2010

Baldoni et al. ICDCS 2009

Baldoni et al. PODC 2011


39

No Churn Quiescent Churn

Continuous Churn

Synch Crash

BFT papers

Baldoni et al ICDCS 2009

Byzant Baldoni et al. PODC 2011 (ba)

Event Synch

crash Baldoni et al ICDCS 2009

byzantine Open Problem

Asynch Crash Aguillera et al 2009 Impossible

byzant Open Problem

Pictorial Related Work and summary of results for Regular Register


40

Other Abstractions we faced

Set object (Europar 2010, EWDC2011) More complex semantic than the one of registers The set containts all its history

Main result: It is not possible to implement a set object in an eventually synchronous distributed system prone to continuous churn if:a) Processes have only finite memory space for local computation

b) Accesses to the set are continuous

c) There are no stable processes participating in the set computation

k-bounded set in an eventually synchronous distributed system


41

Other Abstractions we faced

Leader Election (EDCC2010) There is a bounded set of (good) processes that

gets into the computation and remain forever (no one knows who they are)

Churn is continuous Communication is synchronous with finite losses

and unknown maximum transfer delay

Risk: elect an infinite sequence of processes that leave the system (bad processes)

Main result: «under these assumptions we can implement leader election»


42

43 Roberto Baldoni, “The price of mastering churn in a distributed system”

done in 2 Steps

The HB* Oracle Provide a list of processes

deemed to be up (alive list). The list aims to: Put good processes on the top of

the list Stabilize the position of a good

process in the list

protocol Take the list provided by the

HB* protocol and output the leader

HB*

leader

alive list

unicast multicastse

nd/r

ecei

ve

multicast/receive

Conclusion Dynamic Distributed Systems are everywhere

Most of the todays systems are unmanaged to some extent

Some of the functionality have to be autonomic and do not rely on a manager

Dynamic Distributed Systems are unquestionably more complex than static ones this leads to more complex solutions to solve the same problem

Scalability and dynamicity are not synonymous Understanding the how to implement abstractions

in a efficient way and well-suited to a dynamic distributed systems is stil an open and fashinating problem


44

One slide to remember

Roberto Baldoni, “The price of mastering churn in a distributed system”45

One slide to remember

N

ChurnA(t)

t

#pro

cess

es


Correctness bound

Liveness and Safety issues

Movement of the bound is impacted by the system model. The weaker the system model is the more «static» the system becomes. This brings several impossibility results in presence of churn. Roberto Baldoni, “The price of mastering churn in a distributed system”46

Technology

Reliable Distributed Computing: The Price of Mastering Churn in Distributed Systems