Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman

Distributed Systems 2006

Virtual Synchrony*

*With material adapted from Ken Birman

Distributed Systems 2006 2

Plan

We skip Section 18.2

Tracking group membership: We’ll base it on 2PC and 3PC

Fault-tolerant multicast: We’ll use membership

Ordered multicast: We’ll base it on fault-tolerant multicast

Tools for solving practical replication and availability problems: we’ll base them on ordered multicast

Robust Web Services: We’ll build them with these tools

2PC and 3PC: Our first “tools” (lowest layer)


Using group communication as a building block

Communication to the group Using communication inside the group


How to communicate to a group?

We have primarily looked at communication within groups

Often process groups provide a service that processes outside the groups need to use– How do we multicast to the group from outside the

group?

A number of solutions– Join the group...– Use a well-known member of the group– Cache a group view


Using a well-known group member

The non-member process issues an RPC to a member process– Assigns an ID; reissue if

failure– The member processes

multicasts on behalf of the non-member

What about ordering?– Can requests submitted by

different non-members be considered concurrent?

– Complex in the event of failure

c


Caching a group view

Non-member gets a view of the group, possibly outdated– Client communicate directly with

processes in cached view View obtained, e.g., through

– RPC to group member– By joining (larger) group to get view

Three cases of multicast arrival– 1) entirely in correct view – correct – 2) entirely late – ignore/report, client

iterates multicast– 3) partially correct, partially late – flush

at view installation will deliver message correctly

Often cheaper than RPC-based solution– Still complex with ordering

c

ignore /report

flush


Using communication within groups

What is a ”natural” abstraction for working with group communication?

Reconsider the transactional model – ”virtual serial execution”– Transactions appear to run in isolation (cf ACID)– Database free to optimize by interleaving execution of

transactions

”Virtual Synchrony”– All group members see the same events in the same order– Group communication system free to optimize

Insight– Enables us to run the same algorithm on all processes that will

stay in same state...


A synchronous execution

With true synchrony executions run in genuine lock-step – “state machine execution” possible

p

q

r

s

t

u


Virtual Synchrony at a glance

With virtual synchrony executions only look “lock step” to the application

p

q

r

s

t

u


Virtual Synchrony at a glance

We use the weakest (hence fastest) form of ordered communication possible

p

q

r

s

t

u


Elements of Virtual Synchrony

Support for process groups– Processes may join and leave dynamically – Excluded if (thought to) fail

Various reliable, ordered multicast protocols– Strive to replace synchronous, totally-ordered, dynamically uniform

protocols View-synchronous delivery

– Two processes that are member of the same view receive same set of multicasts during that view

Identical process groups views and rankings– Identical sequences of group membership lists

Gap-freedom guarantees– If, after a failure, m1 has been delivered to a destination, then any

message m0 that should be delivered prior to m1 is also delivered State transfer for joining processes

– A new process may obtain group state from existing member– (Will develop this shortly)


Why ”Virtual” Synchrony?

The user sees what looks like a synchronous execution– Simplifies the developer’s task

But the actual execution is rather concurrent and asynchronous– Maximizes performance– Reduces risk that lock-step execution will trigger

correlated failures


Correlated failures

Why do we claim that virtual synchrony makes these less likely?

Recall that many programs are buggy– Often these are Heisenbugs (order-sensitive)

With lock-step execution each group member sees group events in identical order– So all die in unison

With virtual synchrony orders differ– So an order-sensitive bug might only kill one group

member!


Why Virtual Synchrony again?

So what can we build using Virtual Synchrony? Distributed Algorithms

– Election– Consensus– Consistent snapshot

• Saw this last time

– Replicated data and synchronization– State transfer– Load-balancing– Fault tolerance

• Primary-backup• Coordinator-cohort


Election

We need to choose (in a distributed way) some process to take a special role in an algorithm?

Processes that might participate join an appropriate group

Now the group view gives a simple leader election rule– Everyone sees the same members, in the same order, ranked

by when they joined– Leader can be, e.g., the “oldest” process

We have already used this in a number of places– E.g., abcast


Consensus

A set of processes need to agree on some decision/value?

A group can easily solve consensus– Leader multicasts: “what’s your input”?– All reply: “Mine is 0. Mine is 1”– Initiator picks the most common value and multicasts that: the

“decision value”– If the leader fails, the new leader just restarts the algorithm

Famous, theoretical result (”FLP”):– In the asynchronous model with process failure, distributed

consensus is impossible

Does this apply here?


Replicated Data and Synchronization

We want to to replicate data and synchronize on updates among a set of processes?

Want to support READ, UPDATE, LOCK operations on replicated data

First look at synchronous, non-uniform case


Synchronous, Replicated Data

Use groups...– Joining processes

• Transfer data from existing processes (more later)– Maintain replica of data at each process

READ– Read local copy

UPDATE– Asynchronously multicast update using abcast – i.e., no waiting

LOCK– Nonexclusive read locks can be implemented locally– Acquire exclusive lock by abcasting to group members for them to grant the lock

• Recipient just grants requests for lock in order received – provided no non-exclusive read is pending locally

• Sender waits until all recepients have acquired lock

Fault-tolerant, consistent– Replicas start in identical state– Subsequent update and lock operations behave identically at all copies – same

events, same order


Can we do better?

How many abcasts can be replaced with asynchronous cbcasts?

All, if we modify the protocol a bit...

All UPDATEs guarded by appropriate locks– Want mutual exclusion on data that concurrent

UPDATE could change

LOCKs established by passing write rights


cbcast and locking

For possibly conflicting updates, there will be only one writer at a time

– The initial writer is the first process to join the group

To make a LOCK request, a process sends a cbcast and waits

– Receiving processes queue requests

When writer is done, it will send a cbcast that grants the lock

– It takes oldest process on its LOCK queue and grant to that

– Other processes dequeues request and grants (when local read locks done)

Does this work?– Do group members perform same

sequence of updates?


cbcast and locking

Sequence of conflicting updates identical at all replicas– There is only one writer at the

time for a given lock– Updates and lock grants follow

the same causal path– (The causal order is a total order

along a causal path)

Non-conflicting updates may be applied in any order


Performance

Mostly asynchronous– UPDATEs may be done when sending the UPDATE

cbcast• Causally prior updates will already have been delivered to

sender

– Waiting for LOCKs necessary unless we already have the lock...

”85.000 updates per second”– Cf. quorum read and update

• Forces into lockstep for reads and updates• ”100 reads and updates (combined) per second”

– Not dynamically uniform, though• E.g., safe cbcast or cbcast followed by flush


Summary

We introduced ”Virtual Synchrony” as a powerful distributed programming abstraction– Processes appear to run in lock-step – middleware

may optimize– Based on tools developed so far: membership, group

communication

Gave examples of what can be built on top of Virtual Synchrony– More next time...

Documents

Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman