Distributed Algorithms with DDS

Angelo Corsaro, PhD Chief Technology Officer

[email protected]

Classical Distributed Algorithms with DDS

mailto:[email protected]

Cop

yrig

ht P

rism

Tech

, 201

5

The Data Distribution Service (DDS) provides a very useful foundation for building highly dynamic, reconfigurable, dependable and high performance systems

However, in building distributed systems with DDS one is often faced with two kind of problems:

- How can distributed coordination problems be solved with DDS? e.g. distributed mutual exclusion, consensus, etc

- How can higher order primitives and abstractions be supported over DDS? e.g. fault-tolerant distributed queues, total-order multicast, etc.

In this presentation we will look at how DDS can be used to implement some of the classical Distributed Algorithm that solve these problems

Context

https://twitter.com/acorsaro










http://www.linkedin.com/in/corsaro










http://slideshare.net/angelo.corsaro









DDS Abstractions and Properties






























Copyrig

ht 2013, PrismTech – A

ll Rights Reserved.

Data Distribution Service (DDS)

‣ DDS provides a Global Data Space abstraction that allows applications to autonomously, anonymously, securely and efficiently share data

‣ DDS’ Global Data Space is fully distributed, highly efficient and scalable

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS






























Copyrig


ll Rights Reserved.

Data Distribution Service (DDS)

‣ DataWriters and DataReaders are automatically and dynamically matched by the DDS Discovery

‣ A rich set of QoS allows to control existential, temporal, and spatial properties of data


...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS






























Copyrig


ll Rights Reserved.

Fully Distributed Data Space

Conceptual Model Actual Implementation

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

TopicDQoS

TopicDQoS

TopicAQoS


...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS






























Copyrig


ll Rights Reserved.

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

TopicDQoS

TopicDQoS

TopicAQoS

Fully Distributed Data SpaceThe communication between the DataWriter and the DataReader can use UDP/IP (Unicast and Multicast)or TCP/IP






























Cop

yrig

ht P

rism

Tech

, 201

4

Vortex supports the definition of Data Models.

These data models allow to naturally represent physical and virtual entities characterising the application domain

Vortex types are extensible and evolvable, thus allowing incremental updates and upgrades

Data Centricity






























Cop

yrig

ht P

rism

Tech

, 201

4

A Topic defines a domain-wide information’s class

A Topic is defined by means of a (name, type, qos) tuple, where

• name: identifies the topic within the domain

• type: is the programming language type associated with the topic. Types are extensible and evolvable

• qos: is a collection of policies that express the non-functional properties of this topic, e.g. reliability, persistence, etc.

Topic

TopicTypeName

QoS

struct TemperatureSensor { @key long sid; float temp; float hum; }






























Cop

yrig

ht P

rism

Tech

, 201

5

Each unique key value identifies a unique stream of data

DDS not only demultiplexes “streams” but provides also lifecycle information

A DDS DataWriter can write multiple instances

Topic Instances

Topic

InstancesInstances

color =”Green”

color =”red”

color = “Blue”

struct ShapeType { @Key string color; long x; long y; long shapesize;};






























Cop

yrig

ht P

rism

Tech

, 201

4Vortex “knows” about application data types and uses this information provide type-safety and content-based routing

Content Awarenessstruct TemperatureSensor { @key long sid; float temp; float hum; }

sid temp hum101 25.3 0.6507 33.2 0.7913 27,5 0.551307 26.2 0.67

“temp > 25 OR hum >= 0.6”

sid temp hum101 25.3 0.6507 33.2 0.71307 26.2 0.67

Type

TempSensor






























Cop

yrig

ht P

rism

Tech

, 201

4

For data to flow from a DataWriter (DW) to one or many DataReader (DR) a few conditions have to apply:

The DR and DW domain participants have to be in the same domain

The partition expression of the DR’s Subscriber and the DW’s Publisher should match (in terms of regular expression match)

The QoS Policies offered by the DW should exceed or match those requested by the DR

Quality of ServiceDomain

Participant

DURABILITY

OWENERSHIP

DEADLINE

LATENCY BUDGET

LIVELINESS

RELIABILITY

DEST. ORDER

Publisher

DataWriter

PARTITION

DataReader

Subscriber

DomainParticipant

offered QoS

Topicwrites reads

Domain Idjoins joins

produces-in consumes-from

RxO QoS Policies

requested QoS






























Cop

yrig

ht P

rism

Tech

, 201

5

Anatomy of a DDS Application

Domain (e.g. Domain 123)

Domain Participant

Topic

Publisher

DataWrter

Subscriber

DataReader

Partition (e.g. “Telemetry”, “Shapes”, )

Topic Instances/Samples






























Cop

yrig

ht P

rism

Tech

, 201

5

We can think of a DataWriter and its matching DataReaders as connected by a logical typed communication channel

The properties of this channel are controlled by means of QoS Policies

At the two extreme this logical communication channel can be:

- Best-Effort/Reliable Last n-values Channel

- Best-Effort/Reliable FIFO Channel

Channel Properties

DR

DR

DR

TopicDW






























Cop

yrig

ht P

rism

Tech

, 201

5

The last n-values channel is useful when modelling distributed state

When n=1 then the last value channel provides a way of modelling an eventually consistent distributed state

This abstraction is very useful if what matters is the current value of a given topic instance

The Qos Policies that give a Last n-value Channel are:

- RELIABILITY = RELIABLE

- HISTORY = KEEP_LAST(n)

- DURABILITY = TRANSIENT | PERSISTENT [in most cases]

Last n-values Channel

DR

DR

DR

TopicDW






























Cop

yrig

ht P

rism

Tech

, 201

5

The FIFO Channel is useful when we care about every single sample that was produced for a given topic -- as opposed to the “last value”

This abstraction is very useful when writing distributing algorithm over DDS

Depending on Qos Policies, DDS provides:

- Best-Effort/Reliable FIFO Channel

- FT-Reliable FIFO Channel (using an OpenSplice-specific extension)

The Qos Policies that give a FIFO Channel are:

- RELIABILITY = RELIABLE

- HISTORY = KEEP_ALL

FIFO Channel

DR

DR

DR

TopicDW






























Cop

yrig

ht P

rism

Tech

, 201

5

We can think of a DDS Topic as defining a group

The members of this group are matching DataReaders and DataWriters

DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views

In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side

This is not sufficient for certain distributed algorithms.

Membership

DR

DR

DR

TopicDW

DataWriter Group View

DW

DW DRTopic

DW

DataReader Group View






























Cop

yrig

ht P

rism

Tech

, 201

5

DDS provides built-in mechanism for detection of DataWriter faults through the LivelinessChangedStatus

A writer is considered as having lost its liveliness if it has failed to assert it within its lease period

Fault-Detection

DW

DW DRTopic

DW

DataReader Group View






























System Model






























Cop

yrig

ht P

rism

Tech

, 201

5

Partially Synchronous

- After a Global Stabilisation Time (GST) communication latencies are bounded, yet the bound is unknown

Non-Byzantine Fail/Recovery

- Process can fail and restart but don’t perform malicious actions

System Model






























Cop

yrig

ht P

rism

Tech

, 201

5

The algorithms that will be showed next are implemented on OpenSplice using the Moliere Scala API

All algorithms are available as part of the Open Source project dada

Programming Environment

! DDS-based Advanced Distributed Algorithms Toolkit

!Open Source !github.com/kydos/dada






























Higher Level Abstractions






























Cop

yrig

ht P

rism

Tech

, 201

5

A Group Management abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members

Ideally group management should also provide the ability to elect leaders

A Group Member should represent a process

Group Managementabstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)

// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)

// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)

// Reactions handling Group Events val reactions: Reactions}

case class MemberJoin(val mid: Int)case class MemberLeave(val mid: Int)case class MemberFailure(mid:Int)case class EpochChange(epoch: Long)case class NewLeader(mid: Option[Int])






























Cop

yrig

ht P

rism

Tech

, 201

5

To implement the Group abstraction with support for Leader Election it is sufficient to rely on the following topic types:

Topic Types

enum TMemberStatus { JOINED, LEFT, FAILED, SUSPECTED};

struct TMemberInfo { long mid; // member-id TMemberStatus status;};#pragma keylist TMemberInfo mid

struct TEventualLeaderVote { long long epoch; long mid; long lid; // voted leader-id};#pragma keylist TEventualLeaderVote mid






























Cop

yrig

ht P

rism

Tech

, 201

5

Group Management The TMemberInfo topic is used to advertise presence and manage the members state transitions

Leader Election The TEventualLeaderVote topic is used to cast votes for leader election

This leads us to: Topic(name = MemberInfo, type = TMemberInfo, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}) Topic(name = EventualLeaderVote, type = TEventualLeaderVote, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}

Topics






























Cop

yrig

ht P

rism

Tech

, 201

5

Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader election

This makes it possible to:

- Let DDS provide our latest known state automatically thanks to the TransientLocal Durability

- No need for periodically asserting our liveliness. DDS will do that for our DataWriter

Observation






























Cop

yrig

ht P

rism

Tech

, 201

5

At the beginning of each epoch the leader is None Each new epoch a leader election algorithm is run

Leader Election

M1

M2

M0

crashjoin

join

join

epoch = 0 epoch = 1 epoch = 2 epoch = 3

Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0






























Cop

yrig

ht P

rism

Tech

, 201

5

To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take place

Distinguishing Groups

“1”“2”

“3” DDS Domain

Partition associated to the group with gid=2






























Cop

yrig

ht P

rism

Tech

, 201

5Events provide notification of group membership changes

These events are handled by registering partial functions with the Group reactions

Example object GroupMember { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

val printGroupView = () => { print("Group["+ gid +"] = { ") group.view foreach(m => print(m + " ")) println("}")}

group listen { case MemberFailure(mid) => { println("Member "+ mid + " Failed.") printGroupView() } case MemberJoin(mid) => { println("Member "+ mid + " Joined") printGroupView() } case MemberLeave(mid) => { println("Member "+ mid +" Left") printGroupView() } } }}

[1/2]






























Cop

yrig

ht P

rism

Tech

, 201

5

An eventual leader election algorithm can be implemented by simply casting a vote each time there is an group epoch change

A Group Epoch change takes place each time there is a change on the group view

The leader is eventually elected only if a majority of the process currently on the view agree

Otherwise the group leader is set to “None”

Example[2/2]

object EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

group listen { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>

println(">> NewLeader = "+ l) } }}






























Distributed Mutex






























Cop

yrig

ht P

rism

Tech

, 201

5

A relatively simple Distributed Mutex Algorithm was proposed by Leslie Lamport as an example application of Lamport’s Logical Clocks

The basic protocol (with Agrawala optimization) works as follows (sketched):

- When a process needs to enter a critical section sends a MUTEX request by tagging it with its current logical clock

- The process obtains the Mutex only when he has received ACKs from all the other process in the group

- When process receives a Mutex requests he sends an ACK only if he has not an outstanding Mutex request timestamped with a smaller logical clock

Lamport’s Distributed Mutex






























Cop

yrig

ht P

rism

Tech

, 201

5

A base class defines the Mutex Protocol

The Mutex companion uses dependency injection to decide which concrete mutex implementation to use

Mutex Abstraction

abstract class Mutex { def acquire()

def release()

}






























Cop

yrig

ht P

rism

Tech

, 201

5

The mutual exclusion algorithm requires essentially:

- FIFO communication channels between group members

- Logical Clocks

- MutexRequest and MutexAck Messages

These needs, have now to be translated in terms of topic types, topics, readers/writers and QoS Settings

Foundation Abstractions






























Cop

yrig

ht P

rism

Tech

, 201

5

For implementing the Mutual Exclusion Algorithm it is sufficient to define the following topic types:

Topic Types

struct TLogicalClock { long ts; long mid;};#pragma keylist LogicalClock mid

struct TAck { long amid; // acknowledged member-id LogicalClock ts;};#pragma keylist TAck ts.mid






























Cop

yrig

ht P

rism

Tech

, 201

5

We need essentially two topics: One topic for representing the Mutex Requests, and Another topic for representing Acks

This leads us to: Topic(name = MutexRequest, type = TLogicalClock, QoS = {Reliability.Reliable, History.KeepAll}) Topic(name = MutexAck, type = TAck, QoS = {Reliability.Reliable, History.KeepAll})

Topics






























Cop

yrig

ht P

rism

Tech

, 201

5

All the algorithms presented were implemented using DDS and Scala

The resulting library has been baptized “dada” (DDS Advanced Distributed Algorithms) and is available under LGPL-v3

Show me the Code!






























Cop

yrig

ht P

rism

Tech

, 201

5

The LCMutex is one of the possible Mutex protocol, implementing the Agrawala variation of the classical Lamport’s Algorithm

LCMutex

class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex {

private var group = Group(gid) private var ts = LogicalClock(0, mid) private var receivedAcks = new AtomicLong(0)

private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]() private var myRequest = LogicalClock.Infinite

private val reqDW = DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos)

private val reqDR = DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos)

private val ackDW = DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos)

private val ackDR = DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos)

private val ackSemaphore = new Semaphore(0)






























Cop

yrig

ht P

rism

Tech

, 201

5

LCMutex.acquire

def acquire() { ts = ts.inc() myRequest = ts reqDW write myRequest ackSemaphore.acquire() }

Notice that as the LCMutex is single-threaded we can’t issue concurrent acquire.






























Cop

yrig

ht P

rism

Tech

, 201

5

LCMutex.release

Notice that as the LCMutex is single-threaded we can’t issue a new request before we release.

def release() { myRequest = LogicalClock.Infinite (pendingRequests dequeueAll) foreach { req => ts = ts inc() ackDW write new TAck(req.id, ts) } }






























Cop

yrig

ht P

rism

Tech

, 201

5

LCMutex.onACKackDR listen { case DataAvailable(dr) => { // Count only the ACK for us val acks = ((ackDR take) filter (_.amid == mid)) val k = acks.length

if (k > 0) { // Set the local clock to the max (tsi, tsj) + 1 synchronized { val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1 ts = LogicalClock(maxTs, ts.id) } val ra = receivedAcks.addAndGet(k) val groupSize = group.size // If received sufficient many ACKs we can enter our Mutex! if (ra == groupSize - 1) { receivedAcks.set(0) ackSemaphore.release() } } } }






























Cop

yrig

ht P

rism

Tech

, 201

5

LCMutex.onReqreqDR.reactions += { case DataAvailable(dr) => { val requests = (reqDR take) filterNot (_.mid == mid)

if (requests.isEmpty == false ) { synchronized { val maxTs = math.max((requests map (_.ts)).max, ts.ts) + 1 ts = LogicalClock(maxTs, ts.id) } requests foreach (r => { if (r < myRequest) { ts = ts inc() val ack = new TAck(r.mid, ts) ackDW ! ack None } else { (pendingRequests find (_ == r)).getOrElse({ pendingRequests.enqueue(r) r}) } }) } } }






























Distributed Queue






























Cop

yrig

ht P

rism

Tech

, 201

5

A distributed queue conceptually provides the ability of enqueueing and dequeueing elements

Depending on the invariants that are guaranteed the distributed queue implementation can be more or less efficient

In what follows we’ll focus on a relaxed form of distributed queue, called Eventual Queue, which while providing a relaxed yet very useful semantics is amenable to high performance implementations

Distributed Queue Abstraction






























Cop

yrig

ht P

rism

Tech

, 201

5

Invariants

- All enqueued elements will be eventually dequeued

- Each element is dequeued once

- If the queue is empty a dequeue returns nothing

- If the queue is non-empty a dequeue might return something

- Elements might be dequeued in a different order than they are enqueued

Eventual Queue Specification

DR

DR

DR

DW

DW

DW

DRDistributed Eventual Queue






























Cop

yrig

ht P

rism

Tech

, 201

5

Invariants







DR

DR

DR

DW

DW

DW































Cop

yrig

ht P

rism

Tech

, 201

5

Invariants







DR

DR

DR

DW

DW

DW































Cop

yrig

ht P

rism

Tech

, 201

5

Invariants







DR

DR

DR

DW

DW

DW































Cop

yrig

ht P

rism

Tech

, 201

5

Invariants







DR

DR

DR

DW

DW

DW

Distributed Eventual QueueDR






























Cop

yrig

ht P

rism

Tech

, 201

5

A Queue can be seen as the composition of two simpler data structure, a Dequeue and an Enqueue

The Enqueue simply allows to add elements

The Enqueue simply allows to get elements

Eventual Queue Abstraction

trait Enqueue[T] { def enqueue(t: T)}

trait Dequeue[T] { def dequeue(): Option[T] def sdequeue(): Option[T] def length: Int def isEmpty: Boolean = length == 0}

trait Queue[T] extends Enqueue[T] with Dequeue[T]






























Cop

yrig

ht P

rism

Tech

, 201

5

One approach to implement the eventual queue on DDS is to keep a local queue on each of the consumer and to run a coordination algorithm to enforce the Eventual Queue Invariants

The advantage of this approach is that the latency of the dequeue is minimized and the throughput of enqueues is maximised (we’ll see this latter is really a property of the eventual queue)

The disadvantage, for some use cases, is that the consumer need to store the whole queue locally thus, this solution is applicable for either symmetric environments running on LANs

Eventual Queue on DDS






























Cop

yrig

ht P

rism

Tech

, 201

5

All enqueued elements will be eventually dequeued Each element is dequeued once If the queue is empty a dequeue returns nothing If the queue is non-empty a dequeue might return something

- These invariants require that we implement a distributed protocol for ensuring that values are eventually picked up and picked up only once!

Elements might be dequeued in a different order than they are enqueued

Eventual Queue Invariants & DDS






























Cop

yrig

ht P

rism

Tech

, 201

5

All enqueued elements will be eventually dequeued If the queue is empty a dequeue returns nothing If the queue is non-empty a dequeue might return something

Elements might be dequeued in a different order than they are enqueued

- This essentially means that we can have different local order for the queue elements on each consumer. Which in turns means that we can distribute enqueued elements by simple DDS writes!

- The implication of this is that the enqueue operation is going to be as efficient as a DDS write

- Finally, to ensure eventual consistency in presence of writer faults we’ll take advantage of OpenSplice’s FT-Reliability!

Eventual Queue Invariants & DDS






























Cop

yrig

ht P

rism

Tech

, 201

5

A possible Dequeue protocol can be derived by the Lamport/Agrawala Distributed Mutual Exclusion Algorithm

The general idea is similar as we want to order dequeues as opposed to access to some critical section, however there are some important details to be sorted out to ensure that we really maintain the eventual queue invariants

Key Issues to Address

- DDS provides eventual consistency thus we might have wildly different local view of the content of the queue (not just its order but the actual elements)

- Once a process has gained the right to dequeue it has to be sure that it can pick an element that nobody else has picked just before. Then he has to ensure that before he allows anybody else to pick a value his choice has to be popped by all other local queues

Dequeue Protocol: General Idea






























Cop

yrig

ht P

rism

Tech

, 201

5

To implement the Eventual Queue over DDS we use three different Topic Types

The TQueueCommand represents all the commands used by the protocol (more later on this)

TQueueElement represents a writer time-stamped queue element

Topic Typesstruct TLogicalClock { long long ts; long mid;};

enum TCommandKind { DEQUEUE, ACK, POP};

struct TQueueCommand { TCommandKind kind; long mid; TLogicalClock ts;};#pragma keylist TQueueCommand

typedef sequence<octet> TData;struct TQueueElement { TLogicalClock ts; TData data;};#pragma keylist TQueueElement






























Cop

yrig

ht P

rism

Tech

, 201

5To implement the Eventual Queue we need only two topics: One topic for representing the queue elements Another topic for representing all the protocol messages. Notice that the choice of using a single topic for all the protocol messages was carefully made to be able to ensure FIFO ordering between protocol messages

Topics






























Cop

yrig

ht P

rism

Tech

, 201

5

This leads us to:

Topic(name = QueueElement, type = TQueueElement, QoS = {Reliability.Reliable, History.KeepAll})

Topic(name = QueueCommand, type = TQueueCommand, QoS = {Reliability.Reliable, History.KeepAll})

Topics






























Cop

yrig

ht P

rism

Tech

, 201

5

Dequeue Protocol: A Sample Run

deq():a

a, ts b, ts’

app 1 (1,1)

req {(1,2)}

deq():b ack {(2,2)}

(1,1) (1,2)

pop{ts, (3,1)}

req {(1,1)}

1 1 2

1 1 2 3

3

ack {(4,1)}

4

pop{ts, (5,2)}

app 2

b, ts’ a, ts

(1,2) (1,1) (1,2)

b, ts’

b, ts’

(1,2) (1,2)

’






























Cop

yrig

ht P

rism

Tech

, 201

5

Example: Producerobject MessageProducer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <n> <samples>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val n = args(2).toInt val samples = args(3).toInt val group = Group(gid) group listen { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Enqueue[String]("CounterQueue", mid, gid)

for (i <- 1 to samples) { val msg = "MSG["+ mid +", "+ i +"]" println(msg) queue.enqueue(msg) // Pace the write so that you can see what's going on Thread.sleep(300) } }}






























Cop

yrig

ht P

rism

Tech

, 201

5

Example: Consumerobject MessageConsumer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <readers-num> <n>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val rn = args(2).toInt val n = args(3).toInt val group = Group(gid)

group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Queue[String]("CounterQueue", mid, gid, rn)

val baseSleep = 1000 while (true) { queue.sdequeue() match { case Some(s) => println(Console.MAGENTA_B + s + Console.RESET) case _ => println(Console.MAGENTA_B + "None" + Console.RESET) } val sleepTime = baseSleep + (math.random * baseSleep).toInt Thread.sleep(sleepTime) } }}






























Dealing with Faults






























Cop

yrig

ht P

rism

Tech

, 201

5The algorithms presented so far can be easily extended to deal with failures by taking advantage of group abstraction presented earlier

The main issue to carefully consider is that if a timing assumption is violated thus leading to falsely suspecting the crash of a process safety of some of those algorithms might be violated!

Fault-Detectors






























Paxos






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos is a protocol for state-machine replication proposed by Leslie Lamport in his “The Part-time Parliament”

The Paxos protocol works in under asynchrony -- to be precise, it is safe under asynchrony and has progress under partial synchrony (both are not possible under asynchrony due to FLP) -- and admits a crash/recovery failure mode

Paxos requires some form of stable storage

The theoretical specification of the protocol is very simple and elegant

The practical implementations of the protocol have to fill in many hairy details...

Paxos in Brief






























Cop

yrig

ht P

rism

Tech

, 201

5

The Paxos protocol considers three different kinds of agents (the same process can play multiple roles):

- Proposers

- Acceptors

- Learners

To make progress the protocol requires that a proposer acts as the leader in issuing proposals to acceptors on behalf of clients

The protocol is safe even if there are multiple leaders, in that case progress might be scarified

- This implies that Paxos can use an eventual leader election algorithm to decide the distinguished proposer

Paxos in Brief






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos Synod Protocol






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos in Action

C1

C2

Cn

P1

P2

Pk

A2

Am

A1

L2

Lh

L1

[Leader]






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos in Action -- Phase 1A

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase1A(c-rnd)






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos in Action -- Phase 1B

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase1B(rnd, v-rnd, v-val)






























Cop

yrig

ht P

rism

Tech

, 201

5

Paxos in Action -- Phase 2A

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase2A(c-rnd, c-val)






























Cop

yrig

ht P

rism

Tech

, 201

5


C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase2B(v-rnd, v-val)






























Cop

yrig

ht P

rism

Tech

, 201

5


C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

Decision(v-val)






























Cop

yrig

ht P

rism

Tech

, 201

5

The Eventual queue we specified on the previous section can be implemented using an adaptation of the Paxos protocol

In this case, consumers don’t cache locally the queue but leverage a mid-tier running the Paxos protocol to serve dequeues

Eventual Queue with Paxos

C1

C2

Cn

P1

P2

Pm[Learners]

Pi

Ai

[Proposers]

[Acceptors]

[Eventual Queue]

L1 [Learners]






























Summing Up






























Cop

yrig

ht P

rism

Tech

, 201

5

DDS provides a good foundation to effectively and efficiently express some of the most important distributed algorithms

- e.g. DataWriter fault-detection and FT-Reliable Multicast

dada provides access to reference implementations of many of the most important distributed algorithms

- It is implemented in Scala, but that means you can also use these libraries from Java too!

Concluding Remarks






























Cop

yrig

ht P

rism

Tech

, 201

5






























Cop

yrig

ht P

rism

Tech

, 201

5

struct Point { long x, long y }

struct Point3D : Point { long z }