73
OpenSplice DDS Angelo CORSARO, Ph.D. Chief Technology Officer OMG DDS Sig Co-Chair PrismTech [email protected] Classical Distributed Algorithms with DDS [Developing Higher Level Abstractions on DDS]

Classical Distributed Algorithms with DDS

Embed Size (px)

DESCRIPTION

The OMG DDS standard has been witnessing a very strong adoption as the distribution middleware of choice for a large class of mission and business critical systems, such as Air Traffic Control, Automated Trading, SCADA, Smart Energy, etc. The main reason for choosing DDS lies in its efficiency, scalability, high-availability and configurability -- through the 20+ QoS policy. Yet, all of these nice properties come at the cost of a relaxed consistency model no strong guarantees over global invariants. As a result, many architects have to devise, by themselves – assuming the DDS primitives as a foundation – the correct algorithms for classical problems such as fault-detection, leader election, consensus, distributed mutual exclusion, atomic multicast, distributed queues, etc. In this presentation we will explore DDS-based distributed algorithms for many classical, yet fundamental, problems in distributed systems. For simplicity, we'll start with algorithms that ignore the presence of failures. Then we will (1) demonstrate how these algorithms can be extended to deal with failures, and (2) introduce Paxos as one of the fundamental algorithm for consensus and atomic broadcast. Finally, we'll show how these classical algorithms can be used to implement useful extensions of the DDS semantics, such as multi-writer / multi-reader distributed queues. 

Citation preview

Page 1: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Angelo CORSARO, Ph.D.Chief Technology Officer OMG DDS Sig Co-Chair

[email protected]

Classical Distributed Algorithms with DDS[Developing Higher Level Abstractions on DDS]

Page 2: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Context☐ The Data Distribution Service (DDS) provides a very useful foundation

for building highly dynamic, reconfigurable, dependable and high performance systems

☐ However, in building distributed systems with DDS one is often faced with two kind of problems:☐ How can distributed coordination problems be solved with DDS?

e.g. distributed mutual exclusion, consensus, etc☐ How can higher order primitives and abstractions be supported over DDS?

e.g. fault-tolerant distributed queues, total-order multicast, etc.

☐ In this presentation we will look at how DDS can be used to implement some of the classical Distributed Algorithm that solve these problems

Page 3: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

DDS Abstractions and Properties

Page 4: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Data Distribution Service

☐ Topics: data distribution subject’s

☐ DataWriters: data producers

☐ DataReaders: data consumers

DDS provides a Topic-Based Publish/Subscribe abstraction based on:

DDS Global Data Space

...

TopicA

TopicBTopicC

TopicD

Data Writer

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

For Real-Time Systems

Page 5: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Data Distribution Service

☐ DataWriters and DataReaders are automatically and dynamically matched by the DDS Dynamic Discovery

☐ A rich set of QoS allows to control existential, temporal, and spatial properties of data

DDS Global Data Space

...

TopicA

TopicBTopicC

TopicD

Data Writer

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

For Real-Time Systems

Page 6: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

DDS Topics☐ A Topic defines a class of streams

☐ A Topic has associated a unique name, a user defined extensible type and a set of QoS policies

☐ QoS Policies capture the Topic non-functional invariants

☐ Topics can be discovered or locally defined

DURABILITY,DEADLINE,PRIORITY,

“Circle”, “Square”, “Triangle”, ...

TopicTypeName

QoS

ShapeType

struct ShapeType { @Key string color; long x; long y; long shapesize;};

Page 7: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topic Instances☐ Each unique key value

identifies a unique stream of data

☐ DDS not only demultiplexes “streams” but provides also lifecycle information

☐ A DDS DataWriter can write multiple instances

Topic

InstancesInstances

color =”Green”

color =”red”

color = “Blue”

struct ShapeType { @Key string color; long x; long y; long shapesize;};

Page 8: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Anatomy of a DDS ApplicationDomain (e.g. Domain 123)

Domain Participant

Topic

Publisher

DataWrter

Subscriber

DataReader

Partition (e.g. “Telemetry”, “Shapes”, )

Topic Instances/Samples

Page 9: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Channel Properties

☐ We can think of a DataWriter and its matching DataReaders as connected by a logical typed communication channel

☐ The properties of this channel are controlled by means of QoS Policies

☐ At the two extreme this logical communication channel can be:☐ Best-Effort/Reliable Last n-values Channel☐ Best-Effort/Reliable FIFO Channel

DR

DR

DR

TopicDW

Page 10: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Last n-values Channel☐ The last n-values channel is useful when

modeling distributed state

☐ When n=1 then the last value channel provides a way of modeling an eventually consistent distributed state

☐ This abstraction is very useful if what matters is the current value of a given topic instance

☐ The Qos Policies that give a Last n-value Channel are:☐ RELIABILITY = BEST_EFFORT | RELIABLE☐ HISTORY = KEEP_LAST(n)☐ DURABILITY = TRANSIENT | PERSISTENT [in most cases]

DR

DR

DR

TopicDW

Page 11: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

FIFO Channel☐ The FIFO Channel is useful when we care about

every single sample that was produced for a given topic -- as opposed to the “last value”

☐ This abstraction is very useful when writing distributing algorithm over DDS

☐ Depending on Qos Policies, DDS provides: ☐ Best-Effort/Reliable FIFO Channel☐ FT-Reliable FIFO Channel (using an OpenSplice-

specific extension)

☐ The Qos Policies that give a FIFO Channel are:☐ RELIABILITY = BEST_EFFORT | RELIABLE☐ HISTORY = KEEP_ALL

DR

DR

DR

TopicDW

Page 12: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Membership☐ We can think of a DDS Topic as defining a

group

☐ The members of this group are matching DataReaders and DataWriters

☐ DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views

☐ In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side

☐ This is not sufficient for certain distributed algorithms.

DR

DR

DR

TopicDW

DataWriter Group View

DW

DW DRTopic

DW

DataReader Group View

Page 13: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Fault-Detection

☐ DDS provides built-in mechanism for detection of DataWriter faults through the LivelinessChangedStatus

☐ A writer is considered as having lost its liveliness if it has failed to assert it within its lease period

DW

DW DRTopic

DW

DataReader Group View

Page 14: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

System Model

Page 15: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

System Model

☐ Partially Synchronous☐ After a Global Stabilization Time (GST) communication latencies are

bounded, yet the bound is unknown

☐ Non-Byzantine Fail/Recovery☐ Process can fail and restart but don’t perform malicious actions

Page 16: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Programming Environment

☐ The algorithms that will be showed next are implemented on OpenSplice using the Escalier Scala API

☐ All algorithms are available as part of the Open Source project dada

¥Fastest growing JVM Language¥Open Source¥www.scala-lang.org

¥ #1 OMG DDS Implementation¥ Open Source¥ www.opensplice.org

OpenSplice | DDS¥Scala API for OpenSplice DDS¥Open Source¥github.com/kydos/escalier

Escalier

¥ DDS-based Advanced Distributed Algorithms Toolkit

¥Open Source¥github.com/kydos/dada

Page 17: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Higher Level Abstractions

Page 18: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Group Management☐ A Group Management

abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members

☐ Ideally group management should also provide the ability to elect leaders

☐ A Group Member should represent a process

abstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)

// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)

// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)

// Reactions handling Group Events val reactions: Reactions}

case class MemberJoin(val mid: Int)case class MemberLeave(val mid: Int)case class MemberFailure(mid:Int)case class EpochChange(epoch: Long)case class NewLeader(mid: Option[Int])

Page 19: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topic Types☐ To implement the Group abstraction with support for Leader

Election it is sufficient to rely on the following topic types:

enum TMemberStatus { JOINED, LEFT, FAILED, SUSPECTED};

struct TMemberInfo { long mid; // member-id TMemberStatus status;};#pragma keylist TMemberInfo mid

struct TEventualLeaderVote { long long epoch; long mid; long lid; // voted leader-id};#pragma keylist TEventualLeaderVote mid

Page 20: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

TopicsGroup Management☐ The TMemberInfo topic is used to advertise presence and manage the

members state transitions

Leader Election☐ The TEventualLeaderVote topic is used to cast votes for leader election

This leads us to:☐ Topic(name = MemberInfo, type = TMemberInfo,

QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal})☐ Topic(name = EventualLeaderVote, type = TEventualLeaderVote,

QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}

Page 21: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Observation

☐ Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader election

☐ This makes it possible to:☐ Let DDS provide our latest known state automatically thanks to the

TransientLocal Durability☐ No need for periodically asserting our liveliness as DDS will do that our

DataWriter

Page 22: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Leader ElectionM1

M2

M0

crashjoin

join

join

epoch = 0 epoch = 1 epoch = 2 epoch = 3

Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0

☐ At the beginning of each epoch the leader is None☐ Each new epoch a leader election algorithm is run

Page 23: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Distinguishing Groups

☐ To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take place

“1”“2”

“3” DDS Domain

Partition associated to the group with gid=2

Page 24: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Example

☐ Events provide notification of group membership changes

☐ These events are handled by registering partial functions with the Group reactions

object GroupMember { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

val printGroupView = () => { print("Group["+ gid +"] = { ") group.view foreach(m => print(m + " ")) println("}")}

group.reactions += { case MemberFailure(mid) => { println("Member "+ mid + " Failed.") printGroupView() } case MemberJoin(mid) => { println("Member "+ mid + " Joined") printGroupView() } case MemberLeave(mid) => { println("Member "+ mid +" Left") printGroupView() } } }}

[1/2]

Page 25: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Example☐ An eventual leader election algorithm

can be implemented by simply casting a vote each time there is an group epoch change

☐ A Group Epoch change takes place each time there is a change on the group view

☐ The leader is eventually elected only if a majority of the process currently on the view agree

☐ Otherwise the group leader is set to “None”

[1/2]

object EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

group.reactions += { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>

println(">> NewLeader = "+ l) } }}

Page 26: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Distributed Mutex

Page 27: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Lamport’s Distributed Mutex☐ A relatively simple Distributed Mutex Algorithm was proposed by Leslie

Lamport as an example application of Lamport’s Logical Clocks

☐ The basic protocol (with Agrawala optimization) works as follows (sketched):☐ When a process needs to enter a critical section sends a MUTEX request by

tagging it with its current logical clock☐ The process obtains the Mutex only when he has received ACKs from all the

other process in the group☐ When process receives a Mutex requests he sends an ACK only if he has not an

outstanding Mutex request timestamped with a smaller logical clock

Page 28: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Mutex Abstraction☐ A base class defines the

Mutex Protocol

☐ The Mutex companion uses dependency injection to decide which concrete mutex implementation to use

abstract class Mutex  {  def acquire()

def release()

}

Page 29: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Foundation Abstractions

☐ The mutual exclusion algorithm requires essentially:☐ FIFO communication channels between group members☐ Logical Clocks☐ MutexRequest and MutexAck Messages

These needs, have now to be translated in terms of topic types, topics, readers/writers and QoS Settings

Page 30: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topic Types☐ For implementing the Mutual Exclusion Algorithm it is sufficient to

define the following topic types:

struct TLogicalClock { long ts; long mid;};#pragma keylist LogicalClock mid

struct TAck { long amid; // acknowledged member-id LogicalClock ts;};#pragma keylist TAck ts.mid

Page 31: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

TopicsWe need essentially two topics:☐ One topic for representing the Mutex Requests, and☐ Another topic for representing Acks

This leads us to:☐ Topic(name = MutexRequest, type = TLogicalClock,

QoS = {Reliability.Reliable, History.KeepAll})☐ Topic(name = MutexAck, type = TAck,

QoS = {Reliability.Reliable, History.KeepAll})

Page 32: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Show me the Code!

☐ All the algorithms presented were implemented using DDS and Scala

☐ Specifically we’ve used the OpenSplice Escalier language mapping for Scala

☐ The resulting library has been baptized “dada” (DDS Advanced Distributed Algorithms) and is available under LGPL-v3

Page 33: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

LCMutex☐ The LCMutex is one of the possible Mutex protocol, implementing

the Agrawala variation of the classical Lamport’s Algorithm

class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex {

private var group = Group(gid) private var ts = LogicalClock(0, mid) private var receivedAcks = new AtomicLong(0)

private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]() private var myRequest = LogicalClock.Infinite

private val reqDW = DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos)

private val reqDR = DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos)

private val ackDW = DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos)

private val ackDR = DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos)

private val ackSemaphore = new Semaphore(0)

Page 34: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

LCMutex.acquire

def acquire() { ts = ts.inc() myRequest = ts reqDW ! myRequest ackSemaphore.acquire() }

Notice that as the LCMutex is single-threaded we can’t issue concurrent acquire.

Page 35: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

LCMutex.release

Notice that as the LCMutex is single-threaded we can’t issue a new request before we release.

def release() { myRequest = LogicalClock.Infinite (pendingRequests dequeueAll) foreach { req => ts = ts inc() ackDW ! new TAck(req.id, ts) } }

Page 36: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

LCMutex.onACKackDR.reactions += { case DataAvailable(dr) => { // Count only the ACK for us val acks = ((ackDR take) filter (_.amid == mid)) val k = acks.length

if (k > 0) { // Set the local clock to the max (tsi, tsj) + 1 synchronized { val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1 ts = LogicalClock(maxTs, ts.id) } val ra = receivedAcks.addAndGet(k) val groupSize = group.size // If received sufficient many ACKs we can enter our Mutex! if (ra == groupSize - 1) { receivedAcks.set(0) ackSemaphore.release() } } } }

Page 37: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

LCMutex.onReqreqDR.reactions += { case DataAvailable(dr) => { val requests = (reqDR take) filterNot (_.mid == mid)

if (requests.isEmpty == false ) { synchronized { val maxTs = math.max((requests map (_.ts)).max, ts.ts) + 1 ts = LogicalClock(maxTs, ts.id) } requests foreach (r => { if (r < myRequest) { ts = ts inc() val ack = new TAck(r.mid, ts) ackDW ! ack None } else { (pendingRequests find (_ == r)).getOrElse({ pendingRequests.enqueue(r) r}) } }) } } }

Page 38: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Distributed Queue

Page 39: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Distributed Queue Abstraction☐ A distributed queue is conceptually provide with the ability of

enqueueing and dequeueing elements

☐ Depending on the invariants that are guaranteed the distributed queue implementation can be more or less efficient

☐ In what follows we’ll focus on a relaxed form of distributed queue, called Eventual Queue, which while providing a relaxed yet very useful semantics is amenable to high performance implementations

Page 40: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Specification☐ Invariants

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued

DR

DR

DR

DW

DW

DW

DRDistributed Eventual Queue

Page 41: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Specification☐ Invariants

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued

DR

DR

DR

DW

DW

DW

DRDistributed Eventual Queue

Page 42: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Specification☐ Invariants

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued

DR

DR

DR

DW

DW

DW

DRDistributed Eventual Queue

Page 43: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Specification☐ Invariants

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued

DR

DR

DR

DW

DW

DW

DRDistributed Eventual Queue

Page 44: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Specification☐ Invariants

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued

DR

DR

DR

DW

DW

DW

Distributed Eventual QueueDR

Page 45: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Abstraction

☐ A Queue can be seen as the composition of two simpler data structure, a Dequeue and an Enqueue

☐ The Enqueue simply allows to add elements

☐ The Enqueue simply allows to get elements

trait Enqueue[T] { def enqueue(t: T)}

trait Dequeue[T] { def dequeue(): Option[T] def sdequeue(): Option[T] def length: Int def isEmpty: Boolean = length == 0}

trait Queue[T] extends Enqueue[T] with Dequeue[T]

Page 46: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue on DDS☐ One approach to implement the eventual queue on DDS is to

keep a local queue on each of the consumer and to run a coordination algorithm to enforce the Eventual Queue Invariants

☐ The advantage of this approach is that the latency of the dequeue is minimized and the throughput of enqueues is maximized (we’ll see this latter is really a property of the eventual queue)

☐ The disadvantage, for some use cases, is that the consumer need to store the whole queue locally thus, this solution is applicable for either symmetric environments running on LANs

Page 47: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Invariants & DDS

☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something

☐ These invariants require that we implement a distributed protocol for ensuring that values are eventual picked up and picked up only once!

☐ Elements might be dequeued in a different order than they are enqueued

Page 48: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue Invariants & DDS☐ All enqueued elements will be eventually dequeued☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something

☐ Elements might be dequeued in a different order than they are enqueued☐ This essentially means that we can have different local order for the queue

elements on each consumer. Which in turns means that we can distribute enqueued elements by simple DDS writes!

☐ The implication of this is that the enqueue operation is going to be as efficient as a DDS write

☐ Finally, to ensure eventual consistency in presence of writer faults we’ll take advantage of OpenSplice FT-Reliability!

Page 49: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Dequeue Protocol: General Idea☐ A possible Dequeue protocol can be derived by the Lamport/Agrawala

Distributed Mutual Exclusion Algorithm

☐ The general idea is similar as we want to order dequeues as opposed to access to some critical section, however there are some important details to be sorted out to ensure that we really maintain the eventual queue invariants

☐ Key Issues to be dealt☐ DDS provides eventual consistency thus we might have wildly different local view of the

content of the queue (not just its order but the actual elements)☐ Once a process has gained the right to dequeue it has to be sure that it can pick an

element that nobody else has picked just before. Then he has to ensure that before he allows anybody else to pick a value his choice has to be popped by all other local queues

Page 50: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topic Types☐ To implement the Eventual Queue

over DDS we use three different Topic Types

☐ The TQueueCommand represents all the commands used by the protocol (more later on this)

☐ TQueueElement represents a writer time-stamped queue element

struct TLogicalClock { long long ts; long mid;};

enum TCommandKind { DEQUEUE, ACK, POP};

struct TQueueCommand { TCommandKind kind; long mid; TLogicalClock ts;};#pragma keylist TQueueCommand

typedef sequence<octet> TData;struct TQueueElement { TLogicalClock ts; TData data;};#pragma keylist TQueueElement

Page 51: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topics

To implement the Eventual Queue we need only two topics:☐ One topic for representing the queue elements☐ Another topic for representing all the protocol messages. Notice

that the choice of using a single topic for all the protocol messages was carefully made to be able to ensure FIFO ordering between protocol messages

Page 52: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Topics

This leads us to:

☐ Topic(name = QueueElement, type = TQueueElement, QoS = {Reliability.Reliable, History.KeepAll})

☐ Topic(name = QueueCommand, type = TQueueCommand, QoS = {Reliability.Reliable, History.KeepAll})

Page 53: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Dequeue Protocol: A Sample Rundeq():a

a, ts b, ts’

app 1 (1,1)

req {(1,2)}

deq():b ack {(2,2)}

(1,1) (1,2)

pop{ts, (3,1)}

req {(1,1)}

1 1 2

1 1 2 3

3

ack {(4,1)}

4

pop{ts, (5,2)}

app 2

b, ts’ a, ts

(1,2) (1,1) (1,2)

b, ts’

b, ts’

(1,2) (1,2)

Page 54: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Example: Producerobject MessageProducer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <n> <samples>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val n = args(2).toInt val samples = args(3).toInt val group = Group(gid) group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Enqueue[String]("CounterQueue", mid, gid)

for (i <- 1 to samples) { val msg = "MSG["+ mid +", "+ i +"]" println(msg) queue.enqueue(msg) // Pace the write so that you can see what's going on Thread.sleep(300) } }}

Page 55: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Example: Consumerobject MessageConsumer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <readers-num> <n>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val rn = args(2).toInt val n = args(3).toInt

val group = Group(gid) group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Queue[String]("CounterQueue", mid, gid, rn)

val baseSleep = 1000 while (true) { queue.sdequeue() match { case Some(s) => println(Console.MAGENTA_B + s + Console.RESET) case _ => println(Console.MAGENTA_B + "None" + Console.RESET) } val sleepTime = baseSleep + (math.random * baseSleep).toInt Thread.sleep(sleepTime) } }}

Page 56: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Dealing with Faults

Page 57: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Fault-Detectors

☐ The algorithms presented so far can be easily extended to deal with failures by taking advantage of group abstraction presented earlier

☐ The main issue to carefully consider is that if a timing assumption is violated thus leading to falsely suspecting the crash of a process safety of some of those algorithms might be violated!

Page 58: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Paxos

Page 59: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Brief☐ Paxos is a protocol for state-machine replication proposed by Leslie

Lamport in his “The Part-time Parliament”

☐ The Paxos protocol works in under asynchrony -- to be precise, it is safe under asynchrony and has progress under partial synchrony (both are not possible under asynchrony due to FLP) -- and admits a crash/recovery failure mode

☐ Paxos requires some form of stable storage

☐ The theoretical specification of the protocol is very simple and elegant

☐ The practical implementations of the protocol have to fill in many hairy details...

Page 60: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Brief☐ The Paxos protocol considers three different kinds of agents (the

same process can play multiple roles):☐ Proposers☐ Acceptors☐ Learners

☐ To make progress the protocol requires that a proposer acts as the leader in issuing proposals to acceptors on behalf of clients

☐ The protocol is safe even if there are multiple leaders, in that case progress might be scarified ☐ This implies that Paxos can use an eventual leader election algorithm to decide

the distinguished proposer

Page 61: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos Synod Protocol

[Pseudocode from “Ring Paxos: A High-Throughput Atomic Broadcast Protocol, DSN 2010”. Notice that the pseudo code is not correct as it suffers from progress in several cases, however it illustrates the key idea of the Paxos Synod protocol]

Page 62: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action

C1

C2

Cn

P1

P2

Pk

A2

Am

A1

L2

Lh

L1

[Leader]

Page 63: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action -- Phase 1A

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase1A(c-rnd)

Page 64: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action -- Phase 1B

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase1B(rnd, v-rnd, v-val)

Page 65: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action -- Phase 2A

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase2A(c-rnd, c-val)

Page 66: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action -- Phase 2B

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

phase2B(v-rnd, v-val)

Page 67: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Paxos in Action -- Phase 2B

C1

C2

Cn

P1

P2

Pk

[Leader]

A2

Am

A1

L2

Lh

L1

Decision(v-val)

Page 68: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Eventual Queue with Paxos☐ The Eventual queue we specified on the previous section can be

implemented using an adaptation of the Paxos protocol

☐ In this case, consumers don’t cache locally the queue but leverage a mid-tier running the Paxos protocol to serve dequeues

C1

C2

Cn

P1

P2

Pm[Learners]

Pi

Ai

[Proposers]

[Acceptors]

[Eventual Queue]

L1 [Learners]

Page 69: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S

Summing Up

Page 70: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

Concluding Remarks

☐ OpenSplice DDS provides a good foundation to effectively and efficiently express some of the most important distributed algorithms☐ e.g. DataWriter fault-detection and OpenSplice FT-Reliable Multicast

☐ dada provides access to reference implementations of many of the most important distributed algorithms☐ It is implemented in Scala, but that means you can also use these libraries

from Java too!

Page 71: Classical Distributed Algorithms with DDS

Copyrig

ht  2011,  PrismTech  –    A

ll  Rights  Reserved.

Ope

nSpl

ice

DD

S

References

¥Fastest growing JVM Language¥Open Source¥www.scala-lang.org

¥ #1 OMG DDS Implementation¥ Open Source¥ www.opensplice.org

OpenSplice | DDS¥Scala API for OpenSplice DDS¥Open Source¥github.com/kydos/escalier

Escalier

¥Simple C++ API for DDS¥Open Source¥github.com/kydos/simd-cxx

¥DDS-PSM-Java for OpenSplice DDS¥Open Source¥github.com/kydos/simd-java

¥ DDS-based Advanced Distributed Algorithms Toolkit

¥Open Source¥github.com/kydos/dada

Page 73: Classical Distributed Algorithms with DDS

Ope

nSpl

ice

DD

S