77
Angelo Corsaro, PhD Chief Technology Officer [email protected] Classical Distributed Algorithms with DDS

Distributed Algorithms with DDS

Embed Size (px)

Citation preview

Angelo  Corsaro,  PhD  Chief  Technology  Officer  

[email protected]

Classical Distributed Algorithms with DDS

Cop

yrig

ht P

rism

Tech

, 201

5

The Data Distribution Service (DDS) provides a very useful foundation for building highly dynamic, reconfigurable, dependable and high performance systems

However, in building distributed systems with DDS one is often faced with two kind of problems:

- How can distributed coordination problems be solved with DDS? e.g. distributed mutual exclusion, consensus, etc

- How can higher order primitives and abstractions be supported over DDS? e.g. fault-tolerant distributed queues, total-order multicast, etc.

In this presentation we will look at how DDS can be used to implement some of the classical Distributed Algorithm that solve these problems

Context

Cop

yrig

ht P

rism

Tech

, 201

4

A Topic defines a domain-wide information’s class

A Topic is defined by means of a (name, type, qos) tuple, where

• name: identifies the topic within the domain

• type: is the programming language type associated with the topic. Types are extensible and evolvable

• qos: is a collection of policies that express the non-functional properties of this topic, e.g. reliability, persistence, etc.

Topic

TopicTypeName

QoS

struct  TemperatureSensor  {        @key        long  sid;        float  temp;        float  hum;  }    

Cop

yrig

ht P

rism

Tech

, 201

4

For data to flow from a DataWriter (DW) to one or many DataReader (DR) a few conditions have to apply:

The DR and DW domain participants have to be in the same domain

The partition expression of the DR’s Subscriber and the DW’s Publisher should match (in terms of regular expression match)

The QoS Policies offered by the DW should exceed or match those requested by the DR

Quality of ServiceDomain

Participant

DURABILITY

OWENERSHIP

DEADLINE

LATENCY BUDGET

LIVELINESS

RELIABILITY

DEST. ORDER

Publisher

DataWriter

PARTITION

DataReader

Subscriber

DomainParticipant

offered QoS

Topicwrites reads

Domain Idjoins joins

produces-in consumes-from

RxO QoS Policies

requested QoS

Cop

yrig

ht P

rism

Tech

, 201

5

We can think of a DDS Topic as defining a group

The members of this group are matching DataReaders and DataWriters

DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views

In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side

This is not sufficient for certain distributed algorithms.

Membership

DR

DR

DR

TopicDW

DataWriter Group View

DW

DW DRTopic

DW

DataReader Group View

Cop

yrig

ht P

rism

Tech

, 201

5

A Group Management abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members

Ideally group management should also provide the ability to elect leaders

A Group Member should represent a process

Group Managementabstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)

// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)

// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)

// Reactions handling Group Events val reactions: Reactions}

case class MemberJoin(val mid: Int)case class MemberLeave(val mid: Int)case class MemberFailure(mid:Int)case class EpochChange(epoch: Long)case class NewLeader(mid: Option[Int])

Cop

yrig

ht P

rism

Tech

, 201

5Events provide notification of group membership changes

These events are handled by registering partial functions with the Group reactions

Example object GroupMember { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

val printGroupView = () => { print("Group["+ gid +"] = { ") group.view foreach(m => print(m + " ")) println("}")}

group listen { case MemberFailure(mid) => { println("Member "+ mid + " Failed.") printGroupView() } case MemberJoin(mid) => { println("Member "+ mid + " Joined") printGroupView() } case MemberLeave(mid) => { println("Member "+ mid +" Left") printGroupView() } } }}

[1/2]

Cop

yrig

ht P

rism

Tech

, 201

5

An eventual leader election algorithm can be implemented by simply casting a vote each time there is an group epoch change

A Group Epoch change takes place each time there is a change on the group view

The leader is eventually elected only if a majority of the process currently on the view agree

Otherwise the group leader is set to “None”

Example[2/2]

object EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

group listen { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>

println(">> NewLeader = "+ l) } }}

Cop

yrig

ht P

rism

Tech

, 201

5

A relatively simple Distributed Mutex Algorithm was proposed by Leslie Lamport as an example application of Lamport’s Logical Clocks

The basic protocol (with Agrawala optimization) works as follows (sketched):

- When a process needs to enter a critical section sends a MUTEX request by tagging it with its current logical clock

- The process obtains the Mutex only when he has received ACKs from all the other process in the group

- When process receives a Mutex requests he sends an ACK only if he has not an outstanding Mutex request timestamped with a smaller logical clock

Lamport’s Distributed Mutex

Cop

yrig

ht P

rism

Tech

, 201

5

The LCMutex is one of the possible Mutex protocol, implementing the Agrawala variation of the classical Lamport’s Algorithm

LCMutex

class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex {

private var group = Group(gid) private var ts = LogicalClock(0, mid) private var receivedAcks = new AtomicLong(0)

private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]() private var myRequest = LogicalClock.Infinite

private val reqDW = DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos)

private val reqDR = DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos)

private val ackDW = DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos)

private val ackDR = DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos)

private val ackSemaphore = new Semaphore(0)

Cop

yrig

ht P

rism

Tech

, 201

5

LCMutex.onACKackDR listen { case DataAvailable(dr) => { // Count only the ACK for us val acks = ((ackDR take) filter (_.amid == mid)) val k = acks.length

if (k > 0) { // Set the local clock to the max (tsi, tsj) + 1 synchronized { val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1 ts = LogicalClock(maxTs, ts.id) } val ra = receivedAcks.addAndGet(k) val groupSize = group.size // If received sufficient many ACKs we can enter our Mutex! if (ra == groupSize - 1) { receivedAcks.set(0) ackSemaphore.release() } } } }

Cop

yrig

ht P

rism

Tech

, 201

5

One approach to implement the eventual queue on DDS is to keep a local queue on each of the consumer and to run a coordination algorithm to enforce the Eventual Queue Invariants

The advantage of this approach is that the latency of the dequeue is minimized and the throughput of enqueues is maximised (we’ll see this latter is really a property of the eventual queue)

The disadvantage, for some use cases, is that the consumer need to store the whole queue locally thus, this solution is applicable for either symmetric environments running on LANs

Eventual Queue on DDS

Cop

yrig

ht P

rism

Tech

, 201

5

All enqueued elements will be eventually dequeued If the queue is empty a dequeue returns nothing If the queue is non-empty a dequeue might return something

Elements might be dequeued in a different order than they are enqueued

- This essentially means that we can have different local order for the queue elements on each consumer. Which in turns means that we can distribute enqueued elements by simple DDS writes!

- The implication of this is that the enqueue operation is going to be as efficient as a DDS write

- Finally, to ensure eventual consistency in presence of writer faults we’ll take advantage of OpenSplice’s FT-Reliability!

Eventual Queue Invariants & DDS

Cop

yrig

ht P

rism

Tech

, 201

5

A possible Dequeue protocol can be derived by the Lamport/Agrawala Distributed Mutual Exclusion Algorithm

The general idea is similar as we want to order dequeues as opposed to access to some critical section, however there are some important details to be sorted out to ensure that we really maintain the eventual queue invariants

Key Issues to Address

- DDS provides eventual consistency thus we might have wildly different local view of the content of the queue (not just its order but the actual elements)

- Once a process has gained the right to dequeue it has to be sure that it can pick an element that nobody else has picked just before. Then he has to ensure that before he allows anybody else to pick a value his choice has to be popped by all other local queues

Dequeue Protocol: General Idea

Cop

yrig

ht P

rism

Tech

, 201

5

To implement the Eventual Queue over DDS we use three different Topic Types

The TQueueCommand represents all the commands used by the protocol (more later on this)

TQueueElement represents a writer time-stamped queue element

Topic Typesstruct TLogicalClock { long long ts; long mid;};

enum TCommandKind { DEQUEUE, ACK, POP};

struct TQueueCommand { TCommandKind kind; long mid; TLogicalClock ts;};#pragma keylist TQueueCommand

typedef sequence<octet> TData;struct TQueueElement { TLogicalClock ts; TData data;};#pragma keylist TQueueElement

Cop

yrig

ht P

rism

Tech

, 201

5

Example: Producerobject MessageProducer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <n> <samples>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val n = args(2).toInt val samples = args(3).toInt val group = Group(gid) group listen { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Enqueue[String]("CounterQueue", mid, gid)

for (i <- 1 to samples) { val msg = "MSG["+ mid +", "+ i +"]" println(msg) queue.enqueue(msg) // Pace the write so that you can see what's going on Thread.sleep(300) } }}

Cop

yrig

ht P

rism

Tech

, 201

5

Example: Consumerobject MessageConsumer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <readers-num> <n>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val rn = args(2).toInt val n = args(3).toInt val group = Group(gid)

group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)

val queue = Queue[String]("CounterQueue", mid, gid, rn)

val baseSleep = 1000 while (true) { queue.sdequeue() match { case Some(s) => println(Console.MAGENTA_B + s + Console.RESET) case _ => println(Console.MAGENTA_B + "None" + Console.RESET) } val sleepTime = baseSleep + (math.random * baseSleep).toInt Thread.sleep(sleepTime) } }}

Cop

yrig

ht P

rism

Tech

, 201

5

Paxos is a protocol for state-machine replication proposed by Leslie Lamport in his “The Part-time Parliament”

The Paxos protocol works in under asynchrony -- to be precise, it is safe under asynchrony and has progress under partial synchrony (both are not possible under asynchrony due to FLP) -- and admits a crash/recovery failure mode

Paxos requires some form of stable storage

The theoretical specification of the protocol is very simple and elegant

The practical implementations of the protocol have to fill in many hairy details...

Paxos in Brief

Cop

yrig

ht P

rism

Tech

, 201

5

The Paxos protocol considers three different kinds of agents (the same process can play multiple roles):

- Proposers

- Acceptors

- Learners

To make progress the protocol requires that a proposer acts as the leader in issuing proposals to acceptors on behalf of clients

The protocol is safe even if there are multiple leaders, in that case progress might be scarified

- This implies that Paxos can use an eventual leader election algorithm to decide the distinguished proposer

Paxos in Brief