Reliable and Highly Available Distributed Publish/Subscribe Systems

Reliable and Highly Available Distributed Publish/Subscribe

SystemsReza Sherafat

Hans-Arno Jacobsen

University of TorontoSeptember 2009

Symposium on Reliable and Distributed Systems

SRDS'09 2

Many-to-many communication High-level operations: “subscribe” and “publish“ Decoupling between sources and sinks Flexible content-based messaging

Distributed Publish/Subscribe SystemsPub

Sub

Sub

Pub/Sub

SubscribeS

SS SS

S

S

Subscribe

PPublish PP

P

PPub

SRDS'09 3

Existing approaches

δ-Fault-tolerance

Architecture

Reliable publication delivery protocol

Experimental results

Agenda

SRDS'09 4

A copy is first preserved on disk and then forwarded

Intermediate hops send an ACK to previous hop after preserving

ACKed copies can be dismissed from disk

Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery◦ This ensures reliable delivery but may cause delays while the machine is down

Store-and-ForwardP P PPFrom

hereTohere

ackackack

SRDS'09 5

Use a mesh network to concurrently forward messages on disjoint paths

Upon failures, the message is delivered using alternative routes

Pros: Minimal impact on delivery delay

Cons: Imposes additional traffic & possibility of duplicate delivery

Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]

PPPP

Fromhere

Tohere

SRDS'09 6

Replicas are grouped into virtual nodes Replicas have identical routing information

We compare against this approach in evaluation section

Virtual node

Replica-based Approach[Bhola et al., DSN 2002]

SRDS'09 7

Existing approaches

δ-Fault-Tolerance

Architecture

Reliable publication delivery protocol


Next

SRDS'09 8

In distributed messaging system◦ Failed brokers may be down for a long time◦ There often are concurrent failures◦ Reliable message delivery is essential

Configuration parameter δ

A δ-fault-tolerant P/S system ensures reliable delivery when there are up to δ concurrent crash failures

Reliability:◦ Exactly-once delivery of publications to matching subscribers◦ Per-source FIFO ordered message delivery

δ-Fault-Tolerance

SRDS'09 9

Existing approaches

δ-Fault-Tolerance

Architecture

Reliable publication delivery protocols


Next

SRDS'09 10

Broker are organized in a tree-based overlay network

In our approach δ-fault-tolerance is closely related to how much brokers know about the broker tree

(δ+1)-neighborhood: brokers within distance δ+1

This information is stored in a data structure called thetopology map◦ Topology maps are updated as

brokers enter/leave the network

Architecture

3-neighborhood

2-neighborhood

1-neighborhood

SRDS'09 11

Join Algorithm1. Joining broker connects to a joinpoint

2. joinRequest message is sent to the joinpoint

3. joinpoint replies with a subset of its topology map

4. joinRequest is propagated in the network

5. Receiving brokers update their topology maps

6. confirmation messages propagated from edge brokers are sent back

7. Joining broker receives the confirmation: join is complete

δ-neighborhood

(δ+1)-neighborhood

Joinpoint

Joining broker

12

S S SS

Subscription routing protocol is used to construct forwarding paths

Subscription messages encapsulate:◦ pred: Conjunct predicates specifying client’s interests◦ from: BrokerID points back to broker δ+1 hops closer to subscriber

Subscriptions are sent hop-by-hop throughout the network◦ Brokers update from as message is forwarded◦ Brokers handle confirmation msgs similar to join ◦ Confirmed subs are inserted into subscription routing table

Subscription Routing Information

A B C D E

s.from

s.froms.from s.from

SRDS'09

δ=2

SRDS'09 13

Existing approaches

δ-Fault-Tolerance

Architecture

Reliable publication forwarding protocols


Next

SRDS'09 14

queue

Publication Forwarding Algorithm (No Failure Case)

1. Received pubs are placed in a FIFO message queue and kept until processing is complete

2. Using subscription info: subsmatching the publication are identified

3. Matching subs’ from field are inserted into the recipientSet

4. Using topology map: pub is sent to closest available brokers towards matching subscribers (outgoingSet)

5. Receiving downstream brokers similarly forward the publication until delivered to subscribers

6. Confirmations from all downstream brokers are received

7. Clean-up: once all confirmations arrive, the publication is discarded from the queue

P

(δ+1)-neighborhood

Upst

ream

Down

stre

am

A

SRDS'09 15

Publication Forwarding Algorithm (Failure Case)

Brokers use heartbeats to monitor availability of their connected peers

Once failures are detected the broker reconnects the topology by creating new links to downstream neighbors of the failed brokers

Unconfirmed publications are re-transmitted from msg queue

Subsequent pubs are forwarded via the new links instead

◦ Bypass failed brokers

Multiple concurrent failures (up to δ) are handled similarly

◦ In the worst case, δ brokers have failed in a row

queuePPPP

Upst

ream

Down

stre

am

A

SRDS'09 16

For each pub msg sent over a link there is a confirmation msg that is sent back◦ Increased network traffic

We use an aggregated acknowledgement mechanism called Depth Acknowledgements (DACK)◦ It is very similar to the normal way that ◦ This substitutes the need for confirmation messages

Eliminating Need for Confirmation Messages

SRDS'09 18

Discarding Publications Using DACK Messages

B and C keep track of the highest sequence number they received and discarded (prefix-based) from A and periodically report it upstream using DACK messages.

Brokers append their own information to DACK and also relay portions of their neighbors’ DACK messages.

For each publication, A evaluates safety conditions for all brokers in the publication’s recipientSet.

Safety conditions◦ All intermediate brokers report an arrived seq#

is higher than publication’s seq#, OR◦ Any intermediate broker has reported a

discarded prefix seq# that is higher that the publication’s seq#(necessary when there are failures)

C

B

A

P

arrived:{seq(A), …}discarded:{seq'(A), …}

arrived:{seq(A), seq(A),…}discarded:{seq'(A), seq'(A)},…

PP

DACK

DACK

Upst

ream

Down

stre

am

Arr:seq(A) Dsc:seq'(A)

Arr:seq(A)Dsc:seq'(A)

Update

Update

Update

Update

Update

Update

P(?)

SRDS'09 19

Existing approaches

δ-Fault-Tolerance

Architecture

Reliable publication forwarding protocols


Next

SRDS'09 20

Algorithms implemented in Java

We run the system on a cluster computer:◦ 21 nodes each with 4 cores◦ Gigabit eathernet

Topology setup (δ=3)◦ Consists of 83 brokers◦ #subscriptions: 2600◦ #publishers: 26 at

varied publication rates

We inject failures to R1, R2, R3 and perform measurements

Experimental Setup

R1 R2R3

SRDS'09 21

Impact of failures on publication delivery delay◦ Use stream of publications (10msg/s)◦ Measure delivery delay between publishing and subscribing endpoints

3 separate runs with different number of simultaneous failures

After a short-lived jump, the delivery delay quickly goes back to normal ◦ Difference corresponds to

failure detection timeout

3-Failures2-Failures1-Failure

Publication Delivery Delay

23

Non-faulty broker's load after failures:◦ Input msg traffic: no change!◦ Output msg traffic: increase◦ CPU utilization: increase

Output rate/CPU utilization is affected by nearby failures

Change in Load After Failures

R1 R2

R3 Fails R3 Fails

Spikes at R2 afterbrokers reconnect

R3

R2 stabilizesat slightly higher

Output Msg Rate CPU Load

Lower spikes on R1 Lower spikes on R1R1 sees no chance

R2’s output traffic stabilizes at slightlyhigher rate

R1sees no change

InputMsg Rate

R3 Fails

Smaller spike on R1

Spikes at R2 after brokers reconnect

R2’s input traffic stabilizes at exactlythe same rate

R1’s input traffic stabilizes at exactlythe same rate

Spikes at R2 after brokers reconnect

SRDS'09 24

Topology network◦ Our approach: δ=2 ◦ Replica-based: 2 replicas◦ Considered situation after

2 failures (R2 and R3 fail)

Compared load on R1 after failures occur

In our approach CPU load on R1 is about 30% lower

Comparison with Replica-based Approach

Virtual nodeR2 R3R1

Our approach Replica-based

30% difference

SRDS'09 25

Our system delivers reliable P/S service in the face of up to δ concurrent broker failures

We also proposed optimizations:◦ To use aggregated acknowledgement messages◦ To reduce the network traffic

Ongoing and future work:

◦ Explore multi-path forwarding

◦ http://research.msrg.utoronto.ca/Padres/WebHome

Conclusions

http://research.msrg.utoronto.ca/Padres/WebHome

SRDS'09 26

Thanks!

Questions?

SRDS'09 27

Backup slides …

Sample DACK Propagation and Publication Purging (δ=3)

1.2.3.4.5.6.

7.8.9.

10.

1.2.3.4.5.6.

7.8.9.

First safety condition Second safety condition

LEGEND: Node holds pub in MQ

Node discards pub

Node receives pub Direction of pub forwarding

29

Using DACK info (δ=3)

SRDS'09

Publication Propagation and Purging

1.2.3.4.5.6.

7.8.9.

10.

1.2.3.4.5.6.

7.8.9.

SRDS'09 30

Using DACK info with failures (δ=3)

Publication Propagation and Purging

Documents

Reliable and Highly Available Distributed Publish/Subscribe Systems