Upload
deana
View
59
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Reliable and Highly Available Distributed Publish/Subscribe Systems. Reza Sherafat Hans-Arno Jacobsen University of Toronto September 2009. Distributed Publish/Subscribe Systems. P. P. Publish. P. P. Pub. Pub. Many-to-many communication - PowerPoint PPT Presentation
Citation preview
Reliable and Highly Available Distributed Publish/Subscribe
SystemsReza Sherafat
Hans-Arno Jacobsen
University of TorontoSeptember 2009
Symposium on Reliable and Distributed Systems
SRDS'09 2
Many-to-many communication High-level operations: “subscribe” and “publish“ Decoupling between sources and sinks Flexible content-based messaging
Distributed Publish/Subscribe SystemsPub
Sub
Sub
Pub/Sub
SubscribeS
SS SS
S
S
Subscribe
PPublish PP
P
PPub
SRDS'09 3
Existing approaches
δ-Fault-tolerance
Architecture
Reliable publication delivery protocol
Experimental results
Agenda
SRDS'09 4
A copy is first preserved on disk and then forwarded
Intermediate hops send an ACK to previous hop after preserving
ACKed copies can be dismissed from disk
Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery◦ This ensures reliable delivery but may cause delays while the machine is down
Store-and-ForwardP P PPFrom
hereTohere
ackackack
SRDS'09 5
Use a mesh network to concurrently forward messages on disjoint paths
Upon failures, the message is delivered using alternative routes
Pros: Minimal impact on delivery delay
Cons: Imposes additional traffic & possibility of duplicate delivery
Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]
PPPP
Fromhere
Tohere
SRDS'09 6
Replicas are grouped into virtual nodes Replicas have identical routing information
We compare against this approach in evaluation section
Virtual node
Replica-based Approach[Bhola et al., DSN 2002]
SRDS'09 7
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication delivery protocol
Experimental results
Next
SRDS'09 8
In distributed messaging system◦ Failed brokers may be down for a long time◦ There often are concurrent failures◦ Reliable message delivery is essential
Configuration parameter δ
A δ-fault-tolerant P/S system ensures reliable delivery when there are up to δ concurrent crash failures
Reliability:◦ Exactly-once delivery of publications to matching subscribers◦ Per-source FIFO ordered message delivery
δ-Fault-Tolerance
SRDS'09 9
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication delivery protocols
Experimental results
Next
SRDS'09 10
Broker are organized in a tree-based overlay network
In our approach δ-fault-tolerance is closely related to how much brokers know about the broker tree
(δ+1)-neighborhood: brokers within distance δ+1
This information is stored in a data structure called thetopology map◦ Topology maps are updated as
brokers enter/leave the network
Architecture
3-neighborhood
2-neighborhood
1-neighborhood
SRDS'09 11
Join Algorithm1. Joining broker connects to a joinpoint
2. joinRequest message is sent to the joinpoint
3. joinpoint replies with a subset of its topology map
4. joinRequest is propagated in the network
5. Receiving brokers update their topology maps
6. confirmation messages propagated from edge brokers are sent back
7. Joining broker receives the confirmation: join is complete
δ-neighborhood
(δ+1)-neighborhood
Joinpoint
Joining broker
12
S S SS
Subscription routing protocol is used to construct forwarding paths
Subscription messages encapsulate:◦ pred: Conjunct predicates specifying client’s interests◦ from: BrokerID points back to broker δ+1 hops closer to subscriber
Subscriptions are sent hop-by-hop throughout the network◦ Brokers update from as message is forwarded◦ Brokers handle confirmation msgs similar to join ◦ Confirmed subs are inserted into subscription routing table
Subscription Routing Information
A B C D E
s.from
s.froms.from s.from
SRDS'09
δ=2
SRDS'09 13
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication forwarding protocols
Experimental results
Next
SRDS'09 14
queue
Publication Forwarding Algorithm (No Failure Case)
1. Received pubs are placed in a FIFO message queue and kept until processing is complete
2. Using subscription info: subsmatching the publication are identified
3. Matching subs’ from field are inserted into the recipientSet
4. Using topology map: pub is sent to closest available brokers towards matching subscribers (outgoingSet)
5. Receiving downstream brokers similarly forward the publication until delivered to subscribers
6. Confirmations from all downstream brokers are received
7. Clean-up: once all confirmations arrive, the publication is discarded from the queue
P
(δ+1)-neighborhood
Upst
ream
Down
stre
am
A
SRDS'09 15
Publication Forwarding Algorithm (Failure Case)
Brokers use heartbeats to monitor availability of their connected peers
Once failures are detected the broker reconnects the topology by creating new links to downstream neighbors of the failed brokers
Unconfirmed publications are re-transmitted from msg queue
Subsequent pubs are forwarded via the new links instead
◦ Bypass failed brokers
Multiple concurrent failures (up to δ) are handled similarly
◦ In the worst case, δ brokers have failed in a row
queuePPPP
Upst
ream
Down
stre
am
A
SRDS'09 16
For each pub msg sent over a link there is a confirmation msg that is sent back◦ Increased network traffic
We use an aggregated acknowledgement mechanism called Depth Acknowledgements (DACK)◦ It is very similar to the normal way that ◦ This substitutes the need for confirmation messages
Eliminating Need for Confirmation Messages
SRDS'09 18
Discarding Publications Using DACK Messages
B and C keep track of the highest sequence number they received and discarded (prefix-based) from A and periodically report it upstream using DACK messages.
Brokers append their own information to DACK and also relay portions of their neighbors’ DACK messages.
For each publication, A evaluates safety conditions for all brokers in the publication’s recipientSet.
Safety conditions◦ All intermediate brokers report an arrived seq#
is higher than publication’s seq#, OR◦ Any intermediate broker has reported a
discarded prefix seq# that is higher that the publication’s seq#(necessary when there are failures)
C
B
A
P
arrived:{seq(A), …}discarded:{seq'(A), …}
arrived:{seq(A), seq(A),…}discarded:{seq'(A), seq'(A)},…
PP
DACK
DACK
Upst
ream
Down
stre
am
Arr:seq(A) Dsc:seq'(A)
Arr:seq(A)Dsc:seq'(A)
Update
Update
Update
Update
Update
Update
P(?)
SRDS'09 19
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication forwarding protocols
Experimental results
Next
SRDS'09 20
Algorithms implemented in Java
We run the system on a cluster computer:◦ 21 nodes each with 4 cores◦ Gigabit eathernet
Topology setup (δ=3)◦ Consists of 83 brokers◦ #subscriptions: 2600◦ #publishers: 26 at
varied publication rates
We inject failures to R1, R2, R3 and perform measurements
Experimental Setup
R1 R2R3
SRDS'09 21
Impact of failures on publication delivery delay◦ Use stream of publications (10msg/s)◦ Measure delivery delay between publishing and subscribing endpoints
3 separate runs with different number of simultaneous failures
After a short-lived jump, the delivery delay quickly goes back to normal ◦ Difference corresponds to
failure detection timeout
3-Failures2-Failures1-Failure
Publication Delivery Delay
23
Non-faulty broker's load after failures:◦ Input msg traffic: no change!◦ Output msg traffic: increase◦ CPU utilization: increase
Output rate/CPU utilization is affected by nearby failures
Change in Load After Failures
R1 R2
R3 Fails R3 Fails
Spikes at R2 afterbrokers reconnect
R3
R2 stabilizesat slightly higher
Output Msg Rate CPU Load
Lower spikes on R1 Lower spikes on R1R1 sees no chance
R2’s output traffic stabilizes at slightlyhigher rate
R1sees no change
InputMsg Rate
R3 Fails
Smaller spike on R1
Spikes at R2 after brokers reconnect
R2’s input traffic stabilizes at exactlythe same rate
R1’s input traffic stabilizes at exactlythe same rate
Spikes at R2 after brokers reconnect
SRDS'09 24
Topology network◦ Our approach: δ=2 ◦ Replica-based: 2 replicas◦ Considered situation after
2 failures (R2 and R3 fail)
Compared load on R1 after failures occur
In our approach CPU load on R1 is about 30% lower
Comparison with Replica-based Approach
Virtual nodeR2 R3R1
Our approach Replica-based
30% difference
SRDS'09 25
Our system delivers reliable P/S service in the face of up to δ concurrent broker failures
We also proposed optimizations:◦ To use aggregated acknowledgement messages◦ To reduce the network traffic
Ongoing and future work:
◦ Explore multi-path forwarding
◦ http://research.msrg.utoronto.ca/Padres/WebHome
Conclusions
SRDS'09 26
Thanks!
Questions?
SRDS'09 27
Backup slides …
Sample DACK Propagation and Publication Purging (δ=3)
1.2.3.4.5.6.
7.8.9.
10.
1.2.3.4.5.6.
7.8.9.
First safety condition Second safety condition
LEGEND: Node holds pub in MQ
Node discards pub
Node receives pub Direction of pub forwarding
29
Using DACK info (δ=3)
SRDS'09
Publication Propagation and Purging
1.2.3.4.5.6.
7.8.9.
10.
1.2.3.4.5.6.
7.8.9.
SRDS'09 30
Using DACK info with failures (δ=3)
Publication Propagation and Purging