High Performance Wide-area Overlay using Deadlock-free Routing Ken Hironaka, Hideo Saito, Kenjiro...

High Performance Wide-area Overlay using Deadlock-free Routing

Ken Hironaka, Hideo Saito, Kenjiro TauraThe University of TokyoJune 12th, 2009

Parallel and distributed computing in WANs

• Grid environments have become popular platforms– Grid5000, DAS-3, InTrigger– Helped by greater WAN bandwidth

• Communication is getting increasingly important– CPU-intensive applications

• More communication intensive• e.g.: Model-checking

– Data-intensive applications

• The design/implementation of communication libraries is crucial

Cluster

223/04/19

Application overlays for communication libraries

• WAN connectivity– NAT, Firewall– SmartSockets [Maassen et al. ‘07]

• Scalability– Few connections as possible– Host main memory constraints– Stateful firewall session constraints

• High performance– Avoid network contention at bottlenecks

NAT/firewall

Contention!

Too many…

323/04/19

WAN overlay Requirementsfor parallel and distributed computing• Low Transfer/routing overhead– Overlay performance does matter– Not only latency, but also bandwidth

• “safe” overlays– NO Memory overflow– NO Communication deadlocks

23/04/19 4

Our contribution

• Overlay for effective parallel and distributed computing on WANs– Low transfer/routing overhead– No memory overflow/deadlocks– Efficient deadlock-free routing

for heterogeneous networks

• Experiment on a large scale WAN environment– 4 to 7 clusters ： up to 290 nodes– Higher performance for collective communication

523/04/19

Problem Setting

• Introduction• Problem Setting• Related Work• Proposal• Evaluation• Conclusion

623/04/19

Description of our overlay setting• Multi-cluster environment ( LAN + WAN )• Latency and Bandwidth are heterogeneous

– 100[us] ~ 100[ms]– 10 [Mbps] ～ 10[Gbps]

• Heavy stress on forwarding nodes that buffer packets• A naïve implementation will have 2 outcomes

– Memory overflow in intermediate nodes– communication deadlock among nodes

Narrow Link

10 [Mbps] 10 [Gbps]10 [Gbps]

723/04/19

Send! Recv!

BufferForward!

No. 1: Memory overflow

• When a node buffers packets without regards to its free buffer size

• Thus, all nodes must use fixed size buffers

dstsrc1Gbps 1Gbps

WAN Link823/04/19

No. 2: Deadlocks with naïve flow control

• Simple flow control solution– stop receiving once your buffer is full– Feeds back to the sender to tune the send rate– Possibility of a deadlock

923/04/19

dstsrc

packet buffer FULL!

Flow control

A Deadlock Example

• When multiple transfers coexist ：– Transfers will become dependent on

each other to make progress

• 4 transfer example– Link A → Link B– Link B → Link C– Link C → Link D– Link D → Link A

• Transfers wait on each other

⇒ deadlock

Link: A

Link: B Link: C

Link: D

1023/04/19

The source of the deadlock

• Cycle in the link dependency graph

• Deadlock free routing is necessary– Restrict routing paths so that

deadlocks cannot occur

• They Do Happen!– 40-node LAN, 1 M buffers– 0.1 connection density graph– 5MB all-all operation

Link: A

Link: B Link: C

Link: D

1123/04/19

Additionally…

• Existing deadlock-free routing algorithms do not account for the underlying network

• In WANs, we must use underlying network information for efficient routing

1223/04/19

Related Work

1323/04/19

Existing WAN overlays• Do not consider problems like buffer overflow and

deadlocks

• RON (Resilient Overlay Network) [Andersen et al. ‘01]

– UDP overlay network among all nodes– UDP interface to the user

• Communication reliability and flow control are left to the user

• DiskRouter [Kola et al. ‘03]

– File transfer overlay– When buffer usage reach a threshold, stop receiving

• Possibility of deadlocks

1423/04/19

Flow Control for Overlays• UDP Overlay + End-End Flow-control• [Kar et al. ’01]

– ACK on every packet sent– ACKs piggyback link congestion information

• Spines : [Amir et al. ‘02]– Link congestion information is shared periodically– Sender tunes its rate based on congestion of its path

• Pros:– Eliminate burden on forwarding nodes

• Cons:– Isn’t this re-implementing TCP?– A lot of parameter tuning– hard to yield maximum bandwidth on path: (30 % utilization)

Our work: Use TCP + flow-control at forwarding nodes + deadlock free routing

feedback

UDP UDPUDP

src dst

1523/04/19

Deadlock-free Routing• Restrict routing paths to prevent deadlocks

– Not suitable for WANs

• Algorithms for parallel computer interconnects:– Assume “regular” topologies– [Antonio et al. ’94]– [Dally et al. ’87, ‘93]

• Algorithms for general graphs ：– Do not account for underlying network– Constructed paths are suboptimal– Up/Down Routing [Schroeder et al. ‘91]– Ordered-link Routing [Chiu et al. ‘02]– L-Turn Routing [Koibuchi et al. ‘01]

1623/04/19

Proposal

1723/04/19

Proposal Overview

• Basic Proposal1. Construct a TCP overlay2. Apply deadlock-free routing constraints3. Calculate routing paths

• Optimizations using network information

1823/04/19

Basic Overlay Overview• Only requires a connected

overlay network using TCP connections– e.g.: random graph overlay

construction

• Send in packets ：– Predefined packet sizes

• End-End reliable communication ：– FIFO transfer– Do NOT drop packets

23/04/19 19dstsrc

packet buffer FULL!

Forwarding Procedure(1/2)

• Define the following per TCP connection– Fixed send buffer– 1-packet receive buffer

• Transfer procedure– receive packet on receive buffer– Move to send buffer of connection

to be forwarded– If send buffer is full,

stop receiving on it

Send buffer

Recv buffer

2023/04/19

Forwarding Procedure(2/2)

• When multiple transfers contest for single link– They will make progress in round-robin fashion

• Transfers will be blocked– Deadlock-free routing

⇒ No deadlocks

2123/04/19

Deadlock-free RoutingUp/Down Routing [Schroeder et al. ‘91]

• BFS from root node– Assign IDs in ascending order

• Determine link arrow– Arrow points to the younger ID

• Define link traversal– UP: in arrow direction– DOWN: against arrow direction

• Routing path restriction:– Cannot go UP after DOWN

ji i > jUP

Down → Up prohibited

Determined independently from underlying network

2223/04/19

Routing Table Calculation

• Modification to Dijkstra’s Shortest Path– Routing Table Calculation: O(NlogN) for N nodes

23/04/19 23

Proposal Overview

• Optimizations using network information

• Basic Proposal– Construct a TCP overlay– Apply deadlock-free routing constraints– Calculate routing paths

2423/04/19

•Inter-node Latency Matrix•Connection bandwidth information

Locality-aware

construction Locality-aware

constraints

Throughput-aware

Path calculation

Locality-aware overlay construction

[Saito et al. ‘07]• “Routes to far nodes can afford to make detours”• Connections choice– Low prob. With far-away nodes– High prob. With near nodes

• Reduce connections without performance impact

Small effect

Largeeffect

2523/04/19

Up/Down Routing Optimizations• BFS id assignment is problematic

in multi-cluster settings

• Many nodes are reachable onlywith UP → DOWN paths

• Nodes with small IDs within cluster– UP direction traversal includes

a high-latency WAN connection– They will use WAN links to reach

intra-cluster nodes 　

cluster

DOWNUP

2623/04/19

Proposed Up/Down Routing

• DFS ID assignment– traverse priority to low latency child

• Rationale– Reduce UP→DOWN paths– Intra-cluster nodes can be reachable

only using UP or DOWN traversals– Reduce unnecessary WAN hops

2723/04/19

cluster

Deadlock-free restriction comparison

• Reduce restrictions banning intra-cluster links

clusterBFS-updown Locality-aware

DFS-updown

DOWNUP

cluster2823/04/19

Routing Metric

• Give weight to throughput of entire path– Sum of inverse of bandwidth of used links

dstsrcB1 B2 B3

2923/04/19

Evaluation

3023/04/19

Deadlock-free Routing Overhead• Compare deadlock-free vs. deadlock-unaware routing

– ordered-link– Up/Down– Proposed Up/Down

• Compared hops/bandwidthfor all calculated paths

• Simulation– L2 topology information

of the InTrigger Grid platform• 13 clusters (515 nodes)

1. Vary connection density2. Computed routing tables3. Evaluate result using topology information

3123/04/19

Num. of hops for all paths

• Very small difference for average hop count• Proposed Up/Down has comparable max. hop count

Average Hops Max. Hops

Small difference with deadlock-

unaware

Small difference with deadlock-

unaware

3223/04/19

Minimum Path bandwidth Ratio

• Other deadlock-free algorithms take unnecessary WAN-hops• Proposed optimization avoids taking WAN-hops

High throughput even for sparse

overlays

3323/04/19

Deadlock-free routing effect on path restriction and latency

• Comparison to direct sockets

• 7 Real clusters on InTrigger (170 nodes)– Connection density: 9%

• Routing Metric: Latency over path

3423/04/19

Direct vs. Overlay Latency

• Up/Down uses WAN links even for LAN communication

Proposed Up/DownIs comparable to direct sockets

Up/DownLAN pairs crossing WAN

3523/04/19

Overlay Throughput Performance

• A wide range of environments– 1 Gigabit Ethernet LAN (940 [Mbps])– Myrinet 10G LAN (7 [Gbps])

• With varying number of intermediate nodes

3623/04/19

Direct vs. overlay throughput

• Able to attain close to direct socket throughput

GbE cluster (940[Mbps]) Myrinet cluster (7[Gbps])

3723/04/19

Collective Communication• Our overlay outperforms direct sockets even with

deadlock-free constraints

• Evaluation– Gather, All-to-All– Varying message size, and connection density

• Environment– LAN: 1-switch (36 nodes), hierarchical (177 nodes)– WAN: 4 clusters (291 nodes)

3823/04/19

Gather time

• Collision at switch:• Packet-loss• TCP retrans.:• 200 [ms] loss

• Sparse overlay:• Mitigates

collisionGap due to

TCP RTO: 200 [ms]

Collision at port

1-switch cluster (36 nodes)

3923/04/19

Gather with multiple clusters4 cluster (291 nodes)

Effect by mitigatingcollision

4023/04/19

Cluster

ClusterCluster

All-to-All

• Large-scale environments have bottlenecks– Hierarchical cluster

• 177 nodes

– 4 clusters connected on WAN• 291 nodes

Cluster

1Gbps1Gbps

1Gbps 1Gbps

4123/04/19

All-to-All performance

1 cluster (177 nodes) 4 cluster (291 nodes)

• Sparse overlays perform better due to packet loss avoidance• For hierarchical cluster, packet loss occurs at switches• For multi-cluster setting, WAN becomes source of packet loss

4223/04/19

Conclusion

4323/04/19

Conclusion• Overlay for effective parallel and distributed computing on

WANs– Low transfer overhead– No memory overflow/deadlocks– Use network information to mitigate routing overhead

• Evaluation on simulation/LANs/WANs– Low overhead relative to deadlock-unaware routing– Throughput/latency comparable to direct sockets– Outperforms direct sockets for collective communication

• Future Work– Allow dynamic changes in overlay topology and routing

4423/04/19

Questions?

• Ken Hironakakenny@logos.ic.i.u-tokyo.ac.jp

• Taura Research Labwww.logos.ic.i.u-tokyo.ac.jp

23/04/19 45

High Performance Wide-area Overlay using Deadlock-free Routing Ken Hironaka, Hideo Saito, Kenjiro...

Documents

2003/12/5 1 Assisting technologies for program parallelization Chikayama/Taura Lab. Masakazu HAYATSU

Report to Te Taura Whiri i te Reo Māori Tukua ki te Ao

Adaptive and Robust Broadcast Algorithm Takeshi Sekiya Chikayama-Taura Lab. 2007/4/13

Eriko Hironaka and Eiko Kin- A family of pseudo-Anosov braids with small dilatation

GXP ver.3 Tutorial Kenjiro Taura. What GXP intends to be A tool to interactively use hundreds of machines Many users’ “shortest path” from “just a bunch

Te Taura Whiri i te Reo Māori · Created Date: 12/1/2015 2:15:55 PM

Solving Tsumego on Computers M2 Hirokazu Ishii Chikayama & Taura Lab

Diseases of crustaceans Viral diseases—Taura syndromelibrary.enaca.org/Health/FieldGuide/pdf/Taura syndrome.pdf · • Migratory birds, aquatic insects and humans are likely mechanical

THE HIRONAKA THEOREM ON RESOLUTION OF (Or: A proof we ... · THE HIRONAKA THEOREM ON RESOLUTION OF SINGULARITIES 327 Figure 7. The cusp x2 = y3: It can be resolved by vertical dragging

Endoscopic ultrasound-guided drainage of … ultrasound-guided drainage of postoperative intra-abdominal abscesses Koichiro Mandai, Koji Uno, Kenjiro Yasuda Koichiro Mandai, Koji …

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future

Contentspeople.math.harvard.edu/~hironaka/pRes.pdfSINGULARITIES 3 17. Comments on the methodology of this paper 89 References 90 1. Introduction In this paper, we prove the existence

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

2005/10/211 A Survey on Physical Network Topology Estimation October 21, 2005 Chikayama-Taura Lab. Tatsuya Shirai

A Survey of Programming Frameworks for Dynamic Grid Workflow Applications November 2 nd, 2007 Taura Lab Ken Hironaka

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab

EFFECTIVE HIRONAKA RESOLUTION AND ITS COMPLEXITY …webdoc.sub.gwdg.de/ebook/serien/e/mpi_mathematik/2010/2010_01… · EFFECTIVE HIRONAKA RESOLUTION AND ITS COMPLEXITY 3 (2) Weak-Strong

Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo