Department of Computer Science, Jinan University, Guangzhou, P.R. China Lijun Lyu, Junjie Xie, Yuhui...

Preview:

Citation preview

Department of Computer Science, Jinan University, Guangzhou, P.R. China

Lijun Lyu, Junjie Xie, Yuhui Deng, Yongtao Zhou

ICA3PP 2014: The 14th International Conference on Algorithms & Architectures for Parallel Processing. August 24-27, Dalian, China.

• Motivation

• Challenges

• Related work

• Our idea

• System architecture

• Evaluation

• Conclusion

2

• The Explosive Growth of Data IDC: 1,800EB data in 2011, 40-60% annual increase

Larger Data Center Google: 19 data centers > 1 million servers

Higher traffic Cisco forecasts that annual traffic in global data centers will

nearly triple over the next 5 years and reach 7.7ZB by the end of 2017

3Google Data Center

• Data Center Network Node increment Scalability? Failures are common Fault tolerance?

Google MapReduce in a 4,000-node cluster: 5 nodes fail during a job 1 disk fails every 6 hours

Bandwidth-hungry services Network capacity?Infrastructure services: MapReduce, GFS, …

Network applications: Cloud disk, Video, …

• Tree-based Structure Traditional tree

Bandwidth bottleneck, Single points of failure, Expensive

Modified tree: Fat-tree High capacity Limited scalability

5

Traditional Tree-based StructureFat-tree

• Other novel, hybrid network structures Physical topology

Level-based, but not tree-based Recursively defined

Routing mechanism No routers, without traditional internet routing mechanism Put routing intelligence on servers Take advantage of structural properties

Typical structures DCell, FiConn, BCube, Totoro…

6

• DCell

7

• Totoro

• FiConn

• BCube

• Physical structures

• Routing mechanisms

8

DCell Totoro FiConn BCube

Core idea Divide-and-ConquerCorrect different

address digits

Calculation Hop by hop Full path

Link state Broadcast domain Path probing

Path selection Dijkstra + Rerouting Greedy Available one

Traffic-aware No mention Yes No mention

Shortest distance

No Yes

• What we achieve: Athena Routing Mechanism Routing algorithm

Based on Dynamic Programming Find the shortest path with lower complexity than classic algorithms Support Multi-path

Path probing mechanism Bypass the failed nodes & links Traffic-aware

PropertiesMore resilient, shorter latency, higher capacity, Lower complexity

9

• Athena Routing Mechanism Implement on the structure of Totoro Compare with the original Totoro Fault-tolerant Routing

Algorithm (TFR) and Shortest Path Algorithm (SPA, based on Floyd-Warshall).

Applicable to DCell, FiConn, BCube… Similar topology: level-based, recursively defined.. Put routing intelligence on servers

10

• Totoro Two-port servers Low-end switches Level-based Recursively defined

two-port NIC

11Totoro Structure of One Level

• Building Totoro Connect N servers to an N-port switch Here, N=4 Basic partition: Totoro0

Intra-switch

A Totoro0 Structure 12

• Building Totoro Available ports in Totoro0: c. Here, c=4

Connect n Totoro0s to n-port switches by using c/2 ports

Inter-switch

A Totoro1 structure consists of n Totoro0s. 13

• Building Totoro Connect n Totoroi-1s to n-port switches to build a

Totoroi

Recursively defined Half of available ports ⇒ Open & Scalable The number of paths among Totorois is n/2 times of

the number of paths among Totoroi-1s ⇒ Multi-redundant links ⇒ High network capacity

14

15

0, 0, 0 0, 0, 10, 0, 2 0, 0, 3 0, 1, 0 0, 1, 1 0, 1, 20, 1, 3 0, 2, 0 0, 2, 1 0, 2, 20, 2, 3 0, 3, 0 0, 3, 1 0, 3, 2 0, 3, 3

3, 2, 33, 2, 23, 2, 13, 2, 0 3, 3, 33, 3, 23, 3, 13, 3, 03, 1, 33, 1, 23, 1, 13, 1, 03, 0, 33, 0, 23, 0, 13, 0, 0 2, 3, 32, 3, 22, 3, 02, 2, 32, 2, 22, 2, 12, 2, 02, 1, 32, 1, 22, 1, 12, 1, 02, 0, 32, 0, 22, 0, 1

1-0, 0 1-0, 1

1-2, 11-2, 01-3, 0 1-3, 1

2-0 2-1 2-2 2-3

1-1, 0 1-1, 1

1, 0, 0 1, 0, 11, 0, 2 1, 0, 3 1, 1, 0 1, 1, 1 1, 1, 21, 1, 3 1, 2, 0 1, 2, 1 1, 2, 21, 2, 3 1, 3, 1 1, 3, 2 1, 3, 31, 3, 0

2, 3, 12, 0, 0

Level -1 Link

Level -2 Link

Totoro2 structure with N = 4, n = 4, K = 2.

16

• Athena Routing Algorithm (ARA) Based on Dynamic Programming (DP) Applicable to problems which exhibit the properties of

Overlapping subproblems Optimal substructure

Recursively calculate

17

Steps of ARA: 1.Suppose src and dst belong to two partitions.2.Get all paths connecting these two partitions.3.For each path, recursively calculate it.4.Store all paths.5.Sort all path by length.6.Remove the extra paths.

This function is based on the corresponding structural properties.

Cartesian product

18

• Case study of ARA work out the path from src to dst

19

• Case study of ARA Step. 1: src and dst belong to two different sub-

partitions respectively

20

• Case study of ARA Step. 2: there exist two paths between these two sub-

partitions

21

• Case study of ARA Step. 3: for Path 1, recursively work out the sub-paths

in these sub-partitions, and join them for a full path

22

• Case study of ARA Step. 4: similarly, work out the full path for Path 2

23

• Case study of ARA Step. 5: add all paths into the result set

24

• Case study of ARA Step. 5: sort the paths by lengths

25

• Case study of ARA Step. 5: remove the extra paths (here, we suppose the

size of set to return is 1, i.e., it is the shortest path)

26

• Path Probing Mechanism Source host sends the probing request packets Destination host sends probing reply packets Intermediate servers record the link capacities in the

probing packets and forward them

27

• Path Probing Mechanism Detect the failed paths No extra rerouting technique

is required Detect the link capacity Support load balance…

28

29

30

• Protocol Implementation ARM Packet format

Path-probing packetData packet

31

• Protocol Implementation Protocol

2.5-layer protocol

How an intermediate server determines the next hop? A fact: two adjacent servers in a path only differ at one “bit” Hence, we only store the different “bit”s in the vector.

• Evaluating Path Failure & Average Path Lengths ARM vs. TFR vs. SPA

TFR: the original Totoro Fault-tolerant Routing algorithm

SPA: Shortest Path Algorithm, Floyd-Warshall, performance bound

• Evaluating Resource Usage

32

33

• Evaluating Path Failure & Average Path Lengths Experimental parameters

Types of failures Link, Node, Switch & Rack failures

Platform Totoro2 (4096 servers)

Failures ratios 2% - 20%

Communication mode All-to-all

Simulation times 20 times

34

• Evaluating Path Failure Path failure ratio vs. server/rack failure ratio

The performance of ARM/TFR are almost identical to that of SPA!

35

• Evaluating Path Failure Path failure ratio vs. switch failure ratio

The performance of ARM is almost identical to that of SPA!

But TFR isn’t.

36

• Evaluating Path Failure Path failure ratio vs. link failure ratio When a high link failure occurs:

ARM achieves slightly better capacity than TFR. Performance gap between ARM and SPA still exists!

SPA traverse all feasible links in the whole structure until finding a valid path!

This is a tradeoff that ARM makes to facilitate algorithmic complexity and save computation resources.

37

• Evaluating Average Path Lengths

ARM:1.Better than TFR.2.Almost identical to SPA.3.Shorter than SPA, this is because the path failure ratio of ARM is a bit higher than that of SPA, thus our total path length is shorter.

38

• Evaluating Resource Usage Experimental parameters

Testbed Lenovo T350, Quad-core, 8GB memory

Platform Totoro2 (4096 servers)

Size of each result 10 paths

Communication mode One-to-all in 4 Totoro1

39

• Evaluating Resource Usage

+10nodes/s

28%

18s

0%

CPU:1.Increase by 10 per second2.Peak value of 28% at 18s3.Benefited from the cache

Memory:For each host, it only costs 164KB at most.

• More resilient• Shorter latency

• Higher capacity• Lower complexity

• In the future work, we will focus on the implementation of ARM in DCell, FiConn and other structures!

40

41

ICA3PP 2014: The 14th International Conference on Algorithms & Architectures for Parallel Processing. August 24-27, Dalian, China.

Recommended