27
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University of California, San Diego

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

1

Near-Optimal Oblivious Routingfor 3D-Mesh Networks

ICCD 2008

Rohit Sunkam RamanujamBill Lin

Electrical and Computer Engineering DepartmentUniversity of California, San Diego

2

Motivation: Networks-on-Chip• Chip-multiprocessors (CMPs) increasingly popular• 2D-mesh networks often used as on-chip fabric

I/O Area

I/O Area

single tile

1.5mm

2.0mm

21.7

2mm

12.64mm

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

Tilera Tile64Intel 80-core

3

Motivation: 3D Integrated Circuits• 3D Benefits

– Reduced wire delays– Enormous bandwidth– Heterogeneous system

integration

• Natural progression– 3D-mesh for 3D CMPs

2D to 3D

4

Routing Algorithm Objectives

• Maximize throughput– How much load the network can handle

• Minimize hop count– Minimize routing delay between source and destination

5

Challenges• For 2D-case, a near-optimal throughput routing algorithm with

minimal hop count called O1TURN is known [Seo’05].

• Surprisingly, optimality of O1TURN does not extend to 3D case, actual throughput performance degrades severely.

• Only known optimal throughput routing algorithm is Valiant (VAL) load-balancing, but VAL performs poorly on hop count (latency), twice that of minimal routing.

6

Main Contribution

• Developed a new oblivious routing algorithm called “Randomized Partially Minimal” (RPM) routing.

• RPM provably guarantees near-optimal worst-case throughput in 3D case.– Optimal for even radix k (e.g. 8 x 8 x 8 mesh).– Within factor of 1/k2 for odd radix (e.g. 7 x 7 x 7 mesh).

• Good latency performance.– Only factor of 1.33 of minimal routing (much better than 2x cost of

VAL, only known routing algorithm with optimal throughput)– In practice, 3D-meshes are asymmetric because number of device

layers less than number of tiles per edge.– e.g., for 16 x 16 x 4 mesh (4 layers), RPM’s hop count just factor of 1.1

of minimal routing.

7

Outline

• Motivation for our work

Existing 2D routing algorithms don’t extend well into 3D

• RPM routing algorithm

• Simulation results

• Extensions and future work

8

Existing Routing AlgorithmsThe 2D case• Dimension-Ordered Routing (DOR)

– Route minimal XY

• Valiant load-balancing (VAL)– Route source → randomly chosen

intermediate node → destination– Route minimal XY in both phases

• ROMM– Same as VAL, but intermediate node

restricted to minimal direction

• Orthogonal 1-TURN (O1TURN)– Route minimal XY and YX with equal

probability

Extending to the 3D case …• Dimension-Ordered Routing (DOR)

– Route minimal XYZ

• Valiant load-balancing (VAL)– Route source → randomly chosen

intermediate node → destination– Route minimal XYZ in both phases

• ROMM– Same as VAL, but intermediate node

restricted to minimal direction

• Orthogonal 1-TURN (O1TURN)– Route along one of 6 minimal

orthogonal paths (XYZ, XZY, YXZ, YZX, ZXY, ZYX) with equal probability

9

Worst-Case Throughput

• Best theoretical normalized worst-case throughput known to be 50% (well-known result).

• Worst-case throughput analysis can be reduced to a maximal weighted matching problem [Towles’02].

• VAL achieves this optimal throughput, but has poor latency.

• As shown next, DOR, ROMM, and O1TURN are all far from optimal in 3D.

10

Poor Worst-Case Throughput

Only6-15%

VAL/Optimal

11

How do 2D mesh algorithms fare in 3D?

• Worst case throughput of DOR, ROMM, O1TURN far from optimal• Average hop count of VAL far from minimal• Need a routing algorithm that can trade latency for worst-case

throughput

Hop Count(normalized tominimal)

NormalizedWorst-CaseThroughputNormalizedAverage-CaseThroughput

8 x 8 x 8 NetworkVAL DOR ROMM O1TURN

0.5 0.316 0.454 0.513

VAL DOR ROMM O1TURN

2 1 1 1

0.5 0.063 0.132 0.15

12

Why O1TURN performs poorly in 3D?

• O1TURN – Worst-Case throughput optimal for 2D but more than 3 times worse than optimal for 3D

• The difference – 2D traffic matrix is “admissible” for 2D mesh– In 3D, projected traffic on each 2D plane is no longer admissible !!

• Can we transform the 3D routing problem to routing admissible traffic on each 2D plane ?

13

Outline

• Motivation for our work

• Existing 2D algorithms don’t extend well into 3D

RPM routing algorithm

• Simulation results

• Extensions and future work

14

Randomized Partially-Minimal Routing (RPM)

Source

DestinationRandom

intermediate layer

XY or YX routing on the intermediate layer

X

YZ

Phase-1 Z Source to

intermediate layer

Phase-2 Z Intermediate

layer to destination

15

Main Idea

• Load-balance uniformly across the vertical layers

• Min XY/YX used on each layer

• Main Result: RPM has near-optimal worst-case throughput

– Achieves optimal worst-case throughput when network radix k is even

– Within a factor of 1/k2 optimal when k is odd.

16

RPM achieves Near-Optimal Worst Case Throughput (optimal for even radix)

VAL/Optimal

RPM

17

Average-Case Throughput

• RPM outperforms VAL, DOR, ROMM and O1TURN in average-throughput on randomly generated traffic.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4x4x4 8x8x8 16x16x4

VAL

DOR

ROMM

O1TURN

RPM

18

Average Hop Count

• Normalized hop count of RPM – Symmetric Meshes - 1.33 times minimal compared to 2x

for VAL– Asymmetric 16x16x4 Mesh – 1.1 times minimal

19

Outline

• Motivation for our work

• Existing 2D routing algorithms don’t extend well into 3D

• RPM routing algorithm

Simulation results

• Extensions and future work

20

Flit-Level Simulation

• Ideal throughput evaluation assumes– Ideal single-cycle router– Infinite buffers– No contention in switches, no flow control

• Flit-level simulation– PopNet network simulator– 4 stage router pipeline – Route computation, VC allocation, Switch

arbitration, Link traversal– Credit-based flow control– 8 virtual channels, each 5 flits deep– Multi-flit packets injected into the network (5 flits/packet)

21

Flit-Level Simulation (cont’d)• Network configurations simulated

– 4 x 4 x 4 Mesh– 8 x 8 x 8 Mesh– 16 x 16 x 4 Mesh

• Routing algorithms compared: DOR, VAL, ROMM, O1TURN, DUATO, RPM– DUATO is a minimal adaptive routing algorithm implemented for

comparison

• Four different traffic traces used– Transpose traffic – (x,y,z) → (y,z,x)– Complement traffic – (x,y,z) → (k-x-1, k-y-1, k-z-1)– Uniform traffic – Worst Case traffic pattern for DOR (DOR-WC) – (x,y,z) → (k-z-1, k-y-1, k-x-1)

22

Uniform Traffic

8x8x8 Mesh 16x16x4 Mesh

23

Transpose Traffic

8x8x8 Mesh 16x16x4 Mesh

24

Complement Traffic

8x8x8 Mesh 16x16x4 Mesh

25

DOR-WC Traffic

8x8x8 Mesh 16x16x4 Mesh

26

To sum it up …

• 3D IC technology is emerging.

• Stacking cores in 3 dimensions offers several advantages over 2D placement of cores.

• 2D minimal Mesh routing algorithms have poor worst-case throughput in 3D, VAL has high latency penalty.

• RPM trades off latency (partially-minimal) for better worst case performance (near-optimal).

27

Thank You

Questions?