A Cost Effective Centralized Adaptive Routing for Networks on Chip

Preview:

DESCRIPTION

A Cost Effective Centralized Adaptive Routing for Networks on Chip. Ran Manevich, Israel Cidon, Avinoam Kolodny, Isask ’ har (Zigi) Walter and Shmuel Wimer. Technion – Israel Institute of Technology. QNoC. Research. Group. - PowerPoint PPT Presentation

Citation preview

A Cost Effective Centralized Adaptive

Routing for Networks on Chip

Ran Manevich, Israel Cidon, Avinoam Kolodny, Isask’har (Zigi) Walter and Shmuel

WimerTechnion – Israel Institute of

TechnologyModule

Modu le Module

Modu le Modu le

Modu le Modu le

Modu le

Module

Modu le

Modu le

Modu leGroup

ResearchQNoC

Global traffic information is essential to make the right decision!

2D Mesh NoC

Adaptive Routing in NoCs – Local vs. Global Information

Low CongestionMedium CongestionHigh Congestion

A Packet routed from upper left to bottom right corner utilizing local congestion information.The same packet routed using global information.

I CAN MAKE IT!!!Source

Destination

Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered

Routing Keep it simple! Centralized selection:

Routing tables in sources. One bit per destination.

The option with less congested bottleneck link is preferred.

XY or YX

ATDOR Illustration 1 Five identical flows,

100 MB/s each.

Links modeled as M/M/1 queues. Delay of a single link:

LINKTraffic

DCapacity Traffic

Links capacity is 210 MB/s.

Initial routing - XY

Centralized Routing – How?

Option 1 – Continuous calculation of optimal routing for the active sessions:

Achievable load balancing

Speed and computation complexity

System complexity

Centralized Routing – How?

Option 2 – Iterative serial selection based on traffic load measurements between XY and YX for all source-destination pairs:

Achievable load balancing

Speed and computation complexitySystem complexity

ATDOR illustration 1

Average Delay

Re-Routed Flow

Step #

1->15 1

Re-Routed Flow

Step #

2->8 2

Average Delay

37 ns

Re-Routed Flow

Step #

2->15 3

Average Delay

22 ns

What did we just see? For each flow we:

1. Calculated the better route.2. Updated routing table of the

source.3. Waited for the update to take effect and measured global traffic load.

Steps 2 and 3 are unified for all destinations of a single source:Achievable load balancing

Speed and computation complexityScalability

Performing steps 1-3 for each flow is slow and not scalable.

Back illustration 1

Average Delay

Re-Routed Flow

Step #

1->15 1

Average Delay

22 ns

Re-Routed Flow

Step #

2->822->15

Re-Routed Flow

Step #

4->15 3

Average Delay

22 ns

Re-Routed Flow

Step #

1->15 4

Average Delay

22 ns

Re-Routed Flow

Step #

2->852->15

Average Delay

Problem #1 Changing routing may enhance

congestion and cause fluctuations.

Solution: Change routing only if the alternative is better by the margin α, 0< α <1:

YX XY

YX XY

XY YX

XY YX

if (Current Route = XY)YX if MAX[Load ] a MAX[Load ]

NextRoute =XY if MAX[Load ] > a MAX[Load ]

elseif (Current Route = YX)XY if MAX[Load ] a MAX[Load ]

NextRoute =YX if MAX[Load ] > a MAX[Load ]

ATDOR illustration 2

Average Delay

Re-Routed Flow

Step #

1->14

11->15

1->16

Average Delay

Re-Routed Flow

Step #

1->14

21->15

1->16

Re-Routed Flow

Step #

1->14

31->15

1->16

Problem #2 Coupling among flows sharing

the same source. Solution: Re-Routing counters

CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :

, ( ) mod 3I JL I J

Back to illustration 2R.

Changes Left

Flows

2 1->161 1->150 1->14

Average Delay

R. Changes

Left

Flows

1 1->160 1->150 1->14

Average Delay

73 ns

R. Changes

Left

Flows

0 1->160 1->150 2->14

Average Delay

22 ns

, ( ) mod 3I JL I J

Bring it all togetherR.

Changes Left

Flows

1 1>-15

1 2>-8

2 2>-15

1 4>-15

Average Delay

R. Changes

Left

Flows

0 1>-15

1 2>-8

2 2>-15

1 4>-15

R. Changes

Left

Flows

0 1>-15

0 2>-8

1 2>-15

1 4>-15

Average Delay

22 ns

R. Changes

Left

Flows

0 1>-15

0 2>-8

1 2>-15

0 4>-15

Average Delay

22 nsAverage Delay

14 ns

R. Changes

Left

Flows

0 1>-15

0 2>-8

0 2>-15

0 4>-15

, ( ) mod 3I JL I J

Centralized Adaptive Routing for NoCs - Architecture

Traffic load measurements aggregation into Traffic Load Maps.

Routing control.

Local traffic load measurements inside the routers.

Load Measurements Aggregation

An illustration of aggregation of load values in a 4X4 2D mesh.

A congestion value is written to each traffic load map every clock cycle.

ATDOR – Route Selection Circuit

Combinatorial pipelined implementation.

Result every ATDOR clock cycle.

Maximally loaded links of the two alternatives are compared. Next route:

YX XY

YX XY

XY YX

XY YX

if(Current Route = XY)YX if MAX[Load ] a MAX[Load ]

NextRoute =XY if MAX[Load ] > a MAX[Load ]

elseif(Current Route = YX)XY if MAX[Load ] a MAX[Load ]

NextRoute =YX if MAX[Load ] > a MAX[Load ]

0 < a <1

Hardware Requirements The whole

mechanism was implemented on xc5vlx50t VIRTEX 5 FPGA.

Estimated area for 45nm technology node.

Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.

Average Packet Delay – Uniform Traffic

Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.

Average Packet Delay – Transpose Traffic

Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.

Average Packet Delay – Hotspot Traffic

Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.

Control Iteration Duration Number of re-routed flows vs. time. 8X8 2D Mesh, ATDOR clock of 100 MHz.

α = 15/16 α = 3/4

CMP DNUCA - Architecture 8X8 CMP DNUCA (Dynamic Non Uniform

Cache Array) with 8 CPUs and 56 cache banks:

CMP DNUCA – Saturation Throughput

Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:

Conclusions Centralized adaptive routing is feasible

for NoCs. ATDOR: Centralized selection

between XY and YX for each source-destination pair. Hardware overhead: <4% of an 8X8 typical NoC. Average saturation throughput improvement:Vs. RCA Vs. O1TURN

12.1% 19.3% Synthetic Patterns12.8% 22.8% Spash 2 and

Parsec Benchmarks

Recommended