41
Prediction Router: Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan) Yet another low-latency on-chip router architecture

Prediction Router - research.nii.ac.jp

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Prediction Router - research.nii.ac.jp

Prediction Router:

Hiroki Matsutani (Keio Univ., Japan)

Michihiro Koibuchi (NII, Japan)

Hideharu Amano (Keio Univ., Japan)

Tsutomu Yoshinaga (UEC, Japan)

Yet another low-latency on-chip router architecture

Page 2: Prediction Router - research.nii.ac.jp

• Tile architecture – Many cores (e.g., processors & caches) – On-chip interconnection network

Why low-latency router is needed?

Packet switched network

router

[Dally, DAC’01]

router router

router router router

router router router

Router Core

16-core tile architecture

On-chip router affects the performance and cost of the chip

Page 3: Prediction Router - research.nii.ac.jp

System Topology Routing Switching Flow ctrl MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit

UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit

QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no VC Credit

UMass Amherst aSOC

2D mesh Shortest-path

Pipelined CS, no VC

Timeslot

Sun T1 Crossbar (128bit)

- - Handshake

Cell BE EIB Ring (128bit) Shortest-path

Pipelined CS, no VC

Credit

TRIPS (operand)

2D mesh (109bit)

YX DOR 1-flit, no VC On/off

TRIPS (on-chip) 2D mesh (128bit)

YX DOR WH, 4 VCs Credit

Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM

WH, no VC Stall/go

TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit

Intel 80-core NoC

2-D mesh (32bit)

Source routing

WH, 2 lanes On/off

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of hops increases

Low-latency router architecture has been extensively studied

Why low-latency router is needed?

Page 4: Prediction Router - research.nii.ac.jp

Outline: Prediction router for low-latency NoC

• Existing low-latency routers – Speculative router

– Look-ahead router

– Bypassing router

• Prediction router – Architecture and the prediction algorithms

• Hit rate analysis

• Evaluations – Hit rate, gate count, and energy consumption

– Case study 1: 2-D mesh (small core size)

– Case study 2: 2-D mesh (large core size)

– Case study 3: Fat tree network

Page 5: Prediction Router - research.nii.ac.jp

Wormhole router: Hardware structure

5x5 CROSSBAR

ARBITER

FIFO

FIFO

FIFO

FIFO

FIFO X+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Routing, arbitration, & switch traversal are performed in a pipeline manner

Input ports Output ports 1) selecting an output channel

2) arbitration for the selected output channel

3) sending the packet to the output channel

GRANT

Page 6: Prediction Router - research.nii.ac.jp

• At least 3-cycle for traversing a router – RC (Routing computation) – VSA (Virtual channel & switch allocations) – ST (Switch traversal)

• A packet transfer from router (a) to router (c)

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

At least 12-cycle for transferring a packet from router (a) to router (c)

SA

SA

SA

SA

SA

SA

SA

SA

SA

VA & SA are speculatively performed in parallel

To perform RC and VSA in parallel, look-ahead routing is used

Pipeline structure: 3-cycle router Speculative router: VA/SA in parallel [Peh,HPCA’01]

Page 7: Prediction Router - research.nii.ac.jp

• At least 3-cycle for traversing a router – NRC (Next routing computation) – VSA (Virtual channel & switch allocations) – ST (Switch traversal)

NRC VSA ST

ST

ST

ST

VSA ST

ST

ST

ST

VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

NRC NRC

VSA can be performed w/o waiting for NRC

Routing computation for the next hop

Output port of router (i+1) is selected by router i

SA

SA

SA

SA

SA

SA

SA

SA

SA

Look-ahead router:RC/VA in parallel

Page 8: Prediction Router - research.nii.ac.jp

• At least 2-cycle for traversing a router – NRC + VSA (Next routing computation / arbitrations) – ST (Switch traversal)

NRC

VSA ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9

@Router A

HEAD

DATA 1

DATA 2

DATA 3

NRC

VSA ST

NRC

VSA ST

@Router B @Router C

No dependency between NRC & VSA NRC & VSA in parallel

Typical example of 2-cycle router

Look-ahead router:RC/VA in parallel

At least 9-cycle for transferring a packet from router (a) to router (c) Packing NRC,VSA,ST into a single stage frequency harmed

[Dally’s book,

2004]

Page 9: Prediction Router - research.nii.ac.jp

3-cycle

• Bypassing between intermediate nodes – E.g., Express VCs

Bypassing router: skip some stages

SRC DST

[Kumar, ISCA’07]

3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle 1-cycle

Bypassed 1-cycle

Bypassed

Page 10: Prediction Router - research.nii.ac.jp

• Bypassing between intermediate nodes – E.g., Express VCs

• Pipeline bypassing utilizing the regularity of DOR – E.g., Mad postman

• Pipeline stages on frequently used are skipped – E.g., Dynamic fast path

• Pipeline stages on user-specific paths are skipped – E.g., Preferred path – E.g., DBP

Bypassing router: skip some stages

[Kumar, ISCA’07]

[Koibuchi, NOCS’08]

[Michelogiannakis, NOCS’07]

[Park, HOTI’07]

[Izu, PDP’94]

We propose a low-latency router based on multiple predictors

3-cycle

SRC DST 3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle 1-cycle

Bypassed 1-cycle

Bypassed

Page 11: Prediction Router - research.nii.ac.jp

• Existing low-latency routers – Speculative router

– Look-ahead router

– Bypassing router

• Prediction router – Architecture and the prediction algorithms

• Hit rate analysis

• Evaluations – Hit rate, gate count, and energy consumption

– Case study 1: 2-D mesh (small core size)

– Case study 2: 2-D mesh (large core size)

– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 12: Prediction Router - research.nii.ac.jp

Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-

execution)

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

RC & VSA are skipped if prediction hits 1-cycle transfer

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

Page 13: Prediction Router - research.nii.ac.jp

Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-

execution)

ELAPSED TIME [CYCLE]

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 14: Prediction Router - research.nii.ac.jp

Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-

execution)

ELAPSED TIME [CYCLE]

RC VSA ST

ST

ST

ST

ST RC VSA ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router C

HEAD

DATA 1

DATA 2

DATA 3

ST

ST

ST

HIT

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 15: Prediction Router - research.nii.ac.jp

Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-

execution)

ELAPSED TIME [CYCLE]

RC VSA ST

ST

ST

ST

ST ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS HIT

HEAD

DATA 1

DATA 2

DATA 3

ST

ST

ST

HIT

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 16: Prediction Router - research.nii.ac.jp

Prediction router: Prediction algorithms

• Efficient predictor is key

• Prediction router – Multiple predictors for each

input channel

– Select one of them in response to a given network environment

Single predictor isn’t enough

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

for applications with different traffic patterns

Predictors

A B C

Predictors

A B C

1. Random 2. Static Straight (SS)

An output channel on the same dimension is selected (exploiting the regularity of DOR)

3. Custom User can specify which output channel is accelerated

4. Latest Port (LP) Previously used output channel is selected

5. Finite Context Method (FCM) The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)

6. Sampled Pattern Match (SPM) Pattern matching using a record table

[Burtscher, TC’02]

[Jacquet, TIT’02]

Page 17: Prediction Router - research.nii.ac.jp

5x5 XBAR

ARBITER

FIFO X+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

1-cycle transfer using the reserved crossbar-port when prediction hits

Basic operation @ Correct prediction

Crossbar is reserved

Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct

1st cycle: RC is performed The prediction is correct!

2nd cycle: Next flit is transferred to X+ without RC and VSA

Page 18: Prediction Router - research.nii.ac.jp

5x5 XBAR

ARBITER

FIFO X+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

Even with miss prediction, a flit is transferred in 3-cycle as original router

Basic operation @ Miss prediction Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct Dead flit

1st cycle: RC is performed The prediction is wrong! (X- is correct)

KILL

Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port

More energy for retransmission

Page 19: Prediction Router - research.nii.ac.jp

• Existing low-latency routers – Speculative router

– Look-ahead router

– Bypassing router

• Prediction router – Architecture and the prediction algorithms

• Hit rate analysis

• Evaluations – Hit rate, gate count, and energy consumption

– Case study 1: 2-D mesh (small core size)

– Case study 2: 2-D mesh (large core size)

– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 20: Prediction Router - research.nii.ac.jp

Prediction hit rate analysis • Formulas to calculate the prediction hit rates on

– 2-D torus (Random, LP, SS, FCM, and SPM)

– 2-D mesh (Random, LP, SS, FCM, and SPM)

– Fat tree (Random and LRU)

– To forecast which prediction algorithm is suited for a given network environment w/o simulations

• Accuracy of the analytical model is confirmed through simulations

Derivation of the formulas is omitted in this talk

(See “Section 4” of our paper for more detail)

Page 21: Prediction Router - research.nii.ac.jp

• Existing low-latency routers – Speculative router

– Look-ahead router

– Bypassing router

• Prediction router – Architecture and the prediction algorithms

• Hit rate analysis

• Evaluations – Hit rate, gate count, and energy consumption

– Case study 1: 2-D mesh (small core size)

– Case study 2: 2-D mesh (large core size)

– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 22: Prediction Router - research.nii.ac.jp

Evaluation items

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis) Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

Packet length 4-flit (1-flit: 64 bit)

Switching technique wormhole

Channel buffer size 4-flit / VC

Number of VCs 1 or 2VCs

Cycle / hop (miss) 3 stage

Cycle / hop (hit) 1 stage *Topology and traffic are mentioned later

Table 1: Router & network parameters

CMOS process 65nm

Core voltage 1.20V

Temperature 25C

Table 2: Process library

Design compiler 2006.06

Astro 2007.03

Table 3: CAD tools used

Page 23: Prediction Router - research.nii.ac.jp

3 case studies of prediction router

Case study 3 Case study 1 & 2

2-D mesh network Fat tree network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis) Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

• The most popular network topology

MIT’s RAW [Taylor,ISCA’04]

Intel’s 80-core [Vangal,ISSCC’07]

• Dimension-order routing (XY routing)

Here, we show the results of case studies 1 and 2 together

Page 24: Prediction Router - research.nii.ac.jp

Case study 1: Zero-load comm.latency C

om

m. la

ten

cy [

cyc

les

]

Network size (k-ary 2-mesh)

• Original router

• Pred router (SS)

• Pred router (100% hit)

Uniform random traffic on

4x4 to 16x16 meshes

35.8% reduced for 8x8 cores

(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction

48.2% reduced for 16x16 cores

Simulation results

(analytical model also shows the same result)

More latency reduced (48% for k=16) as network size increases

Page 25: Prediction Router - research.nii.ac.jp

Case study 2: Hit rate @ 8x8 mesh

• SS: go straight

• LP: the last one

• FCM: frequently used pattern

Pre

dic

tio

n h

it r

ate

[%

]

7 NAS parallel benchmark programs 4 synthesized traffics

Efficient for long straight comm.

Page 26: Prediction Router - research.nii.ac.jp

Case study 2: Hit rate @ 8x8 mesh

Efficient for short repeated comm.

Pre

dic

tio

n h

it r

ate

[%

]

• SS: go straight

• LP: the last one

• FCM: frequently used pattern

Efficient for long straight comm.

7 NAS parallel benchmark programs 4 synthesized traffics

Page 27: Prediction Router - research.nii.ac.jp

Case study 2: Hit rate @ 8x8 mesh

All arounder !

Pre

dic

tio

n h

it r

ate

[%

]

• SS: go straight

• LP: the last one

• FCM: frequently used pattern

Efficient for long straight comm.

Efficient for short repeated comm.

7 NAS parallel benchmark programs 4 synthesized traffics

• Existing bypassing routers use – Only a static or a single bypassing policy

• Prediction router supports – Multiple predictors which can be switched in a cycle – To accelerate a wider range of applications

However, effective bypassing policy depends on traffic patterns…

Page 28: Prediction Router - research.nii.ac.jp

Case study 2: Area & Energy

• Area (gate count) – Original router – Pred router (SS + LP) – Pred router

(SS+LP+FCM)

• Energy consumption

Router area [kilo gates]

6.4 - 15.9% increased, depending on type and number of predictors

Light-weight (small overhead)

FCM is all-arounder, but requires counters

Verilog-HDL designs

Synthesized with 65nm library

Page 29: Prediction Router - research.nii.ac.jp

6.4 - 15.9% increased, depending on type and number of predictors

Case study 2: Area & Energy

• Area (gate count) – Original router – Pred router (SS + LP) – Pred router

(SS+LP+FCM)

• Energy consumption – Original router – Pred router (70% hit) – Pred router (100% hit)

Flit switching energy [pJ / bit]

Miss prediction consumes power; 9.5% increased if hit rate is 70%

Latency 35.8%-48.2% saved w/ reasonable area/energy overheads

Router area [kilo gates]

This estimation is pessimistic.

1. More energy consumed in links Effect of router energy overhead is reduced

2. Application will be finished early More energy saved

Page 30: Prediction Router - research.nii.ac.jp

3 case studies of prediction router

Case study 3 Case study 1 & 2

2-D mesh network Fat tree network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis) Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

Page 31: Prediction Router - research.nii.ac.jp

Case study 3: Fat tree network

Up Down

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Page 32: Prediction Router - research.nii.ac.jp

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Case study 3: Fat tree network

• Comm. latency @uniform – Original router – Pred router (LRU) – Pred router (LRU + LP)

Up Down

C

om

m. la

ten

cy [

cyc

les

]

Network size (# of cores)

Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)

Page 33: Prediction Router - research.nii.ac.jp

• Prediction router for low-latency NoCs – Multiple predictors, which can be switched in a cycle – Architecture and six prediction algorithms – Analytical model of prediction hit rates

• Evaluations of prediction router – Case study 1 : 2-D mesh (small core size) – Case study 2 : 2-D mesh (large core size) – Case study 3 : Fat tree network

• Results

1. Prediction router can be applied to various NoCs 2. Communication latency reduced with small overheads 3. Prediction router with multiple predictors can

accelerate a wider range of applications

From three case studies

Area overhead: 6.4% (SS+LP)

Energy overhead: 9.5% (worst)

Latency reduction: up to 48%

(from Case studies 1 & 2)

Summary of the prediction router

Page 34: Prediction Router - research.nii.ac.jp

Thank you

for your attention

It would be very helpful if you would speak slowly. Thank you in advance.

Page 35: Prediction Router - research.nii.ac.jp

5x5 XBAR

ARBITER

FIFO X+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

Prediction router: New modifications

KILL signals

• Predictors for each input channel

• Kill mechanism to remove dead flits

• Two-level arbiter – “Reservation” higher priority – “Tentative reservation” by the pre-execution of VSA

Currently, the critical path is related to the arbiter

Page 36: Prediction Router - research.nii.ac.jp

• Static scheme – A predictor is selected

by user per application

• Dynamic scheme – A predictor is

adaptively selected

Prediction router: Predictor selection

Predictors

A B C

Application 1 Predictor B

Application 2 Predictor A

Application 3 Predictor C

… …

Configuration table

Simple Pre-analysis is needed

Predictors

A B C

Predictor A 100

Predictor B 80

Predictor C 120

Count up if each predictor hits

A predictor is selected every n cycles (e.g., n =10,000)

Flexible More energy

Page 37: Prediction Router - research.nii.ac.jp

Case study 1: Router critical path

• RC: Routing comp.

• VSA: Arbitration

• ST: Switch traversal

Original router Pred router (SS)

Sta

ge

de

lay

[FO

4s

]

6.2% critical path delay increased compared with original router

ST can be occurred in these stages of prediction router

Page 38: Prediction Router - research.nii.ac.jp

Case study 2: Hit rate @ 8x8 mesh

All arounder !

• SS: go straight

• LP: the last one

• FCM: frequently used pattern

• Custom: user-specific path

Efficient for long straight comm.

Efficient for short repeated comm.

7 NAS parallel benchmark programs 4 synthesized traffics

Pre

dic

tio

n h

it r

ate

[%

]

Efficient for simple comm.

Page 39: Prediction Router - research.nii.ac.jp

Case study 4: Spidergon network

• Spidergon topology – Ring + across links

– Each router has 3-port

– Mesh-like 2-D layout

– Across first routing

[Coppola,ISSOC’04]

• Hit rate @ Uniform

Page 40: Prediction Router - research.nii.ac.jp

Case study 4: Spidergon network

• Spidergon topology – Ring + across links

– Each router has 3-port

– Mesh-like 2-D layout

– Across first routing

• Hit rate @ Uniform – SS: Go straight – LP: Last used one – FCM: Frequently used one

[Coppola,ISSOC’04]

Network size (# of cores)

P

red

icti

on

hit

ra

te [

%]

Hit rates of SS & FCM are almost the same

High hit rate is achieved (80% for 64core; 94% for 256core)

Page 41: Prediction Router - research.nii.ac.jp

4 case studies of prediction router

Case study 3 Case study 4 Case study 1 & 2

2-D mesh network Fat tree network Spidergon network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis) Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF