Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

Jae W. Lee (MIT)Man Cheuk Ng (MIT)

Krste Asanovic (UC Berkeley)

June 23th 2008

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED

QUALITY-OF-SERVICE IN ON-CHIP NETWORKS

ISCA-35, Beijing, China

Jae W. Lee (2 / 33)

Resource sharing increases performance variation

P P P P

P P P P

P P P P

P P P P

multi-hop on-chip network

L2$bank

L2$bank

L2$bank

L2$bank

memcont.memcont

Resource sharing(+) reduces hardware cost(-) increases performance

variation

This performance variation becomes larger and larger as the number of sharers (cores) increases.

Jae W. Lee (3 / 33)

Desired quality-of-service from shared resources

Performance isolation (fairness)P P P P

P P P P

P P P P

P P P P


L2$bank

L2$bank

L2$bank

L2$bank

memcont.memcont

acceptedthroughput [MB/s]

01 F23 E45 67 89AB DC

P P P P

P P P P

P P P P

P P P P

processorID

(hotspot)

minimum guaranteed BW

Jae W. Lee (4 / 33)

P P P P

P P P P

P P P P

P P P P


L2$bank

L2$bank

L2$bank

L2$bank

memcont.memcont

processorID


01 F23 E45 67 89AB DC


01 F23 E45 67 89AB DC

differentiated allocation

processorID

(hotspot)

Desired quality-of-service from shared resources

minimum guaranteed BW

Performance isolation (fairness)

Differentiated services (flexibility)

Jae W. Lee (5 / 33)

Resources w/ centralized arbitration are well investigated

R R R

R R R

on-chiprouters

[MICRO ’06][PACT ’07]

[USENIX sec. ’07][IBM ’07]

[MICRO ’07][ISCA ’08]

...

[HPCA ‘02][ICS ‘04]

[ISCA ‘07]…

P+L1$

P+L1$

P+L1$

P+L1$

P+L1$

P+L1$

R R R

memctrl

L2$bank

Resources with centralized arbitration SDRAM controllers L2 cache banks

They have a single entry point for all requests.→ QoS is relatively easier and well investigated.

Jae W. Lee (6 / 33)

QoS from on-chip networks is a challenge

R R R

R R R

on-chiprouters

P+L1$

P+L1$

P+L1$

P+L1$

P+L1$

P+L1$

R R R

memctrl

L2$bank

Resources with distributed arbitration multi-hop on-chip

networks

They have distributed arbitration points.→ QoS is more difficult.

Off-chip solutions cannot be directly applied because of resource constraints.

Jae W. Lee (7 / 33)

We guarantee QoS for flows

Flow: a sequence of packets between a unique pair of end nodes (src and dest) physical links shared by

flows multiple stages of

arbitration for each packet

We provide guaranteed QoS to each flow with: minimum bandwidth

guarantees bounded maximum delay

R R R R

R R R R

R R R R

R R R R

physical linkshared by 3 flows

hotspotresource

Jae W. Lee (8 / 33)

Locally fair ⇒ globally fair

With locally fair round-robin (RR) arbitration: Throughput (Flow A) = (0.5) C Throughput (Flow B) = (0.5)2 C Throughput (Flow C) = Throughput (Flow D) =

(0.5)3 C → Throughput of a flow decreases exponentially as

its distance to the destination (hotspot) increases.

SRC ASRC BSRC C

SRC D DEST

arbitrationpoint 1

arbitrationpoint 2

arbitrationpoint 3

channel rate = C [Gb/s]

Jae W. Lee (9 / 33)

Motivational simulation

In 8x8 mesh network with RR arbitration (hotspot at (8, 8))

node index (Y)node index (X)

accepted throughput[flits/cycle/node]

w/ dimension-ordered routing w/ minimal-adaptive routing

node index (Y)node index (X)


locally-fair round-robin scheduling → globally unfair bandwidth usage

1

8

hotspot

8x8 2D mesh

2345678

76

54

32

1

Jae W. Lee (10 / 33)

Desired bandwidth allocation: an example

Taken from simulation results with GSF:

node index (X)node index (Y)


node index (X)node index (Y)


Fair allocation Differentiated allocation

Jae W. Lee (11 / 33)

Globally Synchronized Frames (GSF)

provide guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi-hop on-chip networks: with high network utilization comparable to

best-effort virtual-channel router with minimal area/energy overhead by avoiding

per-flow queues/structures in on-chip routers→ scalable to # of concurrent flows

Jae W. Lee (12 / 33)

Outline of this talk

Motivation Globally-Synchronized Frames: a step-by-

step development of mechanism Implementation of GSF router Evaluation Related work Conclusion

Jae W. Lee (13 / 33)

GSF takes a frame-based approach

R R R R

R R R R

R R R R

R R R R

shared physical link

time

frame #

2

Frame is a coarse quantization of time. The network can transport a finite number of flits during this

interval. We constrain each flow source to inject a certain number of

flits per frame. shorter frames → coarser BW control but lower maximum

delay typically 1-100s Kflits / frame (over all flows) in 8x8 mesh

network

01

34

Jae W. Lee (14 / 33)

Admission control of flows

R R R R

R R R R

R R R R

R R R R

shared physical link

time

frame #

2

Admission control: reject a new flow if it would make the network unable to transport all the injected flits within a frame interval

01

34

Jae W. Lee (15 / 33)

Single frame does not service bursty traffic well

Both traffic sources have the same long-term rate: 2 flits / frame.

Allocating 2 flits / frame penalizes the bursty source.

time

frame #

01

23

45

regulated srcbursty src

Jae W. Lee (16 / 33)

Overlapping multiple frames to help bursty traffic

Overlapping multiple frames to multiply injection slots Sources can inject flits into future frames (w/ separate per-frame

buffers) Older frames have higher priorities for contended channels.

Drain time of head frame does not change. Future frames can use unclaimed BW by older frames.

Maximum network delay < 3 * (frame interval) Best-effort traffic: always lowest priority (throughput ↑)

time

frame #

0

7

1 12 2 2

3 3 34 4 4

5 5 56 6

2 future frames

head frame

Jae W. Lee (17 / 33)

Reclamation of frame buffers

Per-frame buffers (at each node) = virtual channels At every frame window shift, frame buffers (or VCs)

associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame.

time

frame #

0

7

1 12 2 2

3 3 34 4 4

5 5 56 6Frame

windowshift

VC2

VC1

VC0epoch

0epoch

1epoch

2epoch

3epoch

4epoch

5

Jae W. Lee (18 / 33)

Early reclamation improves network throughput

time

frame #

Observation: Head frame usually drains much earlier than frame interval → low buffer utilization

Terminate head frame early if empty Use a global barrier network to confirm no pending packet in router or

source queue belongs to head frame. Empty buffers are reclaimed much faster and overall throughput increases.

(by >30% for hotspot traffic pattern)

0

7

1 12 2 2

3 3 34 4 4

5 5 56 6Frame

windowshift

epoch0

epoch1

epoch2

epoch3

epoch4

epoch5

0

7

1 12 2 2

3 3 34 4 4

5 5 56 6Frame

windowshift

e0 e1 e2 e3 e4 e5

76

7

e6 e70

7

12 2

3 34 45 5 5

6 6

Jae W. Lee (19 / 33)

GSF in action: two-router network example (3 VCs)

A

AVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)

CBVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)

Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ

active frame window:

GSF in action

Flow A Flow B Flow C Flow D

BC

BA D

Jae W. Lee (20 / 33)

GSF in action

GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D

A BC

BA D

A CBVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)

VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)

Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ


Jae W. Lee (21 / 33)

GSF in action


A BC

D

BA

BA



Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ


Jae W. Lee (22 / 33)

GSF in action


A BC

DA B

AVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)


Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ


Jae W. Lee (23 / 33)

GSF in action


DA B

A BC

empty!



Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ


Jae W. Lee (24 / 33)

GSF in action


DA B

CA

Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ

framewindow

shift

DCBA



Frame 0

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5ƒƒƒ


Jae W. Lee (25 / 33)

Carpool lane sharing

AB

AVC 0(Fk)

VC 1&2(Fk,

F(k+1),F(k+2))

carpool lane:reserved forhead frameonlyA

AVC 0(Fk)VC 1

(F(k+1))VC2

(F(k+2))

C B

C

Buffers are expensive in on-chip environment. Cannot transport a flit even if there is an empty slot in other frame

buffers. Carpool lane sharing: relaxing frame-VC mapping to improve

buffer utilization Reserve one frame buffer (VC0) for head frame only

→ does not increase the drain time of head frame The other buffers are now colorless and can be used by any frame.

Head-of-line (HoL) blocking prevented by not allowing two packets to occupy a VC simultaneously (OK for shallow buffers).

a flit in Frame (k+1)

Jae W. Lee (26 / 33)

Baseline virtual channel (VC) router

Best-effort router Three-stage pipeline

with look-ahead routing:VA/NRC-SA-ST

Credit-based flow control VC, SW allocators: iSlip

uses round-robin arbiters(locally fair)

updates the priority of each arbiter only when that arbiter generates a winning grant

VC0

VC1

VC2

VC3

flit buffers crossbar(5x5)

VC0

VC1

VC2

VC3

ƒƒƒ

VC allocation

SW allocation

next routecomputation

to P

to N

to E

to S

to W

ƒ ƒ ƒ ƒ ƒ

Baseline router for 2D mesh networks

Jae W. Lee (27 / 33)

GSF router

VC0: carpool lane reserved for head frame

only New registers

head_frame (HF) (per node) frame_num (per VC)

NRC: priority precalculation(frame_num-HF) (mod W)

(0 is the highest priority.) VC and SW allocation:

priority enforcement Global barrier network for

frame window shifting

VC0

VC1

VC2

VC3

flit buffersframe_num

2

x

1

0

≠

≠

≠

≠

crossbar(5x5)

VC0

VC1

VC2

VC3

2

0

3

2

≠

≠

≠

≠

ƒƒƒƒƒ

VC allocation

SW allocation


head_frame=2

increment_head_frameglobal barrier network

to P

to N

to E

to S

to W

ƒƒ

hea

d_f

ram

e_em

pty

_at_

no

de_

X

ƒ ƒ ƒ ƒ ƒ


VC allocation

SW allocation

Jae W. Lee (28 / 33)

Simulation setup

Network simulator: Booksim 0.5 M cycles with 50K-cycle warming up

Network configuration 8x8 2D mesh, dimension-ordered routing, 1 flit/cycle link

capacity Four traffic patterns

one QoS traffic pattern: hotspot three best-effort traffic patterns: uniform random,

transpose, nearest neighbor packet size is either 1 or 9 flits (with 50-50 chance)

Baseline VC router 3-stage pipeline (VA/NRC-SA-ST), 2-cycle credit pipeline

delay 6 VCs/physical link, buffer depth is 5 flits/VC

GSF parameters frame window size = 6 [frames], frame size = 1,000 [flits] global barrier latency = 16 [cycles] (conservative)

Jae W. Lee (29 / 33)

Flexible guaranteed QoS provided

All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri: # of flit injection slots for Flow i eMAX: maximum epoch interval.

Example: 8x8 mesh network

node index (X)

node index (Y)


node index (X)

node index (Y)


0.02

0.04

0.06

0

(b) differentiated allocation(a) fair allocation

Jae W. Lee (30 / 33)

Flexible guaranteed QoS provided

(c) differentiated allocation


node index (Y) node index (X)

All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri: # of flit injection slots for Flow I eMAX: maximum epoch interval.

Example: 16x16 torus network with 4 hotspot nodes

Jae W. Lee (31 / 33)

Small throughput degradation for best-effort traffic

average delay [cycles]

uniform random

offered load [flits/cycle/node]

050

100150200250300350400

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

iSlip GSF/1 GSF/8 GSF/16

050

100150200250300350400

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4


050

100150200250300350400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


transposeaverage delay [cycles]


average delay [cycles]


nearest neighbor

Network behavior with non-QoS traffic no latency increase in

uncongested region at most 12 % degradation of

network saturation throughput → can be reduced with larger frame (at the cost of delay bound increase)

12 %

7 %

Jae W. Lee (32 / 33)

Related work

QoS support in IP or multiprocessor networks Fair Queueing [SIGCOMM ’89],

Virtual Clock [SIGCOMM ’90] Multi-rate channel switching [IEEE Comm ’86] Source throttling [HPCA ’01] Age-based arbitration [IEEE TPADS ’92, SC ’07] Rotating Combined Queueing (RCQ) [ISCA ’96]→ expensive, inflexible, and/or without guaranteed QoS

QoS on-chip networks AEthereal (strict TDM; exp. channel setup) [IEEE Design &

Test ’05] SonicsMX (per-thread queues at each node) [DATE ’05] MANGO clockless NoC (partitioning GS and BE VCs) [DATE

’05] Nostrum (routes fixed at design time) [DATE ’04]

Jae W. Lee (33 / 33)

Conclusion

The GSF network is guaranteed QoS-capable

with minimum bandwidth guarantees and maximum delay flexible

fair and differentiated bandwidth allocation no explicit channel setup required along the path

robust <5 % throughput degradation on average (12 % in the

worst) for four traffic patterns in 8x8 mesh network fairness vs overall throughput tradeoff with frame size

simple no per-flow queues/structures in on-chip routers

→ scalable relatively small modifications to a conventional VC

router

Documents

Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)