Upload
ghita
View
32
Download
0
Embed Size (px)
DESCRIPTION
GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS. Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley). ISCA-35, Beijing, China. June 23 th 2008. Resource sharing increases performance variation. Resource sharing ( + ) reduces hardware cost - PowerPoint PPT Presentation
Citation preview
Jae W. Lee (MIT)Man Cheuk Ng (MIT)
Krste Asanovic (UC Berkeley)
June 23th 2008
GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED
QUALITY-OF-SERVICE IN ON-CHIP NETWORKS
ISCA-35, Beijing, China
Jae W. Lee (2 / 33)
Resource sharing increases performance variation
P P P P
P P P P
P P P P
P P P P
multi-hop on-chip network
L2$bank
L2$bank
L2$bank
L2$bank
memcont.memcont
Resource sharing(+) reduces hardware cost(-) increases performance
variation
This performance variation becomes larger and larger as the number of sharers (cores) increases.
Jae W. Lee (3 / 33)
Desired quality-of-service from shared resources
Performance isolation (fairness)P P P P
P P P P
P P P P
P P P P
multi-hop on-chip network
L2$bank
L2$bank
L2$bank
L2$bank
memcont.memcont
acceptedthroughput [MB/s]
01 F23 E45 67 89AB DC
P P P P
P P P P
P P P P
P P P P
processorID
(hotspot)
minimum guaranteed BW
Jae W. Lee (4 / 33)
P P P P
P P P P
P P P P
P P P P
multi-hop on-chip network
L2$bank
L2$bank
L2$bank
L2$bank
memcont.memcont
processorID
acceptedthroughput [MB/s]
01 F23 E45 67 89AB DC
acceptedthroughput [MB/s]
01 F23 E45 67 89AB DC
differentiated allocation
processorID
(hotspot)
Desired quality-of-service from shared resources
minimum guaranteed BW
Performance isolation (fairness)
Differentiated services (flexibility)
Jae W. Lee (5 / 33)
Resources w/ centralized arbitration are well investigated
R R R
R R R
on-chiprouters
[MICRO ’06][PACT ’07]
[USENIX sec. ’07][IBM ’07]
[MICRO ’07][ISCA ’08]
...
[HPCA ‘02][ICS ‘04]
[ISCA ‘07]…
P+L1$
P+L1$
P+L1$
P+L1$
P+L1$
P+L1$
R R R
memctrl
L2$bank
Resources with centralized arbitration SDRAM controllers L2 cache banks
They have a single entry point for all requests.→ QoS is relatively easier and well investigated.
Jae W. Lee (6 / 33)
QoS from on-chip networks is a challenge
R R R
R R R
on-chiprouters
P+L1$
P+L1$
P+L1$
P+L1$
P+L1$
P+L1$
R R R
memctrl
L2$bank
Resources with distributed arbitration multi-hop on-chip
networks
They have distributed arbitration points.→ QoS is more difficult.
Off-chip solutions cannot be directly applied because of resource constraints.
Jae W. Lee (7 / 33)
We guarantee QoS for flows
Flow: a sequence of packets between a unique pair of end nodes (src and dest) physical links shared by
flows multiple stages of
arbitration for each packet
We provide guaranteed QoS to each flow with: minimum bandwidth
guarantees bounded maximum delay
R R R R
R R R R
R R R R
R R R R
physical linkshared by 3 flows
hotspotresource
Jae W. Lee (8 / 33)
Locally fair ⇒ globally fair
With locally fair round-robin (RR) arbitration: Throughput (Flow A) = (0.5) C Throughput (Flow B) = (0.5)2 C Throughput (Flow C) = Throughput (Flow D) =
(0.5)3 C → Throughput of a flow decreases exponentially as
its distance to the destination (hotspot) increases.
SRC ASRC BSRC C
SRC D DEST
arbitrationpoint 1
arbitrationpoint 2
arbitrationpoint 3
channel rate = C [Gb/s]
Jae W. Lee (9 / 33)
Motivational simulation
In 8x8 mesh network with RR arbitration (hotspot at (8, 8))
node index (Y)node index (X)
accepted throughput[flits/cycle/node]
w/ dimension-ordered routing w/ minimal-adaptive routing
node index (Y)node index (X)
accepted throughput[flits/cycle/node]
locally-fair round-robin scheduling → globally unfair bandwidth usage
1
8
hotspot
8x8 2D mesh
2345678
76
54
32
1
Jae W. Lee (10 / 33)
Desired bandwidth allocation: an example
Taken from simulation results with GSF:
node index (X)node index (Y)
accepted throughput[flits/cycle/node]
node index (X)node index (Y)
accepted throughput[flits/cycle/node]
Fair allocation Differentiated allocation
Jae W. Lee (11 / 33)
Globally Synchronized Frames (GSF)
provide guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi-hop on-chip networks: with high network utilization comparable to
best-effort virtual-channel router with minimal area/energy overhead by avoiding
per-flow queues/structures in on-chip routers→ scalable to # of concurrent flows
Jae W. Lee (12 / 33)
Outline of this talk
Motivation Globally-Synchronized Frames: a step-by-
step development of mechanism Implementation of GSF router Evaluation Related work Conclusion
Jae W. Lee (13 / 33)
GSF takes a frame-based approach
R R R R
R R R R
R R R R
R R R R
shared physical link
time
frame #
2
Frame is a coarse quantization of time. The network can transport a finite number of flits during this
interval. We constrain each flow source to inject a certain number of
flits per frame. shorter frames → coarser BW control but lower maximum
delay typically 1-100s Kflits / frame (over all flows) in 8x8 mesh
network
01
34
Jae W. Lee (14 / 33)
Admission control of flows
R R R R
R R R R
R R R R
R R R R
shared physical link
time
frame #
2
Admission control: reject a new flow if it would make the network unable to transport all the injected flits within a frame interval
01
34
Jae W. Lee (15 / 33)
Single frame does not service bursty traffic well
Both traffic sources have the same long-term rate: 2 flits / frame.
Allocating 2 flits / frame penalizes the bursty source.
time
frame #
01
23
45
regulated srcbursty src
Jae W. Lee (16 / 33)
Overlapping multiple frames to help bursty traffic
Overlapping multiple frames to multiply injection slots Sources can inject flits into future frames (w/ separate per-frame
buffers) Older frames have higher priorities for contended channels.
Drain time of head frame does not change. Future frames can use unclaimed BW by older frames.
Maximum network delay < 3 * (frame interval) Best-effort traffic: always lowest priority (throughput ↑)
time
frame #
0
7
1 12 2 2
3 3 34 4 4
5 5 56 6
2 future frames
head frame
Jae W. Lee (17 / 33)
Reclamation of frame buffers
Per-frame buffers (at each node) = virtual channels At every frame window shift, frame buffers (or VCs)
associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame.
time
frame #
0
7
1 12 2 2
3 3 34 4 4
5 5 56 6Frame
windowshift
VC2
VC1
VC0epoch
0epoch
1epoch
2epoch
3epoch
4epoch
5
Jae W. Lee (18 / 33)
Early reclamation improves network throughput
time
frame #
Observation: Head frame usually drains much earlier than frame interval → low buffer utilization
Terminate head frame early if empty Use a global barrier network to confirm no pending packet in router or
source queue belongs to head frame. Empty buffers are reclaimed much faster and overall throughput increases.
(by >30% for hotspot traffic pattern)
0
7
1 12 2 2
3 3 34 4 4
5 5 56 6Frame
windowshift
epoch0
epoch1
epoch2
epoch3
epoch4
epoch5
0
7
1 12 2 2
3 3 34 4 4
5 5 56 6Frame
windowshift
e0 e1 e2 e3 e4 e5
76
7
e6 e70
7
12 2
3 34 45 5 5
6 6
Jae W. Lee (19 / 33)
GSF in action: two-router network example (3 VCs)
A
AVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
CBVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
GSF in action
Flow A Flow B Flow C Flow D
BC
BA D
Jae W. Lee (20 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D
A BC
BA D
A CBVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
Jae W. Lee (21 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D
A BC
D
BA
BA
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
Jae W. Lee (22 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D
A BC
DA B
AVC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
Jae W. Lee (23 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D
DA B
A BC
empty!
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
VC 0(Fr0)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
Jae W. Lee (24 / 33)
GSF in action
GSF in action: two-router network example (3 VCs)Flow A Flow B Flow C Flow D
DA B
CA
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
framewindow
shift
DCBA
VC 0(Fr3)VC 1(Fr1)VC 2(Fr2)
VC 0(Fr3)VC 1(Fr1)VC 2(Fr2)
Frame 0
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5ƒƒƒ
active frame window:
Jae W. Lee (25 / 33)
Carpool lane sharing
AB
AVC 0(Fk)
VC 1&2(Fk,
F(k+1),F(k+2))
carpool lane:reserved forhead frameonlyA
AVC 0(Fk)VC 1
(F(k+1))VC2
(F(k+2))
C B
C
Buffers are expensive in on-chip environment. Cannot transport a flit even if there is an empty slot in other frame
buffers. Carpool lane sharing: relaxing frame-VC mapping to improve
buffer utilization Reserve one frame buffer (VC0) for head frame only
→ does not increase the drain time of head frame The other buffers are now colorless and can be used by any frame.
Head-of-line (HoL) blocking prevented by not allowing two packets to occupy a VC simultaneously (OK for shallow buffers).
a flit in Frame (k+1)
Jae W. Lee (26 / 33)
Baseline virtual channel (VC) router
Best-effort router Three-stage pipeline
with look-ahead routing:VA/NRC-SA-ST
Credit-based flow control VC, SW allocators: iSlip
uses round-robin arbiters(locally fair)
updates the priority of each arbiter only when that arbiter generates a winning grant
VC0
VC1
VC2
VC3
flit buffers crossbar(5x5)
VC0
VC1
VC2
VC3
ƒƒƒ
VC allocation
SW allocation
next routecomputation
to P
to N
to E
to S
to W
ƒ ƒ ƒ ƒ ƒ
Baseline router for 2D mesh networks
Jae W. Lee (27 / 33)
GSF router
VC0: carpool lane reserved for head frame
only New registers
head_frame (HF) (per node) frame_num (per VC)
NRC: priority precalculation(frame_num-HF) (mod W)
(0 is the highest priority.) VC and SW allocation:
priority enforcement Global barrier network for
frame window shifting
VC0
VC1
VC2
VC3
flit buffersframe_num
2
x
1
0
≠
≠
≠
≠
crossbar(5x5)
VC0
VC1
VC2
VC3
2
0
3
2
≠
≠
≠
≠
ƒƒƒƒƒ
VC allocation
SW allocation
next routecomputation
head_frame=2
increment_head_frameglobal barrier network
to P
to N
to E
to S
to W
ƒƒ
hea
d_f
ram
e_em
pty
_at_
no
de_
X
ƒ ƒ ƒ ƒ ƒ
next routecomputation
VC allocation
SW allocation
Jae W. Lee (28 / 33)
Simulation setup
Network simulator: Booksim 0.5 M cycles with 50K-cycle warming up
Network configuration 8x8 2D mesh, dimension-ordered routing, 1 flit/cycle link
capacity Four traffic patterns
one QoS traffic pattern: hotspot three best-effort traffic patterns: uniform random,
transpose, nearest neighbor packet size is either 1 or 9 flits (with 50-50 chance)
Baseline VC router 3-stage pipeline (VA/NRC-SA-ST), 2-cycle credit pipeline
delay 6 VCs/physical link, buffer depth is 5 flits/VC
GSF parameters frame window size = 6 [frames], frame size = 1,000 [flits] global barrier latency = 16 [cycles] (conservative)
Jae W. Lee (29 / 33)
Flexible guaranteed QoS provided
All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri: # of flit injection slots for Flow i eMAX: maximum epoch interval.
Example: 8x8 mesh network
node index (X)
node index (Y)
accepted throughput[flits/cycle/node]
node index (X)
node index (Y)
accepted throughput[flits/cycle/node]
0.02
0.04
0.06
0
(b) differentiated allocation(a) fair allocation
Jae W. Lee (30 / 33)
Flexible guaranteed QoS provided
(c) differentiated allocation
accepted throughput[flits/cycle/node]
node index (Y) node index (X)
All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri: # of flit injection slots for Flow I eMAX: maximum epoch interval.
Example: 16x16 torus network with 4 hotspot nodes
Jae W. Lee (31 / 33)
Small throughput degradation for best-effort traffic
average delay [cycles]
uniform random
offered load [flits/cycle/node]
050
100150200250300350400
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
iSlip GSF/1 GSF/8 GSF/16
050
100150200250300350400
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
iSlip GSF/1 GSF/8 GSF/16
050
100150200250300350400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
iSlip GSF/1 GSF/8 GSF/16
transposeaverage delay [cycles]
offered load [flits/cycle/node]
average delay [cycles]
offered load [flits/cycle/node]
nearest neighbor
Network behavior with non-QoS traffic no latency increase in
uncongested region at most 12 % degradation of
network saturation throughput → can be reduced with larger frame (at the cost of delay bound increase)
12 %
7 %
Jae W. Lee (32 / 33)
Related work
QoS support in IP or multiprocessor networks Fair Queueing [SIGCOMM ’89],
Virtual Clock [SIGCOMM ’90] Multi-rate channel switching [IEEE Comm ’86] Source throttling [HPCA ’01] Age-based arbitration [IEEE TPADS ’92, SC ’07] Rotating Combined Queueing (RCQ) [ISCA ’96]→ expensive, inflexible, and/or without guaranteed QoS
QoS on-chip networks AEthereal (strict TDM; exp. channel setup) [IEEE Design &
Test ’05] SonicsMX (per-thread queues at each node) [DATE ’05] MANGO clockless NoC (partitioning GS and BE VCs) [DATE
’05] Nostrum (routes fixed at design time) [DATE ’04]
Jae W. Lee (33 / 33)
Conclusion
The GSF network is guaranteed QoS-capable
with minimum bandwidth guarantees and maximum delay flexible
fair and differentiated bandwidth allocation no explicit channel setup required along the path
robust <5 % throughput degradation on average (12 % in the
worst) for four traffic patterns in 8x8 mesh network fairness vs overall throughput tradeoff with frame size
simple no per-flow queues/structures in on-chip routers
→ scalable relatively small modifications to a conventional VC
router