Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Avinoam Kolodny
Threads, Caches and Networks in Chip-MultiProcessor Systems
Electrical Engineering Department
Technion – Israel Institute of Technology
ETNA –
1st International Workshop on Emerging Topics in NoC-aware Computer Architecture
ISCA 2013
Technion’s NoC & Architecture Collaborators:
– Israel Cidon
– Yoav Etzion
– Eby Friedman
– Ran Ginosar
– Idit Keidar
– Isaac Keslassy
– Avinoam Kolodny
– Avi Mendelson
– Uri Weiser
– …. And some very good students!
3
Chips are similar to Cities
system complexity
is shown
in the interconnect
If a chip is like a city,NoC is similar to a subway system
Changing view of VLSI systems
“Old” view:
• Communication is fast and free
• Execution time and energy are dominated by ALU operations
5
The truth is actually somewhere in the middle…
… that’s why NoC+CMP architecture is challenging!
“New” view:• Communication
dominates delay, power and cost
• Computing operations are fast and cheap
On-Chip Interconnect is a Bottleneck:The challenge of wire design
6
Interconnect Delay
is dominant
Source: Bohr, IEDM 1995
Interconnect Power
is dominant
Interconnect
51%
Gate
34%
Diffusion
15%
* N. Magen, A. Kolodny, U. Weiser and
N. Shamir,, SLIP 2004.
(Data for Intel “Banias” centrino processor)
Network on-Chip (NoC)
Computing
module
Network
router
Network
link
Module
Module Module
Module Module
Module Module
Module
Module
Module
Module
Module
Network instead of dedicated wires and buses Inherently parallel
Efficient sharing of wires
Scalable, cost effective bandwidth
7
8
Critical Problems Addressed by NoC
1) Global wire design(delay, power, noise, bandwidth, scalability, reliability issues)
2) System integrationproductivity problem
(key to modular design)
3) Building multi-core systems(key to power-efficient computing)
Module
Module Module
Module Module
Module Module
Module
Module
Module
Module
Module
Processor Evolution
CPU
Cache
Single
Core
CPU1
Cache
Dual Core
CPU2
Cache CPU1Cache
Quad Core
CPU3Cache
CPU2Cache
CPU4Cache
9
[Pollack]
Asymmetric (=Heterogeneous) Multi-Core
• Small cores of area: a
• Large core area: βa – used for serial code
• Parallel phases execute on all cores
βaSerial
What do we know about
future systems?
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
Standard modules(DSP, HW accelerators,
Cache banks, etc.)13
High Certainty
Totally unknown
Large
number of
modulesNoC
Interconnect
Applications
Power-aware
Highly
parallel
Accessing On-Chip Cache banksthrough a NoC
14
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
• Shared last level cache (LLC)
– Single copy no inter-cache coherence
• Banked , DNUCA– Interconnected using Network-on-Chip (NoC)
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
[Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]
Exploring Cache-In-the-Middle CMP
15
Shared data migrates to the center of the distributed cache – far away from clients
Longer access times
Remoteness of Shared Data
0 7
56 63
P0 P1
P5 P4P
6P
7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4P
6P
7
P3
P2
Distributed L2
CPU0 CPU1 CPU3CPU2
CP40 CPU5 CPU7CPU6
CPU0 CPU1 CPU3CPU2
CPU4 CPU5 CPU7CPU6
Bank0 Bank1 Bank2 Bank3
Bank4 Bank5 Bank6 Bank7
16
For many multithreaded applications:
Splash-2, SpecOMP, Apache, Specjbb, STM, ..
Observations on Memory Accesses
1. Access to shared lines is substantial
2. Shared lines are shared by many cores
3. A small number of lines make for a large fraction of the total accesses
A small number of lines, shared by many processors, is accessed numerous times
⇒ Shared hot lines effect
17
Shared Data Hinders Cache Performance
What can be done better?
Bring shared data closer to all processors
Preserve vicinity of private data
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
18
This Has Been Addressed Before
19
Aerial view of Nahalal cooperative village
Overview of Nahalal cache organization
P0 P1
P2
P3P4P5
P6
P7
Nahalal Layout
• Partitioning of cache lines by “shared” vs. “private”
• Keep shared lines in the center
– Small & fast structure, close to all processors
• Use outer banks for private data
– Preserves vicinity of private data
* Guz et al., [SPAA-2008] , [CA-Letters’07]
CPU0
CP
U1
CPU2
CPU6
CP
U5
CPU4
CPU3CPU7
CPU0
CP
U1
CPU2
CPU6
CP
U5
CPU4
CPU3CPU7
20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
equake fma3d barnes water apache zeus specjbb RBTree HashTable
Av
era
ge
D
ista
nc
e (
Nu
mb
er
of
Ho
ps
)
private linein CIM
private linein Nahalal
shared linein CIM
shared linein Nahalal
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
equake fma3d barnes water apache zeus specjbb
Av
era
ge
R
ela
tiv
e D
ista
nc
e
private line
in CIM
private line
in Nahalal
shared line
in CIM
shared line
in Nahalal
Avera
ge R
ela
tive D
ista
nce
Average Relative Distance
Nahalal shortens the distance to shared data
Distance to private data remains roughly the same
Average Distance – Shared vs. Private
21
26.8% improvement in average cache hit time
41.1% in apache
Average Cache Hit Time (cycles)
Cache Performance
0
5
10
15
20
25
30
35
40
45
50
equake fma3d barnes water apache zeus specjbb RBTree HashTable
Ca
ch
e A
cc
es
s T
ime
(C
loc
k C
yc
les
)
CIM
NAHALAL
# c
lock
cycl
es
3.9% 8.57%
40.53%
41.1%
29.06%29.35%39.4%
29.1%24.2%
22
Latency is an Issue in NoCs
24
Latency Model
Latency Routing Delay WireDelay
24
2
1 2 2log
cyc clk
cyc c
Router Delay n t
n B pv B
2
2
int intR CUnrepeatedWire Delay l
0 int int int
0 00.7 0.4 0.7
R R
R
R C R Cl S C S C
S
RepeatedWire Delay
* L.-S. Peh and W.J. Dally, "A delay model and speculative architecture for pipelined routers“,2010
* H.B. Bakoglu, Circuits, Interconnections and Packaging for VLSI., 1990.
• Technology independent
model. Latency measured in
units of τ – inverter’s
switching delay
Ultimate Link Length
25
• Increasing wire speed: by widening, spacing and repeater insertion.
• Wire length reaches an ultimate limit - regardless of cost function.
• Maximal link-length decreases as technology advances.
0 10 20 30 40 50 600
1
2
3
4
5
6
7
8
9
10
Length [mm](a)
Co
st F
unct
ion
per
mm
Cost vs Length - 1GHz
Year 2009 -29nm
Year 2011 -24nm
Year 2013 -20nm
Year 2015 -17nm
Year 2017 -14nm
Year 2019 -12nm
Year 2021 -10nm
Year 2023 - 8nm
0 10 20 30 40 50 600
1
2
3
4
5
6
7
8
9
10
Length [mm](b)
Co
st F
unct
ion
per
mm
Cost vs Length - 2GHz
Year 2009 -29nm
Year 2011 -24nm
Year 2013 -20nm
Year 2015 -17nm
Year 2017 -14nm
Year 2019 -12nm
Year 2021 -10nm
Year 2023 - 8nm
0
10
20
30
40
50
60
29nm 24nm 20nm 17nm 14nm 12nm 10nm 8nm
Len
gth
[m
m]
Technology node
1GHz
2GHz
Length [mm]
Co
st
Fu
ncti
on
per
mm
* R. Manevich, L. Polishuk,I . Cidon, A. Kolodny, "Design Tradeoffs of Long Links in Hierarchical Tiled
Networks-on-Chip”, DSD EUROMICRO 2013.
Link Delay vs. Router Cycle
• For future technologies, link delay becomes worse
• When link delay is higher than router’s clock, link pipelining is needed
1
10
100
1000
29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7
Lin
k D
ela
y[τ
]
Technology nodes[nm]
Link Delay -16x16
Router cycle[τ] -vc=2
1
10
100
1000
29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7
Lin
k D
ela
y[τ
]
Technology nodes[nm]
Link Delay -16x16
Link Delay -32x32
Router cycle[τ] -vc=2
1
10
100
1000
29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7
Lin
k D
ela
y[τ
]
Technology nodes[nm]
Link Delay -16x16
Link Delay -32x32
Link Delay -64x64
Router cycle[τ] -vc=2
26
Latency is a Basic Disadvantage in NoCs(for global packets which cross many routers)
In large systems, with traffic modeled by Rent’s rule,global packets (minority):
Consume most of the network’s BW.
Significantly increase the average latency at light load .
* R. Manevich, I. Cidon, and A. Kolodny, "Handling Global Traffic in Future CMP NoCs", SLIP 2012
Latency grows even worsewhen the NoC is loaded
Typical Latency vs. Load
Light Load
Latency
A loaded network quickly reaches a saturation point!
Can NoC latency be reduced?
(“Ideas for CMP-aware NoC”)
Reducing hop-countExample: PyraMesh Topology
Overall hops-count is reduced.
Average latency is
reduced.
Average BW per router is reduced.
• Hierarchical 2D mesh.
• Global packets are routed through higher hierarchy levels.
12345678 hops
instead of 14!
Source
Dest.
* R. Manevich, I. Cidon, and A. Kolodny, "Handling Global Traffic in Future CMP NoCs", SLIP 2012
0.4 0.5 0.6 0.7 0.8 0.90
5
10
15
20NoC Size: 64
(a)
Late
ncy[
Clk
cycl
e]
0.4 0.5 0.6 0.7 0.8 0.90
10
20
30NoC Size: 256
(b)
0.4 0.5 0.6 0.7 0.8 0.90
10
20
30
40
50NoC Size: 1024
Rent coefficient - r(c)
Late
ncy[
Clk
cycl
e]
0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80NoC Size: 4096
Rent coefficient - r(d)
HNoC Simple Mesh PyraMesh EVC Boundary
Average Latency: Comparison of NoC topologies
for a wide range of Rentian traffic loads
32
Average Maximal
Latency speedup vs
Simple Mesh
1.55X 2.05X
Latency speedup vs
2nd best
1.21X 1.64X
* R. Manevich, L. Polishuk, I. Cidon, A. Kolodny, To be published, 2013.
Some Sad Observations
• There is no “perfect” topology– you need to know your traffic model to choose a
network topology
• Choosing the most suitable topology for your type of traffic can reduce the latency by no more than ~2X (at light loads).
• What can be done to prevent additional, congestion-related delays?
33
Ideas for improving latency of Cache traffic in NoC
34
35 E. Bolotin – The Power of Priority, NoCs 2007
Issues in NUCA-based CMP
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
• Each cache access Multiple Noc transactions
• NoC performance CMP performance
• Cache coherency and transaction order (correctness)
• Search (in DNUCA)
• Different traffic types (e.g. fetch vs. prefetch)
• Synchronization (locks)
Need specialized NoC
Services for CMP!
41 E. Bolotin – The Power of Priority, NoCs 2007
Observations on Cache Access
- Delay = Queueing + NoC transactions
- NoC transactions consist of:
• Short ctrl. packets
• Long data packets
Idea: Differentiate between Control and Data
Solution: Preemptive Priority NoC Give priority to short control packets
L2
Dire
cto
ry
NoC
No
C
No
C
P1L1
P2L1
P0L1
4. IN
VALI
D. R
EQ
3. READ EXCL. REQ
6. Read EXCL. RESP
(data transfer)
5. INVALID. ACK
5. IN
VA
LID
. A
CK
P0-MOD.
42 E. Bolotin – The Power of Priority, NoCs 2007
Preemptive Priority NoC: QNoC
Multiple SL link
QNoC
Input ports Output ports
BufSize
SL 0
SL 1
CR
OS
S-B
AR
Scheduler CREDITControlCREDIT
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Physical Link
Output Input
SL 0
SL 1
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Service Levels:
• Dedicated wormhole buffer
• Preemptive priority scheduling
Multiple SL Router
43 E. Bolotin – The Power of Priority, NoCs 2007
Priority NoC: Several Benchmarks
L2 Access Delay Reduction by Priority-based NoC
22.6
31.8
19.6
28.4
13.5
25.3
18.3
32.9
22.3
28.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
apache zeus fft ocean radix
De
lay
Re
du
cti
on
[%
]
Read Read Exclusive
Delay Reduction Program Speedup
Total Program Speedup by Priority-based NoC
9.48.7
9.08.6
5.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
apache zeus fft ocean radix
Sp
ee
du
p [
%]
*E. Bolotin, Z. Guz, I. Cidon, R. Ginosar and A. Kolodny, "The Power of Priority: NoC based Distributed Cache Coherency",
NOCS 2007, Princeton, NJ, May 2007.
• …… chips are so small…..
• Idea: Use centralized mechanisms in NoCs!
44
Should we regard NoCs as truly distributed systems?
centralized mechanism example 1:
Bus-Enhanced NoC (BENoC)
• Motivation
– NoCs have high bandwidth, but latency suffers
– Group communication is expensive
Idea of Bus-Enhanced NoC
Approach
Embed a bus to achieve synergy
Optimize: bus for latency, NoC for bandwidth
Use bus for meta-data onlyR
RR RR
R
R
R RR
R
R
R R
R
R
R
R R
R
R
R
R
R
RR
RR
R
R
R
R
Bus-Enhanced NoC (BENoC)
• Bus re-introduced as a NoC “add-on”
47
Use NoC for data
Optimized for high bandwidth
Use bus for short meta-data Low bandwidth, low latency
Broadcast, multicast
R
RR RR
R
R
R RR
R
R
R R
R
R
R
R R
R
R
R
R
R
RR
RR
R
R
R
R
Module Module
Module Module
Module Module
Module Module
Module
Module
Module
Module
Module
Module
Module
Module
*R. Manevich, I. Walter, I. Cidon and A. Kolodny, "Best of Both Worlds: A Bus-Enhanced NoC (BENoC)",
NOCS 2009, San Diego, CA, May 2009
BENoC Services
• Fast unicast and multicast signaling
– CMP cache example
• Anycast
– Find resources that fulfills certain conditions
– E.g., “Looking for an idling DSP”; or
“Where are the 5 closest multipliers?”
• Convergecast
– Efficient collection of feedback back to the initiator
• Barrier synchronization, …
48
Bus-enhancesd NoC for DNUCA
• Split large cache into independent smaller banks– Non uniform cache access time (NUCA)
• Cache lines are moved to shorten access time– Dynamic NUCA
• Before fetching a into its L1$, a CPU needs to find the L2 cache storing the line
CPU
L1$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
CPU
L1$
CP
U
L1
$
CP
U
L1
$
CPU
L1$
CPU
L1$
CP
U
L1
$
CP
U
L1
$
L2$
51
Simulation of DNUCA with Bus-enhanced NOC
Performance improvement in BENoC compared to a NoC-based
CMP
(a) average read transaction latency; (b) application speed
53
centralized mechanism example 2:
Centralized Adaptive Routing
Route Selection
ATDOR - Adaptive Toggle Dimension Ordered Routing
Keep it simple! Centralized selection:
The option with less congested bottleneck link is preferred.
Centralized Adaptive Routing
Congestion aggregation
Routing control
Congestion data collection within the routers
* R. Manevich, I. Cidon, A. Kolodny, and W. Isask'har, "Centralized Adaptive Routing for NoCs,"
Computer Architecture Letters , vol.9, no.2, pp.57-60, Feb. 2010.
centralized mechanism example 3:
GANA: Global Arbitration NoC Architecture
Global Arbiter NoC Architecture
• An overlay of a Data and Control
GAU
NoC like
Wires and Simple Routers
Global Arbiter
Request and Grant Lines
GANA: Global Arbitration NoC Architecture
• A New NoC Architecture
• Power is 76% @ 0.25 load and 62% @ 0.75 load
• Area is 16% of a baseline NoC
• Single cycle latency per hop
• No Head-of-Line blocking
• No parking-lot effect – Fairness imposed* E. Zahavi, I. Cidon and A. Kolodny, "GANA: A Novel Low Cost Conflict Free NoC Architecture,"
ACM Transactions on Embedded computing, 2013.
Need for Heterogeneous NoCs
CMP Bandwidth Requirements
• Different links in NoC-based CMP need different throughput capacities!
– Typically, links at the center carry more traffic.
NoC-Based CMP Example –Non-uniform traffic
• 3 different types of links:1. DRAM to L2$
• 22 GBps and 2 VCs
• handle a miss read
2. L2$ to DRAM
• 12 GBps and 2 VCs
• Block replacements during miss handling in the L2$
3. Cores <-> L2$
• 3 GBps and 1 VC
Legend: C – Core + L1 cache ; $ - L2 cache ; D – DRAM controller.
* The link thickness corresponds to its
capacity
Heterogeneous NoC Router Architecture for CMPs
67* I. Ben-Itzhak, I. Cidon and A. Kolodny, to be published, 2013.
What can be donein processors and software?
(“Ideas for NoC-aware computing”)
A Unified Machine Model
• Use both cache and many threads to shield memory access
– Derive simple equations for performance, power, BW,..
69
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Cache
To External Memory
Threads Architectural States
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
C C
C C
C C
C C
C
C
C
C
* Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson and U. Weiser, Many-Core vs. Many-Thread
Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, Volume 8, Issue 1, Jan. 2009
A Useful Plot
for Multi-Threaded Systems
70
Number of Threads
Performance
Cache Machines
• Many cores (each may have its private L1) behind a shared cache
71
C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
Cache
To Memory
C
C
C
C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
# Threads
Performance
Cache Non-Effective point
(more threads ► lower hit-rate)
Multi-Thread Machines
• Memory latency shielded by multiple thread execution
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
To Memory
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Threads Architectural States
Ban
dw
idth
Lim
itati
on
s
# Threads
PerformanceMax performance
executionMemory access
72
Unified Machine Performance
• 3 regions: Cache efficiency region, The Valley, MT efficiency region
77
# Threads
Perf
orm
an
ce
Ca
ch
e re
gio
n
MT regionThe Valley
* Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson and U. Weiser, Many-Core vs. Many-Thread
Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, Volume 8, Issue 1, Jan. 2009
Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1
A “weak” function of number of threads - f(Nq) when q≤1
Miss rate is not affected by number of threads
Threads
Perf
orm
an
ce
Hit Rate Dependency – 3 ClassesP
erf
orm
an
ce
# Threads
78
Example: Canneal - simulation results from PARSEC workloads
Not enough parallelism available!
Investigating Workload Parallelism
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
79* Z. Guz, O. Itzhak, I. Kediar. A. Kolodny, A. Mendelson and U.C. Weiser, "Threads vs. Caches: Modeling the
Behavior of Parallel Workloads", ICCD 2010
Inherent Program Scalability Study
• Capture the parallelism limitation of the algorithm
• Use architecture model with no parallelism limiters– No shared resources (e.g. cache, bandwidth)
– Perfect memory system – 1 cycle latency
• Focusing on inter-thread synchronization– Using a special simulator
* O. Itzhak, I. Keidar, A. Kolodny and U. Weiser, To be published, 2013.
Perfect parallelism scalability: blackscholes
Good parallelism scalability: fluidanimate
Poor parallelism scalability raytrace
What can be done when NoC latencies become
dominant?
• More parallelism? Efficient thread switching?
• More locality?
• Special attention to shared data?
• Special attention to meta-data?
Memory Intensive Machines
• Reducing BW (i.e. power) can be achieved by climbing up a constant-throughput-curve
• increase on-die-memory (e.g. innovative cache, new ideas….?)85
TP/BW
TP1
TP2TP3
TP4
Memrisor Opportunities
• 3D memory - above CMOS logic
• Nonvolatile
• High density
• “For free”
86
Sea of nonvolatile memory
above the logic
Deep Pipeline with Memristor-based Thread Reservoir
• Use memristors to reduce thread switch penalty
• At switch time:
– Instead of flush, store the thread state in memristors
– Load pipeline stages for different thread from memristors
87* S. Kvatinsky et al., Computer Architecture Letters, 2013
Summary
• Distances and associated latencies lead to interesting tradeoffs in NoC-based system architecture!
89
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R