Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Implications of Memory-Centric Computing Architectures for Future NoCs
Sudhakar Yalamanchili
Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems
School of Electrical and Computer Engineering Georgia Institute of Technology
Sponsors:
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Role of NoCs
www.themobileindian.com
www.theregister.co.uk
Nvidia Tegra4 IBM Power8
Qualcomm Snapdragon
The System Defines the NoC Requirements
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
n Impact of Technology and Applications
n Transition to Memory Centric Compute: Inside the Package
n Transition to Memory Centric Compute: Inside the Stack
n Concluding Remarks
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
How are Technology and Applications Reshaping Systems?
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Moore’s Law and the End of Dennard Scaling
From betanews.com
• Performance scaled with number of transistors*
Goal: Sustain Performance Scaling
*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Power and Performance
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
Operator_cost + Data_movement_cost + Storage_cost
Specialization ! heterogeneity, asymmetry, technology diversity
*
Power Supply (regulation) + Power Consumption + Cooling
W. J. Dally, Keynote IITC 2012
6
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Energy Cost of Data Management
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
Three operands x 64 bits/operand
DataMovementEnergy = #bits× dist −mm× energy− bit −mm
*S. Borkar and A. Chien, “The Future of Microprocessors, CACM, May 2011
Operator_cost + Data_movement_cost + Storage_cost
W. J. Dally, Keynote IITC 2012
• Refresh • Access
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Interconnect Energy Taper: Electrical
n Relative costs of compute and memory accesses n Time and energy costs have shifted to data movement
Courtesy Greg Astfalk, HP
Core Core Core Core
L1$ L1$ L1$ L1$
Last Level Cache
DRAM
1’s ns
ms
Data Access Latency
10’s ns
100’s ns
Data Access Energy
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Shift in the Balance Point
38
123
207 292
0 1 2 3 4 5 6 7 8
200 400 600 800 1000
Band
width (G
B/s)
Normalized
Perform
ance
Core Frequency (MHz)
38
151
264
0 2
4
6
8
10
12
4 8 12 16 20 24 28 32 36 40 44 Band
width (G
B/s)
Normalized
Perform
ance
# AcBve CUs
Balance plane for performance and energy
Relative energy costs of compute
and memory access
Relative ops/byte demand of application
Hardware Balance
I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.
9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Pin Bandwidth Challenges1
n Number of transistors/die continues to grow n Number of pins growing at a slower rate than #transistors n Number of supply pins are crowding out data pins
n Reducing supply current/pin limits growth of #transistor/die
1P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten, “ Pinned to the Wall – Impact of Packaging and Applications on the Memory and Power Walls,” IEEE/ACM international symposium on Low-Power Electronics and Design (ISPLED), 2011
Data pin bandwidth is not growing as fast as number of transistors on chip
CPU Die
Package Substrate
PCB To DRAM
Die
CPU Die DRAM Die
Si Interposer
Package Substrate
PCB
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Re-Emergence of Processing In (Near) Memory
3D Interposer System
BGA
Power/Ground Planes
TPVs
Memory Stack
Logic Die
PWB PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
Kogge – Execube (4K DRAM + 100K gate parallel processor (www3.nd.edu)
Draper et.al, DIVA chip (isi.edu)
Courtesy Gokul Kumar
1990’s
Today
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The Data Tsunami
www.hq.unu.edu
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Shift in Re-Use Patterns: Locality
Adaptive Mesh Refinement (AMR)
Temporal locality S
patia
l Loc
ality
Machines designed today for these
applications
Need machines for these applications
0
1
2
3 4
5
Graph Analytics Machine Learning
Relational Computing
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Shift to Finer Arithmetic Density
……
LargeQty(p) <-
Qty(q), q > 1000.
……
Relational Computations Over Massive Unstructured Data Sets
Neu
rom
orph
ic
Que
ry P
roce
sing
Candidate Application
Domains
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Where do the $$ and Energy Go?
GPUPwr MemPwr RestOfCardPwr
• Increasing percentage of costs
• Increasing percentage of power
• Increasing percentage of performance (latency-BW)
• Increasing memory intensive applications
I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.
15
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The (Re)-Emergence of Near Data Processing
Silicon Interposer
Multicore Chip
Logic Tier
Memory Tiers
Where are the Networks?
New BW Hierarchy and energy taper
• Hybrid Memory Cube (HMC) • High Bandwidth Memory (HBM)
• Wide I/O
Compute Package Capacity Tier Memory 16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Transition to Memory Centric Compute: Inside the Package
PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Interposer: Processor-Memory Hierarchy
V. Garg, D. Stogner, C. Ulmer, D. E. Schimmel, C. Dislis, S. Yalamanchili, and D. S. Wills, “Early Analysis of Cost/Performance Trade-Offs in MCM Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology–Part B, vol. 20, no. 3, August 1997.
n Attraction n Smaller die (better yield), process customization, and larger L2 n Better thermal behaviors
n Issues n Increased die and substrate testing costs (increase in I/Os) n Cost
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of Interposer: SIMD Image Processor
V. Garg, D. Stogner, C. Ulmer, D. E. Schimmel, C. Dislis, S. Yalamanchili, and D. S. Wills, “Early Analysis of Cost/Performance Trade-Offs in MCM Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology–Part B, vol. 20, no. 3, August 1997.
n Trading cost vs. chip size vs number of I/Os
n What is the most cost effective partitioning?
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The Opportunity: Network in Package
• Smaller micro bumps à increased #connections • Shorter faster wires in the interposer • Higher signal integrity à Lower power • Interposer cost? • Example: Network on Interposer: Jerger, Kannan, Li, and
Loh (MICRO 2014) • Example: memory networks: G. Kim et. al. PACT 2013
PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Transition to Memory-Centric Compute: Inside The Stack
21
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3D Multicore Architecture
http://www.extremetech.com/computing/197720-beyond-ddr4-understand-the-differences-between-wide-io-hbm-and-hybrid-memory-cube
PCB
Logic
MemoryMemoryMemoryMemory
TSV
Test pads
Glass interposer TPV RDL
Thermal adhesiveThermal Vias
Encapsulate
GPU
22
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Source of Memory Bandwidth
• Mismatch between bus bandwidth and DRAM access latency
• Over the past two decades density has increased by 1000X and latency reduced by 56% [source:hynix]
• Solution: Have more, narrower channels and exploit parallelism
23
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Parallelism in the Memory System
n Move towards more narrower channels n Increases data cycles improving efficiency and utilization
4 channels with 256 bit bus vs. 16 channels with 64 bit bus
32, x86 cores, Hybrid Memory Cube (HMC) Model
You can buy bandwidth but you cannot bribe God! -- Unknown
24
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2D Bandwidth Still Matters!
25
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Network Impact of Memory Parallelism
Impact of reduced queuing delays
• Tiled 3D memory with 16 channels, similar to HMC
• Distributed directory based coherence with shared L2 banks
• DRAM latency vs. MC queuing
More channels à less load per channel à reduced queueing
2D torus network on compute Tier Network component
26
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Impact of DRAM Latency vs. Network Latency
• Impact on all requests is low.
• Coherence requires a request to take multiple hops before being satisfied
• It’s the network!
Normalized increase in CAS latency
32 cores, 16 memory channels
2D Network dominates
L1 L2 MC
1 2
3 4
Coherence Traffic
Miss Traffic
L1
2
27
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
The Opportunity
3D BW
2DBW Locality
28
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Refactoring the Memory Hierarchy
29
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
How is the Network Used?
cannealdedup
ferretfluidanimate
s treamclus tervips
050
100150200250300350400450
HPKI with varying L1 s ize 2481632
HPKI
0 5 10 15 20 25 30 3502468
10121416
L1 S ize vs G lobal IPCcannealdedupferretfluidanimates treamclus tervips
L 1 S ize P er Node (in KB )
Globa
l IPC
Hop Count Per Kilo Instructions (HPKI) Core Core Core Core
L1$ L1$ L1$ L1$
DRAM
DRAM
DRAM
DRAM
MC
MC
MC
MC
L2$ L2$ L2$ L2$
L1 L2 MC
1 2
3 4
Coherence Traffic
Miss Traffic
L1
2
30
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Optimization: Memory Side Caching
n Refactor memory hierarchy to reduce hop count
n Modify address space mappings to retain/improve locality
n Maximize L2-DRAM BW n Remove serialization latency
31
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Importance of Address Space Management: Locality
Core
Core
Core
Core
L1$
L1$
L1$
L1$
DRAM
DRAM
DRAM
DRAM
MC
MC
MC
MC
L2$
L2$
L2$
L2$
Global Address Space
Cache Address Mapping (CAM)
Global Address Mapping (GAM)
Local Address Mapping (LAM)
• Different mapping functions determine parallelism in memory and network traffic • More parallelism à better 3D bandwidth utilization but more load on the network
32
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Emphasis on Co-Design
Core Core Core Core
L1$ L1$ L1$ L1$
DRAM DRAM MC
MC
Global Address Space
L2$ L2$
DRAM MC L2$ DRAM M
C L2$
n Refactor the memory hierarchy n Co-design address space assignment across levels of the memory hierarchy
n Diversity of interconnect technologies 33
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Cymric: A Near Data Processing Architecture
……
Stream Multiprocessor
Custom Lightweight Parametric GPU C++ Design µArchitectureFlow
Processor Design
Research Topics
Memory Hierarchy
HARP Compiler
PNM Runtime System
PNM Device Driver
Network + Cache?
PNM cores
Memory
Memory
Memory
Memory
3D
Logic Dies
Memory
Memory Memory Memory
PNM cores
2.5D
• High Radix, Small Buffer Networks • GHC, Slimfly, FB
H. Kim, S. Mukhopadhyay, and S. Yalamanchili
34
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Short Stack: Physical Structure
Active Layer BEOL Silicon TSV
25um
25um 10um
25um 10um
100um
LLC & Memory controllers L2 (per core): 2MB, 4096 sets, 128B, 35 cycles;
Die Dimension: 8.4 mm X 8.4 mm
16 homogeneous OOO x86 cores 3GHz, 1.0V, max temp 100◦C DL1: 128KB, 4096 sets, 64B L1: 32KB, 128 sets, 64B, 1 cycles; Short Stack
Memory Controller and
LLC Bank
35
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Why is Temperature A NoC Problem?
36
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thermal Challenges and Microfluidic Cooling
37
• Fluid flow through the microchannels carry heat out to an external heat exchanger (e.g., heat sink)
Courtesy Professor Muhannad Bakir (GT/ECE)
This research supported by the DARPA ICECOOL Program
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3D FPGA: 3D Bandwidth – Performance Tradeoff
Physical Architecture of Micro-Fluidically Cooled 3D FPGA
Parameters Values
Chip Size (1.65x1.65)cm2
Number of CLBs 980K Number of TSVs per 3D Switch Box
4
Horizontal Routing Channel BW
30
Micro-Channel Dimension () (50x100) µm2
Fluid Velocity 1 m s-1
(50um pitch)
n Tradeoff between cooling capacity and 3D bandwidth
n Co-Design Exploration n Routing quality plays a key
role
Courtesy: A. Srivastava (UMD)
2D vs. 3D congestion
38
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Concluding Remarks
n We are going to see a reshaping of the boundaries between compute and memory
n System-level Co-design for memory-centric compute
n Exploration of network technologies: wireless, capacitive coupling, optics, etc.
n Expand the scope of traditional NoCs
39
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40
Applications &
Architecture
Power
NoC
Technology & Cooling
Thank YouQuestions?
Scaling Performance