Implications of Memory-Centric Computing Architectures for …casl.gatech.edu/wp-content/uploads/2016/01/nocs_keynote.pdf · SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Implications of Memory-Centric Computing Architectures for Future NoCs

Sudhakar Yalamanchili

Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems

School of Electrical and Computer Engineering Georgia Institute of Technology

Sponsors:


Role of NoCs

www.themobileindian.com

www.theregister.co.uk

Nvidia Tegra4 IBM Power8

Qualcomm Snapdragon

The System Defines the NoC Requirements

2


Overview

n Impact of Technology and Applications

n Transition to Memory Centric Compute: Inside the Package

n Transition to Memory Centric Compute: Inside the Stack

n Concluding Remarks

3


How are Technology and Applications Reshaping Systems?

4


Moore’s Law and the End of Dennard Scaling

From betanews.com

•  Performance scaled with number of transistors*

Goal: Sustain Performance Scaling

*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.

5


Power and Performance

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

Operator_cost + Data_movement_cost + Storage_cost

Specialization ! heterogeneity, asymmetry, technology diversity

*

Power Supply (regulation) + Power Consumption + Cooling

W. J. Dally, Keynote IITC 2012

6


Energy Cost of Data Management

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

Three operands x 64 bits/operand

DataMovementEnergy = #bits× dist −mm× energy− bit −mm

*S. Borkar and A. Chien, “The Future of Microprocessors, CACM, May 2011

Operator_cost + Data_movement_cost + Storage_cost

W. J. Dally, Keynote IITC 2012

•  Refresh •  Access

7


Interconnect Energy Taper: Electrical

n Relative costs of compute and memory accesses n Time and energy costs have shifted to data movement

Courtesy Greg Astfalk, HP

Core Core Core Core

L1$ L1$ L1$ L1$

Last Level Cache

DRAM

1’s ns

ms

Data Access Latency

10’s ns

100’s ns

Data Access Energy

8


Shift in the Balance Point

38

123

207 292

0 1 2 3 4 5 6 7 8

200 400 600 800 1000

Band

width (G

B/s)

Normalized

Perform

ance

Core Frequency (MHz)

38

151

264

0 2

4

6

8

10

12

4 8 12 16 20 24 28 32 36 40 44 Band

width (G

B/s)

Normalized

Perform

ance

# AcBve CUs

Balance plane for performance and energy

Relative energy costs of compute

and memory access

Relative ops/byte demand of application

Hardware Balance

I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.

9


Pin Bandwidth Challenges1

n Number of transistors/die continues to grow n Number of pins growing at a slower rate than #transistors n Number of supply pins are crowding out data pins

n Reducing supply current/pin limits growth of #transistor/die

1P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten, “ Pinned to the Wall – Impact of Packaging and Applications on the Memory and Power Walls,” IEEE/ACM international symposium on Low-Power Electronics and Design (ISPLED), 2011

Data pin bandwidth is not growing as fast as number of transistors on chip

CPU Die

Package Substrate

PCB To DRAM

Die

CPU Die DRAM Die

Si Interposer

Package Substrate

PCB

10


Re-Emergence of Processing In (Near) Memory

3D Interposer System

BGA

Power/Ground Planes

TPVs

Memory Stack

Logic Die

PWB PCB

Logic

MemoryMemoryMemoryMemory

TSV

Test pads

Glass interposer TPV RDL

Thermal adhesiveThermal Vias

Encapsulate

GPU

Kogge – Execube (4K DRAM + 100K gate parallel processor (www3.nd.edu)

Draper et.al, DIVA chip (isi.edu)

Courtesy Gokul Kumar

1990’s

Today

11


The Data Tsunami

www.hq.unu.edu

12


Shift in Re-Use Patterns: Locality

Adaptive Mesh Refinement (AMR)

Temporal locality S

patia

l Loc

ality

Machines designed today for these

applications

Need machines for these applications

0

1

2

3 4

5

Graph Analytics Machine Learning

Relational Computing

13


Shift to Finer Arithmetic Density

……

LargeQty(p) <-

Qty(q), q > 1000.

……

Relational Computations Over Massive Unstructured Data Sets

Neu

rom

orph

ic

Que

ry P

roce

sing

Candidate Application

Domains

14


Where do the $$ and Energy Go?

GPUPwr MemPwr RestOfCardPwr

•  Increasing percentage of costs

•  Increasing percentage of power

•  Increasing percentage of performance (latency-BW)

•  Increasing memory intensive applications

I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High Performance GPUs,” IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.

15


The (Re)-Emergence of Near Data Processing

Silicon Interposer

Multicore Chip

Logic Tier

Memory Tiers

Where are the Networks?

New BW Hierarchy and energy taper

• Hybrid Memory Cube (HMC) • High Bandwidth Memory (HBM)

• Wide I/O

Compute Package Capacity Tier Memory 16


Transition to Memory Centric Compute: Inside the Package

PCB

Logic


TSV

Test pads



Encapsulate

GPU

17


Impact of Interposer: Processor-Memory Hierarchy

V. Garg, D. Stogner, C. Ulmer, D. E. Schimmel, C. Dislis, S. Yalamanchili, and D. S. Wills, “Early Analysis of Cost/Performance Trade-Offs in MCM Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology–Part B, vol. 20, no. 3, August 1997.

n Attraction n Smaller die (better yield), process customization, and larger L2 n Better thermal behaviors

n Issues n Increased die and substrate testing costs (increase in I/Os) n Cost

18


Impact of Interposer: SIMD Image Processor

V. Garg, D. Stogner, C. Ulmer, D. E. Schimmel, C. Dislis, S. Yalamanchili, and D. S. Wills, “Early Analysis of Cost/Performance Trade-Offs in MCM Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology–Part B, vol. 20, no. 3, August 1997.

n Trading cost vs. chip size vs number of I/Os

n What is the most cost effective partitioning?

19


The Opportunity: Network in Package

•  Smaller micro bumps à increased #connections •  Shorter faster wires in the interposer •  Higher signal integrity à Lower power •  Interposer cost? •  Example: Network on Interposer: Jerger, Kannan, Li, and

Loh (MICRO 2014) •  Example: memory networks: G. Kim et. al. PACT 2013

PCB

Logic


TSV

Test pads



Encapsulate

GPU

20


Transition to Memory-Centric Compute: Inside The Stack

21


3D Multicore Architecture

http://www.extremetech.com/computing/197720-beyond-ddr4-understand-the-differences-between-wide-io-hbm-and-hybrid-memory-cube

PCB

Logic


TSV

Test pads



Encapsulate

GPU

22


Source of Memory Bandwidth

•  Mismatch between bus bandwidth and DRAM access latency

•  Over the past two decades density has increased by 1000X and latency reduced by 56% [source:hynix]

•  Solution: Have more, narrower channels and exploit parallelism

23


Parallelism in the Memory System

n Move towards more narrower channels n Increases data cycles improving efficiency and utilization

4 channels with 256 bit bus vs. 16 channels with 64 bit bus

32, x86 cores, Hybrid Memory Cube (HMC) Model

You can buy bandwidth but you cannot bribe God! -- Unknown

24


2D Bandwidth Still Matters!

25


Network Impact of Memory Parallelism

Impact of reduced queuing delays

•  Tiled 3D memory with 16 channels, similar to HMC

• Distributed directory based coherence with shared L2 banks

• DRAM latency vs. MC queuing

More channels à less load per channel à reduced queueing

2D torus network on compute Tier Network component

26


Impact of DRAM Latency vs. Network Latency

•  Impact on all requests is low.

• Coherence requires a request to take multiple hops before being satisfied

•  It’s the network!

Normalized increase in CAS latency

32 cores, 16 memory channels

2D Network dominates

L1 L2 MC

1 2

3 4

Coherence Traffic

Miss Traffic

L1

2

27


The Opportunity

3D BW

2DBW Locality

28


Refactoring the Memory Hierarchy

29


How is the Network Used?

cannealdedup

ferretfluidanimate

s treamclus tervips

050

100150200250300350400450

HPKI with varying L1 s ize 2481632

HPKI

0 5 10 15 20 25 30 3502468

10121416

L1 S ize vs G lobal IPCcannealdedupferretfluidanimates treamclus tervips

L 1 S ize P er Node (in KB )

Globa

l IPC

Hop Count Per Kilo Instructions (HPKI) Core Core Core Core

L1$ L1$ L1$ L1$

DRAM

DRAM

DRAM

DRAM

MC

MC

MC

MC

L2$ L2$ L2$ L2$

L1 L2 MC

1 2

3 4

Coherence Traffic

Miss Traffic

L1

2

30


Optimization: Memory Side Caching

n Refactor memory hierarchy to reduce hop count

n Modify address space mappings to retain/improve locality

n Maximize L2-DRAM BW n Remove serialization latency

31


Importance of Address Space Management: Locality

Core

Core

Core

Core

L1$

L1$

L1$

L1$

DRAM

DRAM

DRAM

DRAM

MC

MC

MC

MC

L2$

L2$

L2$

L2$

Global Address Space

Cache Address Mapping (CAM)

Global Address Mapping (GAM)

Local Address Mapping (LAM)

•  Different mapping functions determine parallelism in memory and network traffic •  More parallelism à better 3D bandwidth utilization but more load on the network

32


Emphasis on Co-Design

Core Core Core Core

L1$ L1$ L1$ L1$

DRAM DRAM MC

MC

Global Address Space

L2$ L2$

DRAM MC L2$ DRAM M

C L2$

n Refactor the memory hierarchy n Co-design address space assignment across levels of the memory hierarchy

n Diversity of interconnect technologies 33


Cymric: A Near Data Processing Architecture

……

Stream Multiprocessor

Custom Lightweight Parametric GPU C++ Design µArchitectureFlow

Processor Design

Research Topics

Memory Hierarchy

HARP Compiler

PNM Runtime System

PNM Device Driver

Network + Cache?

PNM cores

Memory

Memory

Memory

Memory

3D

Logic Dies

Memory

Memory Memory Memory

PNM cores

2.5D

•  High Radix, Small Buffer Networks •  GHC, Slimfly, FB

H. Kim, S. Mukhopadhyay, and S. Yalamanchili

34


Short Stack: Physical Structure

Active Layer BEOL Silicon TSV

25um

25um 10um

25um 10um

100um

LLC & Memory controllers L2 (per core): 2MB, 4096 sets, 128B, 35 cycles;

Die Dimension: 8.4 mm X 8.4 mm

16 homogeneous OOO x86 cores 3GHz, 1.0V, max temp 100◦C DL1: 128KB, 4096 sets, 64B L1: 32KB, 128 sets, 64B, 1 cycles; Short Stack

Memory Controller and

LLC Bank

35


Why is Temperature A NoC Problem?

36


Thermal Challenges and Microfluidic Cooling

37

•  Fluid flow through the microchannels carry heat out to an external heat exchanger (e.g., heat sink)

Courtesy Professor Muhannad Bakir (GT/ECE)

This research supported by the DARPA ICECOOL Program


3D FPGA: 3D Bandwidth – Performance Tradeoff

Physical Architecture of Micro-Fluidically Cooled 3D FPGA

Parameters Values

Chip Size (1.65x1.65)cm2

Number of CLBs 980K Number of TSVs per 3D Switch Box

4

Horizontal Routing Channel BW

30

Micro-Channel Dimension () (50x100) µm2

Fluid Velocity 1 m s-1

(50um pitch)

n Tradeoff between cooling capacity and 3D bandwidth

n Co-Design Exploration n Routing quality plays a key

role

Courtesy: A. Srivastava (UMD)

2D vs. 3D congestion

38


Concluding Remarks

n We are going to see a reshaping of the boundaries between compute and memory

n System-level Co-design for memory-centric compute

n Exploration of network technologies: wireless, capacitive coupling, optics, etc.

n Expand the scope of traditional NoCs

39

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40

Applications &

Architecture

Power

NoC

Technology & Cooling

Thank YouQuestions?

Scaling Performance

Documents

Implications of Memory-Centric Computing Architectures for …casl.gatech.edu/wp-content/uploads/2016/01/nocs_keynote.pdf · SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA