Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Next Generation Stacked Memory Systems
Alok Gupta
NVIDIA, Santa Clara, CA
2
OUTLINE
1. Motivation 2. Memory Bandwidth Trends 3. Stacked Memory System 4. Conclusion 5. Q&A
3
MOTIVATION
To keep up with increasing logic horsepower, memory bandwidth must scale every generation else performance becomes IO limited
Absolute power not just a mobile problem
For example, GPUs are maxed out on power budgets at 225-300W
IO bandwidth improvements must be achieved within same power budget as last generation which implies memory system power needs to stay same
Process scaling brings limited improvement in memory IO power – logic at least benefits from Moore’s law
New technology and ideas needed to keep memory bandwidth growth in similar power envelope
4
GPU MEMORY SYSTEM BANDWIDTH
5
MEMORY BANDWIDTH MAPPED TO DRAM TECHNOLOGY
Stacked Memory
6
PACKAGING TRENDS
Conventional MCM Multi-Chip Module
Organic
Interposer
Silicon
Interposer Die Stacking
Diagram
Complexity Well understood Well-established
process No TSVs needed
TSVs limited to
silicon interposer
TSVs needed
across all chips in
the stack
Cost Low Medium TBD Higher Higher
Form Factor
Size (Reference)
Smaller PCB
Larger Package
Smaller PCB
Larger Package
Smaller PCB
Similar Package
Smaller PCB
Smaller Package
PKG PKG PKG
Silicon Interposer
PKG
GPU
2D 2.5D 3D 2.1D
7
HIGH BANDWIDTH MEMORY (HBM) DRAM
• A single package containing multiple memory die stacked together, using through-silicon vias (TSV). The memory within HBM is organized into channels wherein each channel is functionally and operationally independent
• HBM DRAM uses a wide-interface architecture to achieve high-speed, low-power operation and is best suited for 2.5D Silicon Interposer based system designs
Base Logic
Layer in
DRAM Process
8
HIGH BANDWIDTH MEMORY (HBM) DRAM
HBM DRAM array
2-8Gb DRAM die w/ ECC
4/8-high stack – 1GB to 8GB per stack
Up to 256GB DRAM internal bandwidth
Base Layer w/ HBM IO + DRAM Test and Repair logic – in DRAM process
DRAM Interface
Signaling – 1.2V LVCMOS
Data rate – 800MHz-1000MHz DDR, wide 1024-bit interface
9
MOBILE WIDE-IO2 MEMORY
Density: 8Gb
4/8 independent 64-bit channels, 256/512-bit interface
no cross channel restrictions
Interface Speed: 400-566MHz DDR
Bandwidth: 25.6-68.2GBps
1 through 4 high stacks
Mono stack is micro-bumped w/o TSV, Multi-high stack w/ TSV
Power efficient – leverages LP process, CMOS signaling
Designed for 3D stacking but can be made to work for 2.1/2.5D solutions
10
2.1D MEMORY SYSTEMS WITH WIDE MEMORY
2.1D Organic
Interposer Fan-Out WLP Fan-Out WLP PoP Interposer PoP
PKG Constructions
PKG Height (mm) 0.84 0.53 0.90 0.90
PKG Technology
Maturity Level Medium Medium Low Medium
Thermal Good Good Poor Poor
PKG Reliability Unknown
Cost Unknown
11
2.1D MEMORY SYSTEM CHALLENGES
Number of IOs
Signal density and routing limits number of IOs
Interface Speed
Channel not as benign as other stacked solutions – performance/power trade-off
Bandwidth and Capacity scaling
Package Reliability
Solution Cost is Work-In-Progress
12
Package Substrate
Silicon Interposer
2.5D MEMORY SYSTEM WITH HBM DRAM
Passive silicon interposer
Package Substrate
GPU/CPU
HBM
HBM
HBM
HBM
HBM
HBM
HBM
HBM
Cross-Section View
GPU/CPU
HBM
HBM
Top View
HBM
HBM
13
2.5D HIGH DENSITY GPU-MEMORY INTERCONNECT
Silicon interposer enables fine pitch geometries
>50x finer geometry
Performance depends on signal integrity requirements
Loss (width)
Crosstalk (spacing)
GPU - HBM signal routing on Silicon Interposer
14
SILICON INTERPOSER LOSS & CROSSTALK
Insertion Loss
Resistance in channel
Slew Rate degradation due to channel loss
Very simple channel transfer function (almost RC) compared to off-chip signaling
Crosstalk dominated by adjacent aggressors
Line space and thickness
Eye is nice and open
Silicon Interposer Channel Characteristics
Resistance
creates DC
Channel loss
Slew Rate
Degradation due to
loss
Coupled crosstalk is
dominated by
adjacent signals
Sharp Roll-off
w/o resonances
15
2.5D MEMORY SYSTEM CHALLENGES
Stacked memory solution still at an early stage – no very large volume products
Eco-system challenges
Multi-sourcing
Active collaboration required between foundry, memory vendor, and OSAT to deliver a successful product
Assembly, Test/Repair, and failure analysis
Solution cost trend is a big unknown
16
3D MEMORY SYSTEM WITH WIDE-IO2
17
WIDE-IO2 SOC CO-LAYOUT
Routing blockages
Power delivery
Keep-out regions
Thermal hot-spots
18
3D MEMORY SYSTEM CHALLENGES
Memory and SoC Co-layout
Bandwidth and Capacity scaling
Thermals
Power delivery
Cost
19
SUMMARY
Stacked memory is a promising solution for ever demanding need for bandwidth
Xilinx is shipping large FPGAs using silicon interposer – solves a unique problem
High Volume Manufacturing for mainstream GPU/CPU devices still work-in-progress
Business challenges
Multiple companies need to work together (foundry + memory vendor + OSAT
Assembly, Failure Analysis, Test and Repair
DRAM Cost per bit
Stacking of heterogeneous devices not well understood – co-design, thermal and mechanical challenges
20
QUESTIONS?