49
www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~u Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian

Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

Embed Size (px)

Citation preview

Page 1: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Designing Efficient Memory for Future Computing Systems

Aniruddha N. Udipi University of Utah

www.cs.utah.edu/~udipi

Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian

Page 2: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

My other computer is..

2

Page 3: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Scaling server farms

• Facebook: 30,000 servers, 80 Billion images stored, serves 600,000 photos a second, logs 25 TB of data per day… the statistics can go on..

• The primary challenge to scaling: efficient supply of data to thousands of cores

• It’s all about the memory!

3

Page 4: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Performance Trends

• Demand-side– Multi-socket, multi-core, multi-thread – Large datasets - big data analytics,

scientific computation models– RAMCloud-like designs– 1 TB/s per node by 2017

• Supply-side– Pin count, per pin BW, capacity– Severely power limited

4

Source: ZDNet

Source: Tom’s Hardware

Page 5: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

• Datacenters consume ~2% of all power generated in the US– Operation + cooling

• 100 Billion kWh, $7.4 Billion• 25-40 % of total power in large

systems consumed in memory• As processors get simpler, this

fraction likely to increase

Energy Trends

5

Page 6: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Cost-per-bit

• Traditionally the holy grail of DRAM design• Operational expenditure over 3 years == Capital

expenditure in datacenter servers– Cost-per-bit less important than before

6

$3.00 13W

$0.30 60W

Page 7: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

• The job of the memory controller is hard– 18+ timing parameters for DRAM!– Maintenance operations

Refresh, scrub, power down, etc.

• Several DIMM and controller variants– Hard to provide interoperability– Need processor-side support for new

memory features

• Now throw in heterogeneity – Memristors, PCM, STT-RAM, etc.

Complexity Trends

7

Page 8: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Reliability Trends

• Shrinking feature sizes not helping• Nor is the scale

– 64 x 1015 DRAM cells in a typical datacenter• DRAM errors the #1 reason for servers at Google

to enter repair• Datacenters are the backbone of web-connected

infrastructure– Reliability is essential

• Server downtime has huge economic impact– Breached SLAs, for example

8

Page 9: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Thesis statement

• Main memory systems are at an inflection point– Convergence of several trends

• Major overhaul required to achieve a system that is– Energy-efficient, high-performance, low-complexity,

reliable, and cost effective• Combination of two things

– Prudent application of novel technologies– Fundamental rethinking of conventional design

decisions

9

Page 10: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Designing Future Memory Systems

10

CPU

MC

DIMM…

1

2Memory Interconnect – Prudent use of Silicon Photonics, without modifying DRAM dies [ISCA ’11]

Memory Reliability – Efficient RAID-based high-availability Chipkill memory [ISCA ’12]

1

1Memory Chip Architecture – reducing overfetch & increasing parallelism [ISCA ’10]

3 Memory protocol – Streamlined Slot-based Interface with semi-autonomous memory [ISCA ’11]

4

23 4

4

2 3

Page 11: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

PART 1 – Memory Chip Organization

Page 12: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key bottleneck

12

RAS

CAS

Cache Line

DRAM Chip DRAM Chip DRAM Chip DRAM Chip

Row Buffer

One bank shown in each chip

Page 13: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Why this is a problem

13

Page 14: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

14

Page 15: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

SSA Architecture

15

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS

Page 16: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

SSA Operation

16

Address

Cache Line

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarraySubarray Subarray Subarray Subarray

Sleep Mode(or other parallelaccesses)

Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray

Page 17: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

SSA Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer

spells of inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities

– area overhead (< 5%)

17

Page 18: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Contributions

• Up to 6X reduction in DRAM chip dynamic energy

• Up to 5X reduction in DRAM chip static energy

• Up to 50% improvements in performance in applications limited by bank contention

• All for ~5% increase in area

18

Page 19: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

PART 2 – Memory Interconnect

Page 20: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Bottleneck

• Fundamental nature of electrical pins– Limited pin count, per pin bandwidth, memory

capacity, etc. • Diverging growth rates of core count and pin

count• Limited by physics, not engineering!

20

Page 21: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi 21

Silicon Photonic Interconnects

• We need something that can break the edge-bandwidth bottleneck

• Ring modulator based photonics– Off chip light source

– Indirect modulation using resonant rings

– Relatively cheap coupling on- and off-chip

• DWDM for high bandwidth density– As many as 67 wavelengths possible

– Limited by Free Spectral Range, and coupling losses between rings

Source: Xu et al. Optical Express 16(6), 2008

DWDM

64 λ × 10 Gbps/ λ = 80 GB/s per waveguide

Page 22: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

The Questions We’re Trying to Answer

22

Should we replace allinterconnects with photonics? On-chip too?

Should we be designing photonic DRAM dies? Stacks? Channels?

How do we make photonics less invasive to memory die design?

What should the role of 3D be in an optically connected memory?

What should the role of electrical signaling be?

Page 23: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Design Considerations – I

• Photonic interconnects– Large static power dissipation: ring tuning

Rings are designed to resonate at a specific frequency Processing defects and temperature change this Need to heat the rings to correct for this

– Much lower dynamic energy consumption – relatively independent of distance

• Electrical interconnects– Relatively small static power dissipation– Large dynamic energy consumption

23

Page 24: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Design Considerations – II

• Should not over-provision photonic bandwidth, use only where necessary

• Use photonics where they’re really useful– To break the off-chip pin barrier

• Exploit 3D-Stacking and TSVs– High bandwidth, low static power, decouples

memory dies• Exploit low-swing wires

– Cheap on-chip communication

24

Page 25: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Proposed Design

25

Processor

DIMMWaveguide

DRAM chips

Photonic Interface die

Memory controller

ADVANTAGE 1:Increased activity factor, more efficient use of photonicsADVANTAGE 3:Not disruptive to the design of commodity memory diesADVANTAGE 2:Rings are co-located; easier to isolate or tune thermally

Page 26: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Contributions

• 23% reduced energy consumption• 4X capacity per channel• Potential for performance improvements

due to increased bank count• Less disruptive to memory die design

26

Processor

DIMMWaveguide

DRAM chips

Photonic Interface die

Memory controller

Makes the job of the memory controller difficult!

Page 27: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

PART 3 – Memory Access Protocol

Page 28: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Bottleneck

• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface

• Memory controller micro-manages every operation of the memory system– Processor-side support required for every memory

innovation– Several signals between processor and memory

Heavy pressure on address/command bus Worse with several independent banks, large amounts

of state

28

Page 29: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Proposed Solution

• Release MC’s tight control, make memory stack more autonomous

• Move mundane tasks to the interface die– Maintenance operation (refresh, scrub, etc.)– Routine operations (DRAM precharge, NVM wear

leveling)– Timing control (18+ constraints for DRAM alone)– Coding and any other special requirements

• Processor-side controller only schedules requests and controls data bus

29

Page 30: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Memory Access Operation

30

S1

Arrival First free slot

Issue Start looking

Backup slot

MLML > ML

Time

Slot – Cache line data bus occupancyX – Reserved SlotML – Memory Latency = Addr. latency + Bank access + Data bus latency

x xx S2

Page 31: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Performance Impact – Synthetic Traffic

31

< 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth

Page 32: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Performance Impact – PARSEC/STREAM

32

Apps have very low BW requirements Scaled down system, similar trends

Page 33: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Contributions• Plug and play

– Everything is interchangeable and interoperable– Only interface-die support required (communicate ML)

• Better support for heterogeneous systems– Easier DRAM-NVM data movement on the same channel

• More innovation in the memory system– Without processor-side support constraints

• Fewer commands between processor and memory– Energy, performance advantages

33

Page 34: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

PART 4 – Memory Reliability

Page 35: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Bottleneck

• Increased access granularity– Every data access is spread across 36 DRAM chips– DRAM industry standards define minimum access granularity

from each chip– Massive overfetch of data at multiple levels

Wastes energy Wastes bandwidth Occupies ranks/banks for longer, hurting performance

• x4 device width restriction– fewer ranks for given DIMM real estate– x8/x16/x32 more power efficient per capacity

• Reliability level: 1 failed chip out of 36

35

Page 36: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

A new approach: LOT-ECC• Operate on a single rank of memory: 9 chips

– and support failure of 1 chip per rank (9 chips)• Multiple tiers of localized protection

– Tier-1: Local Error Detection (checksums)– Tier 2: Global Error Correction (parity)– T3 & T4 to handle specific failure cases

• Error correction data stored in data memory• Data mapping handled by memory controller

with firmware support– Transparent to OS, caches, etc.

36

Page 37: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

LOT-ECC Design

37

Page 38: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

The Devil is in the Details

• We’re borrowing one bit from [data + LED] to use in the GEC– Put them all in the same DRAM row

• When a cache line is written, – Write data, LED, GEC – all “self-contained”– no read-before-write– Guaranteed row-buffer hit

38

7b 1b 1b

PA0-6 PA7-13 PA49-55 PPA . .T4 T4 T4

PA56

T4

Surplus bit borrowed from data + LED

Chip 0 Chip 1 Chip 7 Chip 8

Page 39: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Key Benefits

• Energy Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes

• Performance Gains: More rank-level parallelism, reduced access granularity

• Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently

• Flexibility: Works with a single rank of x4 DRAMs or more efficient wide-I/O x8/x16 DRAMs

• Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS

39

Page 40: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Power Results

40

-55%

Page 41: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Performance Results

41

Latency Reduction: LOT-ECC x8 – 43% +GEC Coalescing – 47% Oracular – 57%

Page 42: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Exploiting features in SSA

42

DIMM

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

.

...

.

...

.

...

.

...

P7

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity

Page 43: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Putting it all together

Page 44: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Summary

• Tremendous pressure on the memory system– Bandwidth, energy, complexity, reliability

• Prudently apply novel technologies– Silicon photonics– Low-swing wires– 3D-stacking

• Rethink some fundamental design choices– Micromanagement by the memory controller– Overfetch in the face of diminishing locality– Conventional ECC codes

44

Page 45: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Impact

• Significant static/dynamic energy reduction– Memory core, channel, controller, reliability

• Significant performance improvement– Bank parallelism, channel bandwidth, reliability

• Significant complexity reduction– Memory controller

• Improved reliability

45

Page 46: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Synergies

• SSA Photonics • Photonics Autonomous memory• SSA Reliability• SSA, Photonics, and LOT-ECC provide additive energy

benefits– Each targets one of three major sources of energy

consumption – DRAM array, off-chip channel, reliability• SSA, Photonics, and LOT-ECC also provide additive

performance benefits– Each targets one of three major performance bottleneck –

Bank-contention, off-chip BW, reliability

46

Page 47: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Research Contributions

• Memory reliability• Memory access protocol• Memory channel architecture• Memory chip microarchitecture

• On-chip networks• Non-uniform power caches• 3D stacked cache design

47

[ISCA 2012]

[ISCA 2011]

[ISCA 2010]

[HPCA 2010]

[HiPC 2009]

[HPCA 2009]

Page 48: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Future Work

• Future project ideas include– Memory architectures for graphics/throughput-

oriented applications– Memory optimizations for handheld devices

Tightly integrated software support Managing heterogeneity, reconfigurability Novel memory hierarchies

– Memory autonomy and virtualization– Refresh management in DRAM

48

Page 49: Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Acknowledgements

• Rajeev• Naveen• Committee: Al, Norm, Erik, Ken• Awesome lab-mates• Karen, Ann, Emily… front office• Parents & family • Friends

49