33
Rethinking DRAM Design and Organization for Energy- Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, Norm Jouppi* University of Utah and *HP Labs

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

  • Upload
    noam

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, Norm Jouppi * University of Utah and *HP Labs. Why a complete DRAM redesign?. JEDEC SDRAM Standard. - PowerPoint PPT Presentation

Citation preview

Page 1: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Aniruddha N. Udipi,Naveen Muralimanohar*,Niladrish Chatterjee,Rajeev Balasubramonian, Al Davis, Norm Jouppi*

University of Utah and *HP Labs

Page 2: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Why a complete DRAM redesign?

2

Courtesy: http://www.iiasa.ac.at

High Density

Low Cost-per-bit

Energy efficient

June 1994

Time for DRAM’s own “right-hand turn” Rethink design for modern constraints

JEDEC SDRAM Standard

Cost-per-bit over time

Page 3: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

3

Memory Trends

• Energy– Large scale systems attribute 25-40% of total

power to the memory subsystem– Capital acquisition costs = operating costs over 3 years– Energy is a first-order design constraint

• Access patterns– Increasing socket, core, and thread counts– Final memory request stream extremely random– Cannot design for locality

Page 4: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Memory Trends

4

Page 5: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Memory Trends

5

Page 6: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

6

Memory Trends

• Energy– Large scale systems attribute 25-40% of total

power to the memory subsystem– Capital acquisition costs = operating costs over 3 years– Energy is a first-order design constraint

• Access patterns– Increasing socket, core, and thread counts– Final memory request stream extremely random– Cannot design for locality– What is exact overfetch degree?

• DRAM Reliability– Critical apps require chipkill-level reliability– Building fault-tolerance out of unreliable components is expensive– Schroeder et al., SIGMETRICS 2009

Page 7: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

7

Related Work

• Overfetch– Ahn et al. (SC ’09), Ware et al. (ICCD ’06), Sudan et al.

(ASPLOS ’10)

• DRAM Low-power modes– Hur et al. (HPCA ’08), Fan et al. (ISLPED ’01), Pandey

et al. (HPCA ’06) • DRAM Redesign

– Loh (ISCA ’08), Beamer et al. (ISCA ’10)• Chipkill mechanisms

– Yoon and Erez (ASPLOS ’10)

Page 8: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Executive Summary

• Rethink DRAM design for modern constraints– Low-locality, reduced energy consumption, optimize TCO

• Selective Bitline Activation (SBA)– Minimal design changes– Considerable dynamic energy reductions for small latency and

area penalties

• Single Subarray Access (SSA)– Significant changes to memory interface– Large dynamic and static energy savings

• Chipkill-level reliability– Reduced energy and storage overheads for reliability

8

Page 9: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

9

Outline

• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion

Page 10: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Basic Organization

10

Memory bus or channel

Rank

DRAMchip ordeviceBank

Array1/8th of therow buffer

One word ofdata output

DIMM

On-chip Memory

Controller

Page 11: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Basic DRAM Operation

11

RAS

CAS

Cache Line

DRAM Chip DRAM Chip DRAM Chip DRAM Chip

Row Buffer

One bank shown in each chip

Page 12: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

12

Outline

• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion

Page 13: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Selective Bitline Activation

• Activate only those bitlines corresponding to the requested cache line – reduce dynamic energy

– Some area overhead depending on access granularity – we pick 16 cache lines for 12.5% area overhead

• Requires no changes to the interface and minimal control changes

13

Page 14: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

14

Outline

• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion

Page 15: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Key Idea

• Incandescent light bulb• Low purchase cost• High operating cost• Commodity

• Energy-efficient light bulb• Higher purchase cost• Much lower operating cost• Value-addition

15

It’s worth a small increase in capital costs to gain large reductions in operating costs

$3.00 13W

$0.30 60W

And not 10X, just 15-20%!

Page 16: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Wishlist of features

• Eliminate overfetch– Disregard locality

• Increase opportunities for power-down

• Increase parallelism

• Enable efficient reliability mechanisms

16

Page 17: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

SSA Architecture

17

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS

Page 18: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

SSA Basics

• Entire DRAM chip divided into small subarrays

• Width of each subarray is exactly one cache line

• Fetch entire cache line from a single subarray in a single DRAM chip – SSA

• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low

• Close page policy and “posted-RAS” similar to SBA

• Data bus to processor essentially split into 8 narrow buses

18

Page 19: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

SSA Operation

19

Address

Cache Line

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarraySubarray Subarray Subarray Subarray

Sleep Mode(or other parallelaccesses)

Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray

Page 20: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

SSA Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells

of inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities – area

overhead (< 5%)

20

Page 21: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Methodology

• Simics based simulator– ‘ooo-micro-arch’ and ‘trans-staller’

• FCFS/FR-FCFS scheduling policies• Address mapping and DRAM models from DRAMSim• DRAM data from Micron datasheets• Area/Energy numbers from heavily modified CACTI 6.5• PARSEC/NAS/STREAM benchmarks• 8 single-threaded OOO cores, 32 KB L1, 2 MB L2• 2GHz processor, 400MHz DRAM

21

Page 22: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Dynamic Energy Reduction

22

Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA

Page 23: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Contributors to energy consumption

23

64 cache lines in baseline 16 cache lines in SBA 1 cache line in SSA

Page 24: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Static Energy – Power down modes

• Current DRAM chips already support several low-power modes

• Consider the low-overhead power down mode: 5.5X lower energy, 3 cycle wakeup time

• For a constant 5% latency increase– 17% low-power operation in the baseline– 80% low-power operation in SSA

24

Page 25: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Latency Characteristics

25

• Impact of Open/Close page policy – 17% decrease (10/12) or 28% increase (2/12)

• Posted-RAS adds about 10%• Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or

40% increase (6/12)

Page 26: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Contributors to Latency

26

Page 27: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

27

Outline

• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion

Page 28: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

DRAM Reliability

• Many server applications require chipkill-level reliability – failure of an entire DRAM chip

• One example of existing systems– 64-bit word requires 8-bit ECC – Each of these 72 bits must be read out of a different

chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!

– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache line comes from a single chip

28

Page 29: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Proposed Solution

Approach similar to RAID-5

29

DIMM

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

.

...

.

...

.

...

.

...

P7

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity

Page 30: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Chipkill design

• Two-tier error protection

• Tier - 1 protection – self-contained error detection– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer

• Tear -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead

• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation

• Erroneous access– 9 chip operation

30

Page 31: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

31

Outline

• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion

Page 32: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Key Contributions

• Redesign of DRAM microarchitecture

• Substantial chip access energy savings (up to 6X)

• Overall, performance is a wash

• Minor area impact (12% with SBA, 4.5% with SSA)

• Two-tier chipkill-level reliability with minimal energy and storage overheads

32

Page 33: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores

Now is the time for new architectures..

• Take into account modern constraints

• Energy far more critical today than before

• Cost-per-bit perhaps less important – optimize TCO– Operating costs over 3 years = capital acquisition costs

• Memory reliability is important for many server applications.

Memory system’s “right-hand-turn” is long overdue

33