151
Scaling the Memory System in the Many-Core Era Onur Mutlu [email protected] July 26, 2013 BSC/UPC

Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Scaling the Memory System in the Many-Core Era

Onur Mutlu [email protected] July 26, 2013

BSC/UPC

Page 2: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Course Materials and Beyond

n  These slides are a shortened version of the Scalable Memory Systems course at ACACES 2013

n  Website for Course Slides, Papers, and Videos q  http://users.ece.cmu.edu/~omutlu/acaces2013-memory.html q  http://users.ece.cmu.edu/~omutlu q  Includes extended lecture notes and readings

n  Overview Reading q  Onur Mutlu,

"Memory Scaling: A Systems Architecture Perspective" Proceedings of the 5th International Memory Workshop (IMW), Monterey, CA, May 2013. Slides (pptx) (pdf)

2

Page 3: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

The Main Memory System

n  Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor

n  Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

3

Processor and caches

Main Memory Storage (SSD/HDD)

Page 4: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Memory System: A Shared Resource View

4

Storage

Page 5: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

State of the Main Memory System n  Recent technology, architecture, and application trends

q  lead to new requirements q  exacerbate old requirements

n  DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements

n  Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging

n  We need to rethink the main memory system q  to fix DRAM issues and enable emerging technologies q  to satisfy all requirements

5

Page 6: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

6

Page 7: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Major Trends Affecting Main Memory (I) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

7

Page 8: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Major Trends Affecting Main Memory (II) n  Need for main memory capacity, bandwidth, QoS increasing

q  Multi-core: increasing number of cores/agents q  Data-intensive applications: increasing demand/hunger for data q  Consolidation: cloud computing, GPUs, mobile, heterogeneity

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

8

Page 9: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Example: The Memory Capacity Gap

n  Memory capacity per core expected to drop by 30% every two years n  Trends worse for memory bandwidth per core!

9

Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years

Page 10: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Major Trends Affecting Main Memory (III) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

q  ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003]

q  DRAM consumes power even when not used (periodic refresh)

n  DRAM technology scaling is ending

10

Page 11: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Major Trends Affecting Main Memory (IV) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

q  ITRS projects DRAM will not scale easily below X nm q  Scaling has provided many benefits:

n  higher capacity (density), lower cost, lower energy

11

Page 12: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

12

Page 13: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

The DRAM Scaling Problem n  DRAM stores charge in a capacitor (charge-based memory)

q  Capacitor must be large enough for reliable sensing q  Access transistor should be large enough for low leakage and high

retention time q  Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

n  DRAM capacity, cost, and energy/power hard to scale

13

Page 14: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Solutions to the DRAM Scaling Problem

n  Two potential solutions q  Tolerate DRAM (by taking a fresh look at it) q  Enable emerging memory technologies to eliminate/minimize

DRAM

n  Do both q  Hybrid memory systems

14

Page 15: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Solution 1: Tolerate DRAM n  Overcome DRAM shortcomings with

q  System-DRAM co-design q  Novel DRAM architectures, interface, functions q  Better waste management (efficient utilization)

n  Key issues to tackle q  Reduce refresh energy q  Improve bandwidth and latency q  Reduce waste q  Enable reliability at low cost

n  Liu, Jaiyen, Veras, Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. n  Kim, Seshadri, Lee+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012. n  Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. n  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices” ISCA’13. n  Seshadri+, “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” 2013.

15

Page 16: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Solution 2: Emerging Memory Technologies n  Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile) n  Example: Phase Change Memory

q  Expected to scale to 9nm (2022 [ITRS]) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have shortcomings as well q  Can they be enabled to replace/augment/surpass DRAM?

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009, CACM 2010, Top Picks 2010.

n  Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters 2012.

n  Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. n  Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

16

Page 17: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

PCM Ctrl DRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Page 18: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

n  Problem: Memory interference is uncontrolled à uncontrollable, unpredictable, vulnerable system

n  Goal: We need to control it à Design a QoS-aware system

n  Solution: Hardware/software cooperative memory QoS q  Hardware designed to provide a configurable fairness substrate

n  Application-aware memory scheduling, partitioning, throttling

q  Software designed to configure the resources to satisfy different QoS goals

n  E.g., fair, programmable memory controllers and on-chip networks provide QoS and predictable performance [2007-2012, Top Picks’09,’11a,’11b,’12]

An Orthogonal Issue: Memory Interference

Page 19: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

19

Page 20: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: In-Memory Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

20

Page 21: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

DRAM Refresh n  DRAM capacitor charge leaks over time

n  The memory controller needs to refresh each row periodically to restore charge q  Read and close each row every N ms q  Typical N = 64 ms

n  Downsides of refresh -- Energy consumption: Each refresh consumes energy

-- Performance degradation: DRAM rank/bank unavailable while refreshed

-- QoS/predictability impact: (Long) pause times during refresh -- Refresh rate limits DRAM capacity scaling 21

Page 22: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Refresh Overhead: Performance

22

8%  

46%  

Page 23: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Refresh Overhead: Energy

23

15%  

47%  

Page 24: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Retention Time Profile of DRAM

24

Page 25: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Retention Time Profile of DRAM n  Observation: Only very few rows need to be refreshed at the

worst-case rate

25

Page 26: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR: Eliminating Unnecessary Refreshes n  Observation: Most DRAM rows can be refreshed much less often

without losing data [Kim+, EDL’09]

n  Key idea: Refresh rows containing weak cells more frequently, other rows less frequently

1. Profiling: Profile retention time of all rows 2. Binning: Store rows into bins by retention time in memory controller

Efficient storage with Bloom Filters (only 1.25KB for 32GB memory) 3. Refreshing: Memory controller refreshes rows in different bins at different rates

n  Results: 8-core, 32GB, SPEC, TPC-C, TPC-H q  74.6% refresh reduction @ 1.25KB storage q  ~16%/20% DRAM dynamic/idle power reduction q  ~9% performance improvement q  Energy benefits increase with DRAM capacity

26 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Page 27: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

1. Profiling

27

Page 28: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

2. Binning

n  How to efficiently and scalably store rows into retention time bins?

n  Use Hardware Bloom Filters [Bloom, CACM 1970]

28

Page 29: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Bloom Filter Operation Example

29

Page 30: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Bloom Filter Operation Example

30

Page 31: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Bloom Filter Operation Example

31

Page 32: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Bloom Filter Operation Example

32

Page 33: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Benefits of Bloom Filters as Bins n  False positives: a row may be declared present in the

Bloom filter even if it was never inserted q  Not a problem: Refresh some rows more frequently than

needed

n  No false negatives: rows are never refreshed less frequently than needed (no correctness problems)

n  Scalable: a Bloom filter never overflows (unlike a fixed-size table)

n  Efficient: No need to store info on a per-row basis; simple hardware à 1.25 KB for 2 filters for 32 GB DRAM system

33

Page 34: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

3. Refreshing (RAIDR Refresh Controller)

34

Page 35: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

3. Refreshing (RAIDR Refresh Controller)

35

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Page 36: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Going Forward

n  How to find out and expose weak memory cells/rows q  Retention time profiling q  Early analysis of modern DRAM chips:

n  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms”, ISCA 2013.

n  Tolerating cell-to-cell interference at the system level q  Flash and DRAM

36

Page 37: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on RAIDR and Refresh

n  Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu, "RAIDR: Retention-Aware Intelligent DRAM Refresh" Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012. Slides (pdf)

n  Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu, "An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms" Proceedings of the 40th International Symposium on Computer Architecture (ISCA), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

37

Page 38: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: In-Memory Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

38

Page 39: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

39  

DRAM  Latency-­‐Capacity  Trend  

0  

20  

40  

60  

80  

100  

0.0  

0.5  

1.0  

1.5  

2.0  

2.5  

2000   2003   2006   2008   2011  

Latency  (ns)  

Capa

city  (G

b)  

Year  

Capacity   Latency  (tRC)  

16X  

-­‐20%  

DRAM  latency  con.nues  to  be  a  cri.cal  bo4leneck  

Page 40: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

40  

     What  Causes  the  Long  Latency?  DRAM  Chip  

channel  

I/O  

channel  

I/O  

cell  array  cell  array  

banks  subarray  

subarray  

row  decod

er  

sense  amplifier  

capacitor  

access  transistor  

wordline  

bitline

 

cell  

Page 41: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

41  

DRAM  Chip  

channel  

I/O  

channel  

I/O  

cell  array  cell  array  

banks  subarray  

subarray        What  Causes  the  Long  Latency?  

DRAM  Latency  =  Subarray  Latency  +  I/O  Latency  DRAM  Latency  =  Subarray  Latency  +  I/O  Latency  

Dominant  

Suba

rray  

I/O  

row  add

r.  

row  decoder  

sense  am

plifier  

mux  column  addr.  

Page 42: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

42  

     Why  is  the  Subarray  So  Slow?  Subarray  

row  decod

er  

sense  amplifier  

capacitor  

access  transistor  

wordline  

bitline

 

Cell  

large  sense  amplifier  

bitline

:  512  cells  cell  

•  Long  bitline  – AmorQzes  sense  amplifier  cost  à  Small  area  – Large  bitline  capacitance  à  High  latency  &  power  

sense  am

plifier  

row  decod

er  

Page 43: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

43  

     Trade-­‐Off:  Area  (Die  Size)  vs.  Latency  

Faster  

Smaller  

Short  Bitline    

Long  Bitline    

Trade-­‐Off:  Area  vs.  Latency  

Page 44: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

44  

     Trade-­‐Off:  Area  (Die  Size)  vs.  Latency  

0  

1  

2  

3  

4  

0   10   20   30   40   50   60   70  

Normalized

 DRA

M  Area  

Latency  (ns)  

64  

32  

128  256   512  cells/bitline  

Commodity  DRAM  

Long  Bitline  

Cheape

r  

Faster  

Fancy  DRAM  Short  Bitline  

Page 45: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

45  

Short  Bitline  

Low  Latency    

     ApproximaQng  the  Best  of  Both  Worlds  Long  Bitline  

Small  Area    

Long  Bitline  

Low  Latency    

Short  Bitline  Our  Proposal  Small  Area    

Short  Bitline  è  Fast  Need  

IsolaJon  Add  IsolaJon  Transistors  

High  Latency  

Large  Area    

Page 46: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

46  

     ApproximaQng  the  Best  of  Both  Worlds  

Low  Latency    

Our  Proposal  Small  Area    

Long  Bitline  Small  Area    

Long  Bitline  

High  Latency  

Short  Bitline  

Low  Latency    

Short  Bitline  Large  Area    

Tiered-­‐Latency  DRAM  

Low  Latency  

Small  area  using  long  bitline  

Page 47: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

47  

     Tiered-­‐Latency  DRAM  

Near  Segment  

Far  Segment  

IsolaJon  Transistor  

•  Divide  a  bitline  into  two  segments  with  an  isolaQon  transistor  

Sense  Amplifier  

Page 48: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

48  

Far  Segment  Far  Segment  

     Near  Segment  Access  

Near  Segment  

IsolaJon  Transistor  

•  Turn  off  the  isolaQon  transistor  

IsolaJon  Transistor  (off)  

Sense  Amplifier  

Reduced  bitline  capacitance            è  Low  latency  &  low  power  

Reduced  bitline  length  

Page 49: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

49  

Near  Segment  Near  Segment  

     Far  Segment  Access  •  Turn  on  the  isolaQon  transistor  

Far  Segment  

IsolaJon  Transistor  IsolaJon  Transistor  (on)  

Sense  Amplifier  

Large  bitline  capacitance  AddiQonal  resistance  of  isolaQon  transistor  

Long  bitline  length  

         è  High  latency  &  high  power  

Page 50: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

50  

0%  

50%  

100%  

150%  

0%  

50%  

100%  

150%  

 Commodity  DRAM  vs.  TL-­‐DRAM    Latency  

Power  

–56%  

+23%  

–51%  

+49%  •  DRAM  Latency  (tRC)  •  DRAM  Power  

•  DRAM  Area  Overhead  ~3%:  mainly  due  to  the  isolaIon  transistors  

TL-­‐DRAM  Commodity  

DRAM  Near              Far   Commodity  

DRAM  Near              Far  TL-­‐DRAM  

 (52.5ns)  

Page 51: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

51  

     Trade-­‐Off:  Area  (Die-­‐Area)  vs.  Latency  

0  

1  

2  

3  

4  

0   10   20   30   40   50   60   70  

Normalized

 DRA

M  Area  

Latency  (ns)  

64  

32  

128  256        512  cells/bitline    

       

Cheape

r  

Faster  

Near  Segment   Far  Segment  

Page 52: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

52  

     Leveraging  Tiered-­‐Latency  DRAM  •  TL-­‐DRAM  is  a  substrate  that  can  be  leveraged  by  the  hardware  and/or  soOware  

•  Many  potenIal  uses  1. Use  near  segment  as  hardware-­‐managed  inclusive  cache  to  far  segment  

2. Use  near  segment  as  hardware-­‐managed  exclusive  cache  to  far  segment  

3. Profile-­‐based  page  mapping  by  operaIng  system  4. Simply  replace  DRAM  with  TL-­‐DRAM    

Page 53: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

53  

0%  

20%  

40%  

60%  

80%  

100%  

120%  

1  (1-­‐ch)   2  (2-­‐ch)   4  (4-­‐ch)  0%  

20%  

40%  

60%  

80%  

100%  

120%  

1  (1-­‐ch)   2  (2-­‐ch)   4  (4-­‐ch)  

     Performance  &  Power  ConsumpQon      11.5%  

 

Normalized

 Perform

ance  

Core-­‐Count  (Channel)  Normalized

 Pow

er  Core-­‐Count  (Channel)  

10.7%    

12.4%     –23%  

 –24%    

–26%    

Using  near  segment  as  a  cache  improves  performance  and  reduces  power  consumpJon  

Page 54: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on TL-DRAM

n  Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

54

Page 55: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: In-Memory Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

55

Page 56: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Today’s  Memory:  Bulk  Data  Copy  

Memory      

MC L3 L2 L1 CPU

1)  High  latency  

2)  High  bandwidth  uIlizaIon  

3)  Cache  polluIon  

4)  Unwanted  data  movement  

56  

Page 57: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Future:  RowClone  (In-­‐Memory  Copy)  

Memory      

MC L3 L2 L1 CPU

1)  Low  latency  

2)  Low  bandwidth  uIlizaIon  

3)  No  cache  polluIon  

4)  No  unwanted  data  movement  

57  

Page 58: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

DRAM operation (load one byte)

Row Buffer (4 Kbits)

Memory Bus

Data pins (8 bits)

DRAM array

4 Kbits

1. Activate row

2. Transfer row

3. Transfer byte onto bus

Page 59: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone: in-DRAM Row Copy (and Initialization)

Row Buffer (4 Kbits)

Memory Bus

Data pins (8 bits)

DRAM array

4 Kbits

1. Activate row A

2. Transfer row

3. Activate row B

4. Transfer row

Page 60: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone:  Key  Idea  •  DRAM  banks  contain  

1.  MuIple  rows  of  DRAM  cells  –  row  =  8KB  2.  A  row  buffer  shared  by  the  DRAM  rows  

•  Large  scale  copy  1.  Copy  data  from  source  row  to  row  buffer  2.  Copy  data  from  row  buffer  to  desInaIon  row    Can  be  accomplished  by  two  consecuQve  ACTIVATEs  (if  source  and  desQnaQon  rows  are  in  the  same  subarray)  

60  

Page 61: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone:  Intra-­‐subarray  Copy  

0 1 0 0 1 1 0 0 0 1 1 0

1 1 0 1 0 1 1 1 0 0 1 1

AcIvate  (src)   DeacIvate    (our  proposal)   AcIvate  (dst)  

0 1 0 0 1 1 0 0 0 1 1 0

? ? ? ? ? ? ? ? ? ? ? ? 0 1 0 0 1 1 0 0 0 1 1 0

Sense  Amplifiers  (row  buffer)  

src  

dst  

61  

Page 62: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone:  Inter-­‐bank  Copy  

I/O  Bus  Transfer  

(our  proposal)  

src  

dst  

Read   Write  

62  

Page 63: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone:  Inter-­‐subarray  Copy  

I/O  Bus  1.  Transfer  (src  to  temp)  

src  

dst  

temp  

2.  Transfer  (temp  to  dst)  63  

Page 64: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Fast  Row  IniIalizaIon  

0 0 0 0 0 0 0 0 0 0 0 0

Fix  a  row  at  Zero  (0.5%  loss  in  capacity)  

64  

Page 65: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RowClone:  Latency  and  Energy  Savings  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

Latency   Energy  

Normalized

 Savings  

Baseline   Intra-­‐Subarray  Inter-­‐Bank   Inter-­‐Subarray  

11.5x   74x  

65  Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” CMU Tech Report 2013.

Page 66: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Goal: Ultra-efficient heterogeneous architectures

CPU core

CPU core

CPU core

CPU core

mini-CPU core

video core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

LLC

Memory Controller Specialized

compute-capability in memory

Memory imaging core

Memory Bus

Slide credit: Prof. Kayvon Fatahalian, CMU

Page 67: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Enabling Ultra-efficient (Visual) Search

▪  What is the right partitioning of computation capability? ▪  What is the right low-cost memory substrate? ▪  What memory technologies are the best enablers? ▪  How do we rethink/ease (visual) search algorithms/applications?

Cache

Processor Core

Memory Bus

Main Memory

Database (of images)

Query vector

Results

Picture credit: Prof. Kayvon Fatahalian, CMU

Page 68: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on RowClone

n  Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data" CMU Computer Science Technical Report, CMU-CS-13-108, Carnegie Mellon University, April 2013.

68

Page 69: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: In-Memory Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

69

Page 70: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Reducing DRAM Bank Conflicts n  Problem: Bank conflicts are costly for performance and energy

q  serialized requests, wasted energy (thrashing of row buffer, busy wait)

n  Goal: Reduce bank conflicts without adding more banks (low cost) n  Key idea: Exploit the internal subarray structure of a DRAM bank to

parallelize bank conflicts to different subarrays q  Slightly modify DRAM bank to reduce subarray-level hardware sharing

n  Results on Server, Stream/Random, SPEC q  19% reduction in dynamic DRAM energy q  13% improvement in row hit rate q  17% performance improvement q  0.15% DRAM area overhead

70 Kim, Seshadri+ “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012. 0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

D

ynam

ic E

nerg

y

Baseline MASA

-19%

0%

20%

40%

60%

80%

100%

Row

-Buf

fer

Hit

-Rat

e

Baseline MASA

+13

%

Page 71: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on SALP

n  Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu, "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM" Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012. Slides (pptx)

71

Page 72: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating DRAM: Summary n  Major problems with DRAM scaling and design: high refresh

rate, high latency, low parallelism, bulk data movement

n  Four new DRAM designs q  RAIDR: Reduces refresh impact q  TL-DRAM: Reduces DRAM latency at low cost q  SALP: Improves DRAM parallelism q  RowClone: Reduces energy and performance impact of bulk data copy

n  All four designs q  Improve both performance and energy consumption q  Are low cost (low DRAM area overhead) q  Enable new degrees of freedom to software & controllers

n  Rethinking DRAM interface and design essential for scaling q  Co-design DRAM with the rest of the system

72

Page 73: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Further Reading: Data Retention and Power

n  Characterization of Commodity DRAM Chips q  Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu,

"An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms" Proceedings of the 40th International Symposium on Computer Architecture (ISCA), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

n  Voltage and Frequency Scaling in DRAM q  Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur

Mutlu, "Memory Power Management via Dynamic Voltage/Frequency Scaling" Proceedings of the 8th International Conference on Autonomic Computing (ICAC), Karlsruhe, Germany, June 2011. Slides (pptx) (pdf)

73

Page 74: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

74

Page 75: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Solution 2: Emerging Memory Technologies n  Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile)

n  Example: Phase Change Memory q  Data stored by changing phase of material q  Data read by detecting material’s resistance q  Expected to scale to 9nm (2022 [ITRS]) q  Prototyped at 20nm (Raoux+, IBM JRD 2008) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have (many) shortcomings q  Can they be enabled to replace/augment/surpass DRAM?

75

Page 76: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Phase Change Memory: Pros and Cons n  Pros over DRAM

q  Better technology scaling (capacity and cost) q  Non volatility q  Low idle power (no refresh)

n  Cons q  Higher latencies: ~4-15x DRAM (especially write) q  Higher active energy: ~2-50x DRAM (especially write) q  Lower endurance (a cell dies after ~108 writes)

n  Challenges in enabling PCM as DRAM replacement/helper: q  Mitigate PCM shortcomings q  Find the right way to place PCM in the system

76

Page 77: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

PCM-based Main Memory (I) n  How should PCM-based (main) memory be organized?

n  Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09]: q  How to partition/migrate data between PCM and DRAM

77

Page 78: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

PCM-based Main Memory (II) n  How should PCM-based (main) memory be organized?

n  Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

q  How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

78

Page 79: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

An Initial Study: Replace DRAM with PCM n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009. q  Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) q  Derived “average” PCM parameters for F=90nm

79

Page 80: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Results: Naïve Replacement of DRAM with PCM n  Replace DRAM with PCM in a 4-core, 4MB L2 system n  PCM organized the same as DRAM: row buffers, banks, peripherals n  1.6x delay, 2.2x energy, 500-hour average lifetime

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a

Scalable DRAM Alternative,” ISCA 2009. 80

Page 81: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Architecting PCM to Mitigate Shortcomings n  Idea 1: Use multiple narrow row buffers in each PCM chip

à Reduces array reads/writes à better endurance, latency, energy

n  Idea 2: Write into array at cache block or word granularity

à Reduces unnecessary wear

81

DRAM PCM

Page 82: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Results: Architected PCM as Main Memory n  1.2x delay, 1.0x energy, 5.6-year average lifetime n  Scaling improves energy, endurance, density

n  Caveat 1: Worst-case lifetime is much shorter (no guarantees) n  Caveat 2: Intensive applications see large performance and energy hits n  Caveat 3: Optimistic PCM parameters?

82

Page 83: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

PCM Ctrl DRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Page 84: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

One Option: DRAM as a Cache for PCM n  PCM is main memory; DRAM caches memory rows/blocks

q  Benefits: Reduced latency on DRAM cache hit; write filtering

n  Memory controller hardware manages the DRAM cache q  Benefit: Eliminates system software overhead

n  Three issues: q  What data should be placed in DRAM versus kept in PCM? q  What is the granularity of data movement? q  How to design a low-cost hardware-managed DRAM cache?

n  Two solutions: q  Locality-aware data placement [Yoon+ , ICCD 2012]

q  Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]

84

Page 85: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

DRAM vs. PCM: An Observation n  Row buffers are the same in DRAM and PCM n  Row buffer hit latency same in DRAM and PCM n  Row buffer miss latency small in DRAM, large in PCM

n  Accessing the row buffer in PCM is fast n  What incurs high latency is the PCM array access à avoid this

85

CPU DRAMCtrl

PCM Ctrl

Bank Bank Bank Bank

Row  buffer  DRAM Cache PCM Main Memory

N ns row hit Fast row miss

N ns row hit Slow row miss

Page 86: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Row-Locality-Aware Data Placement n  Idea: Cache in DRAM only those rows that

q  Frequently cause row buffer conflicts à because row-conflict latency is smaller in DRAM

q  Are reused many times à to reduce cache pollution and bandwidth waste

n  Simplified rule of thumb: q  Streaming accesses: Better to place in PCM q  Other accesses (with some reuse): Better to place in DRAM

n  Yoon et al., “Row Buffer Locality-Aware Data Placement in Hybrid Memories,” ICCD 2012.

86

Page 87: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Row-Locality-Aware Data Placement: Results

87

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Server Cloud Avg

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

Workload

FREQ FREQ-Dyn RBLA RBLA-Dyn

10% 14% 17%

Memory  energy-­‐efficiency  and  fairness  also  improve  correspondingly  

Page 88: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Weighted Speedup Max. Slowdown Perf. per Watt Normalized Metric

16GB PCM RBLA-Dyn 16GB DRAM

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

Max

. Slo

wdo

wn

Hybrid vs. All-PCM/DRAM

88  

31%  beher  performance  than  all  PCM,    within  29%  of  all  DRAM  performance  

31%  

29%  

Page 89: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

The Problem with Large DRAM Caches n  A large DRAM cache requires a large metadata (tag +

block-based information) store n  How do we design an efficient DRAM cache?

89

DRAM   PCM  

CPU

(small, fast cache) (high capacity)

Mem  Ctlr  

Mem  Ctlr  

LOAD  X  

Access X

Metadata:  X  à  DRAM  

X  

Page 90: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Idea 1: Store Tags in Main Memory n  Store tags in the same row as data in DRAM

q  Data and metadata can be accessed together

n  Benefit: No on-chip tag storage overhead n  Downsides:

q  Cache hit determined only after a DRAM access q  Cache hit requires two DRAM accesses

90

Cache  block  2  Cache  block  0   Cache  block  1  DRAM row

Tag0   Tag1   Tag2  

Page 91: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Idea 2: Cache Tags in On-Chip SRAM n  Recall Idea 1: Store all metadata in DRAM

q  To reduce metadata storage overhead

n  Idea 2: Cache in on-chip SRAM frequently-accessed metadata q  Cache only a small amount to keep SRAM size small

91

Page 92: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Idea 3: Dynamic Data Transfer Granularity n  Some applications benefit from caching more data

q  They have good spatial locality

n  Others do not q  Large granularity wastes bandwidth and reduces cache

utilization

n  Idea 3: Simple dynamic caching granularity policy q  Cost-benefit analysis to determine best DRAM cache block size q  Group main memory into sets of rows q  Some row sets follow a fixed caching granularity q  The rest of main memory follows the best granularity

n  Cost–benefit analysis: access latency versus number of cachings n  Performed every quantum

92

Page 93: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

SRAM   Region   TIM   TIMBER   TIMBER-­‐Dyn  

Normalized

 Weighted  Speedu

p  

93  

TIMBER  Performance  

-­‐6%  

Meza,  Chang,  Yoon,  Mutlu,  Ranganathan,  “Enabling  Efficient  and  Scalable  Hybrid  Memories,”  IEEE  Comp.  Arch.  Lelers,  2012.  

Page 94: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

SRAM   Region   TIM   TIMBER   TIMBER-­‐Dyn  

Normalized

 Perform

ance  per  W

ah  

(for  M

emory  System

)  

94  

TIMBER  Energy  Efficiency  18%  

Meza,  Chang,  Yoon,  Mutlu,  Ranganathan,  “Enabling  Efficient  and  Scalable  Hybrid  Memories,”  IEEE  Comp.  Arch.  Lelers,  2012.  

Page 95: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on Hybrid Memories n  Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy

Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012.

n  HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories" Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf)

n  Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)

95

Page 96: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Further: Overview Papers on Two Topics n  Merging of Memory and Storage

q  Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

n  Flash Memory Scaling

q  Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, "Error Analysis and Retention-Aware Error Management for NAND Flash Memory" Intel Technology Journal (ITJ) Special Issue on Memory Resiliency, Vol. 17, No. 1, May 2013.

96

Page 97: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

97

Page 98: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Trend: Many Cores on Chip n  Simpler and lower power than a single large core n  Large scale parallelism on chip

98

IBM  Cell  BE  8+1  cores  

Intel  Core  i7  8  cores  

Tilera  TILE  Gx  100  cores,  networked  

IBM  POWER7  8  cores  

Intel  SCC  48  cores,  networked  

Nvidia  Fermi  448  “cores”  

AMD  Barcelona  4  cores  

Sun  Niagara  II  8  cores  

Page 99: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Many Cores on Chip

n  What we want: q  N times the system performance with N times the cores

n  What do we get today?

99

Page 100: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

(Un)expected Slowdowns

Memory Performance Hog Low priority

High priority

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

Attacker (Core 1)

Movie player (Core 2)

100

Page 101: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Why? Uncontrolled Memory Interference

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

attacker movie player

DRAM Bank 3

101

Page 102: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

102

A Memory Performance Hog

STREAM

-  Sequential memory access -  Very high row buffer locality (96% hit rate) -  Memory intensive

RANDOM

-  Random memory access -  Very low row buffer locality (3% hit rate) -  Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 103: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

103

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0 T1: Row 111

T0: Row 0 T0: Row 0 T1: Row 5

T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0

Memory Request Buffer

T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 104: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

104

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc 0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 105: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Greater Problem with More Cores

n  Vulnerable to denial of service (DoS) [Usenix Security’07]

n  Unable to enforce priorities or SLAs [MICRO’07,’10,’11, ISCA’08’11’12, ASPLOS’10]

n  Low system performance [IEEE Micro Top Picks ’09,’11a,’11b,’12]

Uncontrollable, unpredictable system

105

Page 106: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Distributed DoS in Networked Multi-Core Systems

106

Attackers (Cores 1-8)

Stock option pricing application (Cores 9-64)

Cores connected via packet-switched routers on chip

~5000X slowdown

Grot, Hestness, Keckler, Mutlu, “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009.

Page 107: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

n  Problem: Memory interference is uncontrolled à uncontrollable, unpredictable, vulnerable system

n  Goal: We need to control it à Design a QoS-aware system

n  Solution: Hardware/software cooperative memory QoS q  Hardware designed to provide a configurable fairness substrate

n  Application-aware memory scheduling, partitioning, throttling

q  Software designed to configure the resources to satisfy different QoS goals

q  E.g., fair, programmable memory controllers and on-chip networks provide QoS and predictable performance

[2007-2012, Top Picks’09,’11a,’11b,’12]

Solution: QoS-Aware, Predictable Memory

Page 108: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12][Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

108

Page 109: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

n  Memory Channel Partitioning q  Idea: System software maps badly-interfering applications’ pages

to different channels [Muralidhara+, MICRO’11]

n  Separate data of low/high intensity and low/high row-locality applications n  Especially effective in reducing interference of threads with “medium” and

“heavy” memory intensity q  11% higher performance over existing systems (200 workloads)

A Mechanism to Reduce Memory Interference

109

Core 0 App A

Core 1 App B

Channel 0

Bank 1

Channel 1

Bank 0

Bank 1

Bank 0

Conventional Page Mapping

Time Units

1 2 3 4 5

Channel Partitioning

Core 0 App A

Core 1 App B

Channel 0

Bank 1

Bank 0

Bank 1

Bank 0

Time Units

1 2 3 4 5

Channel 1

Page 110: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

More on Memory Channel Partitioning n  Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut

Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

110

Page 111: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12][Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

111

Page 112: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

QoS-Aware Memory Scheduling

n  How to schedule requests to provide q  High system performance q  High fairness to applications q  Configurability to system software

n  Memory controller needs to be aware of threads

112

Memory  Controller  

Core   Core  

Core   Core  Memory  

Resolves memory contention by scheduling requests

Page 113: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

q  Takeaway: Preserving within-thread bank-parallelism improves performance; request batching improves fairness

n  ATLAS memory scheduler [Kim+ HPCA’10]

q  Idea: Prioritize threads that have attained the least service from the memory scheduler

q  Takeaway: Prioritizing “light” threads improves performance 113

Page 114: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Take  turns  accessing  memory  

Throughput vs. Fairness

114  

Fairness  biased  approach  

thread  C  

thread  B  

thread  A  

less  memory    intensive  

higher  priority  

PrioriIze  less  memory-­‐intensive  threads  

Throughput  biased  approach  

Good  for  throughput  

starvaJon  è  unfairness  

thread  C   thread  B  thread  A  

Does  not  starve  

not  prioriJzed  è    reduced  throughput  

Single  policy  for  all  threads  is  insufficient  

Page 115: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Achieving the Best of Both Worlds

115  

thread  

thread  

higher  priority  

thread  

thread  

thread    

thread  

thread  

thread  

PrioriQze  memory-­‐non-­‐intensive  threads  

For  Throughput  

Unfairness  caused  by  memory-­‐intensive  being  prioriQzed  over  each  other    

•   Shuffle  thread  ranking  

Memory-­‐intensive  threads  have    different  vulnerability  to  interference  

•   Shuffle  asymmetrically  

For  Fairness  

thread  

thread  

thread  

thread  

Page 116: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1.   Group  threads  into  two  clusters  2.   PrioriQze  non-­‐intensive  cluster  3.   Different  policies  for  each  cluster  

116  

thread  

Threads  in  the  system  

thread  

thread  

thread  

thread  

thread  

thread  

Non-­‐intensive    cluster  

Intensive  cluster  

thread  

thread  

thread  

Memory-­‐non-­‐intensive    

Memory-­‐intensive    

PrioriJzed  

higher  priority  

higher  priority  

Throughput  

Fairness  

Page 117: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

TCM: Throughput and Fairness

FRFCFS  

STFM  

PAR-­‐BS  

ATLAS  

TCM  

4  

6  

8  

10  

12  

14  

16  

7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

117  

Beler  system  throughput  

Beler  fa

irness  

24  cores,  4  memory  controllers,  96  workloads    

TCM,  a  heterogeneous  scheduling  policy,  provides  best  fairness  and  system  throughput  

Page 118: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

TCM: Fairness-Throughput Tradeoff

118  

2  

4  

6  

8  

10  

12  

12   12.5   13   13.5   14   14.5   15   15.5   16  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

When  configuraQon  parameter  is  varied…  

Adjus.ng    ClusterThreshold  

TCM  allows  robust  fairness-­‐throughput  tradeoff    

STFM  PAR-­‐BS  

ATLAS  

TCM  

Beler  system  throughput  

Beler  fa

irness   FRFCFS  

Page 119: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Memory QoS in a Parallel Application

n  Threads in a multithreaded application are inter-dependent n  Some threads can be on the critical path of execution due

to synchronization; some threads are not n  How do we schedule requests of inter-dependent threads to

maximize multithreaded application performance?

n  Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11]

n  Hardware/software cooperative limiter thread estimation: n  Thread executing the most contended critical section n  Thread that is falling behind the most in a parallel for loop

119

Page 120: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Summary: Memory QoS Approaches and Techniques

n  Approaches: Smart vs. dumb resources q  Smart resources: QoS-aware memory scheduling q  Dumb resources: Source throttling; channel partitioning q  Both approaches are effective in reducing interference q  No single best approach for all workloads

n  Techniques: Request/thread scheduling, source throttling, memory partitioning q  All approaches are effective in reducing interference q  Can be applied at different levels: hardware vs. software q  No single best technique for all workloads

n  Combined approaches and techniques are the most powerful q  Integrated Memory Channel Partitioning and Scheduling [MICRO’11]

120 MCP Micro 2011 Talk

Page 121: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Summary: Memory Interference and QoS

n  QoS-unaware memory à uncontrollable and unpredictable system n  Providing QoS awareness improves performance,

predictability, fairness, and utilization of the memory system

n  Designed new techniques to: q  Minimize memory interference q  Provide predictable performance

n  Many new research ideas needed for integrated techniques and closing the interaction with software

121

Page 122: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Readings on Memory QoS (I) n  Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX

Security 2007. n  Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling,”

MICRO 2007. n  Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA

2008, IEEE Micro 2009. n  Kim et al., “ATLAS: A Scalable and High-Performance Scheduling

Algorithm for Multiple Memory Controllers,” HPCA 2010. n  Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010, IEEE

Micro 2011. n  Muralidhara et al., “Memory Channel Partitioning,” MICRO 2011. n  Ausavarungnirun et al., “Staged Memory Scheduling,” ISCA 2012. n  Subramanian et al., “MISE: Providing Performance Predictability and

Improving Fairness in Shared Main Memory Systems,” HPCA 2013. n  Das et al., “Application-to-Core Mapping Policies to Reduce Memory

System Interference in Multi-Core Systems,” HPCA 2013. 122

Page 123: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Readings on Memory QoS (II) n  Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS 2010, ACM

TOCS 2012. n  Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008, IEEE TC

2011. n  Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. n  Ebrahimi et al., “Prefetch-Aware Shared Resource Management for

Multi-Core Systems,” ISCA 2011.

123

Page 124: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  The Memory Interference/QoS Problem and Solutions q  QoS-Aware Memory Systems

n  How Can We Do Better? n  Summary

124

Page 125: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Principles (So Far)

n  Better cooperation between devices/circuits, hardware and the system q  Expose more information about devices and controllers to

upper layers

n  Better-than-worst-case design q  Do not optimize for worst case q  Worst case should not determine the common case

n  Heterogeneity in parameters/design q  Enables a more efficient design (No one size fits all)

125

Page 126: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Other Opportunities with Emerging Technologies

n  Merging of memory and storage q  e.g., a single interface to manage all data [Meza+, WEED’13]

n  New applications q  e.g., ultra-fast checkpoint and restore

n  More robust system design q  e.g., reducing data loss

n  Memory as an accelerator (or computing element) q  e.g., enabling efficient search and filtering

126

Page 127: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

127

Page 128: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Summary: Scalable Memory Systems n  Main memory scaling problems are a critical bottleneck for

system performance, efficiency, and usability

n  Solution 1: Tolerate DRAM with novel architectures q  RAIDR: Retention-aware refresh q  TL-DRAM: Tiered-Latency DRAM q  RowClone: Fast Page Copy and Initialization q  SALP: Subarray-Level Parallelism

n  Solution 2: Enable emerging memory technologies q  Replace DRAM with NVM by architecting NVM chips well q  Hybrid memory systems with automatic data management

n  QoS-aware Memory Systems q  Required to reduce and control memory interference

n  Software/hardware/device cooperation essential for effective scaling of main memory

128

Page 129: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Thank you.

129

Page 130: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Scaling the Memory System in the Many-Core Era

Onur Mutlu [email protected] July 26, 2013

BSC/UPC

Page 131: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Backup Slides

131

Page 132: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Backup Slides Agenda

n  RAIDR: Retention-Aware DRAM Refresh n  Building Large DRAM Caches for Hybrid Memories n  Memory QoS and Predictable Performance n  Subarray-Level Parallelism (SALP) in DRAM

132

Page 133: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR: Reducing DRAM Refresh Impact

Page 134: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

Tolerating Temperature Changes

134

Page 135: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR: Baseline Design

135

Refresh control is in DRAM in today’s auto-refresh systems RAIDR can be implemented in either the controller or DRAM

Page 136: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR in Memory Controller: Option 1

136

Overhead of RAIDR in DRAM controller: 1.25 KB Bloom Filters, 3 counters, additional commands issued for per-row refresh (all accounted for in evaluations)

Page 137: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR in DRAM Chip: Option 2

137

Overhead of RAIDR in DRAM chip: Per-chip overhead: 20B Bloom Filters, 1 counter (4 Gbit chip)

Total overhead: 1.25KB Bloom Filters, 64 counters (32 GB DRAM)

Page 138: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR Results n  Baseline:

q  32 GB DDR3 DRAM system (8 cores, 512KB cache/core) q  64ms refresh interval for all rows

n  RAIDR: q  64–128ms retention range: 256 B Bloom filter, 10 hash functions q  128–256ms retention range: 1 KB Bloom filter, 6 hash functions q  Default refresh interval: 256 ms

n  Results on SPEC CPU2006, TPC-C, TPC-H benchmarks q  74.6% refresh reduction q  ~16%/20% DRAM dynamic/idle power reduction q  ~9% performance improvement

138

Page 139: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR Refresh Reduction

139

32 GB DDR3 DRAM system

Page 140: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR: Performance

140

RAIDR performance benefits increase with workload’s memory intensity

Page 141: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

RAIDR: DRAM Energy Efficiency

141

RAIDR energy benefits increase with memory idleness

Page 142: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

DRAM Device Capacity Scaling: Performance

142

RAIDR performance benefits increase with DRAM chip capacity

Page 143: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

DRAM Device Capacity Scaling: Energy

143

RAIDR energy benefits increase with DRAM chip capacity RAIDR slides

Page 144: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Reducing DRAM Bank Conflict Impact

144

Kim, Seshadri, Lee, Liu, Mutlu A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM ISCA 2012.

Page 145: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Problem, Goal, Observations n  Problem: Bank conflicts are costly for performance and energy

q  serialized requests, wasted energy (thrashing of row buffer, busy wait) n  Goal: Reduce bank conflicts without adding more banks (low cost) n  Observation 1: A DRAM bank is divided into subarrays and each

subarray has its own local row buffer

145

Page 146: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Key Ideas

n  Observation 2: Subarrays are mostly independent q  Except when sharing global structures to reduce cost

146

Key Idea of SALP: Minimally reduce sharing of global structures

Reduce the sharing of … Global decoder à Enables almost parallel access to subarrays Global row buffer à Utilizes multiple local row buffers

Page 147: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Reduce Sharing of Global Decoder

147

Local row-buffer

Local row-buffer Global row-buffer

···

Glob

al  Decod

er  

Latch  

Latch  

Latch  

Instead of a global latch, have per-subarray latches

Page 148: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Reduce Sharing of Global Row-Buffer

148

Wir

e

Global bitlines

Global row-buffer

Local row-buffer

Local row-buffer

Switch

Switch

READ READ

DD

DD

Selectively connect local row-buffers to global row-buffer using a Designated single-bit latch

Page 149: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Baseline Bank Organization

149

Local row-buffer

Local row-buffer

Global row-buffer

Glob

al  Decod

er  

Global bitlines

Latch  

Page 150: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Proposed Bank Organization

150

Local row-buffer

Local row-buffer

Global row-buffer

Glob

al  Decod

er  

Latch  

Latch  

D  

D  

Global bitlines

Overhead of SALP in DRAM chip: 0.15% 1. Global latch à per-subarray local latches 2. Designated bit latches and wire to selectively enable a subarray

Page 151: Scaling the Memory System in the Many-Core Erausers.ece.cmu.edu/~omutlu/pub/onur-BSC-lecture2... · Ctrl Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost

SALP: Results n  Wide variety of systems with different #channels, banks,

ranks, subarrays n  Server, streaming, random-access, SPEC workloads

n  Dynamic DRAM energy reduction: 19% q  DRAM row hit rate improvement: 13%

n  System performance improvement: 17% q  Within 3% of ideal (all independent banks)

n  DRAM die area overhead: 0.15% q  vs. 36% overhead of independent banks

151