64
Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu http://www.ece.cmu.edu/~omutlu June 5, 2011 ISMM/MSPC

Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions

Onur Mutlu http://www.ece.cmu.edu/~omutlu

June 5, 2011 ISMM/MSPC

Page 2: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Modern Memory System: A Shared Resource

2

Page 3: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

The Memory System n  The memory system is a fundamental performance and

power bottleneck in almost all computing systems: server, mobile, embedded, desktop, sensor

n  The memory system must scale (in size, performance, efficiency, cost) to maintain performance and technology scaling

n  Recent technology, architecture, and application trends lead to new requirements from the memory system: q  Scalability (technology and algorithm) q  Fairness and QoS-awareness q  Energy/power efficiency

3

Page 4: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Summary

4

Page 5: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Technology Trends n  DRAM does not scale well beyond N nm [ITRS 2009, 2010]

q  Memory scaling benefits: density, capacity, cost

n  Energy/power already key design limiters

q  Memory hierarchy responsible for a large fraction of power n  IBM servers: ~50% energy spent in off-chip memory hierarchy

[Lefurgy+, IEEE Computer 2003] n  DRAM consumes power when idle and needs periodic refresh

n  More transistors (cores) on chip n  Pin bandwidth not increasing as fast as number of transistors

q  Memory is the major shared resource among cores q  More pressure on the memory hierarchy

5

Page 6: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Application Trends n  Many different threads/applications/virtual-machines (will)

concurrently share the memory system

q  Cloud computing/servers: Many workloads consolidated on-chip to improve efficiency

q  GP-GPU, CPU+GPU, accelerators: Many threads from multiple applications

q  Mobile: Interactive + non-interactive consolidation

n  Different applications with different requirements (SLAs) q  Some applications/threads require performance guarantees q  Modern hierarchies do not distinguish between applications

n  Applications are increasingly data intensive q  More demand for memory capacity and bandwidth

6

Page 7: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Architecture/System Trends n  Sharing of memory hierarchy

n  More cores and components q  More pressure on the memory hierarchy

n  Asymmetric cores: Performance asymmetry, CPU+GPUs, accelerators, … q  Motivated by energy efficiency and Amdahl’s Law

n  Different cores have different performance requirements q  Memory hierarchies do not distinguish between cores

n  Different goals for different systems/users q  System throughput, fairness, per-application performance q  Modern hierarchies are not flexible/configurable

7

Page 8: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Summary: Major Trends Affecting Memory

n  Need for main memory capacity and bandwidth increasing

n  New need for handling inter-application interference; providing fairness, QoS

n  Need for memory system flexibility increasing n  Main memory energy/power is a key system design concern n  DRAM is not scaling well

8

Page 9: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Summary

9

Page 10: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Requirements from an Ideal Memory System

n  Traditional q  High system performance q  Enough capacity q  Low cost

n  New q  Technology scalability q  QoS support and configurability q  Energy (and power, bandwidth) efficiency

10

Page 11: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

n  Traditional q  High system performance: Need to reduce inter-thread interference

q  Enough capacity: Emerging tech. and waste management can help

q  Low cost: Other memory technologies can help

n  New q  Technology scalability

n  Emerging memory technologies (e.g., PCM) can help

q  QoS support and configurability n  Need HW mechanisms to control interference and build QoS policies

q  Energy (and power, bandwidth) efficiency n  One-size-fits-all design wastes energy; emerging tech. can help?

11

Requirements from an Ideal Memory System

Page 12: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Summary

12

Page 13: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

The DRAM Scaling Problem n  DRAM stores charge in a capacitor (charge-based memory)

q  Capacitor must be large enough for reliable sensing q  Access transistor should be large enough for low leakage and high

retention time q  Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

n  DRAM capacity, cost, and energy/power hard to scale

13

Page 14: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Concerns with DRAM as Main Memory

n  Need for main memory capacity and bandwidth increasing q  DRAM capacity hard to scale

n  Main memory energy/power is a key system design concern

q  DRAM consumes high power due to leakage and refresh

n  DRAM technology scaling is becoming difficult

q  DRAM capacity and cost may not continue to scale

14

Page 15: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Possible Solution 1: Tolerate DRAM

n  Overcome DRAM shortcomings with q  System-level solutions q  Changes to DRAM microarchitecture, interface, and functions

15

Page 16: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Possible Solution 2: Emerging Technologies n  Some emerging resistive memory technologies are more

scalable than DRAM (and they are non-volatile)

n  Example: Phase Change Memory q  Data stored by changing phase of special material q  Data read by detecting material’s resistance q  Expected to scale to 9nm (2022 [ITRS]) q  Prototyped at 20nm (Raoux+, IBM JRD 2008) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have shortcomings as well q  Can they be enabled to replace/augment/surpass DRAM?

16

Page 17: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Phase Change Memory: Pros and Cons n  Pros over DRAM

q  Better technology scaling (capacity and cost) q  Non volatility q  Low idle power (no refresh)

n  Cons q  Higher latencies: ~4-15x DRAM (especially write) q  Higher active energy: ~2-50x DRAM (especially write) q  Lower endurance (a cell dies after ~108 writes)

n  Challenges in enabling PCM as DRAM replacement/helper: q  Mitigate PCM shortcomings q  Find the right way to place PCM in the system q  Ensure secure and fault-tolerant PCM operation

17

Page 18: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

PCM-based Main Memory (I) n  How should PCM-based (main) memory be organized?

n  Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09]: q  How to partition/migrate data between PCM and DRAM

n  Energy, performance, endurance

q  Is DRAM a cache for PCM or part of main memory? q  How to design the hardware and software

n  Exploit advantages, minimize disadvantages of each technology 18

Page 19: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

PCM-based Main Memory (II) n  How should PCM-based (main) memory be organized?

n  Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

q  How to redesign entire hierarchy (and cores) to overcome PCM shortcomings n  Energy, performance, endurance

19

Page 20: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

PCM-Based Memory Systems: Research Challenges

n  Partitioning q  Should DRAM be a cache or main memory, or configurable? q  What fraction? How many controllers?

n  Data allocation/movement (energy, performance, lifetime) q  Who manages allocation/movement? q  What are good control algorithms?

n  Latency-critical, heavily modified à DRAM, otherwise PCM? n  Preventing denial/degradation of service

n  Design of cache hierarchy, memory controllers, OS q  Mitigate PCM shortcomings, exploit PCM advantages

n  Design of PCM/DRAM chips and modules q  Rethink the design of PCM/DRAM with new requirements

20

Page 21: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

An Initial Study: Replace DRAM with PCM n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009. q  Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) q  Derived “average” PCM parameters for F=90nm

21

Page 22: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Results: Naïve Replacement of DRAM with PCM n  Replace DRAM with PCM in a 4-core, 4MB L2 system n  PCM organized the same as DRAM: row buffers, banks, peripherals n  1.6x delay, 2.2x energy, 500-hour average lifetime

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a

Scalable DRAM Alternative,” ISCA 2009. 22

Page 23: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Architecting PCM to Mitigate Shortcomings n  Idea 1: Use narrow row buffers in each PCM chip

à Reduces write energy, peripheral circuitry

n  Idea 2: Use multiple row buffers in each PCM chip à Reduces array reads/writes à better endurance, latency, energy

n  Idea 3: Write into array at cache block or word granularity

à Reduces unnecessary wear

23

DRAM PCM

Page 24: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Results: Architected PCM as Main Memory n  1.2x delay, 1.0x energy, 5.6-year average lifetime n  Scaling improves energy, endurance, density

n  Caveat 1: Worst-case lifetime is much shorter (no guarantees) n  Caveat 2: Intensive applications see large performance and energy hits n  Caveat 3: Optimistic PCM parameters?

24

Page 25: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

PCM as Main Memory: Research Challenges n  Many research opportunities from

technology layer to algorithms layer

n  Enabling PCM/NVM q  How to maximize performance? q  How to maximize lifetime? q  How to prevent denial of service?

n  Exploiting PCM/NVM q  How to exploit non-volatility? q  How to minimize energy consumption? q  How to minimize cost? q  How to exploit NVM on chip?

25

Microarchitecture

ISA

Programs

Algorithms Problems

Logic

Devices

Runtime System (VM, OS, MM)

User

Page 26: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Summary

26

Page 27: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Memory System is the Major Shared Resource

27

threads’ requests interfere

Page 28: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Inter-Thread/Application Interference

n  Problem: Threads share the memory system, but memory system does not distinguish between threads’ requests

n  Existing memory systems q  Free-for-all, shared based on demand q  Control algorithms thread-unaware and thread-unfair q  Aggressive threads can deny service to others q  Do not try to reduce or control inter-thread interference

28

Page 29: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

29

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

stream random

DRAM Bank 3

Page 30: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

30

A Memory Performance Hog

STREAM

-  Sequential memory access -  Very high row buffer locality (96% hit rate) -  Memory intensive

RANDOM

-  Random memory access -  Very low row buffer locality (3% hit rate) -  Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 31: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

31

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0 T1: Row 111

T0: Row 0 T0: Row 0 T1: Row 5

T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0

Memory Request Buffer

T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 32: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

32

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc 0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 33: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Problems due to Uncontrolled Interference

33

n  Unfair slowdown of different threads [MICRO’07, ISCA’08, ASPLOS’10]

n  Low system performance [MICRO’07, ISCA’08, HPCA’10, MICRO’10]

n  Vulnerability to denial of service [USENIX Security’07]

n  Priority inversion: unable to enforce priorities/SLAs [MICRO’07]

n  Poor performance predictability (no performance isolation)

Cores make very slow progress

Memory performance hog Low priority

High priority Sl

owdo

wn

Main memory is the only shared resource

Page 34: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Problems due to Uncontrolled Interference

34

n  Unfair slowdown of different threads [MICRO’07, ISCA’08, ASPLOS’10]

n  Low system performance [MICRO’07, ISCA’08, HPCA’10, MICRO’10]

n  Vulnerability to denial of service [USENIX Security’07]

n  Priority inversion: unable to enforce priorities/SLAs [MICRO’07]

n  Poor performance predictability (no performance isolation)

Page 35: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

How Do We Solve The Problem?

n  Inter-thread interference is uncontrolled in all memory resources q  Memory controller q  Interconnect q  Caches

n  We need to control it q  i.e., design an interference-aware (QoS-aware) memory system

35

Page 36: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

QoS-Aware Memory Systems: Challenges

n  How do we reduce inter-thread interference? q  Improve system performance and core utilization q  Reduce request serialization and core starvation

n  How do we control inter-thread interference? q  Provide mechanisms to enable system software to enforce

QoS policies q  While providing high system performance

n  How do we make the memory system configurable/flexible? q  Enable flexible mechanisms that can achieve many goals

n  Provide fairness or throughput when needed n  Satisfy performance guarantees when needed

36

Page 37: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q  QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q  QoS-aware thread scheduling to cores

37

Page 38: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Smart Resources: Thread Cluster Memory Scheduling n  Dumb Resources: Fairness via Source Throttling

n  Summary

38

Page 39: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

QoS-Aware Memory Scheduling

n  How to schedule requests to provide q  High system performance q  High fairness to applications q  Configurability to system software

n  Memory controller needs to be aware of threads

39

Memory  Controller  

Core   Core  

Core   Core  Memory  

Resolves memory contention by scheduling requests

Page 40: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

n  ATLAS memory scheduler [Kim+ HPCA’10]

40

Page 41: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Within-Thread Bank Parallelism  

41  

Bank  0  

Bank  1  

req  

req  req  

req  

memory  service  +meline  

thread  A    

thread  B    

thread  execu+on  +meline  

WAIT  

WAIT  

thread  B    

thread  A    Bank  0  

Bank  1  

req  

req  req  

req  

memory  service  +meline  

thread  execu+on  +meline  

WAIT  

WAIT  

rank  

thread  B    

thread  A    

thread  A    

thread  B    

SAVED  CYCLES  

Key  Idea:  

Page 42: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

q  Takeaway: Preserving within-thread bank-parallelism improves performance; request batching improves fairness

n  ATLAS memory scheduler [Kim+ HPCA’10]

q  Idea: Prioritize threads that have attained the least service from the memory scheduler

q  Takeaway: Prioritizing “light” threads improves performance

42

Page 43: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

1  

3  

5  

7  

9  

11  

13  

15  

17  

7   7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

FCFS  FRFCFS  STFM  PAR-­‐BS  ATLAS  

No  previous  memory  scheduling  algorithm  provides  both  the  best  fairness  and  system  throughput  

Previous Scheduling Algorithms are Biased

43  

System  throughput  bias  

Fairness  bias  

BeIer  system  throughput  

BeIe

r  fairness  

24  cores,  4  memory  controllers,  96  workloads    

Page 44: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Take  turns  accessing  memory  

Throughput vs. Fairness

44  

Fairness  biased  approach  

thread  C  

thread  B  

thread  A  

less  memory    intensive  

higher  priority  

PrioriMze  less  memory-­‐intensive  threads  

Throughput  biased  approach  

Good  for  throughput  

starva3on  è  unfairness  

thread  C   thread  B  thread  A  

Does  not  starve  

not  priori3zed  è    reduced  throughput  

Single  policy  for  all  threads  is  insufficient  

Page 45: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Achieving the Best of Both Worlds

45  

thread  

thread  

higher  priority  

thread  

thread  

thread    

thread  

thread  

thread  

PrioriDze  memory-­‐non-­‐intensive  threads  

For  Throughput  

Unfairness  caused  by  memory-­‐intensive  being  prioriDzed  over  each  other    

•   Shuffle  thread  ranking  

Memory-­‐intensive  threads  have    different  vulnerability  to  interference  

•   Shuffle  asymmetrically  

For  Fairness  

thread  

thread  

thread  

thread  

Page 46: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1.   Group  threads  into  two  clusters  2.   PrioriDze  non-­‐intensive  cluster  3.   Different  policies  for  each  cluster  

46  

thread  

Threads  in  the  system  

thread  

thread  

thread  

thread  

thread  

thread  

Non-­‐intensive    cluster  

Intensive  cluster  

thread  

thread  

thread  

Memory-­‐non-­‐intensive    

Memory-­‐intensive    

Priori3zed  

higher  priority  

higher  priority  

Throughput  

Fairness  

Page 47: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

FRFCFS  

STFM  

PAR-­‐BS  

ATLAS  

TCM  

4  

6  

8  

10  

12  

14  

16  

7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

TCM: Throughput and Fairness

47  

BeIer  system  throughput  

BeIe

r  fairness  

24  cores,  4  memory  controllers,  96  workloads    

TCM,  a  heterogeneous  scheduling  policy,  provides  best  fairness  and  system  throughput  

Page 48: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

TCM: Fairness-Throughput Tradeoff

48  

2  

4  

6  

8  

10  

12  

12   12.5   13   13.5   14   14.5   15   15.5   16  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

When  configuraDon  parameter  is  varied…  

Adjus+ng    ClusterThreshold  

TCM  allows  robust  fairness-­‐throughput  tradeoff    

STFM  PAR-­‐BS  

ATLAS  

TCM  

BeIer  system  throughput  

BeIe

r  fairness   FRFCFS  

Page 49: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Smart Resources: Thread Cluster Memory Scheduling n  Dumb Resources: Fairness via Source Throttling

n  Summary

49

Page 50: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q  QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q  QoS-aware thread scheduling to cores

50

Page 51: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Many Shared Resources

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

... DRAM Bank K

...

Shared Memory Resources

Chip Boundary On-chip Off-chip

51

Page 52: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

The Problem with “Smart Resources”

n  Independent interference control mechanisms in caches, interconnect, and memory can contradict each other

n  Explicitly coordinating mechanisms for different resources requires complex implementation

n  How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

52

Page 53: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

An Alternative Approach: Source Throttling

n  Manage inter-thread interference at the cores, not at the shared resources

n  Dynamically estimate unfairness in the memory system n  Feed back this information into a controller n  Throttle cores’ memory access rates accordingly

q  Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc)

q  E.g., if unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated

n  Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10.

53

Page 54: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

54

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering (limit injection rate and parallelism) 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

| ⎨

⎩ Slowdown

Estimation

Time Interval 1 Interval 2 Interval 3

Runtime Unfairness Evaluation

Dynamic Request Throttling

Fairness via Source Throttling (FST) [ASPLOS’10]

Page 55: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

System Software Support n  Different fairness objectives can be configured by

system software q  Keep maximum slowdown in check

n  Estimated Max Slowdown < Target Max Slowdown

q  Keep slowdown of particular applications in check to achieve a particular performance target n  Estimated Slowdown(i) < Target Slowdown(i)

n  Support for thread priorities q  Weighted Slowdown(i) =

Estimated Slowdown(i) x Weight(i)

55

Page 56: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Source Throttling Results: Takeaways

n  Source throttling alone provides better performance than a combination of “smart” memory scheduling and fair caching q  Decisions made at the memory scheduler and the cache

sometimes contradict each other

n  Neither source throttling alone nor “smart resources” alone provides the best performance

n  Combined approaches are even more powerful q  Source throttling and resource-based interference control

56

Page 57: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q  QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q  QoS-aware thread scheduling to cores

57

Page 58: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Another Way of Reducing Interference n  Memory Channel Partitioning

q  Idea: Map badly-interfering applications’ pages to different channels [Muralidhara+ CMU TR’11]

n  Separate data of low/high intensity and low/high row-locality applications n  Especially effective in reducing interference of threads with “medium” and

“heavy” memory intensity

58

Page 59: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Summary: Memory QoS Approaches and Techniques

n  Approaches: Smart vs. dumb resources q  Smart resources: QoS-aware memory scheduling q  Dumb resources: Source throttling; channel partitioning q  Both approaches are effective in reducing interference q  No single best approach for all workloads

n  Techniques: Request scheduling, source throttling, memory partitioning q  All approaches are effective in reducing interference q  Can be applied at different levels: hardware vs. software q  No single best technique for all workloads

n  Combined approaches and techniques are the most powerful q  Integrated Memory Channel Partitioning and Scheduling

59

Page 60: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Two Related Talks at ISCA

n  How to design QoS-aware memory systems (memory scheduling and source throttling) in the presence of prefetching q  Ebrahimi et al., “Prefetch-Aware Shared Resource Management for

Multi-Core Systems,” ISCA’11. q  Monday afternoon (Session 3B)

n  How to design scalable QoS mechanisms in on-chip interconnects q  Idea: Isolate shared resources in a region, provide QoS support only

within the region, ensure interference-free access to the region q  Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture

for Scalability and Service Guarantees,” ISCA’11. q  Wednesday morning (Session 8B)

60

Page 61: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Agenda

n  Technology, Application, Architecture Trends n  Requirements from the Memory Hierarchy n  Research Challenges and Solution Directions

q  Main Memory Scalability q  QoS support: Inter-thread/application interference

n  Smart Resources: Thread Cluster Memory Scheduling n  Dumb Resources: Fairness via Source Throttling

n  Conclusions

61

Page 62: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Conclusions

n  Technology, application, architecture trends dictate new needs from memory system

n  A fresh look at (re-designing) the memory hierarchy q  Scalability: Enabling new memory technologies q  QoS, fairness & performance: Reducing and controlling inter-

application interference: QoS-aware memory system design q  Efficiency: Customizability, minimal waste, new technologies

n  Many exciting research topics in fundamental areas across the system stack q  Hardware/software/device cooperation essential

62

Page 63: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Thank you.

63

Page 64: Memory Systems in the Many-Core Era : Some Challenges and ... · The Memory System ! The memory system is a fundamental performance and power bottleneck in almost all computing systems:

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions

Onur Mutlu http://www.ece.cmu.edu/~omutlu

June 5, 2011 ISMM/MSPC