198
Computer Architecture Lecture 12: Memory Interference and Quality of Service Prof. Onur Mutlu ETH Zürich Fall 2017 1 November 2017

Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Computer ArchitectureLecture 12: Memory Interference

and Quality of Service

Prof. Onur MutluETH ZürichFall 2017

1 November 2017

Page 2: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Summary of Last Week’s Lecturesn Control Dependence Handling

q Problemq Six solutions

n Branch Prediction

n Trace Caches

n Other Methods of Control Dependence Handlingq Fine-Grained Multithreadingq Predicated Executionq Multi-path Execution

2

Page 3: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Agenda for Todayn Shared vs. private resources in multi-core systems

n Memory interference and the QoS problem

n Memory scheduling

n Other approaches to mitigate and control memory interference

3

Page 4: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Quick Summary Papersn "Parallelism-Aware Batch Scheduling: Enhancing both

Performance and Fairness of Shared DRAM Systems”

n "The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost"

n "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems”

n "Parallel Application Memory Scheduling”

n "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning"

4

Page 5: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Shared Resource Design for Multi-Core Systems

5

Page 6: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Memory System: A Shared Resource View

6

Storage

Page 7: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Resource Sharing Conceptn Idea: Instead of dedicating a hardware resource to a

hardware context, allow multiple contexts to use itq Example resources: functional units, pipeline, caches, buses,

memoryn Why?

+ Resource sharing improves utilization/efficiency à throughputq When a resource is left idle by one thread, another thread can

use it; no need to replicate shared data+ Reduces communication latency

q For example, shared data kept in the same cache in SMT processors

+ Compatible with the shared memory model

7

Page 8: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Resource Sharing Disadvantagesn Resource sharing results in contention for resources

q When the resource is not idle, another thread cannot use itq If space is occupied by one thread, another thread needs to re-

occupy it

- Sometimes reduces each or some thread’s performance- Thread performance can be worse than when it is run alone

- Eliminates performance isolation à inconsistent performance across runs

- Thread performance depends on co-executing threads- Uncontrolled (free-for-all) sharing degrades QoS- Causes unfairness, starvation

Need to efficiently and fairly utilize shared resources8

Page 9: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Example: Problem with Shared Caches

9

L2 $

L1 $

……

Processor Core 1

L1 $

Processor Core 2←t1

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.

Page 10: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Example: Problem with Shared Caches

10

L1 $

Processor Core 1

L1 $

Processor Core 2

L2 $

……

t2→

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.

Page 11: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Example: Problem with Shared Caches

11

L1 $

L2 $

……

Processor Core 1 Processor Core 2←t1

L1 $

t2→

t2’s throughput is significantly reduced due to unfair cache sharing.

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.

Page 12: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Need for QoS and Shared Resource Mgmt.n Why is unpredictable performance (or lack of QoS) bad?

n Makes programmer’s life difficultq An optimized program can get low performance (and

performance varies widely depending on co-runners)

n Causes discomfort to userq An important program can starveq Examples from shared software resources

n Makes system management difficultq How do we enforce a Service Level Agreement when

hardware resources are sharing is uncontrollable?12

Page 13: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Resource Sharing vs. Partitioningn Sharing improves throughput

q Better utilization of space

n Partitioning provides performance isolation (predictable performance)q Dedicated space

n Can we get the benefits of both?

n Idea: Design shared resources such that they are efficiently utilized, controllable and partitionableq No wasted resource + QoS mechanisms for threads

13

Page 14: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Shared Hardware Resourcesn Memory subsystem (in both Multi-threaded and Multi-core)

q Non-private cachesq Interconnectsq Memory controllers, buses, banks

n I/O subsystem (in both Multi-threaded and Multi-core)q I/O, DMA controllersq Ethernet controllers

n Processor (in Multi-threaded)q Pipeline resourcesq L1 caches

14

Page 15: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Memory System is the Major Shared Resource

15

threads’ requests interfere

Page 16: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Much More of a Shared Resource in Future

16

Page 17: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Inter-Thread/Application Interferencen Problem: Threads share the memory system, but memory

system does not distinguish between threads’ requests

n Existing memory systems q Free-for-all, shared based on demandq Control algorithms thread-unaware and thread-unfairq Aggressive threads can deny service to othersq Do not try to reduce or control inter-thread interference

17

Page 18: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Unfair Slowdowns due to Interference

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab(Core 1)

gcc(Core 2)

18

Page 19: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

19

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAMMemory System

Multi-CoreChip

unfairnessINTERCONNECT

stream random

DRAM Bank 3

Page 20: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

// initialize large arrays A, B

for (j=0; j<N; j++) {index = rand();A[index] = B[index];…

}

20

A Memory Performance Hog

STREAM

- Sequential memory access - Very high row buffer locality (96% hit rate)- Memory intensive

RANDOM

- Random memory access- Very low row buffer locality (3% hit rate)- Similarly memory intensive

// initialize large arrays A, B

for (j=0; j<N; j++) {index = j*linesize;A[index] = B[index];…

}

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 21: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

21

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16T0: Row 0T1: Row 111T0: Row 0T0: Row 0T1: Row 5

T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0

Memory Request Buffer

T0: STREAMT1: RANDOM

Row size: 8KB, cache block size: 64B128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 22: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

22

DRAM Controllers

n A row-conflict memory access takes significantly longer than a row-hit access

n Current controllers take advantage of the row buffer

n Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*(1) Row-hit first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first

n This scheduling policy aims to maximize DRAM throughputn But, it is unfair when multiple threads share the DRAM system

*Rixner et al., “Memory Access Scheduling,” ISCA 2000.*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

Page 23: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

23

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 24: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Greater Problem with More Cores

n Vulnerable to denial of service (DoS)n Unable to enforce priorities or SLAs n Low system performance

Uncontrollable, unpredictable system

24

Page 25: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Greater Problem with More Cores

n Vulnerable to denial of service (DoS) n Unable to enforce priorities or SLAsn Low system performance

Uncontrollable, unpredictable system

25

Page 26: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Distributed DoS in Networked Multi-Core Systems

26

Attackers(Cores 1-8)

Stock option pricing application(Cores 9-64)

Cores connected via packet-switchedrouters on chip

~5000X latency increase

Grot, Hestness, Keckler, Mutlu, “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“MICRO 2009.

Page 27: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Memory Performance Attacksn Thomas Moscibroda and Onur Mutlu,

"Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems"Proceedings of the 16th USENIX Security Symposium (USENIX SECURITY), pages 257-274, Boston, MA, August 2007. Slides (ppt)

27

Page 28: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Interconnect Based Starvationn Boris Grot, Stephen W. Keckler, and Onur Mutlu,

"Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip"Proceedings of the 42nd International Symposium on Microarchitecture (MICRO), pages 268-279, New York, NY, December 2009. Slides (pdf)

28

Page 29: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

How Do We Solve The Problem?

n Inter-thread interference is uncontrolled in all memory resourcesq Memory controllerq Interconnectq Caches

n We need to control itq i.e., design an interference-aware (QoS-aware) memory system

29

Page 30: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Systems: Challenges

n How do we reduce inter-thread interference?q Improve system performance and core utilizationq Reduce request serialization and core starvation

n How do we control inter-thread interference?q Provide mechanisms to enable system software to enforce

QoS policies q While providing high system performance

n How do we make the memory system configurable/flexible? q Enable flexible mechanisms that can achieve many goals

n Provide fairness or throughput when neededn Satisfy performance guarantees when needed

30

Page 31: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Designing QoS-Aware Memory Systems: Approaches

n Smart resources: Design each shared resource to have a configurable interference control/reduction mechanismq QoS-aware memory controllers q QoS-aware interconnectsq QoS-aware caches

n Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mappingq Source throttling to control access to memory system q QoS-aware data mapping to memory controllers q QoS-aware thread scheduling to cores

31

Page 32: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Fundamental Interference Control Techniquesn Goal: to reduce/control inter-thread memory interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

32

Page 33: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling

n How to schedule requests to provideq High system performanceq High fairness to applicationsq Configurability to system software

n Memory controller needs to be aware of threads

33

MemoryController

Core Core

Core CoreMemory

Resolves memory contention by scheduling requests

Page 34: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling:Evolution

Page 35: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolutionn Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q Idea: Estimate and balance thread slowdownsq Takeaway: Proportional thread progress improves performance,

especially when threads are “heavy” (memory intensive)

n Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]q Idea: Rank threads and service in rank order (to preserve bank

parallelism); batch requests to prevent starvationq Takeaway: Preserving within-thread bank-parallelism improves

performance; request batching improves fairness

n ATLAS memory scheduler [Kim+ HPCA’10]q Idea: Prioritize threads that have attained the least service from the

memory scheduler q Takeaway: Prioritizing “light” threads improves performance

35

Page 36: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolutionn Thread cluster memory scheduling [Kim+ MICRO’10]

q Idea: Cluster threads into two groups (latency vs. bandwidth sensitive); prioritize the latency-sensitive ones; employ a fairness policy in the bandwidth sensitive group

q Takeaway: Heterogeneous scheduling policy that is different based on thread behavior maximizes both performance and fairness

n Integrated Memory Channel Partitioning and Scheduling [Muralidhara+ MICRO’11]n Idea: Only prioritize very latency-sensitive threads in the scheduler;

mitigate all other applications’ interference via channel partitioningn Takeaway: Intelligently combining application-aware channel

partitioning and memory scheduling provides better performance than either

36

Page 37: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolutionn Parallel application memory scheduling [Ebrahimi+ MICRO’11]

q Idea: Identify and prioritize limiter threads of a multithreaded application in the memory scheduler; provide fast and fair progress to non-limiter threads

q Takeaway: Carefully prioritizing between limiter and non-limiter threads of a parallel application improves performance

n Staged memory scheduling [Ausavarungnirun+ ISCA’12]n Idea: Divide the functional tasks of an application-aware memory

scheduler into multiple distinct stages, where each stage is significantly simpler than a monolithic scheduler

n Takeaway: Staging enables the design of a scalable and relatively simpler application-aware memory scheduler that works on very large request buffers

37

Page 38: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolution

n MISE: Memory Slowdown Model [Subramanian+ HPCA’13]n Idea: Estimate the performance of a thread by estimating its change

in memory request service rate when run alone vs. shared à use this simple model to estimate slowdown to design a scheduling policy that provides predictable performance or fairness

n Takeaway: Request service rate of a thread is a good proxy for its performance; alone request service rate can be estimated by giving high priority to the thread in memory scheduling for a while

n ASM: Application Slowdown Model [Subramanian+ MICRO’15]q Idea: Extend MISE to take into account cache+memory interferenceq Takeaway: Cache access rate of an application can be estimated

accurately and is a good proxy for application performance

38

Page 39: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolution

n BLISS: Blacklisting Memory Scheduler [Subramanian+ ICCD’14]q Idea: Deprioritize (i.e., blacklist) a thread that has consecutively

serviced a large number of requestsq Takeaway: Blacklisting greatly reduces interference enables the

scheduler to be simple without requiring full thread ranking

n DASH: Deadline-Aware Memory Scheduler [Usui+ TACO’16]q Idea: Balance prioritization between CPUs, GPUs and Hardware

Accelerators (HWA) by keeping HWA progress in check vs. deadlines such that HWAs do not hog performance and appropriately distinguishing between latency-sensitive vs. bandwidth-sensitive CPU workloads

q Takeaway: Proper control of HWA progress and application-aware CPU prioritization leads to better system performance while meeting HWA dedlines

39

Page 40: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolutionn Prefetch-aware shared resource management [Ebrahimi+

ISCA’11] [Ebrahimi+ MICRO’09] [Ebrahimi+ HPCA’09] [Lee+ MICRO’08]q Idea: Prioritize prefetches depending on how they affect system

performance; even accurate prefetches can degrade performance of the system

q Takeaway: Carefully controlling and prioritizing prefetch requests improves performance and fairness

n DRAM-Aware last-level cache policies and write scheduling [Lee+ HPS Tech Report’10] [Lee+ HPS Tech Report’10]q Idea: Design cache eviction and replacement policies such that they

proactively exploit the state of the memory controller and DRAM (e.g., proactively evict data from the cache that hit in open rows)

q Takeaway: Coordination of last-level cache and DRAM policies improves performance and fairness; writes should not be ignored

40

Page 41: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

QoS-Aware Memory Scheduling: Evolutionn FIRM: Memory Scheduling for NVM [Zhao+ MICRO’14]

q Idea: Carefully handle write-read prioritization with coarse-grained batching and application-aware scheduling

q Takeaway: Carefully controlling and prioritizing write requests improves performance and fairness; write requests are especially critical in NVMs

41

Page 42: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Stall-Time Fair Memory Scheduling

Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"

40th International Symposium on Microarchitecture (MICRO), pages 146-158, Chicago, IL, December 2007. Slides (ppt)

STFM Micro 2007 Talk

Page 43: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

The Problem: Unfairness

n Vulnerable to denial of service (DoS)n Unable to enforce priorities or SLAs n Low system performance

Uncontrollable, unpredictable system

43

Page 44: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

How Do We Solve the Problem?n Stall-time fair memory scheduling [Mutlu+ MICRO’07]

n Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone àfair scheduling

n Also improves overall system performance by ensuring cores make “proportional” progress

n Idea: Memory controller estimates each thread’s slowdown due to interference and schedules requests in a way to balance the slowdowns

n Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.

44

Page 45: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

45

Stall-Time Fairness in Shared DRAM Systems

n A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system

n DRAM-related stall-time: The time a thread spends waiting for DRAM memoryn STshared: DRAM-related stall-time when the thread runs with other threadsn STalone: DRAM-related stall-time when the thread runs alonen Memory-slowdown = STshared/STalone

q Relative increase in stall-time

n Stall-Time Fair Memory scheduler (STFM) aims to equalizeMemory-slowdown for interfering threads, without sacrificing performanceq Considers inherent DRAM performance of each threadq Aims to allow proportional progress of threads

Page 46: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

46

STFM Scheduling Algorithm [MICRO’07]

n For each thread, the DRAM controllerq Tracks STsharedq Estimates STalone

n Each cycle, the DRAM controllerq Computes Slowdown = STshared/STalone for threads with legal requestsq Computes unfairness = MAX Slowdown / MIN Slowdown

n If unfairness < aq Use DRAM throughput oriented scheduling policy

n If unfairness ≥ aq Use fairness-oriented scheduling policy

n (1) requests from thread with MAX Slowdown firstn (2) row-hit first , (3) oldest-first

Page 47: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

47

How Does STFM Prevent Unfairness?

Row Buffer

Data

Row 0

T0: Row 0

Row 0

T1: Row 16T0: Row 0

T1: Row 111

T0: Row 0T0: Row 0

T1: Row 5

T0: Row 0T0: Row 0

T0: Row 0

T0 SlowdownT1 Slowdown 1.00

1.00

1.00Unfairness

1.03

1.03

1.06

1.06

a 1.05

1.03

1.061.031.041.08

1.04

1.041.11

1.06

1.07

1.04

1.101.14

1.03

Row 16Row 111

Page 48: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

STFM Pros and Consn Upsides:

q First algorithm for fair multi-core memory schedulingq Provides a mechanism to estimate memory slowdown of a

threadq Good at providing fairnessq Being fair can improve performance

n Downsides:q Does not handle all types of interferenceq (Somewhat) complex to implementq Slowdown estimations can be incorrect

48

Page 49: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on STFMn Onur Mutlu and Thomas Moscibroda,

"Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"Proceedings of the 40th International Symposium on Microarchitecture (MICRO), pages 146-158, Chicago, IL, December 2007. [Summary] [Slides (ppt)]

49

Page 50: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Parallelism-Aware Batch Scheduling

Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems”

35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June 2008. Slides (ppt)

PAR-BS ISCA 2008 Talk

Page 51: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Another Problem due to Memory Interference

n Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requestsq Memory-Level Parallelism (MLP) q Out-of-order execution, non-blocking caches, runahead execution

n Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks

n Multiple threads share the DRAM controllern DRAM controllers are not aware of a thread’s MLP

q Can service each thread’s outstanding requests serially, not in parallel

51

Page 52: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Bank Parallelism of a Thread

52

Thread A: Bank 0, Row 1

Thread A: Bank 1, Row 1

Bank access latencies of the two requests overlappedThread stalls for ~ONE bank access latency

Thread A :

Bank 0 Bank 1

Compute

2 DRAM Requests

Bank 0Stall Compute

Bank 1

Single Thread:

Page 53: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Compute

Compute

2 DRAM Requests

Bank Parallelism Interference in DRAM

53

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99Thread A: Bank 1, Row 1

A : Compute

2 DRAM Requests

Bank 0Stall

Bank 1

Baseline Scheduler:

B: Compute

Bank 0

StallBank 1

Stall

Stall

Bank access latencies of each thread serializedEach thread stalls for ~TWO bank access latencies

Page 54: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

2 DRAM Requests

Parallelism-Aware Scheduler

54

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99Thread A: Bank 1, Row 1

A :

2 DRAM RequestsParallelism-aware Scheduler:

B: ComputeBank 0

Stall Compute

Bank 1

Stall

2 DRAM Requests

A : Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

B: Compute

Bank 0

Stall ComputeBank 1

Stall

Stall

Baseline Scheduler:

ComputeBank 0

Stall Compute

Bank 1

Saved Cycles Average stall-time: ~1.5 bank access

latencies

Page 55: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Parallelism-Aware Batch Scheduling (PAR-BS)

n Principle 1: Parallelism-awarenessq Schedule requests from a thread (to

different banks) back to backq Preserves each thread’s bank parallelismq But, this can cause starvation…

n Principle 2: Request Batchingq Group a fixed number of oldest requests

from each thread into a “batch”q Service the batch before all other requestsq Form a new batch when the current one is doneq Eliminates starvation, provides fairnessq Allows parallelism-awareness within a batch

55

Bank 0 Bank 1

T1

T1

T0

T0

T2

T2

T3

T3

T2 T2

T2

Batch

T0

T1 T1

Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

Page 56: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

PAR-BS Components

n Request batching

n Within-batch schedulingq Parallelism aware

56

Page 57: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Request Batching

n Each memory request has a bit (marked) associated with it

n Batch formation:q Mark up to Marking-Cap oldest requests per bank for each threadq Marked requests constitute the batchq Form a new batch when no marked requests are left

n Marked requests are prioritized over unmarked onesq No reordering of requests across batches: no starvation, high fairness

n How to prioritize requests within a batch?

57

Page 58: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Within-Batch Scheduling

n Can use any existing DRAM scheduling policyq FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality

n But, we also want to preserve intra-thread bank parallelismq Service each thread’s requests back to back

n Scheduler computes a ranking of threads when the batch is formedq Higher-ranked threads are prioritized over lower-ranked onesq Improves the likelihood that requests from a thread are serviced in

parallel by different banksn Different threads prioritized in the same order across ALL banks

58

HOW?

Page 59: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Thread Ranking

59

Bank0

Bank1

req

reqreq

req

memoryservicetimeline

threadA

threadB

threadexecutiontimeline

WAIT

WAIT

threadB

threadABank0

Bank1

req

reqreq

req

memoryservicetimeline

threadexecutiontimeline

WAIT

WAIT

rank

threadB

threadA

threadA

threadB

SAVEDCYCLES

KeyIdea:

Page 60: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

How to Rank Threads within a Batchn Ranking scheme affects system throughput and fairness

n Maximize system throughputq Minimize average stall-time of threads within the batch

n Minimize unfairness (Equalize the slowdown of threads)q Service threads with inherently low stall-time early in the batchq Insight: delaying memory non-intensive threads results in high

slowdown

n Shortest stall-time first (shortest job first) rankingq Provides optimal system throughput [Smith, 1956]*

q Controller estimates each thread’s stall-time within the batchq Ranks threads with shorter stall-time higher

60* W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

Page 61: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

n Maximum number of marked requests to any bank (max-bank-load)q Rank thread with lower max-bank-load higher (~ low stall-time)

n Total number of marked requests (total-load)q Breaks ties: rank thread with lower total-load higher

Shortest Stall-Time First Ranking

61

T2T3T1

T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3max-bank-load total-load

T0 1 3

T1 2 4

T2 2 6

T3 5 9

Ranking:T0 > T1 > T2 > T3

Page 62: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

7

5

3

Example Within-Batch Scheduling Order

62

T2T3T1

T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3Baseline Scheduling Order (Arrival order)

PAR-BS SchedulingOrder

T2

T3

T1 T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3

T1

T3T2 T2

T1 T2T1

T0

T2

T0

T3 T2

T3

T3

T3

T3

T0 T1 T2 T34 4 5 7

AVG: 5 bank access latencies AVG: 3.5 bank access latencies

Stall times

T0 T1 T2 T31 2 4 7Stall times

Tim

e

12

4

6

Ranking: T0 > T1 > T2 > T3

1234567

Tim

e

Page 63: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Putting It Together: PAR-BS Scheduling Policyn PAR-BS Scheduling Policy

(1) Marked requests first(2) Row-hit requests first(3) Higher-rank thread first (shortest stall-time first)(4) Oldest first

n Three properties:q Exploits row-buffer locality and intra-thread bank parallelismq Work-conserving

n Services unmarked requests to banks without marked requests q Marking-Cap is important

n Too small cap: destroys row-buffer localityn Too large cap: penalizes memory non-intensive threads

n Many more trade-offs analyzed in the paper63

Batching

Parallelism-awarewithin-batchscheduling

Page 64: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Hardware Cost

n <1.5KB storage cost forq 8-core system with 128-entry memory request buffer

n No complex operations (e.g., divisions)

n Not on the critical pathq Scheduler makes a decision only every DRAM cycle

64

Page 65: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

65

Unfairness on 4-, 8-, 16-core Systems

1

1.5

2

2.5

3

3.5

4

4.5

5

4-core 8-core 16-core

Unfa

irnes

s (lo

wer

is b

ette

r)

FR-FCFSFCFSNFQSTFMPAR-BS

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]

1.11X 1.08X

1.11X

Page 66: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

66

System Performance (Hmean-speedup)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

4-core 8-core 16-core

Nor

mal

ized

Hm

ean

Spee

dup

FR-FCFSFCFSNFQSTFMPAR-BS

8.3% 6.1% 5.1%

Page 67: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

PAR-BS Pros and Cons

n Upsides: q First scheduler to address bank parallelism destruction across

multiple threadsq Simple mechanism (vs. STFM)q Batching provides fairnessq Ranking enables parallelism awareness

n Downsides:q Does not always prioritize the latency-sensitive applications

67

Page 68: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on PAR-BSn Onur Mutlu and Thomas Moscibroda,

"Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems"Proceedings of the 35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June 2008. [Summary] [Slides (ppt)]

68

Page 69: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ATLAS Memory Scheduler

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter,"ATLAS: A Scalable and High-Performance

Scheduling Algorithm for Multiple Memory Controllers"16th International Symposium on High-Performance Computer Architecture (HPCA),

Bangalore, India, January 2010. Slides (pptx)

ATLAS HPCA 2010 Talk

Page 70: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ATLAS: Summaryn Goal: To maximize system performance

n Main idea: Prioritize the thread that has attained the least service from the memory controllers (Adaptive per-Thread Least Attained Service Scheduling)q Rank threads based on attained service in the past time

interval(s)q Enforce thread ranking in the memory scheduler during the

current interval

n Why it works: Prioritizes “light” (memory non-intensive) threads that are more likely to keep their cores busy

70

Page 71: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

4

6

8

10

12

14

16

1 2 4 8 16

Memorycontrollers

System

throughp

ut

FCFS FR_FCFS STFM PAR-BS ATLAS

System Throughput: 24-Core System

71

System throughput = ∑ Speedup

ATLAS consistently provides higher system throughput than all previous scheduling algorithms

17.0%

9.8%8.4%

5.9%

3.5%

Syst

em th

roug

hput

# of memory controllers

Page 72: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

02468101214

4 8 16 24 32

Cores

System

throughp

ut

PAR-BS ATLAS

System Throughput: 4-MC System

# of cores increases è ATLAS performance benefit increases

72

1.1%3.5%

4.0%

8.4%10.8%

Syst

em th

roug

hput

# of cores

Page 73: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ATLAS Pros and Consn Upsides:

q Good at improving overall throughput (compute-intensive threads are prioritized)

q Low complexityq Coordination among controllers happens infrequently

n Downsides:q Lowest/medium ranked threads get delayed significantly à

high unfairness

73

Page 74: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on ATLAS Memory Schedulern Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter,

"ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers"Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA), Bangalore, India, January 2010. Slides (pptx)

74

Page 75: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM:Thread Cluster Memory Scheduling

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter,"Thread Cluster Memory Scheduling:

Exploiting Differences in Memory Access Behavior"43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

TCM Micro 2010 Talk

Page 76: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Nopreviousmemoryschedulingalgorithmprovidesboththebestfairnessandsystemthroughput

1

3

5

7

9

11

13

15

17

7 7.5 8 8.5 9 9.5 10

Maxim

umSlowdo

wn

WeightedSpeedup

FCFSFRFCFSSTFMPAR-BSATLAS

Previous Scheduling Algorithms are Biased

76

Systemthroughputbias

Fairnessbias

Better systemthroughput

Better

fairn

ess

24cores,4memorycontrollers,96workloads

Page 77: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Taketurnsaccessingmemory

Throughput vs. Fairness

77

Fairnessbiasedapproach

threadC

threadB

threadA

lessmemoryintensive

higherpriority

Prioritizelessmemory-intensivethreads

Throughputbiasedapproach

Goodforthroughput

starvationè unfairness

threadC threadBthreadA

Doesnotstarve

notprioritizedèreducedthroughput

Singlepolicyforallthreadsisinsufficient

Page 78: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Achieving the Best of Both Worlds

78

thread

thread

higherpriority

thread

thread

thread

thread

thread

thread

Prioritizememory-non-intensivethreads

ForThroughput

Unfairnesscausedbymemory-intensivebeingprioritizedovereachother

• Shufflethreadranking

Memory-intensivethreadshavedifferentvulnerabilitytointerference

• Shuffleasymmetrically

ForFairness

thread

thread

thread

thread

Page 79: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1. Groupthreadsintotwoclusters2. Prioritizenon-intensivecluster3. Differentpoliciesforeachcluster

79

thread

Threadsinthesystem

thread

thread

thread

thread

thread

thread

Non-intensivecluster

Intensivecluster

thread

thread

thread

Memory-non-intensive

Memory-intensive

Prioritized

higherpriority

higherpriority

Throughput

Fairness

Page 80: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Outline

80

1.Clustering

Page 81: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Clustering ThreadsStep1 SortthreadsbyMPKI (missesperkiloinstruction)

81

thread

thread

thread

thread

thread

thread

higherMPKI

T α <10%ClusterThreshold

IntensiveclusterαT

Non-intensivecluster

T = Totalmemorybandwidthusage

Step2MemorybandwidthusageαTdividesclusters

Page 82: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Outline

82

1.Clustering

2.BetweenClusters

Page 83: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Prioritizenon-intensive cluster

• Increasessystemthroughput– Non-intensive threadshavegreaterpotentialformakingprogress

• Doesnotdegradefairness– Non-intensive threadsare“light”– Rarelyinterferewithintensive threads

Prioritization Between Clusters

83

>priority

Page 84: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Outline

84

1.Clustering

2.BetweenClusters

3.Non-IntensiveCluster

Throughput

Page 85: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

PrioritizethreadsaccordingtoMPKI

• Increasessystemthroughput– Leastintensivethreadhasthegreatestpotentialformakingprogressintheprocessor

Non-Intensive Cluster

85

thread

thread

thread

thread

higherpriority lowestMPKI

highestMPKI

Page 86: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Outline

86

1.Clustering

2.BetweenClusters

3.Non-IntensiveCluster

4.IntensiveCluster

Throughput

Fairness

Page 87: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Periodicallyshufflethepriorityofthreads

• Istreatingallthreadsequallygoodenough?• BUT:Equalturns≠ Sameslowdown

Intensive Cluster

87

thread

thread

thread

Increasesfairness

Mostprioritizedhigherpriority

thread

thread

thread

Page 88: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

02468

101214

random-access streamingSlow

down

Case Study: A Tale of Two ThreadsCaseStudy: Twointensivethreadscontending1. random-access2. streaming

88

Prioritizerandom-access Prioritizestreaming

random-access threadismoreeasilysloweddown

02468

101214

random-access streaming

Slow

down

7xprioritized

1x

11x

prioritized1x

Whichissloweddownmoreeasily?

Page 89: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Why are Threads Different?

89

random-access streamingreqreqreqreq

Bank1 Bank2 Bank3 Bank4 Memoryrows

•Allrequestsparallel•Highbank-levelparallelism

•Allrequestsè Samerow•Highrow-bufferlocality

reqreqreqreq

activatedrowreqreqreqreq reqreqreqreqstuck

Vulnerabletointerference

Page 90: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Outline

90

1.Clustering

2.BetweenClusters

3.Non-IntensiveCluster

4.IntensiveCluster

Fairness

Throughput

Page 91: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

NicenessHowtoquantifydifferencebetweenthreads?

91

VulnerabilitytointerferenceBank-levelparallelism

CausesinterferenceRow-bufferlocality

+ Niceness -

NicenessHigh Low

Page 92: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM: Quantum-Based Operation

92

Time

Previousquantum(~1Mcycles)

Duringquantum:• Monitorthreadbehavior1.Memoryintensity2. Bank-levelparallelism3. Row-bufferlocality

Beginningofquantum:• Performclustering• Computenicenessofintensivethreads

Currentquantum(~1Mcycles)

Shuffleinterval(~1Kcycles)

Page 93: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM: Scheduling Algorithm1. Highest-rank: Requestsfromhigherrankedthreadsprioritized

• Non-Intensive cluster> Intensive cluster• Non-Intensivecluster:lowerintensityè higherrank• Intensive cluster:rankshuffling

2.Row-hit: Row-bufferhitrequestsareprioritized

3.Oldest: Olderrequestsareprioritized

93

Page 94: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM: Implementation CostRequiredstorageatmemorycontroller(24cores)

• Nocomputationisonthecriticalpath

94

Thread memorybehavior Storage

MPKI ~0.2kb

Bank-level parallelism ~0.6kb

Row-bufferlocality ~2.9kb

Total <4kbits

Page 95: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Previous WorkFRFCFS [Rixner etal.,ISCA00]:Prioritizesrow-bufferhits

– Thread-obliviousè Lowthroughput&Lowfairness

STFM [Mutlu etal.,MICRO07]:Equalizesthreadslowdowns– Non-intensivethreadsnotprioritizedè Lowthroughput

PAR-BS [Mutlu etal.,ISCA08]:Prioritizesoldestbatchofrequestswhilepreservingbank-levelparallelism

– Non-intensivethreadsnotalwaysprioritizedè Lowthroughput

ATLAS [Kimetal.,HPCA10]:Prioritizesthreadswithlessmemoryservice

– Mostintensivethreadstarvesè Lowfairness95

Page 96: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM: Throughput and Fairness

FRFCFS

STFM

PAR-BS

ATLAS

TCM

4

6

8

10

12

14

16

7.5 8 8.5 9 9.5 10

Maxim

umSlowdo

wn

WeightedSpeedup

96

Better systemthroughput

Better

fairn

ess

24cores,4memorycontrollers,96workloads

TCM,aheterogeneousschedulingpolicy,providesbestfairnessandsystemthroughput

Page 97: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM: Fairness-Throughput Tradeoff

97

2

4

6

8

10

12

12 13 14 15 16

Maxim

umSlowdo

wn

WeightedSpeedup

Whenconfigurationparameterisvaried…

AdjustingClusterThreshold

TCMallowsrobustfairness-throughputtradeoff

STFMPAR-BS

ATLAS

TCM

Better systemthroughput

Better

fairn

ess FRFCFS

Page 98: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Operating System Support• ClusterThreshold isatunableknob

– OScantradeoffbetweenfairnessandthroughput

• Enforcingthreadweights– OSassignsweightstothreads– TCMenforcesthreadweightswithineachcluster

98

Page 99: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Conclusion

99

• Nopreviousmemoryschedulingalgorithmprovidesbothhighsystemthroughputandfairness– Problem: Theyuseasinglepolicyforallthreads

• TCMgroupsthreadsintotwoclusters1. Prioritizenon-intensive clusterè throughput2. Shuffleprioritiesinintensive clusterè fairness3. Shufflingshouldfavornice threadsè fairness

• TCMprovidesthebestsystemthroughputandfairness

Page 100: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TCM Pros and Consn Upsides:

q Provides both high fairness and high performanceq Caters to the needs for different types of threads (latency vs.

bandwidth sensitive)q (Relatively) simple

n Downsides:q Scalability to large buffer sizes?q Robustness of clustering and shuffling algorithms?q Ranking is still too complex?

100

Page 101: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on TCMn Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-

Balter,"Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior"Proceedings of the 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

101

Page 102: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

The Blacklisting Memory Scheduler

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu,"The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost"

Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD), Seoul, South Korea, October 2014. [Slides (pptx) (pdf)]

Page 103: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

TacklingInter-ApplicationInterference:Application-awareMemoryScheduling

103

Monitor RankHighest

RankedAID

EnforceRanks

Fullrankingincreasescriticalpathlatencyandarea

significantlytoimproveperformanceandfairness

4

3

2

12

4

3

1

Req 1 1Req 2 4Req 3 1Req 4 1Req 5 3

Req 7 1Req 8 3

RequestBuffer

Req 5 2

RequestApp.ID(AID)

========

Page 104: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Performancevs.Fairnessvs.Simplicity

104

Performance

Fairness

Simplicity

FRFCFS

PARBS

ATLAS

TCM

Blacklisting

Ideal

App-unaware

App-aware(Ranking)

OurSolution(NoRanking)

Isitessentialtogiveupsimplicitytooptimizeforperformanceand/orfairness?Oursolutionachievesallthreegoals

VerySimple

Lowperformanceandfairness

Complex

OurSolution

Page 105: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation1:GroupRatherThanRank

Observation1: Sufficienttoseparateapplicationsintotwogroups,ratherthandofullranking

105

Benefit1:Lowcomplexitycomparedtoranking

Group

VulnerableInterferenceCausing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

Benefit2:Lowerslowdownsthanranking

Page 106: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation1:GroupRatherThanRank

Observation1: Sufficienttoseparateapplicationsintotwogroups,ratherthandofullranking

106

Group

VulnerableInterferenceCausing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

Howtoclassifyapplicationsintogroups?

Page 107: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation2Observation2: Servingalargenumberofconsecutive

requestsfromanapplicationcausesinterference

BasicIdea:• Group applicationswithalargenumberofconsecutiverequestsasinterference-causingà Blacklisting

• Deprioritize blacklistedapplications• Clear blacklistperiodically(1000sofcycles)

Benefits:• Lowercomplexity• Finergrainedgroupingdecisionsà Lowerunfairness

107

Page 108: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Performancevs.Fairnessvs.Simplicity

108

Performance

Fairness

Simplicity

FRFCFSFRFCFS-CapPARBSATLASTCMBlacklisting

Ideal

Highestperformance

Closetosimplest

Closetofairest

Blacklistingistheclosestschedulertoideal

Page 109: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

PerformanceandFairness

109

13579

111315

1 3 5 7 9

Unfairness

Performance

FRFCFS FRFCFS-Cap PARBSATLAS TCM Blacklisting

5%21%

(Higherisbetter)

(Low

erisbetter)

1.Blacklistingachievesthehighestperformance2.Blacklistingbalancesperformanceandfairness

Page 110: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Complexity

110

0

20000

40000

60000

80000

100000

120000

0 2 4 6 8 10 12

Sche

dulerA

rea(sq.um)

CriticalPathLatency(ns)

FRFCFS FRFCFS-Cap PARBSATLAS TCM Blacklisting

43%

70%

Blacklistingreducescomplexitysignificantly

Page 111: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on BLISS (I)n Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha

Rastogi, and Onur Mutlu,"The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost"Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD), Seoul, South Korea, October 2014. [Slides (pptx) (pdf)]

111

Page 112: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on BLISS: Longer Versionn Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi,

and Onur Mutlu,"BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling"IEEE Transactions on Parallel and Distributed Systems (TPDS), to appear in 2016. arXiv.org version, April 2015.An earlier version as SAFARI Technical Report, TR-SAFARI-2015-004, Carnegie Mellon University, March 2015.[Source Code]

112

Page 113: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu,"Staged Memory Scheduling: Achieving High Performance

and Scalability in Heterogeneous Systems”39th International Symposium on Computer Architecture (ISCA),

Portland, OR, June 2012.

SMS ISCA 2012 Talk

Page 114: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

SMS: Executive Summaryn Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes

n Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages:1) Batch formation: maintains row buffer locality2) Batch scheduler: reduces interference between applications3) DRAM command scheduler: issues requests to DRAM

n Compared to state-of-the-art memory schedulers:q SMS is significantly simpler and more scalableq SMS provides higher performance and fairness

114

Page 115: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

SMS: Staged Memory Scheduling

115

Memory Scheduler

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

ReqReq

ReqReq

ReqReq Req

Req Req Req

ReqReqReqReq Req

Req Req

Req Req ReqReqReq Req

Req

ReqReq

ReqReq Req

Req Req ReqReqReqReqReq Req Req

ReqReqReq Req

Batch Scheduler

Stage 1

Stage 2

Stage 3

Req

Mon

olith

ic Sc

hedu

ler

Batch Formation

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 116: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Stage 1

Stage 2

SMS: Staged Memory Scheduling

116

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req ReqBatch Scheduler

Batch Formation

Stage 3DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 117: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Current BatchScheduling

PolicySJF

Current BatchScheduling

PolicyRR

Batch Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Putting Everything Together

117

Core 1 Core 2 Core 3 Core 4

Stage 1:Batch Formation

Stage 3: DRAM Command Scheduler

GPU

Stage 2:

Page 118: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Complexityn Compared to a row hit first scheduler, SMS consumes*

q 66% less areaq 46% less static power

n Reduction comes from:q Monolithic scheduler à stages of simpler schedulersq Each stage has a simpler scheduler (considers fewer

properties at a time to make the scheduling decision)q Each stage has simpler buffers (FIFO instead of out-of-order)q Each stage has a portion of the total buffer size (buffering is

distributed across stages)

118* Based on a Verilog model using 180nm library

Page 119: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Performance at Different GPU Weights

119

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous BestBest Previous Scheduler

ATLAS TCM FR-FCFS

Page 120: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

n At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight

Performance at Different GPU Weights

120

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best

SMSSMS

Best Previous Scheduler

Page 121: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on SMSn Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian,

Gabriel Loh, and Onur Mutlu,"Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems"Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012. Slides (pptx)

121

Page 122: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

DASH Memory Scheduler[TACO 2016]

122

Page 123: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Current SoC Architectures

n Heterogeneous agents: CPUs and HWAs q HWA : Hardware Accelerator

n Main memory is shared by CPUs and HWAs à Interference

123

CPU CPU CPU CPU

Shared Cache HWA HWA HWA

DRAM Controller

DRAM

How to schedule memory requests from CPUs and HWAs to mitigate interference?

Page 124: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

DASH Scheduler: Executive Summaryn Problem: Hardware accelerators (HWAs) and CPUs share the same

memory subsystem and interfere with each other in main memoryn Goal: Design a memory scheduler that improves CPU performance while

meeting HWAs’ deadlinesn Challenge: Different HWAs have different memory access characteristics

and different deadlines, which current schedulers do not smoothly handleq Memory-intensive and long-deadline HWAs significantly degrade CPU

performance when they become high priority (due to slow progress)q Short-deadline HWAs sometimes miss their deadlines despite high priority

n Solution: DASH Memory Scheduler q Prioritize HWAs over CPU anytime when the HWA is not making good progressq Application-aware scheduling for CPUs and HWAs

n Key Results:1) Improves CPU performance for a wide variety of workloads by 9.5% 2) Meets 100% deadline met ratio for HWAs

n DASH source code freely available on our GitHub124

Page 125: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

GoalofOurScheduler(DASH)

• Goal: Designamemoryschedulerthat– MeetsGPU/accelerators’framerates/deadlinesand– AchieveshighCPUperformance

• BasicIdea:– DifferentCPUapplicationsandhardwareacceleratorshavedifferentmemoryrequirements

– Trackprogressofdifferentagentsandprioritizeaccordingly

125

Page 126: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation:DistributePriorityforAccelerators

• GPU/acceleratorsneedprioritytomeetdeadlines• Worstcaseprioritizationnotalwaysthebest• Prioritizewhentheyarenot ontracktomeetadeadline

126

DistributingpriorityovertimemitigatesimpactofacceleratorsonCPUcores’requests

Page 127: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation:NotAllAcceleratorsareEqual

• Long-deadlineacceleratorsaremorelikelytomeet theirdeadlines

• Short-deadlineacceleratorsaremorelikelytomiss theirdeadlines

127

Scheduleshort-deadlineacceleratorsbasedonworst-casememoryaccesstime

Page 128: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

KeyObservation:NotAllCPUcoresareEqual

• Memory-intensivecores aremuchlessvulnerable tointerference

• Memorynon-intensivecores aremuchmorevulnerable tointerference

128

Prioritizeacceleratorsovermemory-intensivecorestoensureacceleratorsdonotbecomeurgent

Page 129: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

DASHSummary:KeyIdeasandResults

• DistributepriorityforHWAs• PrioritizeHWAsovermemory-intensiveCPUcoresevenwhennoturgent

• Prioritizeshort-deadline-periodHWAsbasedonworstcaseestimates

129

ImprovesCPUperformanceby7-21%Meets(almost)100%ofdeadlinesforHWAs

Page 130: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

DASH: Scheduling Policyn DASH scheduling policy

1. Short-deadline-period HWAs with high priority2. Long-deadline-period HWAs with high priority3. Memory non-intensive CPU applications4. Long-deadline-period HWAs with low priority5. Memory-intensive CPU applications6. Short-deadline-period HWAs with low priority

130

Switch probabilistically

Page 131: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on DASHn Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and

Onur Mutlu,"DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators"ACM Transactions on Architecture and Code Optimization (TACO), Vol. 12, January 2016. Presented at the 11th HiPEAC Conference, Prague, Czech Republic, January 2016. [Slides (pptx) (pdf)] [Source Code]

131

Page 132: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

We did not cover the following slides in lecture. These are for your preparation for the next lecture.

Page 133: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Computer ArchitectureLecture 12: Memory Interference

and Quality of Service

Prof. Onur MutluETH ZürichFall 2017

1 November 2017

Page 134: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Predictable Performance: Strong Memory Service Guarantees

134

Page 135: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Goal: Predictable Performance in Complex Systems

n Heterogeneous agents: CPUs, GPUs, and HWAs n Main memory interference between CPUs, GPUs, HWAs

135

CPU CPU CPU CPU

Shared Cache

GPUHWA HWA

DRAM and Hybrid Memory Controllers

DRAM and Hybrid Memories

How to allocate resources to heterogeneous agentsto mitigate interference and provide predictable performance?

Page 136: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Strong Memory Service Guaranteesn Goal: Satisfy performance/SLA requirements in the

presence of shared main memory, heterogeneous agents, and hybrid memory/storage

n Approach: q Develop techniques/models to accurately estimate the

performance loss of an application/agent in the presence of resource sharing

q Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications

q All the while providing high system performance

n Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013.

n Subramanian et al., “The Application Slowdown Model,” MICRO 2015.136

Page 137: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Strong Memory Service Guaranteesn We will defer this for later…

137

Page 138: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on MISEn Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen,

and Onur Mutlu,"MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems"Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

138

Page 139: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Application Slowdown Modeln Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and

Onur Mutlu,"The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory"Proceedings of the 48th International Symposium on Microarchitecture(MICRO), Waikiki, Hawaii, USA, December 2015. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

139

Page 140: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Handling Memory Interference In Multithreaded Applications

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling"

Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

Page 141: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Multithreaded (Parallel) Applicationsn Threads in a multi-threaded application can be inter-

dependentq As opposed to threads from different applications

n Such threads can synchronize with each otherq Locks, barriers, pipeline stages, condition variables,

semaphores, …

n Some threads can be on the critical path of execution due to synchronization; some threads are not

n Even within a thread, some “code segments” may be on the critical path of execution; some are not

141

Page 142: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Critical Sections

n Enforce mutually exclusive access to shared datan Only one thread can be executing it at a timen Contended critical sections make threads wait à threads

causing serialization can be on the critical path

142

Each thread:loop {

Computelock(A)

Update shared dataunlock(A)

}

N

C

Page 143: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Barriers

n Synchronization pointn Threads have to wait until all threads reach the barriern Last thread arriving at the barrier is on the critical path

143

Each thread:loop1 {

Compute}barrierloop2 {

Compute}

Page 144: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Stages of Pipelined Programsn Loop iterations are statically divided into code segments called stagesn Threads execute stages on different coresn Thread executing the slowest stage is on the critical path

144

loop {Compute1

Compute2

Compute3}

A

B

C

A B C

Page 145: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Handling Interference in Parallel Applications

n Threads in a multithreaded application are inter-dependentn Some threads can be on the critical path of execution due

to synchronization; some threads are notn How do we schedule requests of inter-dependent threads

to maximize multithreaded application performance?

n Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threadsto reduce memory interference among them [Ebrahimi+, MICRO’11]

n Hardware/software cooperative limiter thread estimation:n Thread executing the most contended critical sectionn Thread executing the slowest pipeline stagen Thread that is falling behind the most in reaching a barrier

145PAMS Micro 2011 Talk

Page 146: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Prioritizing Requests from Limiter Threads

146

Critical Section 1 BarrierNon-Critical Section

Waiting for Sync or Lock

Thread DThread C

Thread B

Thread A

Time

Barrier

Time

Barrier

Thread D

Thread C

Thread BThread A

Critical Section 2 Critical Path

SavedCycles Limiter Thread: DBCA

Most ContendedCritical Section: 1

Limiter Thread Identification

Page 147: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on PAMSn Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo

Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling"Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

147

Page 148: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Other Ways of Handling Memory Interference

Page 149: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Fundamental Interference Control Techniquesn Goal: to reduce/control inter-thread memory interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

149

Page 150: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Designing QoS-Aware Memory Systems: Approaches

n Smart resources: Design each shared resource to have a configurable interference control/reduction mechanismq QoS-aware memory controllers q QoS-aware interconnectsq QoS-aware caches

n Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mappingq Source throttling to control access to memory system q QoS-aware data mapping to memory controllers q QoS-aware thread scheduling to cores

150

Page 151: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Memory Channel Partitioning

Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via

Application-Aware Memory Channel Partitioning”44th International Symposium on Microarchitecture (MICRO),

Porto Alegre, Brazil, December 2011. Slides (pptx)

MCP Micro 2011 Talk

Page 152: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Observation: Modern Systems Have Multiple Channels

A new degree of freedomMapping data across multiple channels

152

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 153: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Data Mapping in Current Systems

153

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Causes interference between applications’ requests

Page

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 154: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Partitioning Channels Between Applications

154

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page

Eliminates interference between applications’ requests

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 155: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Overview: Memory Channel Partitioning (MCP)

n Goalq Eliminate harmful interference between applications

n Basic Ideaq Map the data of badly-interfering applications to different

channels

n Key Principlesq Separate low and high memory-intensity applicationsq Separate low and high row-buffer locality applications

155Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 156: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Key Insight 1: Separate by Memory IntensityHigh memory-intensity applications interfere with low

memory-intensity applications in shared memory channels

156

Map data of low and high memory-intensity applications to different channels

12345Channel 0

Bank 1

Channel 1

Bank 0

Conventional Page Mapping

Red App

Blue App

Time Units

Core

Core

Bank 1

Bank 0

Channel Partitioning

Red App

Blue App

Channel 0Time Units

12345

Channel 1

Core

Core

Bank 1

Bank 0

Bank 1

Bank 0

Saved Cycles

Saved Cycles

Page 157: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Key Insight 2: Separate by Row-Buffer Locality

157

High row-buffer locality applications interfere with low row-buffer locality applications in shared memory channels

Conventional Page Mapping

Channel 0

Bank 1

Channel 1

Bank 0R1

R0R2R3R0

R4

Request Buffer State

Bank 1

Bank 0

Channel 1

Channel 0

R0R0

Service Order123456

R2R3

R4

R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Channel 0

R0R0

Service Order123456

R2R3

R4R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

R0

Channel 0

R1

R2R3

R0

R4

Request Buffer State

Channel Partitioning

Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Saved CyclesMap data of low and high row-buffer locality applications

to different channels

Page 158: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Memory Channel Partitioning (MCP) Mechanism

1. Profile applications2. Classify applications into groups3. Partition channels between application groups4. Assign a preferred channel to each application5. Allocate application pages to preferred channel

158

Hardware

System Software

Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 159: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Interval Based Operation

159

time

Current Interval Next Interval

1. Profile applications

2. Classify applications into groups3. Partition channels between groups4. Assign preferred channel to applications

5. Enforce channel preferences

Page 160: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Observations

n Applications with very low memory-intensity rarely access memoryà Dedicating channels to them results in precious memory bandwidth waste

n They have the most potential to keep their cores busyà We would really like to prioritize them

n They interfere minimally with other applicationsà Prioritizing them does not hurt others

160

Page 161: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Integrated Memory Partitioning and Scheduling (IMPS)

n Always prioritize very low memory-intensity applications in the memory scheduler

n Use memory channel partitioning to mitigate interference between other applications

161Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 162: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Hardware Costn Memory Channel Partitioning (MCP)

q Only profiling counters in hardwareq No modifications to memory scheduling logicq 1.5 KB storage cost for a 24-core, 4-channel system

n Integrated Memory Partitioning and Scheduling (IMPS)q A single bit per requestq Scheduler prioritizes based on this single bit

162Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Page 163: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Performance of Channel Partitioning

163

1%

5%

0.9

0.95

1

1.05

1.1

1.15Normalize

dSystem

Perform

ance FRFCFS

ATLAS

TCM

MCP

IMPS

7%

11%

Better system performance than the best previous scheduler at lower hardware cost

Averaged over 240 workloads

Page 164: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Combining Multiple Interference Control Techniques

n Combined interference control techniques can mitigate interference much more than a single technique alone can do

n The key challenge is:q Deciding what technique to apply whenq Partitioning work appropriately between software and

hardware

164

Page 165: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Memory Channel Partitioningn Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu,

Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning"Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

165

Page 166: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Fundamental Interference Control Techniquesn Goal: to reduce/control inter-thread memory interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

166

Page 167: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Fairness via Source Throttling

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt,"Fairness via Source Throttling: A Configurable and High-Performance

Fairness Substrate for Multi-Core Memory Systems"15th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS),

pages 335-346, Pittsburgh, PA, March 2010. Slides (pdf)

FST ASPLOS 2010 Talk

Page 168: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Many Shared Resources

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAMBank 0

DRAMBank 1

DRAM Bank 2

... DRAM Bank K

...

Shared MemoryResources

Chip BoundaryOn-chipOff-chip

168

Page 169: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

The Problem with “Smart Resources”

n Independent interference control mechanisms in caches, interconnect, and memory can contradict each other

n Explicitly coordinating mechanisms for different resources requires complex implementation

n How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

169

Page 170: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Source Throttling: A Fairness Substraten Key idea: Manage inter-thread interference at the cores

(sources), not at the shared resources

n Dynamically estimate unfairness in the memory system n Feed back this information into a controllern Throttle cores’ memory access rates accordingly

q Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc)

q E.g., if unfairness > system-software-specified target thenthrottle down core causing unfairness &throttle up core that was unfairly treated

n Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12.

170

Page 171: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

171

Runtime UnfairnessEvaluation

DynamicRequest Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest)3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) {1-Throttle down App-interfering

(limit injection rate and parallelism)2-Throttle up App-slowest

}

FSTUnfairness Estimate

App-slowestApp-interfering

⎪ ⎨ ⎪ ⎧⎩

Slowdown Estimation

TimeInterval 1 Interval 2 Interval 3

Runtime UnfairnessEvaluation

DynamicRequest Throttling

Fairness via Source Throttling (FST) [ASPLOS’10]

Page 172: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

System Software Support

n Different fairness objectives can be configured by system softwareq Keep maximum slowdown in check

n Estimated Max Slowdown < Target Max Slowdownq Keep slowdown of particular applications in check to achieve a

particular performance targetn Estimated Slowdown(i) < Target Slowdown(i)

n Support for thread prioritiesq Weighted Slowdown(i) =

Estimated Slowdown(i) x Weight(i)

172

Page 173: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Source Throttling Results: Takeaways

n Source throttling alone provides better performance than a combination of “smart” memory scheduling and fair cachingq Decisions made at the memory scheduler and the cache

sometimes contradict each other

n Neither source throttling alone nor “smart resources” alone provides the best performance

n Combined approaches are even more powerful q Source throttling and resource-based interference control

173

Page 174: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Core (Source) Throttlingn Idea: Estimate the slowdown due to (DRAM) interference

and throttle down threads that slow down othersq Ebrahimi et al., “Fairness via Source Throttling: A Configurable

and High-Performance Fairness Substrate for Multi-Core Memory Systems,” ASPLOS 2010.

n Advantages+ Core/request throttling is easy to implement: no need to change the

memory scheduling algorithm+ Can be a general way of handling shared resource contention+ Can reduce overall load/contention in the memory system

n Disadvantages- Requires interference/slowdown estimations à difficult to estimate- Thresholds can become difficult to optimize à throughput loss

174

Page 175: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Source Throttling (I)n Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt,

"Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems"Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 335-346, Pittsburgh, PA, March 2010. Slides (pdf)

175

Page 176: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Source Throttling (II)n Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu,

"HAT: Heterogeneous Adaptive Throttling for On-Chip Networks"Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), New York, NY, October 2012. Slides (pptx) (pdf)

176

Page 177: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Source Throttling (III)n George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu,

and Srinivasan Seshan,"On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects"Proceedings of the 2012 ACM SIGCOMM Conference(SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx)

177

Page 178: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Fundamental Interference Control Techniquesn Goal: to reduce/control interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread schedulingIdea: Pick threads that do not badly interfere with each

other to be scheduled together on cores sharing the memory system

178

Page 179: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Application-to-Core Mapping to Reduce Interference

n Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, AkhileshKumar, and Mani Azimi,"Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems"Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

n Key ideas:q Cluster threads to memory controllers (to reduce across chip interference)q Isolate interference-sensitive (low-intensity) applications in a separate

cluster (to reduce interference from high-intensity applications)q Place applications that benefit from memory bandwidth closer to the

controller

179

Page 180: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Application-to-Core Mapping

180

Clustering

Balancing

Isolation

Radial Mapping

Improve LocalityReduce Interference

Improve Bandwidth Utilization

Reduce Interference

Improve BandwidthUtilization

Page 181: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

System Performance

0.8

0.9

1.0

1.1

1.2

1.3

MPKI500 MPKI1000 MPKI1500 MPKI2000 Avg

Nor

mal

ized

Wei

ghte

d Sp

eedu

pBASE BASE+CLS A2C

181

System performance improves by 17%

Page 182: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Network Power

182

0.0

0.2

0.4

0.6

0.8

1.0

1.2

MPKI500 MPKI1000 MPKI1500 MPKI2000 Avg

Nor

mal

ized

NoC

Pow

er

BASE BASE+CLS A2C

Average network power consumption reduces by 52%

Page 183: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on App-to-Core Mappingn Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh

Kumar, and Mani Azimi,"Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems"Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

183

Page 184: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Interference-Aware Thread Schedulingn An example from scheduling in clusters (data centers)n Clusters can be running virtual machines

184

Page 185: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

VirtualizedCluster

185

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

HowtodynamicallyscheduleVMsonto

hosts?

DistributedResourceManagement(DRM)policies

Page 186: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ConventionalDRMPolicies

186

Core0 Core1

Host

LLC

DRAM

App App

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

VM

App

MemoryCapacity

CPU

Basedonoperating-system-levelmetricse.g.,CPUutilization,memorycapacitydemand

Page 187: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Microarchitecture-levelInterference

187

VM

App

Core0 Core1

Host

LLC

DRAM

VM

App• VMswithinahostcompetefor:

– Sharedcachecapacity– Sharedmemorybandwidth

Canoperating-system-levelmetricscapturethemicroarchitecture-levelresourceinterference?

Page 188: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

MicroarchitectureUnawareness

188

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

VMOperating-system-levelmetrics

CPUUtilization MemoryCapacity

92% 369 MB

93% 348 MBApp

App

STREAM

gromacs

Microarchitecture-levelmetrics

LLCHitRatio MemoryBandwidth

2% 2267MB/s

98% 1 MB/s

VM

App

MemoryCapacity

CPU

Page 189: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ImpactonPerformance

189

0.0

0.2

0.4

0.6

IPC(HarmonicMean)

ConventionalDRM withMicroarchitectureAwareness

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

App

STREAM

gromacs

VM

App

MemoryCapacity

CPU SWAP

Page 190: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

ImpactonPerformance

190

0.0

0.2

0.4

0.6

IPC(HarmonicMean)

ConventionalDRM withMicroarchitectureAwareness

49%

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

Core0 Core1

Host

LLC

DRAM

VM

App

VM

App

App

STREAM

gromacs

VM

App

MemoryCapacity

CPU

Weneedmicroarchitecture-levelinterferenceawarenessin

DRM!

Page 191: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

A-DRM:Architecture-awareDRM

• Goal:Takeintoaccountmicroarchitecture-levelsharedresourceinterference– Sharedcachecapacity– Sharedmemorybandwidth

• KeyIdea:– Monitoranddetectmicroarchitecture-levelsharedresourceinterference

– Balancemicroarchitecture-levelresourceusageacrossclustertominimizememoryinterferencewhilemaximizingsystemperformance

191

Page 192: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

A-DRM:Architecture-awareDRM

192

OS+Hypervisor

VM

App

VM

App

A-DRM:GlobalArchitecture–awareResourceManager

ProfilingEngine

Architecture-awareInterferenceDetector

Architecture-awareDistributedResourceManagement(Policy)

MigrationEngine

Hosts Controller

CPU/MemoryCapacity

Profiler

ArchitecturalResource

•••

ArchitecturalResources

Page 193: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

More on Architecture-Aware DRMn Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi,

Depei Qian, and Onur Mutlu,"A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters"Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), Istanbul, Turkey, March 2015. [Slides (pptx) (pdf)]

193

Page 194: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Interference-Aware Thread Schedulingn Advantages

+ Can eliminate/minimize interference by scheduling “symbiotic applications” together (as opposed to just managing the interference)+ Less intrusive to hardware (less need to modify the hardware resources)

n Disadvantages and Limitations-- High overhead to migrate threads between cores and machines-- Does not work (well) if all threads are similar and they interfere

194

Page 195: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Summary: Fundamental Interference Control Techniques

n Goal: to reduce/control interference

1. Prioritization or request scheduling

2. Data mapping to banks/channels/ranks

3. Core/source throttling

4. Application/thread scheduling

Best is to combine all. How would you do that?195

Page 196: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Summary: Memory QoS Approaches and Techniques

n Approaches: Smart vs. dumb resourcesq Smart resources: QoS-aware memory schedulingq Dumb resources: Source throttling; channel partitioningq Both approaches are effective in reducing interferenceq No single best approach for all workloads

n Techniques: Request/thread scheduling, source throttling, memory partitioningq All approaches are effective in reducing interferenceq Can be applied at different levels: hardware vs. softwareq No single best technique for all workloads

n Combined approaches and techniques are the most powerfulq Integrated Memory Channel Partitioning and Scheduling [MICRO’11]

196MCP Micro 2011 Talk

Page 197: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

Summary: Memory Interference and QoSn QoS-unaware memory à

uncontrollable and unpredictable system

n Providing QoS awareness improves performance, predictability, fairness, and utilization of the memory system

n Discussed many new techniques to:q Minimize memory interferenceq Provide predictable performance

n Many new research ideas needed for integrated techniques and closing the interaction with software

197

Page 198: Computer Architecture - ETH Z · configurable interference control/reduction mechanism q QoS-aware memory controllers q QoS-aware interconnects q QoS-aware caches n Dumb resources:

What Did We Not Cover?

n Prefetch-aware shared resource managementn DRAM-controller co-designn Cache interference managementn Interconnect interference managementn Write-read schedulingn …

198