Parallel Application Memory Scheduling

Parallel Application Memory Scheduling

Eiman Ebrahimi*

Rustam Miftakhutdinov*, Chris Fallin‡

Chang Joo Lee*+, Jose Joao*

Onur Mutlu‡, Yale N. Patt** HPS Research Group

The University of Texas at Austin

‡ Computer Architecture LaboratoryCarnegie Mellon University

+ Intel Corporation Austin

2

Background Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAMBank 0

DRAMBank 1

DRAM Bank 2

... DRAMBank K

...

Shared MemoryResources

Chip BoundaryOn-chipOff-chip

2

Background

Memory requests from different cores interfere in shared memory resources

Multi-programmed workloads System Performance and Fairness

A single multi-threaded application?

3

Core 0

Core 1

Core 2

Core N

Shared Cache

Memory Controller

...DRAMBank

K

...

Shared MemoryResources

Chip Boundary

3

DRAMBank

0

DRAMBank

1

DRAMBank

2

Memory System Interference in A Single Multi-Threaded Application Inter-dependent threads from the same

application slow each other down

Most importantly the critical path of execution can be significantly slowed down

Problem and goal are very different from interference between independent applications Interdependence between threads Goal: Reduce execution time of a single application No notion of fairness among the threads

of the same application4

Potential inA Single Multi-Threaded Application

hist mg cg is bt ft gmean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Nor

mal

ized

Exe

cuti

on T

ime

If all main-memory related interference is ideally eliminated, execution time is reduced by 45% on average

5

Normalized to systemusing FR-FCFS memoryscheduling

Outline Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion

6


7

Parallel Application Memory Scheduler Identify the set of threads likely to be on the

critical path as limiter threads Prioritize requests from limiter threads

Among limiter threads: Prioritize requests from latency-sensitive threads

(those with lower MPKI)

Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce

inter-thread memory interference Prioritize requests from threads falling behind

others in a parallel for-loop

8








9

Runtime System Limiter Identification Contended critical sections are often on the critical path

of execution

Extend runtime system to identify thread executing the most contended critical section as the limiter thread Track total amount of time all threads wait on

each lock in a given interval Identify the lock with largest waiting time as

the most contended Thread holding the most contended lock is a limiter and

this information is exposed to the memory controller

10

Prioritizing Requests from Limiter Threads

11

Critical Section 1 BarrierNon-Critical SectionWaiting for Sync or Lock

Thread DThread CThread BThread A

Time

Barrier

Time

Barrier


Critical Section 2 Critical Path

SavedCycles Limiter Thread: DBCA

Most ContendedCritical Section:1

Limiter Thread Identification








12

Time-based classification of threads as latency- vs. BW-sensitive

13

Critical Section

BarrierNon-Critical SectionWaiting for Sync


Time

BarrierTime Interval 1

Time Interval 2

Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]

Terminology A code-segment is defined as:

A program region between two consecutive synchronization operations

Identified with a 2-tuple:<beginning IP, lock address>

Important for classifying threads as latency- vs. bandwidth-sensitive Time-based vs. code-segment based

classification

14

Code-segment based classification of threads as latency- vs. BW-sensitive

15

Thread DThread CThread B


Time

Code-Segment Changes

Barrier

Thread A

Time

BarrierTime Interval 1

Time Interval 2

CodeSegment 1

CodeSegment 2

Critical Section

BarrierNon-Critical SectionWaiting for Sync








16

Shuffling Priorities of Non-Limiter Threads

Goal: Reduce inter-thread interference among a set of threads

with the same importance in terms of our estimation of the critical path

Prevent any of these threads from becoming new bottlenecks

Basic Idea: Give each thread a chance to be high priority in the memory

system and exploit intra-thread bank parallelism and row-buffer locality

Every interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval

17

Shuffling Priorities of Non-Limiter Threads

BarrierThread AThread BThread CThread D

Barrier

Time

Thread AThread BThread CThread D Time


Thread AThread BThread CThread D

Barrier Barrier

Time

Thread AThread BThread CThread D

Time


4321

321

2

1

1Saved Cycles

Saved Cycles Saved

Lost Cycles

18

Baseline(No shuffling)

Policy 1

Threads with similar memory behavior

Policy 2

Shuffling Shuffling

4231

12

3

2

1 1

Cycles

ActiveWaiting

Legend Threads with different memory behavior

PAMS dynamically monitors memory intensities and chooses appropriate

shuffling policy for non-limiter threads at runtime


19

Evaluation Methodology x86 cycle accurate simulator

Baseline processor configuration Per-core

- 4-wide issue, out-of-order, 64 entry ROB

Shared (16-core system)- 128 MSHRs- 4MB, 16-way L2 cache

Main Memory- DDR3 1333 MHz- Latency of 15ns per command (tRP, tRCD, CL)- 8B wide core to memory bus

20

PAMS Evaluation

hist mg cg is bt ft gmean0

0.2

0.4

0.6

0.8

1

1.2

Thread cluster memory scheduler [Kim+, MICRO'10]Thread criticality prediction (TCP)-based memory schedulerParallel application memory scheduler

Nor

mal

ized

Exe

cuti

on T

ime

(nor

mal

ized

to

FR-F

CFS)

13%

21

7%

Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]

22

Sensitivity to system parameters

-10.5%-15.9%-16.7%

L2 Cache Size4 MB 8 MB 16 MB

Δ FR-FCFS Δ FR-FCFS Δ FR-FCFS

-10.4%-11.6%-16.7%

Number of Memory Channels1 Channel 2 Channels 4 ChannelsΔ FR-FCFS Δ FR-FCFS Δ FR-FCFS

Conclusion Inter-thread main memory interference within a

multi-threaded application increases execution time

Parallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance by Identifying a set of threads likely to be on the critical path and

prioritizing requests from them Periodically shuffling priorities of non-likely critical threads to

reduce inter-thread interference among them

PAMS significantly outperforms Best previous memory scheduler designed for

multi-programmed workloads A memory scheduler that uses a state-of-the-art

thread criticality predictor (TCP)

23

Parallel Application Memory Scheduling

Eiman Ebrahimi*

Rustam Miftakhutdinov*, Chris Fallin‡

Chang Joo Lee*+, Jose Joao*

Onur Mutlu‡, Yale N. Patt** HPS Research Group

The University of Texas at Austin

‡ Computer Architecture LaboratoryCarnegie Mellon University

+ Intel Corporation Austin

Documents

Parallel Application Memory Scheduling