Adaptive History-Based Memory Schedulers

1

Adaptive History-Based Memory Schedulers

Ibrahim Hur and Calvin LinIBM Austin

The University of Texas at Austin

2

Memory Bottleneck

Memory system performance is not increasing as fast as CPU performance

Latency: Use caches, prefetching, …

Bandwidth: Use parallelism inside memory system

1

10

100

1000

10000

1980 1985 1990 1995 2000

Time

Pe

rfo

rma

nc

e

CPUMemory

3

How to Increase Memory Command Parallelism?

time

Read Bank 0

Read Bank 0

Read Bank 1

bank conflict

Bank 0

Bank 1

Bank 2

DRAM

Bank 3

Read Bank 0

Read Bank 1

Read Bank 0

betterorder

Similar to instruction scheduling, can reorder commands for higher bandwidth

4

Inside the Memory Systemcach

es

DR

AM

Read Queue

Memory Queue

Write Queue

arb

iter

Memory Controller

FIFO

not FIFO

not FIFO

the arbiter schedules memory operations

5

Our Work

Study memory command scheduling in the context of the IBM Power5

Present new memory arbiters 20% increased bandwidth Very little cost: 0.04% increase in chip area

6

Outline

The Problem Characteristics of DRAM Previous Scheduling Methods

Our approach History-based schedulers Adaptive history-based schedulers

Results

Conclusions

7

Understanding the Problem:Characteristics of DRAM

Multi-dimensional structure Banks, rows, and columns IBM Power5: ranks and ports as well

Access time is not uniform Bank-to-Bank conflicts Read after Write to the same rank conflict Write after Read to different port conflict …

8

Previous Scheduling Approaches: FIFO Scheduling

caches

arb

iter

Read Queue

Write Queue

Memory Queue(FIFO)

DRAM

caches

9

Memoryless Scheduling

caches

arb

iter

Read Queue

Write Queue

Memory Queue(FIFO)

DRAM

caches

long delay

Adapted from Rixner et al, ISCA2000

10

What we really want

Keep the pipeline full; don’t hold commands in the reorder queues until conflicts are totally resolved

Forward them to memory queue in an order to minimize future conflicts

C5

8

BD C A

3

7

DD is

better

To do this we need to know history of the commands

Read/Write Queues

memory queue

arbiter

11

Another Goal: Match Application’s Memory Command Behavior

Arbiter should select commands from queues roughly in the ratio in which the application generates them

Otherwise, read or write queue may be congested

Command history is useful here too

12

Our Approach: History-Based

Memory Schedulers

Benefits:

Minimize contention costs Consider multiple constraints

Match application’s memory access behavior 2 Reads per Write? 1 Read per Write? …

The Result: less congested memory system, i.e. more bandwidth

13

How does it work?

Use a Finite State Machine (FSM)

Each state in the FSM represents one possible history

Transitions out of a state are prioritized

At any state, scheduler selects the available command with the highest priority

FSM is generated at design time

14

An Example

First Preference

Second Preference

Third Preference

Fourth Preference

most appropriate command to memory

available commands in reorder queues

next state

current state

15

How to determine priorities?

Two criteria: A: Minimize contention costs B: Satisfy program’s Read/Write command

mix

First Method : Use A, break ties with B Second Method : Use B, break ties with A

Which method to use? Combine two methods probabilistically

(details in the paper)

16

Limitation of the History-Based Approach

Designed for one particular mix of Read/Writes

Solution: Adaptive History-Based Schedulers Create multiple state machines: one for

each Read/Write mix Periodically select most appropriate state

machine

17

Adaptive History-Based Schedulers

Arbiter1 Arbiter2 Arbiter3

Arbiter SelectionLogic

Read Counter

Write Counter

Cycle Counter

select

2R:1W 1R:1W 1R:2W

18

Evaluation

Used a cycle accurate simulator for the IBM Power5 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port

Evaluated and compared our approach with previous approaches with data intensive applications: Stream, NAS, and microbenchmarks

19

The IBM Power5

Memory Controller

• 2 cores on a chip• SMT capability• Large on-chip L2 cache• Hardware prefetching• 276 million transistors

(1.6% of chip area)

20

Results 1: Stream Benchmarks

0

10

20

30

40

50

60

70

80

90

100

copy scale sum triad

No

rma

lize

d E

xe

cu

tio

n T

ime

(%

)

FIFO Memoryless Adaptive History-Based

21

Results 2: NAS Benchmarks

0

10

20

30

40

50

60

70

80

90

100

bt

cg

ep ft is lu

mg

sp

me

an

No

rma

lize

d E

xe

cu

tio

n T

ime

(%

)


(1 core active)

22

Results 3: Microbenchmarks

0

10

20

30

40

50

60

70

80

90

1004

r0w

2r0

w

1r0

w

8r1

w

4r1

w

3r1

w

2r1

w

3r2

w

1r1

w

1r2

w

1r4

w

0r1

w

0r2

w

0r4

w

No

rma

lize

d E

xe

cu

tio

n T

ime

(%

)


23

caches

arb

iter

Read Queue

Write Queue

Memory Queue(FIFO)

DRAM

caches

12 concurrent commands

24

DRAM Utilization

0

5001000

1500

20002500

3000

3500

40004500

5000

1 2 3 4 5 6 7 8 9 10 11 12

Nu

mb

er

of

Oc

cu

ren

ce

s

0

5001000

1500

20002500

3000

3500

40004500

5000

1 2 3 4 5 6 7 8 9 10 11 12

Nu

mb

er

of

Oc

cu

ren

ce

s

Number of Active Commands in DRAM

Our Approach

Memoryless Approach

25

Why does it work?cach

es

DR

AM

Read Queue

Memory Queue

Write Queue

arb

iter

Memory Controller

Low Occupancy in Reorder

QueuesFull Reorder QueuesFull Memory QueueBusy Memory System

detailed analysis in the paper

26

Other Results

We obtain >95% performance of the perfect DRAM configuration (no conflicts)

Results with higher frequency, and no data prefetching are in the paper

History size of 2 works well

27

Conclusions

Introduced adaptive history-based schedulers

Evaluated on a highly tuned system, IBM Power5

Performance improvementOver FIFO : Stream 63% NAS 11%Over Memoryless : Stream 19% NAS 5%

Little cost: 0.04% chip area increase

28

Conclusions (cont.)

Similar arbiters can be used in other places as well, e.g. cache controllers

Can optimize for other criteria, e.g. power or power+performance.

29

Thank you

Documents

Adaptive History-Based Memory Schedulers