Upload
istas
View
26
Download
0
Embed Size (px)
DESCRIPTION
Adaptive History-Based Memory Schedulers. Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin. Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system. Memory Bottleneck. - PowerPoint PPT Presentation
Citation preview
1
Adaptive History-Based Memory Schedulers
Ibrahim Hur and Calvin LinIBM Austin
The University of Texas at Austin
2
Memory Bottleneck
Memory system performance is not increasing as fast as CPU performance
Latency: Use caches, prefetching, …
Bandwidth: Use parallelism inside memory system
1
10
100
1000
10000
1980 1985 1990 1995 2000
Time
Pe
rfo
rma
nc
e
CPUMemory
3
How to Increase Memory Command Parallelism?
time
Read Bank 0
Read Bank 0
Read Bank 1
bank conflict
Bank 0
Bank 1
Bank 2
DRAM
Bank 3
Read Bank 0
Read Bank 1
Read Bank 0
betterorder
Similar to instruction scheduling, can reorder commands for higher bandwidth
4
Inside the Memory Systemcach
es
DR
AM
Read Queue
Memory Queue
Write Queue
arb
iter
Memory Controller
FIFO
not FIFO
not FIFO
the arbiter schedules memory operations
5
Our Work
Study memory command scheduling in the context of the IBM Power5
Present new memory arbiters 20% increased bandwidth Very little cost: 0.04% increase in chip area
6
Outline
The Problem Characteristics of DRAM Previous Scheduling Methods
Our approach History-based schedulers Adaptive history-based schedulers
Results
Conclusions
7
Understanding the Problem:Characteristics of DRAM
Multi-dimensional structure Banks, rows, and columns IBM Power5: ranks and ports as well
Access time is not uniform Bank-to-Bank conflicts Read after Write to the same rank conflict Write after Read to different port conflict …
8
Previous Scheduling Approaches: FIFO Scheduling
caches
arb
iter
Read Queue
Write Queue
Memory Queue(FIFO)
DRAM
caches
9
Memoryless Scheduling
caches
arb
iter
Read Queue
Write Queue
Memory Queue(FIFO)
DRAM
caches
long delay
Adapted from Rixner et al, ISCA2000
10
What we really want
Keep the pipeline full; don’t hold commands in the reorder queues until conflicts are totally resolved
Forward them to memory queue in an order to minimize future conflicts
C5
8
BD C A
3
7
DD is
better
To do this we need to know history of the commands
Read/Write Queues
memory queue
arbiter
11
Another Goal: Match Application’s Memory Command Behavior
Arbiter should select commands from queues roughly in the ratio in which the application generates them
Otherwise, read or write queue may be congested
Command history is useful here too
12
Our Approach: History-Based
Memory Schedulers
Benefits:
Minimize contention costs Consider multiple constraints
Match application’s memory access behavior 2 Reads per Write? 1 Read per Write? …
The Result: less congested memory system, i.e. more bandwidth
13
How does it work?
Use a Finite State Machine (FSM)
Each state in the FSM represents one possible history
Transitions out of a state are prioritized
At any state, scheduler selects the available command with the highest priority
FSM is generated at design time
14
An Example
First Preference
Second Preference
Third Preference
Fourth Preference
most appropriate command to memory
available commands in reorder queues
next state
current state
15
How to determine priorities?
Two criteria: A: Minimize contention costs B: Satisfy program’s Read/Write command
mix
First Method : Use A, break ties with B Second Method : Use B, break ties with A
Which method to use? Combine two methods probabilistically
(details in the paper)
16
Limitation of the History-Based Approach
Designed for one particular mix of Read/Writes
Solution: Adaptive History-Based Schedulers Create multiple state machines: one for
each Read/Write mix Periodically select most appropriate state
machine
17
Adaptive History-Based Schedulers
Arbiter1 Arbiter2 Arbiter3
Arbiter SelectionLogic
Read Counter
Write Counter
Cycle Counter
select
2R:1W 1R:1W 1R:2W
18
Evaluation
Used a cycle accurate simulator for the IBM Power5 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port
Evaluated and compared our approach with previous approaches with data intensive applications: Stream, NAS, and microbenchmarks
19
The IBM Power5
Memory Controller
• 2 cores on a chip• SMT capability• Large on-chip L2 cache• Hardware prefetching• 276 million transistors
(1.6% of chip area)
20
Results 1: Stream Benchmarks
0
10
20
30
40
50
60
70
80
90
100
copy scale sum triad
No
rma
lize
d E
xe
cu
tio
n T
ime
(%
)
FIFO Memoryless Adaptive History-Based
21
Results 2: NAS Benchmarks
0
10
20
30
40
50
60
70
80
90
100
bt
cg
ep ft is lu
mg
sp
me
an
No
rma
lize
d E
xe
cu
tio
n T
ime
(%
)
FIFO Memoryless Adaptive History-Based
(1 core active)
22
Results 3: Microbenchmarks
0
10
20
30
40
50
60
70
80
90
1004
r0w
2r0
w
1r0
w
8r1
w
4r1
w
3r1
w
2r1
w
3r2
w
1r1
w
1r2
w
1r4
w
0r1
w
0r2
w
0r4
w
No
rma
lize
d E
xe
cu
tio
n T
ime
(%
)
FIFO Memoryless Adaptive History-Based
23
caches
arb
iter
Read Queue
Write Queue
Memory Queue(FIFO)
DRAM
caches
12 concurrent commands
24
DRAM Utilization
0
5001000
1500
20002500
3000
3500
40004500
5000
1 2 3 4 5 6 7 8 9 10 11 12
Nu
mb
er
of
Oc
cu
ren
ce
s
0
5001000
1500
20002500
3000
3500
40004500
5000
1 2 3 4 5 6 7 8 9 10 11 12
Nu
mb
er
of
Oc
cu
ren
ce
s
Number of Active Commands in DRAM
Our Approach
Memoryless Approach
25
Why does it work?cach
es
DR
AM
Read Queue
Memory Queue
Write Queue
arb
iter
Memory Controller
Low Occupancy in Reorder
QueuesFull Reorder QueuesFull Memory QueueBusy Memory System
detailed analysis in the paper
26
Other Results
We obtain >95% performance of the perfect DRAM configuration (no conflicts)
Results with higher frequency, and no data prefetching are in the paper
History size of 2 works well
27
Conclusions
Introduced adaptive history-based schedulers
Evaluated on a highly tuned system, IBM Power5
Performance improvementOver FIFO : Stream 63% NAS 11%Over Memoryless : Stream 19% NAS 5%
Little cost: 0.04% chip area increase
28
Conclusions (cont.)
Similar arbiters can be used in other places as well, e.g. cache controllers
Can optimize for other criteria, e.g. power or power+performance.
29
Thank you