Upload
ion
View
31
Download
0
Embed Size (px)
DESCRIPTION
A Programmable Memory Hierarchy for Prefetching Linked Data Structures. Alvin R. Lebeck Department of Computer Science Duke University. Chia-Lin Yang Department of Computer Science and Information Engineering National Taiwan University. Memory Wall. Processor-memory gap grows over time - PowerPoint PPT Presentation
Citation preview
A Programmable Memory Hierarchy for Prefetching Linked Data Structures
Chia-Lin Yang
Department of Computer Science and Information Engineering
National Taiwan University
Alvin R. Lebeck
Department of Computer Science
Duke University
2
Memory Wall
• Processor-memory gap grows over time
• Prefetching– What ? Future Address Prediction
– When? Prefetch Schedule
1
10
100
1000
10000
100000
1980 1985 1990 1995 2000
Processor-Memory Gap
CPU perform
ance 60% yr
DRAM performance 10% yr
3
• Linked data structures
– No regularity in the address stream• Adjacent elements are not necessarily contiguous in memory
– Pointer-chasing problem
Prefetch Linked Data Structures (LDS)
…..
p = head;while (p){ work (p->data); p = p -> next;}
while (p){ prefetch (p->next->next->next); work (p->data); p = p -> next;}
currently visiting would like to prefetch
p *p
4
The Push Architecture
• A LDS prefetching framework built on a novel data movement model - Push (Yang’2000)
L1
L2
Main Memory
req
req
L1
L2
Main Memory
Traditional Pull Model New Push Model
5
Outline
• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion
6
Block Diagram of the Push Architecture
Prefetch Buffer L1
L2
Main
Memory
Prefetch Engine
Prefetch Engine
Prefetch Engine
prefetch req
prefetch req
prefetch req
L2 Bus
Memory Bus
7
How to Predict Future Addresses?
• LDS traversal kernels
• Load instructions in LDS traversal kernels are a compact representation of LDS accesses [Roth’98]
• PFEs execute LDS traversal kernels independent of the CPU
• The amount of computation between node accesses affects how far the PFE could run ahead of the CPU
while ( list != NULL) { p = list->x; process (p->data); list = list->next; recurrent load}
8
L1
L2
Main Memory
• Push model : pipelined process
1PFEr2
r2
a1
a1
a2 a2r1
r1 x1
x2
x1x2
2a2 x2 x1
3a2
2 3 41
The Pointer-Chasing Problem: how does the push model help?
a2
9
Push Architecture Design Issues
CPU L1
L2
Main
Memory
PFE
PFE
PFE
1. PFE Architecture Design
controller
controller
controller
2. Interaction Scheme
5. Demands on the cache/memory
controller
3. Synchronization between the CPU and PFE execution
4. Redundant Prefetch
10
ISSUE #1: PFE Architecture
• Programmable PFE – General purpose processor core – 5 stage pipeline, in-order processor– Integer ALU units for address calculation & control flow– TLB for address translation – Root register to store the root address of the LDS being traversed
11
Issue #2: Interaction among PFEs
CPU
L1
L2
Mem
Root Reg
PFE
store [x]
PFE
PFE
Root Reg
Root Reg
Tree (root); : :
Tree ( node){ if (node) { Tree (node->left); Tree (node->right); }}
2
3 4
5
6 7
y
x resume
resume x
1store root addressx
issue x
stop L1 PFE
12
• When do we need to synchronize the CPU and PFE execution?– Early prefetches
• the PFEs are running too far ahead of the CPU
– Useless prefetches• the PFEs are traversing down the wrong path
• the PFEs are running behind the CPU
• Throttle mechanism
Issue #3: Synchronization between CPU and PFEs
Free Bit Cache Blocks
1
0
0
1
PFECPUproduceconsume
Prefetch Buffer
13
Variations of the Push Architecture
• 2_PFE should perform comparably to 3_PFE
L1
L2
Main Memory
PFE
PFE
push
pull
L1
L2
Main Memory PFE
push
L1
L2
Main Memory
PFE
PFE
push
PFE
push
3_PFE 2_PFE 1_PFE
• 1_PFE performs well if most of LDS exist only in the main memory
14
Outline
• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion
15
Experimental Setup
• SimpleScalar: out-of-order processor• Benchmark:
• Olden benchmark suite & rayshade
• Baseline processor:– 4-way issue, 64 RUU, 16 LSQ– lockup-free caches with 8 outstanding misses– 32KB, 32B line, 2-way L1 & 512K, 64B line, 4-way L2– 84 cycle round-trip memory latency & 48 cycle DRAM access time
• Prefetch model– Push model: 3 level PFEs, 32-entry fully-associative prefetch buffer – Pull model: L1 level PFE, 32-entry fully-associative prefetch buffer
16
Performance Comparison: Push vs. Pull
• health, mst, perimeter and treeadd• Push: 4% to 25% speedup Pull: 0% to 4% speedup
• em3d, rayshade• Push: 31% to 57% speedup Pull: 25% to 39% speedup
• bh• Push: 33% speedup Pull: 33% speedup
• Dynamically changing structures: bisort and tsp
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
health em3d mst rayshade perimeter bh bisort treeadd tsp voronoi
Benchmark
Norm
aliz
ed E
xecu
tion
Tim
e
memory latency
computation time
17
Variations of the Push Architecture
00.10.20.30.40.50.60.70.80.9
1
Norm
aliz
ed E
xecution T
ime
Base
3_PFE
2_PFE
1_PFE
• 2_PFE performs comparably to 3_PFE• 1_PFE performs comparably to 3_PFE except for em3d.
18
Related Work
• Prefetching for Irregular Applications:– Correlation based prefetch (Joseph’97 and Alexander’96)
– Compiler based prefetch (Luk’96)
– Dependence based prefetch (Roth’98)
– Jump-pointer prefetch (Roth’99)
• Decoupled Architecture– Decoupled Access Execute (Smith’82)
– Pre-execution (Annavaram’2001,Collin’2001, Roth’2001, Zilles’2001, Luk’2001)
• Processor-in-Memory– Berkley IRAM Group (Patterson’97)
– Active Page (Oskin’98)
– FlexRAM (Kang’99)
– Impulse (Carter’99)
– Memory-side prefetching (Hughes’2000)
19
Conclusion
• Build a general architectural solution for the push model
• The push model is effective in reducing the impact of the pointer-chasing problem on prefetching performance– applications with tight traversal loops
• Push : 4% to 25% Pull: 0% to 4%
– applications with longer computation between node accesses
• Push : 31% to 57% Pull: 25% to 39%
• 2_PFE performs comparably to 3_PFE.
20
Traversal Kernel
void *HashLookup(int key, hash hash){ j = (hash->mapfunc)(key); for (ent = hash->array[j]; ent && ent->key != key; ent = ent->next); if (ent) return ent->entry; return Null;}
void kernel (HashEntry ent, int key){ for (ent ; ent && ent->key != key; ent = ent->next); }
1. traversal kernel identifier2. hash->array[j]3. key
PFE
CPU
memory-mapped interface
21
Block Diagram of Specialized PFE
Recurrent Load
Table
Non-Recurrent
Load Table
Root Register
Kernel Id Register
Instruction Buffer
Traversal-Info Table
Ready Queue (pc, base, offset)
+
+
Result Buffer (pc)
Cache/Memory Controller
TLB
22
Block Diagram of Programmable PFE
Kernel Id Register
Instruction Buffer
Kernel IndexTable
Result Buffer
Cache/Memory Controller
TLB
Processor
Instruction Cache
Stack
Register File
Root reg
: memory-mapped structure
local access
global access
23
Issue #4: Redundant Prefetches
• Redundant prefetches:
• Tree traversals:
L1
L2
Main Memory
1
2 5
3 4 6 7
24
Issue #4: Redundant Prefetches
• Performance impact– Waste bus bandwidth
– Memory accesses are satisfied more slowly in the lower level of memory hierarchy
• Add a small data cache in the L2/Memory PFEs
Cache/Memory Controller
PFE Processor
Data Cacherequest
result
miss request
25
#Issue 5: Modifications to Cache/Memory Controller
L1
L2
Main Memory
Request Buffer
MSHR
MSHR
L2 Bus
Memory Bus
demand requests merge
Request Buffer PFE
PFE
demand/prefetch requests merge
26
How to Avoid Early Prefetches?
1
2 9
3 6 10
13
4 5 7 8 11
12
14
15
534
1
2 9
3 6 10
13
4 5 7 8 11
12
14
15
t1 t2 t3
234
534
27
How to Avoid Early Prefetches?
Free Bit Data
0
0
0
1
2 9
3 6 10
13
4 5 7 8 11
12
14
15
234
1
2 9
3 6 10
13
4 5 7 8 11
12
14
15
Free Bit Data
1
0
0
234
t1 t3
suspend execution
PFE PFE
continue execution
28
How to Avoid Useless Prefetches?
Free Bit Data
0
0
0
234
suspend execution
MemPFE
1 2 3 4 5
MemPFE
trigger execution
6
t1
L1/L2 misses
L1 hits
1 2 3 4 5
Free Bit Data
0
0
0
234
::::::::
29
How to Avoid Useless Prefetches?
Free Bit Data
0
0
0
234
suspend execution
MemPFE
1 2 3 4 5
MemPFE
trigger execution
6
Free Bit Data
0
1
1trigger execution
MemPFE 7
t1 t2
L1/L2 misses
L1 hits
61 2 3 4 5
30
Performance Prediction of the Push Architecture for Future Processors
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
health em3d mst rayshade perimeter bh treeadd tsp
Norm
alized
Exec
ution
Time
0.8MHz
1.2GHz
1.6GHz
2.0GHz
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
health em3d mst rayshade perimeter bh treeadd tsp
Norm
alized
Exec
ution
Time
memory latency
computation time
0.8M
1.2G1.6G
2.0G
31
Prefetch Coverage
0%
10%
20%
30%
40%
50%
60%
70%
health em3d mst rayshade perimeter bh treeadd tsp
% o
f Cac
he M
isse
s
Total Hidden Misses
Partial Hidden Misses
32
Prefetch Distribution
0%10%20%30%40%50%60%70%80%90%
100%
L1
L2
Memory
33
Cumulative Distance between Recurrent Loads
0%10%20%30%40%50%60%70%80%90%
100% <8
<16
<32
<64
<128
>128
34
Bandwidth Requirement
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Norm
alize
d Ex
ecut
ion T
ime
Limited Bandw idth
Non_Limited Bandw idth
35
Effect of the PFE Data Cache & Throttle Mechanism
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
health em3d mst rayshade perimeter bh treeadd tsp
Nor
mal
ized
Exe
cutio
n Ti
me
basepush_basepush_bufferpush_throttlepush_buffer_throttle
• The throttle mechanism has impact on bh.• The PFE data cache has impact on em3d, perimeter and treeadd
36
Effect of the PFE Data Cache
0%10%
20%30%
40%50%60%
70%80%
90%100%
em3d perimeter bh treeadd
• em3d, perimeter, bh and treeadd : 30% to 50% of prefetches are redundant
Redundant Prefetch Distribution% of redundant prefetches are captured inthe PFE data cache
0%
10%
20%
30%
40%
50%
% of
prefet
ches
L2
Memory
• 70% to 100% of redundant prefetches are captured in the PFE data cache
37
PFE Architecture :Effect of Wider Issue PFEs
• Increasing issue width further improves performance, particularly for em3d and treeadd
00.10.20.30.40.50.60.70.80.9
1
Nor
mal
ized
Exe
cutio
n T
ime base
single
2issue
4issue
38
TLB Miss Effect
• Hardware TLB miss handler, 30 cycle TLB miss penalty
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Benchmark
No
rmal
ized
Exe
cuti
on
Tim
e
perf
32
64
128
256
39
PFE Architecture: Specialized vs. Programmable PFE
• A programmable PFE can achieve performance comparable to a specialized PFE
00.10.20.30.40.50.60.70.80.9
11.1
Health Mst Rayshade
Benchmark
Nor
mal
ized
Exe
cutio
n T
ime
Specialized
Programmable
40
Breadth-First Tree Traversal
1
2 3
4 5 6 7
8 9 10
11
12
13
14
15
8 9 10
13
14
15
Head Tail
:::
Traversal Kernel list = head; while (list) { node = list->ptr; left = node->left; right = node->right; list = list->next; }
::::::::::::::::::::::::::::::::::::::::::::::::
41
Push Architecture Design Issues
CPU L1
L2
Main
Memory
PFE
PFE
PFE
1. PFE Architecture Design
controller
controller
controller
2. Interaction Scheme
5. Demands on the cache/memory
controller
4. Synchronization between the CPU and PFE execution
3. Redundant Prefetch
42
Restore PFE State
00400950 addiu $sp[29],$sp[29],-56 00400958 sw $ra[31],48($sp[29]) 00400960 sw $s8[30],44($sp[29]) 00400968 sw $s0[16],40($sp[29]) 00400970 addu $s8[30],$zero[0],$sp[29] 00400978 addu $s0[16],$zero[0],$a0[4] 00400980 beq $s0[16],$zero[0],004009a8 (x)00400988 lw $a0[4],4($s0[16]) miss 00400990 jal 00400950 <K_TreeAdd>(y)00400998 lw $a0[4],8($s0[16]) 004009a0 jal 00400950 <K_TreeAdd> 004009a8 addu $sp[29],$zero[0],$s8[30] 004009b0 lw $ra[31],48($sp[29]) 004009b8 lw $s8[30],44($sp[29]) 004009c0 lw $s0[16],40($sp[29]) ::::::::::
1
2 5
3 4 6 7
x yx issued: 400988
x miss: 400990, 400950 - 400978
y issued: 400998
Register File PC
save registers in the stack
restore registers from the stack
43
• Correct resume PC– Statically construct the resume PC table
Restore PFE State
Recurrent Load PC Resume PC
400988 400998
400998 4009a8