Ioana Burcea*
Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#
Predictor Virtualization
*University of Toronto
Canada
§Carnegie Mellon University
#École Polytechnique Fédérale de Lausanne
ASPLOS 13
March 4, 2008
2Ioana Burcea Predictor Virtualization University of Toronto
Why Predictors? History Repeats Itself
CPU
Branch Prediction
Prefetching
Value Prediction
Pointer Caching
Cache Replacement
Predictors
Application footprints grow
Predictors need to scale to remain effective
3Ioana Burcea Predictor Virtualization University of Toronto
Extra Resources: CMPs With Large On-Chip Caches
Main Memory
D$I$
CPU
D$I$
CPU
D$I$
CPU
D$I$
CPU
L2 Cache10’s – 100’s of MB
4Ioana Burcea Predictor Virtualization University of Toronto
Predictor Virtualization
Physical Memory Address Space
D$I$
CPU
D$I$
CPU
D$I$
CPU
D$I$
CPU
L2 Cache
5Ioana Burcea Predictor Virtualization University of Toronto
Predictor Virtualization (PV)
Emulate large predictor tables
Reduce predictor table dedicated resources
6Ioana Burcea Predictor Virtualization University of Toronto
Research Contributions PV – metadata stored in conventional cache hierarchy
Benefits Emulate larger tables → increased accuracy Less dedicated resources
Why now? Large caches / CMPs / Need for larger predictors
Will this work? Metadata locality → intrinsically exploited by caches
First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB
Advantages of virtualization
7Ioana Burcea Predictor Virtualization University of Toronto
PV architecture
PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*
Conclusions
*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Talk Road Map
8Ioana Burcea Predictor Virtualization University of Toronto
PV architecture
PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*
Conclusions
*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Talk Road Map
9Ioana Burcea Predictor Virtualization University of Toronto
PV Architecture
Virtualize
request prediction
D$I$
CPU
L2 Cache
Main Memory
Predictor
Table
Optimization
Engine
10Ioana Burcea Predictor Virtualization University of Toronto
PV Architecture
request prediction
D$I$
CPU
L2 Cache
index
PVCache
PVProxy
Physical Memory Address Space
PVTable
Optimization
Engine
PVStart
11Ioana Burcea Predictor Virtualization University of Toronto
PV: Variable Prediction Latency
request prediction
D$I$
CPU
L2 Cache
index
PVCache
PVProxy
Physical Memory Address Space
PVTable
Optimization
Engine
PVStart
Common
Case
Infrequent
Rare
12Ioana Burcea Predictor Virtualization University of Toronto
Metadata Locality
Entry reuse Temporal
One entry used for multiple predictions
Spatial – can be engineered One miss overcome by several subsequent hits
Metadata access pattern predictability Predictor metadata prefetching
13Ioana Burcea Predictor Virtualization University of Toronto
PV architecture
PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*
Conclusions
*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Talk Road Map
14Ioana Burcea Predictor Virtualization University of Toronto
Spatial Memory Streaming [ISCA 06]M
emor
y
spatial patterns
1100000001101…
1100001010001…Spatial patterns stored in a pattern history table (PHT)
*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos.
“Spatial Memory Streaming”
15Ioana Burcea Predictor Virtualization University of Toronto
data access stream
Virtualizing “Spatial Memory Streaming” (SMS)
Detector Predictor
patterns
patterns
prefetchestrigger access
Virtualize
~1KB ~60 KB
16Ioana Burcea Predictor Virtualization University of Toronto
8 sets
Virtualizing SMS
VirtualTable1K
sets
11 ways
PVCache
11 ways
tag pattern
tag tagpattern
pattern
unused
11 bits 32 bits 39 bits
Set entries → cache block – 64 bytes
17Ioana Burcea Predictor Virtualization University of Toronto
Current Implementation
Non-Intrusive Virtual table stored in reserved physical address space
One table per core
Caches oblivious to metadata
Options Predictor tables stored in virtual memory
Single, shared table per application
Caches aware of metadata
18Ioana Burcea Predictor Virtualization University of Toronto
Simulation Infrastructure
SimFlex
Full-system simulator based on Simics
Base processor configuration
4-core CMP
8-wide OoO
256-entry ROB
L1D/L1I 64KB 4-way set-associative
UL2 8MB 16-way set-associative
Commercial workloads
TPC-C: DB2 and Oracle
TPC-H: Query 1, Query 2, Query 16, Query 17
SpecWeb: Apache and Zeus
19Ioana Burcea Predictor Virtualization University of Toronto
0%
20%
40%
60%
80%
100%
120%
140%
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
better
Original Prefetcher – Accuracy vs. Predictor Size
L1
Rea
d M
isse
s
20Ioana Burcea Predictor Virtualization University of Toronto
0%
20%
40%
60%
80%
100%
120%
140%
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
better
Original Prefetcher – Accuracy vs. Predictor Size
L1
Rea
d M
isse
s
21Ioana Burcea Predictor Virtualization University of Toronto
0%
20%
40%
60%
80%
100%
120%
140%
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
better
Original Prefetcher – Accuracy vs. Predictor Size
L1
Rea
d M
isse
s
22Ioana Burcea Predictor Virtualization University of Toronto
Original Prefetcher – Accuracy vs. Predictor Size
Small Tables Diminish Prefetching Accuracy
0%
20%
40%
60%
80%
100%
120%
140%
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Infin
ite
1K-1
6a
1K-1
1a
512-
11a
256-
11a
128-
11a
64-1
1a
32-1
1a
16-1
1a
8-11
a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
better
L1
Rea
d M
isse
s
23Ioana Burcea Predictor Virtualization University of Toronto
Virtualized Prefetcher - Performance
Sp
eed
up
Original Prefetcher ~60KB
Virtualized Prefetcher < 1KB
better 0%
10%
20%
30%
40%
50%
60%
70%
Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17
Original - 1K sets Original - 16 sets Original - 8 sets Virtualized - 8 sets
Hardware Cost
24Ioana Burcea Predictor Virtualization University of Toronto
Impact on L2 Memory Requests
Dark Side: Increased L2 Memory Requests
better
L2
Mem
ory
Req
ues
ts I
ncr
eas
e
0%
10%
20%
30%
40%
Apache Oracle Qry 17
PV - 8 sets
25Ioana Burcea Predictor Virtualization University of Toronto
Impact of Virtualization on Off-Chip Bandwidth
0%
1%
2%
3%
4%
5%
Apache Qry17 Oracle
App L2 Misses App L2 Write-backs
PV L2 Misses PV L2 Write-backs
Minimal Impact on Off-Chip Bandwidth
better
Off
-Ch
ip B
and
wid
th I
ncr
ease
Indirect impact on performance
Direct impact on performance
26Ioana Burcea Predictor Virtualization University of Toronto
Conclusions
Predictor Virtualization Metadata stored in conventional cache hierarchy
Benefits Emulate larger tables → increased accuracy Less dedicated resources
First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB
Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation
Ioana Burcea*[email protected]
Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#
Predictor Virtualization
*University of Toronto
Canada
§Carnegie Mellon University
#École Polytechnique Fédérale de Lausanne
ASPLOS 13
March 4, 2008
Ioana Burcea*[email protected]
Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#
Predictor Virtualization
*University of Toronto
Canada
§Carnegie Mellon University
#École Polytechnique Fédérale de Lausanne
ASPLOS 13
March 4, 2008
Ioana Burcea*[email protected]
Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#
Predictor Virtualization
*University of Toronto
Canada
§Carnegie Mellon University
#École Polytechnique Fédérale de Lausanne
ASPLOS 13
March 4, 2008
30Ioana Burcea Predictor Virtualization University of Toronto
PV – Motivating Trends
Dedicating resources to predictors hard to justify Larger predictor tables
Increased performance
Chip multiprocessors Space dedicated to predictors ↔ # processors
Memory hierarchies offer the opportunity Increased capacity
Diminishing returns
Use conventional memory hierarchies to store predictor metadata
31Ioana Burcea Predictor Virtualization University of Toronto
Virtualizing the Predictor Table
Pattern History Table
Tag Pattern Tag Pattern…
…
…
1 0 1 0 1 1 1 0
1 0 1 0
1 0 1 1
0 0 1 1
0 0 1 1
PC
Trigger Access
Address
Tag index
Pattern
Prefetch
Virtualize
PHT stored in physical address space
Multiple PHT entries packed in one memory block
one memory request brings an entire table set
32Ioana Burcea Predictor Virtualization University of Toronto
Packing Entries in One Cache Block
Index: PC + offset within spatial group PC →16 bits
32 blocks in a spatial group → 5 bit offset
→ 32 bit spatial pattern
Pattern table: 1K sets 10 bits to index the table → 11 bit tag
Cache block: 64 bytes 11 entries per cache block → Pattern table
1K sets – 11-way set associative
21 bit index
tag pattern
tag tagpattern
pattern0 11 43 54 85 unused
33Ioana Burcea Predictor Virtualization University of Toronto
Memory Address Calculation
+000000
16 bits 5 bits
10 bits
PV Start Address
Block offset
Memory Address
PC
tag
34Ioana Burcea Predictor Virtualization University of Toronto
Increase in Off-Chip Bandwidth – different L2 sizes
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
2MB
4MB
8MB
Apache Zeus DB2 Oracle Qry1 Qry2 Qry16 Qry17
Write-backs
L2 Misses
Off
-Ch
ip B
and
wid
th I
ncr
ease
35Ioana Burcea Predictor Virtualization University of Toronto
Increased L2 Latency
0%
10%
20%
30%
40%
50%
60% SMS - 1K SMS - PV8
Sp
eed
up
36Ioana Burcea Predictor Virtualization University of Toronto
Conclusions PV – metadata stored in conventional cache hierarchy
Benefits Less dedicated resources Emulate larger tables → increased accuracy
Example – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB
Why now? Large caches / CMPs / Need for larger predictors
Will this work? Metadata locality → intrinsically exploited by caches Metadata access pattern predictability
Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation