Upload
jeff
View
28
Download
0
Embed Size (px)
DESCRIPTION
Accurate and Complexity-Effective Spatial Pattern Prediction. Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos. Motivation – Variation in Spatial Locality. Caches Exploit Spatial Locality via Block Size Prefetch Nearby Data Improve Performance - PowerPoint PPT Presentation
Citation preview
Computer Architecture Labat
University of Toronto
AENAO: Power Aware Memory Coherence & Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Accurate and Complexity-EffectiveAccurate and Complexity-Effective Spatial Pattern Prediction Spatial Pattern Prediction
Chi ChenSe-Hyun YangBabak Falsafi
Andreas Moshovos
2CALCM
Motivation – Variation in Spatial Locality
Caches Exploit Spatial Locality via Block Size Prefetch Nearby Data Improve Performance
“One Size Fits All” Solution Large enough for prefetching Small enough to avoid memory link saturation
Opportunity Variation Within and Across Applications
If “Best Block Size” was known:1. Prefetch even further Higher Performance
2. “Turn-off” unused data in cache Lower Leakage Power
3CALCM
This Work
Dynamic Spatial Pattern Prediction Leakage Power Reduction
Sub-blocks of a block as a Group Place “unused” block parts in low leakage state
Prefetching Consecutive Memory Blocks as a Group Selectively Prefetch Blocks Upon First Access in Group
Key Contribution: PC + Offset Within Group Quick Learning Compact Representation High Coverage
4CALCM
How Well it Works
Spatial Pattern Predictor (SPP) 256-entry Tag-Less Direct-Mapped ~95% coverage
L1 Data Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation
Prefetching w/ 1024 byte Group Up to 2x speedup and 56% Average Conventional Cache: 14% Slowdown
5CALCM
Outline
Conventional Cache: Optimization Opportunities
Variation in Spatial Locality
Prediction Framework
Prior Work
Results
6CALCM
Optimization Opportunity #1
L1D with 64-Byte cache lines
age isAdult next
isAdult nextage
miss
miss
miss age isAdult next
Resident untouched data Wasteful Leakage
untouched touched
typedef struct person { char name[20]; … int age; int isAdult; struct person* next;} // total 64 bytes
// do something …
while ( people ) { if ( peopleage >= 21 ) peopleisAdult = TRUE; people = peoplenext;}
Conventional Cache
7CALCM
Optimization Opportunity #2
L1D with 64-Byte cache lines
age isAdult
isAdultage
age isAdult
Detech Access Patterns at Group Level Selectively Prefetch Same Block Members
Improve Performance w/o Saturating Memory
Conventional Cache
age isAdultG
rou
p #
1G
rou
p #
2
typedef struct person { char name[20]; … int age; int isAdult;} people[LARGE]
// do something …
for i { if ( people[i].age >= 21 ) people[i].isAdult = TRUE;}
8CALCM
Variation in Spatial Locality
1/8
facerec gcc mcf vortex
100%
80%
60%
40%
20%
0%
2/8
3/8
4/8
5/8
6/8
7/8
8/8
Fraction of data used before eviction Measured on 64KB 2-way L1D w/ 64B cache lines
40% 89% 26% 48%
Average Line Usage
All
Cac
he
Lin
es T
ou
che
d
9CALCM
Prediction Framework
1 0 . . . 1
Minimum Fetch Unit (MFU):• replacement unit of cache• e.g., cache line or sub block
Spatial Group:• group of adjacent MFUs• indexed by logical tag
Spatial Pattern:• reference pattern of a spatial group
Tag0 Tag0 Tag1 Tag1 Tag1. . . . . .
Spatial Group Generation:• starts with a new logical tag
Time
10CALCM
Spatial Pattern Predictor
0 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 1 1 0
1 1 0 0
1 0 0 0
1 1 1 1
001
000
011
010
Spatial PatternRegister
PHT EntryPointer
PredictionIndex
Spatial PatternHistory
Current Pattern Table (CPT) Pattern History Table (PHT)DataCache
Current Pattern Table records patterns Pattern History Table stores captured patterns
PC SPG Offset
Prediction Index: 32 bits
=?
Spatial Pattern Prediction
11CALCM
Prior Work
Static profiling, V. Vleet, et al. ICCD 1999 Adjustable block size, Dubnicki & LeBlanc. ISCA 1992 Fetching adjacent cache lines, Temam & Jegou. ICS 1994 Dual cache, Gonzalez, Aliagas & Valero. ICS 1995 Spatial Locality Detection Table, Johnson, Merten & Hwu.
MICRO 1998 Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA
1998
Key Difference is Prediction Handle: PC + Group Offset
1. Compact Representation 2. Quick Learning3. High Coverage
12CALCM
Results Overview
Predictor Performance Statistics
Leakage Power Reduction
Performance Improvement w/ Prefetching
13CALCM
Methodology
SimpleScalar simulator 64KB 2-way L1D/L1I cache, 2-cycle latency 2MB 8-way L2 cache, 12-cycle latency
SPEC CPU2000 Alpha binaries + reference inputs
Predictor performance evaluation Simulated to completion
Performance impact evaluation Skipped 10B and simulated next 500M instructions
Energy reduction evaluation SPICE w/ 70nm CMOS technology & 1V supply voltage
14CALCM
Practical Predictor: Performance
160%
100%
0%
20%
40%
60%
80%
gcc mcf
256-entry tag-less direct-mapped average prediction accuracy of 96%
A B CA B CvortexA B C
fecerecA B C
256 EntriesA: 16-wayB: DMC: FA
Training Over-PredictionOver-PredictionUnder-PredictionCorrect Prediction
% o
f p
erfe
ct
pre
dic
tio
ns bet
ter
15CALCM
Predictor Applications
Leakage energy reduction Sub blocks as minimum fetch units Cache lines as spatial groups A cache miss starts a spatial group generation Assuming Gated-Ground by Agarwal, Li, & Roy
Spatial group prefetcher Cache lines as minimum fetch units Adjacent cache lines grouped into spatial groups A new logical tag starts a spatial group generation
16CALCM
Leakage Energy Reduction
Execution Time Increase
Relative Leakage Power
80%
5%
0%
20%
40%
60%
100%
gcc mcf vortexfecerec AVG
Up to 73% leakage energy reduction ~40% average leakage energy reduction < 1% average performance degradation
60%
<1%~2%
bet
ter
bet
ter
17CALCM
Performance Improvement
-50%
0%
50%
100%
150%
facerec gcc mcf vortex AVG
SPG 1024SPG 512CONV. 1024CONV. 512
Up to 2x speedup with 1024B spatial groups ~60% average speedup with 1024B spatial groups
18CALCM
Summary
Spatial Pattern Predictor (SPP) Key Contribution: PC + Group Offset
Small and Effective, High Coverage 256-entry Tag-Less Direct-Mapped ~95% coverage
L1 Data Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation
Prefetching w/ 1024 byte Group Up to 2x speedup and 56% Average Conventional Cache: 14% Slowdown
Computer Architecture Labat
University of Toronto
AENAO: Power Aware Memory Coherence & Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Accurate and Complexity-EffectiveAccurate and Complexity-Effective Spatial Pattern Prediction Spatial Pattern Prediction
Chi ChenSe-Hyun YangBabak Falsafi
Andreas Moshovos
20CALCM
Prediction Index
Infinite Tables PC + SPG offset yields high prediction accuracy PC + SPG offset has low prediction memory requirements
160%
100%
0%
20%
40%
60%
80%
facerec gcc mcf
TrainingOver-Prediction
Under-Prediction
Correct Prediction
A B C D A B C D A B C Dvortex
A B C D
A: PCB: PC+SPG IDC: PC+SPG OFFSETD: PC+ADDR
21CALCM
Contributions
Spatial Pattern Predictor (SPP) 256-entry Tag-Less Direct-Mapped ~95% coverage
Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation
Processor Performance Improvement Up to 2x speedup
22CALCM
Variations in Spatial Locality
0%
20%
40%
60%
80%
100%amm
p art bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
Percen
tage of A
ll Cach
e Line U
sages
<=13%14-25%26-38%39-50%51-63%64-75%76-88%89-100%
Fraction of data used before eviction Measured on 64KB 2-way L1D w/ 64B cache lines
23CALCM
Prediction Index
PC + SPG offset yields high prediction accuracy PC + SPG offset requires low prediction memory
requirement
ABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCD
ammp art bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
0%20%40%60%80%100%120%140%160%
Percent
of Perfe
ct Predi
ctions
A: PC-onlyB: PC+SPG IDC: PC+SPG OFFSETD: PC+ADDR
Correct PredictionUnderpredictionOverpredictionTraining
24CALCM
Predictor Memory Organization
256-entry tag-less direct-mapped yields average prediction accuracy of 96%
0%20%40%60%80%100%120%140%160%
ABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEF
ammp art bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
Percen
t of Perfect
Predict
ions
A: 128-en try 16-wayB: 128-en try DMC: 128-en try FAD: 256-en try 16-wayE: 256-entry DMF: 256-entry FA
Correct PredictionUnderpredictionOverpredictionTraining
25CALCM
Spatial Group Size (1/2)
ABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEartA am
mp bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
Percenta
ge of Pe
rfect Pre
dictions
0%20%40%60%80%100%120%140%160%
A: 16B Spatial Group 8B Fetch UnitB: 32B Spatial Group 8B Fetch UnitC: 64B Spatial Group 8B Fetch UnitD: 128B Spatial Group 8B Fetch UnitE: 256B Spatial Group 8B Fetch Unit
Correct PredictionUnderpredictionOverpredictionTraining
26CALCM
Spatial Group Size (2/2)
0%20%40%60%80%100%120%140%160%
ABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEF
ammp art bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
Percen
tage of P
erfect P
redictio
nsCorrect PredictionUnderpredictionOverpredictionTraining
A: 32B Spatial Group 8B Fetch UnitB: 64B Spatial Group 8B Fetch UnitC: 128B Spatial Group 8B Fetch UnitD: 128B Spatial Group 64B Fetch UnitE: 256B Spatial Group 64B F etch UnitF: 512B Spatial Group 64B Fetch Unit
27CALCM
Predictor Memory Organization
0%20%40%60%80%100%120%140%160%
ABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFG
ammp art bzip equake
facerec fma3d gap gcc lucas mcf mgrid
vortex
Percen
tage of P
erfect P
redictio
ns
A: 8-entryB: 16-entryC: 32-entryD: 64-entryE: 128-entryF: 256-entryG: INF
Correct PredictionUnderpredictionOverpredictionTraining
28CALCM
Leakage Energy Reduction
Up to 73% leakage energy reduction ~40% average leakage energy reduction < 1% average performance degradation
0%
20%
40%
60%
80%
100%
ammp art bzip equ
akeface
rec fma3d gap gcc lucas mcf mgrid
vortex AVG
Execution Time Increase Fraction of Baseline Leakage Dissipation
5%
29CALCM
ammp512B
1024BSPG 512BSPG 1024B
-41-6310-25
art3296121305
bzip-43-4968
equake-34-415999
facerec
-13-358103
fma3d
-9-900
gap
20313147
gcc
-2-211
lucas
-23-673451
mcf
-27-323867
mgrid
6123653
vortex
-27-4311
AVG
-13-143359
Performance Improvement
Up to 2x speedup with 1024B spatial groups ~60% average speedup with 1024B spatial groups
30CALCM
Predictor Memory Organization
160%
100%
0%
20%
40%
60%
80%
gcc mcf
256-entry tag-less direct-mapped average prediction accuracy of 96%
A B C D E FA B C D E Fvortex
A B C D E Ffecerec
A B C D E F
A: 128-entry 16-wayB: 128-entry DMC: 128-entry FAD: 256-entry 16-wayE: 256-entry DMF: 256-entry FA
TrainingOver-Prediction
Under-Prediction
Correct Prediction