Upload
eara
View
28
Download
0
Embed Size (px)
DESCRIPTION
A. A. Diverge Branch. Hard to predict. nop. B. C. B. p1. D. C. !p1. E. E. F. G. Insert select- µ ops ( φ -nodes SSA). H. CFM point. H. Frequently executed path Not frequently executed path Hard to predict path. A. A. A. A. A. A. - PowerPoint PPT Presentation
Citation preview
C B
E
D
F G
Frequently executed path
Not frequently executed path
Hard to predict path
A
C
E
B
H
Insert select-µops
(φ-nodes SSA)
Diverge Branch
CFM point
A
H
Hard to predict
p1
!p1
nop
diverge-branch executed block CFM point
Diverge-Merge Processor (DMP) [MICRO 2006, TOP PICKS 2007]
C B
E
D
F G
Frequently executed path
Not frequently executed path
A A A
A A A
A
H
Profile-Assisted Compiler Supportfor Dynamic Predicationin Diverge-Merge Processors
Hyesoon Kim José A. Joao
Onur Mutlu* Yale N. Patt
HPS Research Group University of Texas at Austin
*
4
Control-Flow Graphs
A
simple hammock
A
nested hammock
A
frequently-hammock
A
loop
A
. . . . . . . . . . .
non-merging
66% of mispredicted branches can be dynamically predicated by DMP.
Exact CFM points
Approximate CFM points
5
Diverge-Merge Processor (DMP)
DMP can dynamically predicate complex
branches (in addition to simple hammocks).
The compiler identifies Diverge branches
Control-flow merge (CFM) points
The microarchitecture decides when and
what to predicate dynamically.
6
Why hardware and compiler?
Compiler-centric solution (static predication):predicated ISA, not adaptive,applicable to limited CFG.
Microarchitecture-only solution:complex, expensive, limited in scope.
Compiler-microarchitecture interaction Each one does what it is good at.
7
Compiler Support
Analysis: Identify Diverge Branch Candidates and CFM points
Select Diverge Branches and CFM points
Code generation: mark the selected diverge branchesand CFM points (ISA extensions)
8
Simple/Nested Hammocks: Alg-exact
A
CFM point = IPOSDOM(A)
9
Frequently-Hammocks: Alg-freq
A
H
T NT
• Use edge-profiling frequencies
0.8 0.2
0.1 0.9 0.5 0.5
pT(X): conditional probability of reaching basic block X if branch A is taken
pNT(X): conditional probability of reaching basic block X if branch A is not taken
• Stop at IPOSDOM or MAX_INSTR
• Compute pMerging(X) = pT(X) * pNT(X)
pMerging(H) = 1 * 0.5 = 0.5 CFM point candidate !
1 1 1 1
pT(H)=1
pNT(H)=0.5
10
Compiler Support
Analysis: Identify Diverge Branch Candidates and CFM points
Select Diverge Branches and CFM points
Code generation: mark the selected diverge branchesand CFM points (ISA extensions)
• Chains of CFM points• Short hammocks
• Return CFM points• Diverge loop branches
• Simple/nested hammocks • Frequently-hammocks
11
Compiler Support
Analysis: Identify Diverge Branch Candidates and CFM points
Select Diverge Branches and CFM points
Code generation: mark the selected diverge branchesand CFM points (ISA extensions)
• Heuristics • Cost-benefit model
12
Heuristics-Based Selection
Do not select: CFM points too far from the diverge-branch Hammocks with too many branches
on each path Approximate CFM points with
a low probability of merging
Motivation:minimize overhead: wrong pathmaximize benefit: control-independence
13
Confidence Estimation Cost BenefitCompared to baseline
Cost-Benefit Model-Based Selection
High - - - Same
LowInaccurate Overhead None Lose
Accurate Overhead No flush possibly WIN
cost = (1 – A) x overhead + A x (overhead – misprediction_penalty)
Select if cost < 0
A = accuracy of the conf. estimator = mispredicted / low_conf.
14
Compiler Support
Analysis: Identify Diverge Branch Candidates and CFM points
Select Diverge Branches and CFM points
Code generation: mark the selected diverge branchesand CFM points (ISA extensions)
• Heuristics • Cost-benefit model
15
Methodology All compiler algorithms implemented on a binary analysis
and annotation toolset.
Cycle-accurate execution-driven simulation of a DMP processor: Alpha ISA Processor configuration
16KB perceptron predictor Minimum 25-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2KB 12-bit history enhanced JRS confidence estimator 32 predicate registers, 3 CFM registers
12 SPEC CPU 2000 INT, 5 SPEC 95 INT
16
Heuristic-Based Selection
0
10
20
30
40
50
60
70
gzi
p
vp
r
gc
c
mc
f
cra
fty
pa
rse
r
eo
n
pe
rlb
mk
ga
p
vo
rte
x
bzi
p2
two
lf
co
mp
go
ijp
eg li
m8
8k
sim
hm
ea
n
IPC
del
ta (
%)
exactexact+freqexact+freq+shortexact+freq+short+retexact+freq+short+ret+loop
20.4%
17
Cost-Benefit-Based Selection
0
10
20
30
40
50
60
70
gzi
p
vp
r
gcc
mcf
cra
fty
pars
er
eo
n
perl
bm
k
gap
vo
rtex
bzi
p2
two
lf
co
mp
go
ijp
eg li
m88ksim
hm
ean
IPC
delt
a (
%)
cost-edge
cost-edge+short
cost-edge+short+ret
cost-edge+short+ret+loop
20.2%
The cost-benefit model is simpler and effective
18
Input Set Effects
0
10
20
30
40
50
60
70
gzi
p
vp
r
gcc
mcf
cra
fty
pars
er
eo
n
perl
bm
k
gap
vo
rtex
bzi
p2
two
lf
co
mp
go
ijp
eg li
m88
ksim
hm
ea
n
IPC
delt
a (
%)
Heuristics-same
Heuristics-diff
Cost-same
Cost-diff
19.8%
19
Conclusion Compiler-microarchitecture interaction is good!
DMP exploits frequently-hammocks.
We developed new algorithms that select beneficial diverge-branches and CFM points.
We proposed a new cost-benefit model for dynamic predication.
DMP and our algorithms improve performance by 20%.
Thank You!
Questions?