Upload
prosper-griffith
View
215
Download
0
Embed Size (px)
Citation preview
Pre-Executionvia
Speculative Data Driven Multithreading
Amir RothUniversity of Wisconsin, Madison
August 10, 2001
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
2
Explanation of Title Slide
• pre-execution: a new way of extracting additional instruction-level parallelism (ILP), and hence performance, from ordinary sequential programs
• speculative data-driven multithreading (DDMT): an implementation of pre-execution
thesis: pre-execution and DDMT are good
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
3
Summary of Contributions
• pre-execution (concept)– idea: execute to get unpredictable but performance critical values– technology: proactive out-of-order sequencing, decoupling
• DDMT (proposed implementation)– idea: extend superscalar design, siphon pre-execution bandwidth– technology: register integration
• algorithm for selecting what to pre-execute (framework)– idea: automatically select from computations executed by program– technology: pre-execution benefit-cost function
• performance evaluation (of framework and implementation)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
4
Why Am I Doing This?
• still need higher performance– new app’s, better performance on existing app’s
• still need higher sequential-program performance– many out there, parallel/MT programs composed of sequential code
• need parallelism to complement frequency– frequency getting harder, performance returns diminishing
• need ILP to complement [program,thread,bit]LP
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
5
Outline
• motivation and introduction to pre-execution– problem we are solving and potential gain of solving it (3 slides)– pre-execution basics (5 slides)
• pre-execution DDMT style• automated computation selection• DDMT microarchitecture• performance evaluation• terrace
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
6
ILP: Incumbent Model
• vonNeumann: retire instructions in (program) order• performance: execute useful instructions each cycle• the superscalar way
– examine “sliding window”, dataflow execution w/in window– in-order retirement implements sequential semantics– out-of-order execution increases useful execution rate– in-order fetch establishes data dependences
• performance loss: not enough ready-to-execute useful instructions in window
reti
refe
tch
win
dow
pro
gra
m o
rder
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
7
Value Latency and PDIs
• Problem: value latency– need correct value faster than execution can supply– two important kinds: branch outcomes, load addresses
• branch prediction/address prediction (prefetching)– faster than execution, correct ~95% of the time– last 5% are performance degrading instances (PDIs)
• effects of PDIs– branch mis-predictions: stall fetch of useful instructions– cache misses: execution latency stalls retirement,
backpressure stalls fetch
LD
BR
LD
BR
fetch exec
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
8
Why Worry About 5%
• performance gain by “fixing” PDIs– 8-wide processor, 64KB L1, 1MB L2, 80 cycle memory latency– branches: not perfect prediction, perfect resolution (fixup at rename)
not as good, but matches our implementation
Perfect Memory Latency and Branch Resolution
0
1
2
3
4
5
6
7
8
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
PERFECT
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
9
Pre-Execution
• dilemma– need PDI values faster than execution– can only accurately get PDI values using execution
• pre-execution: execution that is faster than execution– part I: (pre) execute PDIs faster than original program– part II: communicate pre-executed PDI values to original program
– part II: cache for loads, ?? for branches (implementation dependent)– part I: the crux– important: pre-execute PDI computations, otherwise just guessing
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
10
Q: How to Execute Faster than Execution?
• A part I: proactive, out-of-order sequencing (fetch)– execute (and fetch) fewer instructions (fetch >> battle/2)– out-of-order: PDI computation only (not full program)– proactive: “know” PDI is coming and its computation– hoist computation (and its latency) arbitrary distances
• A part II: decoupling– pre-execute PDI computation in separate “thread”– “move” stalls to pre-execution thread (inter-thread overlapping)
proactive OoO sequencing + decoupling = pre-execution
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
11
Pre-Execution: Example
data-driven thread (DDT) pre-executing computation
– OoO sequencing PDI computation fetched
quickly
– decoupling overlap memory latency
with master thread instructions
– result communication helps with branches
LD
BR
LD
BR
fetch exec
original program
fetch exec
masterthread
LDBR
fetch exec
DDT(p-e thread)
BR
speedup
fork
BRBRsendresult
LDLD
LDabsorblatency
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
12
How does Processor “Know” PDI is Coming?
• DDT associated wth trigger– master thread (MT) sees trigger? forks DDT– assume: MT will execute PDI via computation == DDT– choose DDT/trigger s.t. assumption statistically true
• two other possibilies– (a) master may not execute computation matching DDT– (b) load may hit, branch may be correctly predicted– useless (actually harmful) pre-execution– take these probabilities into account
LDBR
MT DDT
LD
BR
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
13
Role of Pre-Execution
• need for in-order sequencing– OoO sequence contains all dependences? can’t tell – p-e results must be speculative– can’t “interleave” two OoO sequences together– must in-order sequence full program once (at least)– OoO sequencing (p-e) must be redundant
• role of pre-execution– speculative: no correctness obligations– redundant: (relatively) high cost– pre-execute anything, just so performance improves
LDBR
MT DDT
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
14
Outline
• intro to pre-execution• pre-execution DDMT style
– DDMT basics (1 slides)– intro to register integration (2 slides)– implicit data-driven sequencing (2 slides)
• automated computation selection • DDMT microarchitecture • experimental evaluation
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
15
Pre-Execution DDMT Style
• extend dynamically scheduled superscalar processor– DDT$ holds static DDTs– replicate register maps, CMIS manages, schedules DDT “injection”– DDT instructions look “normal”, but not put into ROB or retired
put into RS, read/write pregs, execute on FUs
– IT implements register integration (result sharing via pregs)– centralized organization: no dedicated PE B/W, steal as needed
bonus: B/W available to steal when PE is needed most (overall ILP is low)
I$ D$+PREGRS
ROB
RERE
DDT$ ITCMIS
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
16
Register Integration
• master thread directly reuses pre-executed results– saves execution bandwidth, compresses dataflow graph– instant branch resolution: integrated mis-predicted branch resolved
at register renaming time (at fetch would be better, but harder)• regsiter integration
– like instruction reuse [Sodani] using pregs instead of lregs/values– reuse test: match PCs (op) & input pregs– reuse: map output to IT entry output preg– map table manipulations only, pregs not read or written– pregs naturally track dependence chains (e.g., DDTs)
pin1 pregPCp7 p6x18- p9x1c
pin2
p6IT
Vin1 ValPC0x44 0x48x18
- 17x1c
Vin24
0x48RB
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
17
Pre-Execution Reuse via Integration
– DDT allocates pregs, creates IT entries– master thread reads IT, integrates pregs allocated by DDT– recursive process: 0x18 integration (p6) sets up 0x2c integration
PC raw inst renamed
R2=ld[R1]R1=R1+4x18
x2c p9=ld[p6]p6=p7+4
DDTPC raw inst
R2=ld[R1]x2c p?=ld[p6]R1=R1+4x18 p?=p7+4
renamed
R1=R1+1x18 p6=p7+1R1=R1+1x18 p6=p7+1
master thread
pin pregPCp7 p6x18p6 p9x2c
IT
fork R1p7
map table
p6=p7+4p9=ld[p6]
• need two things to make this work– initial R1 mapping (p7) must match (copy map table on fork)– data-dependences must match (instructions and PCs!!)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
18
Sequencing DDTs (OoO sequencing)
• Q: how to implement out-of-order sequencing?– how to sequence instructions with non-contiguous PCs?
• implicit data-driven sequencing– list DDT instructions, inject list as is– processor doesn’t “interpret” DDT branches– branches in DDTs only for subsequent integration
• DDTs are static and finite– good: natural overhead control, no “runaway” DDTs – bad: can’t pre-execute loops (important for latency tolerance)
R2=ld[R1]R1=R1+4
PCx18x2c
inststatic DDT
bz R2, x44x30
Trig:
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
19
Faking Control-Flow in DDTs
• “trick” processor into pre-executing any control flow– processor doesn’t interpret DDT branches
• faking conditional control– implicit conditional: pre-execute along common path– greedy conditional: pre-execute along both paths
• faking loop control– important for latency tolerance!– unrolling: unroll a loop within a DDT– unoverlapped unrolling: powerful, difficult in DDMT – induction unrolling: unroll induction only
R2=ld[R1]R1=R1+4
PC
x18x2c
inststatic DDT
bz R2, x44x30
R1=R1+4x18R1=R1+4x18Trig:
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
20
Outline
• intro to pre-execution• pre-execution DDMT style• automated DDT selection
– identifying problem instructions (2 slides)– optimizing DDTs for latency tolerance and overhead (cool, 2 slides)– merging DDTs to reduce overhead (ultra boring, 0 slides)
• DDMT microarchitecture • experimental evaluation• summary
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
21
Automated DDT Selection Algorithm
• goal: DDTs that hide most PDI latency with least overhead • automated? integration requirement bounds search space
– examine program traces– slice backwards from PDIs to enumerate DDTs– choose DDTs that maximize some benefit-cost function
• 3 steps– identify static problem instructions (PIs)– find DDTs for each PI– merge partially overlapping DDTs to reduce overhead
• implementation– H/W? S/W? VM? (we model S/W, but leave other possibilities open)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
22
Problem Instructions (PIs)
• impractical to pre-execute all PDIs– divide PDIs by static instruction– choose static problem instructions (PIs) with good “pre-executability”– only find-DDTs-for / pre-execute-PDIs-of PIs
• good pre-executability criteria– problem ratio: high ratio of PDIs (high miss/misprediction rate)– problem contribution: high PDI representation– problem latency: high latency per PDI
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
23
Potential of Pre-Executing PIs
• PI definition– contribution: 1 in 500 PDIs– ratio: 1 in 10 PDIs (10% miss/misprediction rate)– latency: 10 cycles for loads, 5 for branches
Performance Potential of Perfecting Problem Instructions
0
1
2
3
4
5
6
7
8
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
PERFECT PI
PERFECT
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
24
Selecting DDTs for a Single PI
• simple case– PDIs computed by one slice– choose sub-slice that should be DDT
• in general– multiple, partially overlapping slices– choose set of non-overlapping DDTs
• approach– mantra: maximize latency tolerance, minimize overhead– aggregate (over all pre-executions)– compute benefit-cost (LT-OH) for each potential DDT– choose DDT with maximum benefit-cost
R2=ld[R1]R1=R1+4
PC
x18x2c
inst
bz R2, x44x30
R1=R1+4x18R1=R1+4x18Trig?
Trig?Trig?Trig?
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
25
Pre-Execution Benefit-Cost Function
• aggregate advantage: ADVAGG = LTAGG - OHAGG
• LTAGG = # PDIs covered * LTPDI
– don’t count pre-executions for cache hits (no latency to tolerate)– LTPDI = EXECTMT – EXECTDDT
sequencing constrained dataflow-height (SCDH) to approximate EXECT– LTPDI <= problem latency (can’t tolerate more latency than there is)
• OHAGG = # pre-executions * OHPE
– all pre-executions count, for cache hits, for no corresponding loads– OHPE = rename b/w consumed in cycles (most direct OH measure)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
26
Outline
• intro to pre-execution• pre-execution DDMT style• automated DDT selection• DDMT microarchitecture
– implementing register integration (1 slide)– other implementation notes (1 slide)
• experimental evaluation• summary
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
27
Implementing Register Integration
• more physical registers– to keep pre-executed DDT results alive longer, pipestage++?
• integration circuit– looks like conventional register renaming dependence x-check– IT may be M-way associative (M candidates per instruction) – circuit complexity: NM2 (M > 4 not adivsed), pipestage++?
• integrating loads– DDTs may miss conflicting stores from master thread– not detected by integration (preg-based): coherence problem– solution: re-execute integrated loads & squash (alt: snoop)– learn which load integrations cause squashes, don’t integrate them
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
28
Other Implementation Notes
• forking– only master thread can fork, no “chaining”
• injection scheduling– Q: how fast should DDTs be “injected”?– A: at dataflow speed, but no faster (approximate with DDT-1)
• stores and DDT memory communication– DDTs can contain stores, can’t write DDT stores into D$– small queue (DDSQ): write DDT stores, direct “right” DDT loads
• exceptions– buffer until instruction integrated (potentially abort rest of DDT)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
29
Outline
• intro to pre-execution• pre-execution DDMT style• automated DDT selection• DDMT microarchitecture• experimental evaluation
– numbers (3 slides)– more numbers (6 slides)– explanations of numbers (2 slides)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
30
Experimental Framework
• SPECint2K, 2 Olden microbenchmarks, Alpha EV6, –O3 –fast– training runs, 10% sampling
• SimpleScalar-based simulation environment– 8-wide, superscalar, out-of-order– 128 ROB, 64 LDQ, 32 STQ, 80 RS– Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 3 L1 hit– 32KB IL1/64KB DL1 (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc.– 1024 pregs, 1024-entry, 4-way IT (baseline does squash reuse)
• methodology– DDT selection/DDMT: same input sample (for now)
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
31
DDMT Performance
• microbenchmarks: +50%, SPEC2K: +1%-10%– avg. load latency reduced up to 20%– avg. branch resolution latency reduced up to 40%– 15%-40% of pre-executed instructions integrated– inability to use unoverlapped full unrolling hurts bzip2, gzip, mcf, vpr
DDMT Performance
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
DDMT
PERFECT PI
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
32
Importance of Integration
• no branch resolution: don’t integrate pre-executed branches– -3% for branch pre-execution benchmarks: crafty, eon, vpr.p, twolf
• no pre-execution reuse: don’t integrate any DDT result– hurts branch pre-execution & high IPC programs, DDMT < base– scheduling/RS contention important in higher IPC cases (intuitive)
Effect of Register Integration
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
DDMT
NO BRANCH RESOLUTION
NO PRE-EXECUTION REUSE
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
33
Another Way of Measuring Overhead
• remove overhead from DDMT– renaming, scheduling re-execution B/W free, limit master RS only– +2-3% performance– current IT formulation suppresses DDMT overhead (more later)
Overhead-less DDMT
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
DDMT
OVERHEAD-LESS DDMT
PERFECT PI
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
34
Sensitivity: DDT Selection Implementation
• model via relationship of DDT selection input to DDMT input– limit: same input, H/W implementation? (default)– offline: different input, S/W implementation– online: different sample within same input, VM implementation
DDT selection input insensitive, DDTs=f(program structure)
Stability of DDMT across DDT Selection Inputs
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
LIMIT
OFFLINE
ONLINE
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
35
Sensitivity: Integration Associativity
• high associativity increases successful integration rate– baseline (squash reuse only) not very sensitive to associativity– integration reduces RS/scheduler contention– low associativity interferes with unrolling (in current IT formulation)
Impact of IT Associativity
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASEDIRECT MAPPED2-WAY4-WAY (DEFAULT)8-WAYFULLY ASSOCIATIVE
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
36
Sensitivity: Unrolling Degree
• max DDT length 64, fully associative IT (to avoid interference)– increased unrolling important for tolerating memory latencies– but increases OH– modulo unrolling, increased allowed DDT size/scope doesn’t help
DDTs no longer than needed to tolerate required latency
Impact of DDT Unrolling Degree
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
UNROLL2
UNROLL4 (DEFAULT)
UNROLL8
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
37
Sensitivity: Memory Latency
• double memory latency: 140 cycles– more latency per PDI– increase maximum unrolling degree to 8 (also 8-way associative IT)– higher relative speedups
Sensitivity to Memory Latency
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE-ML=70
DDMT-ML=70
BASE-ML=140
DDMT-ML=140
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
38
Sensitivity: Cache Size
• cut cache sizes by 4: DL1=16KB, L2=256KB– more PDIs– roughly same absolute speedups, higher relatively– increased contention
Sensitivity to Cache Size
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE-DL1=64KB-L2=1MB
DDMT-DL1=64KB-L2=1MB
BASE-DL1=16KB-L2=256KB
DDMT-DL1=16KB-L2=256KB
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
39
IT Formulation?
• old: IT doubles as ledger for pregs allocated by DDTs– if evicted from IT, preg is freed, downstream DDT destroyed– keeps overhead (RS contention) down– only need to re-execute integrated loads– restricts effective unrolling degree to IT associativity– requires incremental invalidations on IT (associative matches)
• new: decouple pre-execution from presence in IT – re-execute all integrated instructions– no incremental invalidations– downstream DDT not destroyed on IT eviction
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
40
Preliminary New-Formulation Results
• 3-5% better for some, 1-2% worse for others– RS contention increases greatly (too many not-ready DDT instr’s)– DDT-1 injection too aggressive, slower policy ties up DDT contexts– different character than older formulation, needs work
New IT Formulation
0
1
2
3
4
5
em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r
IPC
BASE
DDMT-OLD
DDMT-NEW
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
41
Evaluation Summary
• performance– does well on microbenchmarks (like it’s supposed to)– modest to moderate gains on SPEC2K + aggressive baseline– relatively better the more there is to do (PDIs or latency per PDI)
• limitations (future work?)– pre-execution/branch predictor interface would be nice
others and I have looked at this
– difficulties with unoverlapped unrolling external unrolling mechanism, static selection framework
– need more RS entries clustered RS queues should be good, DDTs are dependence chains
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
42
The End
• pre-execution: more ILP from sequential programs– attacks performance problems directly– key technologies: proactive out-of-order sequencing + decoupling
• DDMT: a superscalar-friendly implementation– no dedicated pre-execution bandwidth– register integration for pre-execution reuse
• automated DDT selection– stable, has the right knobs
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
43
Outline
• motivation and introduction to pre-execution• pre-execution DDMT style• automated computation selection• DDMT microarchitecture• performance evaluation• terrace
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
44
Tuning DDT Selection
• can’t change DDT structure, can control length– Longer DDTs tolerate more latency, more overhead, fewer PDIs
• from below– minimal latency tolerance
• from above– maximum length, slicing window size, unrolling degree
• upshot: maximum ADVAGG DDT has characteristic length– loosening controls doesn’t make much difference
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
45
Related Work: Architectures
• dataflow architectures [Dennis75], Manchester [Gurd+85], TTDA [Arvind+90], ETS [Culler+90]
– decoupling, data-driven fetch to the limit, but no sequential interface– pre-execution: sequential interface with speculative dataflow helper
[UW-CSTR-#1411]
• decoupled access/execute architecture [Smith82]
– decoupling, single execution, proactive ooo?– pre-execution: speculative decoupled miss/execute micro-architecture
[MEDEA’00]
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
46
Related Work: Microarchitectures
• decoupled runahead slipstream [Rotenberg+00]
– decoupled, not proactive out-of-order
• speculative thread-level parallelism Multiscalar [Franklin+93], SPSM [Dubey+95], DMT [Akkary+98]
– decoupled, not proactive out-of-order
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
47
Pre-Execution Geneology
AssistedExecution
Dependence-Based Prefetching
SSMT
Dependence-Based Target Pre-Computation
Branch FlowMicroarchitecture
DDMTSpeculativeDataflow
Speculative Slices
Speculative Pre-Computation
Slice Processors
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
48
Loads
0
5
10
15
mcf vpr mst
Sp
ee
du
p (
%)
Performance: Separating the Effects
• full DDMT• no integration
– loads: prefetching still OK mcf unrolling needs integration
– branches: effects lost em3d DDT prefetches too
• no decoupling – DDTs as scheduler hints
“problem aware OOO”
– no speedup
Branches
0
5
10
15
eon gzip em3d
Sp
ee
du
p (
%)
–integ
full DDMT–decoup
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
49
Pre-Execution vs. Inlined Helper Code
• must hoist computation – must copy, straight hoist will create WAR dependences– hoist past/out-of/into procedures?– schedule past/out-of/into procedures?– non-binding prefetches or inline stalls– branch pre-execution?
• pre-execution vs. Itanium®– speculative loads ease hoisting past procedures– but that’s it
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
50
Loads in DDTs
• missed memory dependences (store invalidations)?– problem is with integration– keep address/value pairs in integration table (shadow MOB)– snoop
• multiprocessors?– pre-execution is sequentially consistent (SC)– just radical out-of-order– actual execution, snoopable state for all loads– many cases occur in uniprocessor as interactions with main thread
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
51
Contributions: Prelim vs. Thesis
• Prelim– implementation: Speculative Dataflow (TTDA), DDMT (SMT base)– applications: prefetch data, pre-compute branches, “smoothe” ILP– automated pre-execution computation selection (enough to get by)
• Dissertation– implementation: DDMT (superscalar based)– applications: prefetch data, pre-compute branches– automated computation selection
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
52
Superscalar: Obstacles to ILP
• backpressure, backpressure…– out-of-order retirement? can’t do– larger window (ROB)? hard (engineering) – bigger useful window? harder (P[no mis-pred] shrinks)– out-of-order fetch? getting to that
Amir Roth Pre-Execution via Speculative Data-Driven Multithreading
53
Calculating Execution Times
• why do DDTs execute faster than MT?– sequence fewer instructions!– SCDH (Sequencing Constrained Dataflow Height): accounts for this
R2=ld[R1]R1=R1+4
PC
x18x2c
inst
bz R2, x44x30
R1=R1+4x18R1=R1+4x18
DSTtrig
808586
400
SC
101111
50
SCDH
111213
61
R2=ld[R1]R1=R1+4
PC
x18x2c
inst
bz R2, x44x30
R1=R1+4x18R1=R1+4x18
DSTtrig
234
10
SC
234
10
SCDH
345
21