Pre-Execution via Speculative Data Driven Multithreading Amir Roth University of Wisconsin, Madison August 10, 2001

Pre-Executionvia

Speculative Data Driven Multithreading

Amir RothUniversity of Wisconsin, Madison

August 10, 2001

Amir Roth Pre-Execution via Speculative Data-Driven Multithreading

2

Explanation of Title Slide

• pre-execution: a new way of extracting additional instruction-level parallelism (ILP), and hence performance, from ordinary sequential programs

• speculative data-driven multithreading (DDMT): an implementation of pre-execution

thesis: pre-execution and DDMT are good


3

Summary of Contributions

• pre-execution (concept)– idea: execute to get unpredictable but performance critical values– technology: proactive out-of-order sequencing, decoupling

• DDMT (proposed implementation)– idea: extend superscalar design, siphon pre-execution bandwidth– technology: register integration

• algorithm for selecting what to pre-execute (framework)– idea: automatically select from computations executed by program– technology: pre-execution benefit-cost function

• performance evaluation (of framework and implementation)


4

Why Am I Doing This?

• still need higher performance– new app’s, better performance on existing app’s

• still need higher sequential-program performance– many out there, parallel/MT programs composed of sequential code

• need parallelism to complement frequency– frequency getting harder, performance returns diminishing

• need ILP to complement [program,thread,bit]LP


5

Outline

• motivation and introduction to pre-execution– problem we are solving and potential gain of solving it (3 slides)– pre-execution basics (5 slides)

• pre-execution DDMT style• automated computation selection• DDMT microarchitecture• performance evaluation• terrace


6

ILP: Incumbent Model

• vonNeumann: retire instructions in (program) order• performance: execute useful instructions each cycle• the superscalar way

– examine “sliding window”, dataflow execution w/in window– in-order retirement implements sequential semantics– out-of-order execution increases useful execution rate– in-order fetch establishes data dependences

• performance loss: not enough ready-to-execute useful instructions in window

reti

refe

tch

win

dow

pro

gra

m o

rder


7

Value Latency and PDIs

• Problem: value latency– need correct value faster than execution can supply– two important kinds: branch outcomes, load addresses

• branch prediction/address prediction (prefetching)– faster than execution, correct ~95% of the time– last 5% are performance degrading instances (PDIs)

• effects of PDIs– branch mis-predictions: stall fetch of useful instructions– cache misses: execution latency stalls retirement,

backpressure stalls fetch

LD

BR

LD

BR

fetch exec


8

Why Worry About 5%

• performance gain by “fixing” PDIs– 8-wide processor, 64KB L1, 1MB L2, 80 cycle memory latency– branches: not perfect prediction, perfect resolution (fixup at rename)

not as good, but matches our implementation

Perfect Memory Latency and Branch Resolution

0

1

2

3

4

5

6

7

8

em3d mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r

IPC

BASE

PERFECT


9

Pre-Execution

• dilemma– need PDI values faster than execution– can only accurately get PDI values using execution

• pre-execution: execution that is faster than execution– part I: (pre) execute PDIs faster than original program– part II: communicate pre-executed PDI values to original program

– part II: cache for loads, ?? for branches (implementation dependent)– part I: the crux– important: pre-execute PDI computations, otherwise just guessing


10

Q: How to Execute Faster than Execution?

• A part I: proactive, out-of-order sequencing (fetch)– execute (and fetch) fewer instructions (fetch >> battle/2)– out-of-order: PDI computation only (not full program)– proactive: “know” PDI is coming and its computation– hoist computation (and its latency) arbitrary distances

• A part II: decoupling– pre-execute PDI computation in separate “thread”– “move” stalls to pre-execution thread (inter-thread overlapping)

proactive OoO sequencing + decoupling = pre-execution


11

Pre-Execution: Example

data-driven thread (DDT) pre-executing computation

– OoO sequencing PDI computation fetched

quickly

– decoupling overlap memory latency

with master thread instructions

– result communication helps with branches

LD

BR

LD

BR

fetch exec

original program

fetch exec

masterthread

LDBR

fetch exec

DDT(p-e thread)

BR

speedup

fork

BRBRsendresult

LDLD

LDabsorblatency


12

How does Processor “Know” PDI is Coming?

• DDT associated wth trigger– master thread (MT) sees trigger? forks DDT– assume: MT will execute PDI via computation == DDT– choose DDT/trigger s.t. assumption statistically true

• two other possibilies– (a) master may not execute computation matching DDT– (b) load may hit, branch may be correctly predicted– useless (actually harmful) pre-execution– take these probabilities into account

LDBR

MT DDT

LD

BR


13

Role of Pre-Execution

• need for in-order sequencing– OoO sequence contains all dependences? can’t tell – p-e results must be speculative– can’t “interleave” two OoO sequences together– must in-order sequence full program once (at least)– OoO sequencing (p-e) must be redundant

• role of pre-execution– speculative: no correctness obligations– redundant: (relatively) high cost– pre-execute anything, just so performance improves

LDBR

MT DDT


14

Outline

• intro to pre-execution• pre-execution DDMT style

– DDMT basics (1 slides)– intro to register integration (2 slides)– implicit data-driven sequencing (2 slides)

• automated computation selection • DDMT microarchitecture • experimental evaluation


15

Pre-Execution DDMT Style

• extend dynamically scheduled superscalar processor– DDT$ holds static DDTs– replicate register maps, CMIS manages, schedules DDT “injection”– DDT instructions look “normal”, but not put into ROB or retired

put into RS, read/write pregs, execute on FUs

– IT implements register integration (result sharing via pregs)– centralized organization: no dedicated PE B/W, steal as needed

bonus: B/W available to steal when PE is needed most (overall ILP is low)

I$ D$+PREGRS

ROB

RERE

DDT$ ITCMIS


16

Register Integration

• master thread directly reuses pre-executed results– saves execution bandwidth, compresses dataflow graph– instant branch resolution: integrated mis-predicted branch resolved

at register renaming time (at fetch would be better, but harder)• regsiter integration

– like instruction reuse [Sodani] using pregs instead of lregs/values– reuse test: match PCs (op) & input pregs– reuse: map output to IT entry output preg– map table manipulations only, pregs not read or written– pregs naturally track dependence chains (e.g., DDTs)

pin1 pregPCp7 p6x18- p9x1c

pin2

p6IT

Vin1 ValPC0x44 0x48x18

- 17x1c

Vin24

0x48RB


17

Pre-Execution Reuse via Integration

– DDT allocates pregs, creates IT entries– master thread reads IT, integrates pregs allocated by DDT– recursive process: 0x18 integration (p6) sets up 0x2c integration

PC raw inst renamed

R2=ld[R1]R1=R1+4x18

x2c p9=ld[p6]p6=p7+4

DDTPC raw inst

R2=ld[R1]x2c p?=ld[p6]R1=R1+4x18 p?=p7+4

renamed

R1=R1+1x18 p6=p7+1R1=R1+1x18 p6=p7+1

master thread

pin pregPCp7 p6x18p6 p9x2c

IT

fork R1p7

map table

p6=p7+4p9=ld[p6]

• need two things to make this work– initial R1 mapping (p7) must match (copy map table on fork)– data-dependences must match (instructions and PCs!!)


18

Sequencing DDTs (OoO sequencing)

• Q: how to implement out-of-order sequencing?– how to sequence instructions with non-contiguous PCs?

• implicit data-driven sequencing– list DDT instructions, inject list as is– processor doesn’t “interpret” DDT branches– branches in DDTs only for subsequent integration

• DDTs are static and finite– good: natural overhead control, no “runaway” DDTs – bad: can’t pre-execute loops (important for latency tolerance)

R2=ld[R1]R1=R1+4

PCx18x2c

inststatic DDT

bz R2, x44x30

Trig:


19

Faking Control-Flow in DDTs

• “trick” processor into pre-executing any control flow– processor doesn’t interpret DDT branches

• faking conditional control– implicit conditional: pre-execute along common path– greedy conditional: pre-execute along both paths

• faking loop control– important for latency tolerance!– unrolling: unroll a loop within a DDT– unoverlapped unrolling: powerful, difficult in DDMT – induction unrolling: unroll induction only

R2=ld[R1]R1=R1+4

PC

x18x2c

inststatic DDT

bz R2, x44x30

R1=R1+4x18R1=R1+4x18Trig:


20

Outline

• intro to pre-execution• pre-execution DDMT style• automated DDT selection

– identifying problem instructions (2 slides)– optimizing DDTs for latency tolerance and overhead (cool, 2 slides)– merging DDTs to reduce overhead (ultra boring, 0 slides)

• DDMT microarchitecture • experimental evaluation• summary


21

Automated DDT Selection Algorithm

• goal: DDTs that hide most PDI latency with least overhead • automated? integration requirement bounds search space

– examine program traces– slice backwards from PDIs to enumerate DDTs– choose DDTs that maximize some benefit-cost function

• 3 steps– identify static problem instructions (PIs)– find DDTs for each PI– merge partially overlapping DDTs to reduce overhead

• implementation– H/W? S/W? VM? (we model S/W, but leave other possibilities open)


22

Problem Instructions (PIs)

• impractical to pre-execute all PDIs– divide PDIs by static instruction– choose static problem instructions (PIs) with good “pre-executability”– only find-DDTs-for / pre-execute-PDIs-of PIs

• good pre-executability criteria– problem ratio: high ratio of PDIs (high miss/misprediction rate)– problem contribution: high PDI representation– problem latency: high latency per PDI


23

Potential of Pre-Executing PIs

• PI definition– contribution: 1 in 500 PDIs– ratio: 1 in 10 PDIs (10% miss/misprediction rate)– latency: 10 cycles for loads, 5 for branches

Performance Potential of Perfecting Problem Instructions

0

1

2

3

4

5

6

7

8


IPC

BASE

PERFECT PI

PERFECT


24

Selecting DDTs for a Single PI

• simple case– PDIs computed by one slice– choose sub-slice that should be DDT

• in general– multiple, partially overlapping slices– choose set of non-overlapping DDTs

• approach– mantra: maximize latency tolerance, minimize overhead– aggregate (over all pre-executions)– compute benefit-cost (LT-OH) for each potential DDT– choose DDT with maximum benefit-cost

R2=ld[R1]R1=R1+4

PC

x18x2c

inst

bz R2, x44x30

R1=R1+4x18R1=R1+4x18Trig?

Trig?Trig?Trig?


25

Pre-Execution Benefit-Cost Function

• aggregate advantage: ADVAGG = LTAGG - OHAGG

• LTAGG = # PDIs covered * LTPDI

– don’t count pre-executions for cache hits (no latency to tolerate)– LTPDI = EXECTMT – EXECTDDT

sequencing constrained dataflow-height (SCDH) to approximate EXECT– LTPDI <= problem latency (can’t tolerate more latency than there is)

• OHAGG = # pre-executions * OHPE

– all pre-executions count, for cache hits, for no corresponding loads– OHPE = rename b/w consumed in cycles (most direct OH measure)


26

Outline

• intro to pre-execution• pre-execution DDMT style• automated DDT selection• DDMT microarchitecture

– implementing register integration (1 slide)– other implementation notes (1 slide)

• experimental evaluation• summary


27

Implementing Register Integration

• more physical registers– to keep pre-executed DDT results alive longer, pipestage++?

• integration circuit– looks like conventional register renaming dependence x-check– IT may be M-way associative (M candidates per instruction) – circuit complexity: NM2 (M > 4 not adivsed), pipestage++?

• integrating loads– DDTs may miss conflicting stores from master thread– not detected by integration (preg-based): coherence problem– solution: re-execute integrated loads & squash (alt: snoop)– learn which load integrations cause squashes, don’t integrate them


28

Other Implementation Notes

• forking– only master thread can fork, no “chaining”

• injection scheduling– Q: how fast should DDTs be “injected”?– A: at dataflow speed, but no faster (approximate with DDT-1)

• stores and DDT memory communication– DDTs can contain stores, can’t write DDT stores into D$– small queue (DDSQ): write DDT stores, direct “right” DDT loads

• exceptions– buffer until instruction integrated (potentially abort rest of DDT)


29

Outline

• intro to pre-execution• pre-execution DDMT style• automated DDT selection• DDMT microarchitecture• experimental evaluation

– numbers (3 slides)– more numbers (6 slides)– explanations of numbers (2 slides)


30

Experimental Framework

• SPECint2K, 2 Olden microbenchmarks, Alpha EV6, –O3 –fast– training runs, 10% sampling

• SimpleScalar-based simulation environment– 8-wide, superscalar, out-of-order– 128 ROB, 64 LDQ, 32 STQ, 80 RS– Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 3 L1 hit– 32KB IL1/64KB DL1 (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc.– 1024 pregs, 1024-entry, 4-way IT (baseline does squash reuse)

• methodology– DDT selection/DDMT: same input sample (for now)


31

DDMT Performance

• microbenchmarks: +50%, SPEC2K: +1%-10%– avg. load latency reduced up to 20%– avg. branch resolution latency reduced up to 40%– 15%-40% of pre-executed instructions integrated– inability to use unoverlapped full unrolling hurts bzip2, gzip, mcf, vpr

DDMT Performance

0

1

2

3

4

5


IPC

BASE

DDMT

PERFECT PI


32

Importance of Integration

• no branch resolution: don’t integrate pre-executed branches– -3% for branch pre-execution benchmarks: crafty, eon, vpr.p, twolf

• no pre-execution reuse: don’t integrate any DDT result– hurts branch pre-execution & high IPC programs, DDMT < base– scheduling/RS contention important in higher IPC cases (intuitive)

Effect of Register Integration

0

1

2

3

4

5


IPC

BASE

DDMT

NO BRANCH RESOLUTION

NO PRE-EXECUTION REUSE


33

Another Way of Measuring Overhead

• remove overhead from DDMT– renaming, scheduling re-execution B/W free, limit master RS only– +2-3% performance– current IT formulation suppresses DDMT overhead (more later)

Overhead-less DDMT

0

1

2

3

4

5


IPC

BASE

DDMT

OVERHEAD-LESS DDMT

PERFECT PI


34

Sensitivity: DDT Selection Implementation

• model via relationship of DDT selection input to DDMT input– limit: same input, H/W implementation? (default)– offline: different input, S/W implementation– online: different sample within same input, VM implementation

DDT selection input insensitive, DDTs=f(program structure)

Stability of DDMT across DDT Selection Inputs

0

1

2

3

4

5


IPC

BASE

LIMIT

OFFLINE

ONLINE


35

Sensitivity: Integration Associativity

• high associativity increases successful integration rate– baseline (squash reuse only) not very sensitive to associativity– integration reduces RS/scheduler contention– low associativity interferes with unrolling (in current IT formulation)

Impact of IT Associativity

0

1

2

3

4

5


IPC

BASEDIRECT MAPPED2-WAY4-WAY (DEFAULT)8-WAYFULLY ASSOCIATIVE


36

Sensitivity: Unrolling Degree

• max DDT length 64, fully associative IT (to avoid interference)– increased unrolling important for tolerating memory latencies– but increases OH– modulo unrolling, increased allowed DDT size/scope doesn’t help

DDTs no longer than needed to tolerate required latency

Impact of DDT Unrolling Degree

0

1

2

3

4

5


IPC

BASE

UNROLL2

UNROLL4 (DEFAULT)

UNROLL8


37

Sensitivity: Memory Latency

• double memory latency: 140 cycles– more latency per PDI– increase maximum unrolling degree to 8 (also 8-way associative IT)– higher relative speedups

Sensitivity to Memory Latency

0

1

2

3

4

5


IPC

BASE-ML=70

DDMT-ML=70

BASE-ML=140

DDMT-ML=140


38

Sensitivity: Cache Size

• cut cache sizes by 4: DL1=16KB, L2=256KB– more PDIs– roughly same absolute speedups, higher relatively– increased contention

Sensitivity to Cache Size

0

1

2

3

4

5


IPC

BASE-DL1=64KB-L2=1MB

DDMT-DL1=64KB-L2=1MB

BASE-DL1=16KB-L2=256KB

DDMT-DL1=16KB-L2=256KB


39

IT Formulation?

• old: IT doubles as ledger for pregs allocated by DDTs– if evicted from IT, preg is freed, downstream DDT destroyed– keeps overhead (RS contention) down– only need to re-execute integrated loads– restricts effective unrolling degree to IT associativity– requires incremental invalidations on IT (associative matches)

• new: decouple pre-execution from presence in IT – re-execute all integrated instructions– no incremental invalidations– downstream DDT not destroyed on IT eviction


40

Preliminary New-Formulation Results

• 3-5% better for some, 1-2% worse for others– RS contention increases greatly (too many not-ready DDT instr’s)– DDT-1 injection too aggressive, slower policy ties up DDT contexts– different character than older formulation, needs work

New IT Formulation

0

1

2

3

4

5


IPC

BASE

DDMT-OLD

DDMT-NEW


41

Evaluation Summary

• performance– does well on microbenchmarks (like it’s supposed to)– modest to moderate gains on SPEC2K + aggressive baseline– relatively better the more there is to do (PDIs or latency per PDI)

• limitations (future work?)– pre-execution/branch predictor interface would be nice

others and I have looked at this

– difficulties with unoverlapped unrolling external unrolling mechanism, static selection framework

– need more RS entries clustered RS queues should be good, DDTs are dependence chains


42

The End

• pre-execution: more ILP from sequential programs– attacks performance problems directly– key technologies: proactive out-of-order sequencing + decoupling

• DDMT: a superscalar-friendly implementation– no dedicated pre-execution bandwidth– register integration for pre-execution reuse

• automated DDT selection– stable, has the right knobs


43

Outline

• motivation and introduction to pre-execution• pre-execution DDMT style• automated computation selection• DDMT microarchitecture• performance evaluation• terrace


44

Tuning DDT Selection

• can’t change DDT structure, can control length– Longer DDTs tolerate more latency, more overhead, fewer PDIs

• from below– minimal latency tolerance

• from above– maximum length, slicing window size, unrolling degree

• upshot: maximum ADVAGG DDT has characteristic length– loosening controls doesn’t make much difference


45

Related Work: Architectures

• dataflow architectures [Dennis75], Manchester [Gurd+85], TTDA [Arvind+90], ETS [Culler+90]

– decoupling, data-driven fetch to the limit, but no sequential interface– pre-execution: sequential interface with speculative dataflow helper

[UW-CSTR-#1411]

• decoupled access/execute architecture [Smith82]

– decoupling, single execution, proactive ooo?– pre-execution: speculative decoupled miss/execute micro-architecture

[MEDEA’00]


46

Related Work: Microarchitectures

• decoupled runahead slipstream [Rotenberg+00]

– decoupled, not proactive out-of-order

• speculative thread-level parallelism Multiscalar [Franklin+93], SPSM [Dubey+95], DMT [Akkary+98]

– decoupled, not proactive out-of-order


47

Pre-Execution Geneology

AssistedExecution

Dependence-Based Prefetching

SSMT

Dependence-Based Target Pre-Computation

Branch FlowMicroarchitecture

DDMTSpeculativeDataflow

Speculative Slices

Speculative Pre-Computation

Slice Processors


48

Loads

0

5

10

15

mcf vpr mst

Sp

ee

du

p (

%)

Performance: Separating the Effects

• full DDMT• no integration

– loads: prefetching still OK mcf unrolling needs integration

– branches: effects lost em3d DDT prefetches too

• no decoupling – DDTs as scheduler hints

“problem aware OOO”

– no speedup

Branches

0

5

10

15

eon gzip em3d

Sp

ee

du

p (

%)

–integ

full DDMT–decoup


49

Pre-Execution vs. Inlined Helper Code

• must hoist computation – must copy, straight hoist will create WAR dependences– hoist past/out-of/into procedures?– schedule past/out-of/into procedures?– non-binding prefetches or inline stalls– branch pre-execution?

• pre-execution vs. Itanium®– speculative loads ease hoisting past procedures– but that’s it


50

Loads in DDTs

• missed memory dependences (store invalidations)?– problem is with integration– keep address/value pairs in integration table (shadow MOB)– snoop

• multiprocessors?– pre-execution is sequentially consistent (SC)– just radical out-of-order– actual execution, snoopable state for all loads– many cases occur in uniprocessor as interactions with main thread


51

Contributions: Prelim vs. Thesis

• Prelim– implementation: Speculative Dataflow (TTDA), DDMT (SMT base)– applications: prefetch data, pre-compute branches, “smoothe” ILP– automated pre-execution computation selection (enough to get by)

• Dissertation– implementation: DDMT (superscalar based)– applications: prefetch data, pre-compute branches– automated computation selection


52

Superscalar: Obstacles to ILP

• backpressure, backpressure…– out-of-order retirement? can’t do– larger window (ROB)? hard (engineering) – bigger useful window? harder (P[no mis-pred] shrinks)– out-of-order fetch? getting to that


53

Calculating Execution Times

• why do DDTs execute faster than MT?– sequence fewer instructions!– SCDH (Sequencing Constrained Dataflow Height): accounts for this

R2=ld[R1]R1=R1+4

PC

x18x2c

inst

bz R2, x44x30

R1=R1+4x18R1=R1+4x18

DSTtrig

808586

400

SC

101111

50

SCDH

111213

61

R2=ld[R1]R1=R1+4

PC

x18x2c

inst

bz R2, x44x30

R1=R1+4x18R1=R1+4x18

DSTtrig

234

10

SC

234

10

SCDH

345

21

Documents

Pre-Execution via Speculative Data Driven Multithreading Amir Roth University of Wisconsin, Madison August 10, 2001