71
Dean Tullsen ACACES 2008 Parallelism – Use multiple contexts to achieve better performance than possible on a single context. Traditional Parallelism – We use extra threads/processors to offload computation. Threads divide up the execution stream. Non-traditional parallelism – Extra threads are used to speed up computation without necessarily off-loading any of the original computation Primary advantage nearly any code, no matter how inherently serial, can benefit from parallelization. Another advantage – threads can be added or subtracted without significant disruption.

Non-traditional Parallelism

  • Upload
    hosea

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Non-traditional Parallelism. Parallelism – Use multiple contexts to achieve better performance than possible on a single context. Traditional Parallelism – We use extra threads/processors to offload computation . Threads divide up the execution stream. - PowerPoint PPT Presentation

Citation preview

Page 1: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Parallelism – Use multiple contexts to achieve better performance than possible on a single context.

Traditional Parallelism – We use extra threads/processors to offload computation. Threads divide up the execution stream.

Non-traditional parallelism – Extra threads are used to speed up computation without necessarily off-loading any of the original computation Primary advantage nearly any code, no matter

how inherently serial, can benefit from parallelization. Another advantage – threads can be added or

subtracted without significant disruption.

Page 2: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Thread 1 Thread 2 Thread 3 Thread 4

Page 3: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Thread 1 Thread 2 Thread 3 Thread 4

Speculative precomputation, dynamic speculative precomputation, many others.

Most commonly – prefetching, possibly branch pre-calculation.

Page 4: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Chappell, Stark, Kim, Reinhardt, Patt, “Simultaneous Subordinate Micro-threading” 1999 Use microcoded threads to manipulate

the microarchitecture to improve the performance of the main thread.

Zilles 2001, Collins 2001, Luk 2001 Use a regular SMT thread, with code

distilled from the main thread, to support the main thread.

Page 5: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Speculative Precomputation [Collins, et al 2001 – Intel/UCSD]

Dynamic Speculative Precomputation

Event-Driven Simultaneous Optimization Value Specialization Inline Prefetching Thread Prefetching

Page 6: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

3.30

6.28

1.14

4.79

5.79

1.41

2.76

1.04

2.47

4.46

32.6

4

27.9

0

1

2

3

4

5

6

7

8

art equake gzip mcf health mst

Spe

edup

Perfect Memory

Perfect Delinquent Loads (10)

Page 7: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

In SP, a p-slice is a thread derived from a trace of execution between a trigger instruction and the delinquent load.

All instructions upon which the load’s address is not dependent are removed (often 90-95%).

Live-in register values (typically 2-6) must be explicitly copied from main thread to helper thread.

Page 8: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Delinquent load

Trigger instruction

Prefetch

Spawn thread

Memory latency

Page 9: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Because SP uses actual program code, can precompute addresses that fit no predictable pattern.

Because SP runs in a separate thread, it can interfere with the main thread much less than software prefetching. When it isn’t working, it can be killed.

Because it is decoupled from the main thread, the prefetcher is not constrained by the control flow of the main thread.

All the applications in this study already had very aggressive software prefetching applied, when possible.

Page 10: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

On-chip memory for transfer of live-in values.

Chaining triggers – for delinquent loads in loops, a speculative thread can trigger the next p-slice (think of this as a looping prefetcher which targets a load within a loop) Minimizes live-in copy overhead. Enables SP threads to get arbitrarily far ahead. Necessitates a mechanism to stop the chaining

prefetcher.

Page 11: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Chaining triggers executed without impacting main thread

Target delinquent loads arbitrarily far ahead of non-speculative thread Speculative threads make progress

independent of main thread Use basic triggers to initiate

precomputation, but use chaining triggers to sustain it

Page 12: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

0.81

1.21.41.6

1.82

2.22.42.62.8

art equake gzip mcf health mst Average

Sp

eed

up

ove

r B

asel

ine

2 Thread Contexts 4 Thread Contexts 8 Thread Contexts

Page 13: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Speculative precomputation uses otherwise idle hardware thread contexts Pre-computes future memory accesses Targets worst behaving static loads in a

program Chaining Triggers enable speculative

threads to spawn additional speculative threads Results in tremendous performance gains,

even with conservative hardware assumptions

Page 14: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Speculative PrecomputationDynamic Speculative

Precomputation [Collins, et al – UCSD/Intel]

Event-Driven Simultaneous Optimization Value Specialization Inline Prefetching Thread Prefetching

Page 15: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

SP, as well as similar techniques proposed about the same time, require Profile support Heavy user or compiler interaction

It is thus susceptible to profile-mismatch, requires recompilation for each machine architecture, and if they require user interaction…

(or, a bit more accurately, we just wanted to see if we could do it all in hardware)

Page 16: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

relies on the hardware to identify delinquent loads create speculative threads optimize the threads when they aren’t working

quite well enough eliminate the threads when they aren’t working

at all destroy threads when they are no longer

useful…

Page 17: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Like hardware prefetching, works without software support or recompilation, regardless of the machine architecture.

Like SP, works with minimal interference on main thread.

Like SP, works on highly irregular memory access patterns.

Page 18: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Identify delinquent loads Delinquent Load Identification Table

Construct p-slices and apply optimizations Retired Instruction Buffer

Spawn and manage P-slices Slice Information Table

Implemented as back-end instruction analyzers

Page 19: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

PCPCPCPC

ICache

RegisterRenaming

CentralizedInstruction

Queue

Re-order BufferRe-order BufferRe-order BufferRe-order Buffer

MonolithicRegister

File

ExecutionUnits

DataCache

Page 20: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

PCPCPCPC

ICache

RegisterRenaming

CentralizedInstruction

Queue

Re-order BufferRe-order BufferRe-order BufferRe-order Buffer

MonolithicRegister

File

ExecutionUnits

DataCache

Delinquent LoadIdentificationTable (DLIT)

RetiredInstructionBuffer (RIB)

SliceInformationTable (SIT)

Page 21: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Once delinquent load identified, RIB buffers instructions until the delinquent load appears as the newest instruction in the buffer.

Dependence analysis easily identifies load’s antecedents, a trigger instruction, and the live-in’s needed by the slice. Similar to register live-range analysis But much easier

Page 22: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Construct p-slices to prefetch delinquent loads

Buffers information on an in-order run of committed instructions Comparable to trace cache fill unit

FIFO structureRIB normally idle

Page 23: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Analyze instructions between two instances of delinquent load Most recent to oldest

Maintain partial p-slice and register live-in set

Add to p-slice instructions which produce live-in set register Update register live-in set

When analysis terminates, p-slice has been constructed and live-in registers identified

Page 24: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

struct DATATYPE { int val[10];};

DATATYPE * data [100];

for(j = 0; j < 10; j++) { for(i = 0; i < 100; i++) { data[i]->val[j]++; }}

loop:I1 load r1=[r2]I2 add r3=r3+1I3 add r6=r3-100I4 add r2=r2+8I5 add r1=r4+r1I6 load r5=[r1]I7 add r5=r5+1I8 store [r1]=r5I9 blt r6, loop

Page 25: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

Analyze fromrecent

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

To oldest

Included Live-in Set

Page 26: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included Live-in Set

Page 27: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

Page 28: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

r1

Page 29: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

r1

Page 30: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

r1,r4

Page 31: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

r1,r4

r1,r4

Page 32: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

Included

r1

Live-in Set

r1,r4

r1,r4

r1,r4

r2,r4

r2,r4

r2,r4

r2,r4

r1,r4

Page 33: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

load r5 = [r1]

add r1 = r4+r1

add r2 = r2+8

Instruction

add r6 = r3-100

add r3 = r3+1

load r1 = [r2]

blt r6, loop

store [r1] = r5

add r5 = r5+1

load r5 = [r1]

P-Slice

load r1 = [r2] add r1 = r4+r1 load r5 = [r1]

Live-in Set

r2,r4

Delinquent Loadis trigger

Page 34: Non-traditional Parallelism

Dean Tullsen ACACES 2008

If two occurrences of the load are in the buffer (the common case), we’ve identified a loop that can be exploited for better slices.

Can perform additional analysis passes and optimizations Retain live-in set from previous pass Increases construction latency but keeps RIB

simple Optimizations

Advanced trigger placement (if dependences allow, move trigger earlier in loop)

Induction unrolling (prefetch multiple iterations ahead)

Chaining (looping) slices – prefetch many loads with a single thread.

Dean Tullsen Processor Architecture and Compilation Lab

Page 35: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

mcf vpr

art

equa

ke

mgr

id

swim

em3d

mst

perim

eter

tree

add

aver

age

Sp

ee

du

p o

ve

r n

o D

yn

am

ic S

PBasic Dynamic SP Improved TriggerInduction Unrolling Chaining

1.72 1.97 1.93

Page 36: Non-traditional Parallelism

Dean Tullsen ACACES 2008Dean Tullsen Processor Architecture and Compilation Lab

Dynamic Speculative Precomputation aggressively targets delinquent loads Thread based prefetching scheme Uses back-end (off critical path)

instruction analyzers P-slices constructed with no external

software supportMulti-pass RIB analysis enables

aggressive p-slice optimizations

Page 37: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Speculative PrecomputationDynamic Speculative

PrecomputationEvent-Driven Simultaneous

Optimization Value Specialization Inline Prefetching Thread Prefetching

Page 38: Non-traditional Parallelism

Dean Tullsen ACACES 2008

With Weifeng Zhang and Brad Calder

Page 39: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Use “helper threads” to recompile/optimize the main thread.

Optimization is triggered by interesting events that are identified in hardware (event-driven).

Page 40: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Thread 1 Thread 2 Thread 3 Thread 4

Execution and Compilation take place in parallel!

Page 41: Non-traditional Parallelism

Dean Tullsen ACACES 2008

A new model of optimization Computation and optimization occur in parallel Optimizations are triggered by the program’s

runtime behavior

Advantages Low overhead profiling of runtime behavior Low overhead optimization by exploiting additional

hardware context Quick response to the program’s changing behavior Aggressive optimizations

Page 42: Non-traditional Parallelism

Dean Tullsen ACACES 2008

original code

Helper

thread

base optimized code

event

Re-optim

ized code

Maintaining only one copy of the optimized code

Recurrent optimization on already optimized code when the behavior changes

Main

thread

Gradually enabling aggressive optimizations

event

Helper

thread

Page 43: Non-traditional Parallelism

Hardware event-driven Hardware monitors the program’s behavior with

no software overhead Optimization threads triggered to respond to

particular events. Optimization events handled ASAP to quickly

adapt to the program’s changing behaviors

Hardware Multithreaded Concurrent, low-overhead helper threads Gradual re-optimization upon new events

Main thread

Optimization threads

events

Trident

Page 44: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Software Components

Hardware Components

Main thread

helper registrationThread trigger

helper priorityHardware event

Runtime Support

Helper thread

Event manager

Helper threadHelper thread

User applicationOS loader

code cache

event queue

Register a given thread to be monitored, and create helper thread contexts

Monitor the main thread to generate events (into the queue) Helper thread is triggered to perform optimization. Update the code cache and patch the main thread

Page 45: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Events Occurrence of a particular type of runtime behavior

Generic events Hot branch events Trace invalidation

Optimization specific events Hot value events Delinquent Load events

Other events ?

Page 46: Non-traditional Parallelism

Dean Tullsen ACACES 2008

The Trident Framework is built around a fairly traditional dynamic optimization system => hot trace formation, code cache

Trident captures hot traces in hardware (details omitted)

However, even with its basic optimizations, Trident has key advantages over previous systems Hardware hot branch events identify hot traces Zero-overhead monitoring Low-overhead optimization in another thread No context switches between these functions

Page 47: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Definitions Hot trace

▪ A number of basic blocks frequently running together

Trace formation▪ Streamlining these blocks for

better execution locality

Code cache▪ Memory buffer to store hot

traces

A

G

E

C

F

K

J

H

D

B

call

return

start

1 0

1 0

I

Page 48: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Streamlining the instruction sequence Redundant branch/load removal Constant propagation Instruction re-association Code elimination

Architecture-aware optimizations reduction of RAS (return address stack) mis-

predictions (orders of magnitude) I-cache conscious placement of traces within code

cache. Trace Invalidation

Page 49: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Value specialization Make a special version of the code

corresponding to likely live-in values Advantages over hardware value

prediction Value predictions are made in the

background and less frequently No limits on how many predictions can be made Allow more sophisticated prediction techniques Propagate predicted values along the trace Trigger other optimizations such as strength

reduction

Page 50: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Value specialization Make a special version of the code

corresponding to likely live-in values Advantages over software value

specialization Can adapt to semi-invariant runtime values

(eg, values that change, but slowly) Adapts to actual dynamic runtime values. Detects optimizations that are no longer

working.

Page 51: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Value specialization Semi-invariant “constants” Strided values (details

omitted) Dynamic verification

Perform the original load Perform the original load into a scratch registerinto a scratch register

Move predicted value into Move predicted value into the load destinationthe load destination

Check the predicted value, Check the predicted value, branch to recovery if not branch to recovery if not equalequal

Perform constant Perform constant propagation and strength propagation and strength reductionreduction

Copy the scratch into load Copy the scratch into load destinationdestination

Jump to next instruction Jump to next instruction after load in the original after load in the original binarybinary

compensation blockcompensation blockLDQ 0(R2) R1

ADD R6, R4 R3

MUL R1, R3 R2

……

LDQ 0(R2) R3

MOV 0 R1

BNE R1, R3, recovery

ADD R6, R4 R3

MOV 0 R2

..…No dependency!

Page 52: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Evaluate helper threads’ impact on the main threads

Exercise full optimization flow Do not use the optimized traces ~0.6% of degradation of the main

thread’s IPC

Concurrent execution of the main thread and helpers

≤ 2% of total execution time (running concurrently with the main thread)

Page 53: Non-traditional Parallelism

Dean Tullsen ACACES 2008

236%

-5%

5%

15%

25%

35%

bzip

cra

fty

eo

n

ga

p

gc

c

gzip

mc

f

pa

rse

r

pe

rl

two

lf

vo

rtex

vp

r

av

gp

erc

en

t s

pe

ed

up

s H/W value prediction

Trace formation

Value specialization

169% 238%

Page 54: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Speculative PrecomputationDynamic Speculative

PrecomputationEvent-Driven Simultaneous

Optimization Value Specialization Inline Prefetching Thread Prefetching

Page 55: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Limitations of existing prefetching techniques Compiler-based static prefetching

▪ Address / aliasing resolution▪ Timeliness▪ Hard to identify delinquent loads▪ Variation due to data input or architecture

Hardware prefetching▪ Cannot follow complicated load behaviors

Page 56: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Goal Provide an efficient way to perform flexible software

prefetching Find prefetching opportunities in legacy code

Effective prefetching Prefetching should be accurate

▪ Target the loads which actually miss in the cache▪ Prefetch far ahead enough to cover miss latency

▪ Must have low overhead to compute prefetching addresses

Page 57: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Intrinsically difficult to get the prefetch distance right Trident enables adaptive discovery of the optimal prefetch distances Conventional systems often make decisions once because of high

overhead

Loa

d 1

Loa

d 3

Loa

d 2

execution time

original execution trace

Page 58: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Hot branch event

Delinquent load event

Page 59: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Determine how far ahead to prefetch a delinquent load

Prefetch Distance =

Prefetch (offset+stride*distance)(base)

Most prior prefetching systems keep the prefetch distance fixed after initial estimate

Trident reuses the first two steps, except that Low overhead of monitoring + optimization allows

us to adapt this distance as well as the stride

average load miss latency

average cycles per iteration

Page 60: Non-traditional Parallelism

Dean Tullsen ACACES 2008

1. Object prefetching – identifies loads within the same object, and clusters them to minimize prefetch overhead.

2. Adaptive determination of prefetch distance

Page 61: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Heavy interaction between neighboring loads (especially other loads we are also prefetching) make static or even dynamic determination of the correct prefetch distance difficult.

Because of the low cost of optimization, Trident uses trial-and-error to discover the right distances.

Page 62: Non-traditional Parallelism

Dean Tullsen ACACES 2008

All stride based prefetch instructions are inserted with initial distance of 1

These loads are continuously monitored in the DLT

The optimizer increases/decreases the distance until The load is no longer delinquent The load is matured

Stabilization is achieved quickly

prefetch not hiding enough latency

load is delinquent

delinquent load event

Page 63: Non-traditional Parallelism

Dean Tullsen ACACES 2008

In many cases, pointer chasing loads actually have strided patterns.

These patterns can be identified by Trident’s hardware monitors.

This gives Trident 2 advantages over software prefetchers Low-overhead address computation The ability to prefetch multiple iterations

ahead.

Page 64: Non-traditional Parallelism

Baseline: H/W stride-based prefetching stream buffers Self-repairing based prefetching achieves 23%

speedup 12% better than software prefetching without

repairing

0%

20%

40%

60%

80%

applu

art

dot

equake

facerec

fma3d

galgel

gap

mcf

mgrid

parser

swim

vis

wupw

ise

avgp

erce

nt

spee

du

ps

S/W prefetching - basicS/W prefetching - w hole objectS/W prefetching w ith self-repairing

Page 65: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Speculative PrecomputationDynamic Speculative

PrecomputationEvent-Driven Simultaneous

Optimization Value Specialization Inline Prefetching Thread Prefetching (speculative

precomputation)

Page 66: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Can potentially be more effective than inline prefetching.

However, more complex, with more things to get right/wrong Trigger points Termination points Synchronization between helper and main thread

These vary not just with load latencies, but also control flow, etc.

Again, Trident’s ability to continuously adapt is key.

Page 67: Non-traditional Parallelism

Dean Tullsen ACACES 2008

It is critical in any thread-based prefetching scheme that the prefetch thread stay ahead of the main thread.

Trident optimizations jump-start p-thread multiple iterations

ahead Use dynamically detected strides to replace

complex recurrences Same-object prefetching P-thread placement optimizations for I-

cache performance Low-overhead sw synchronization Quick repair of off-track prefetching

Page 68: Non-traditional Parallelism

Dean Tullsen ACACES 2008

0%

10%

20%

30%

40%

applu

art

dot

equake

facere

c

fma3d

galg

el

gap

mcf

mgrid

pars

er

sw

im

vis

wupw

ise

avg

perc

en

t sp

eed

up

s

basic p-slice prefetching - runahead=10jump start (J=5) w/ runahead (R=5)specialization with speculative stride <J=5,R=5>109%

129%133%

Trident’s acceleration techniques achieve 7% better performance than existing pre-computation techniques

Page 69: Non-traditional Parallelism

Dean Tullsen ACACES 2008

0%

10%

20%

30%

40%

50%

60%

perc

en

t sp

eed

up

s

inlined prefetching

precomputation + inlined (non-looped)

142%79%

Adaptive inlined prefetching Pre-computation achieve 10% better

performance than previous aggressive inlined prefetching

Page 70: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Event-driven multithreaded optimization Hardware event-driven optimization means low

overhead profiling. ▪ monitoring of code need never stop

Allowing compilation to take place in parallel with execution provides low overhead optimization▪ Allows more aggressive optimizations▪ Allows gradual improvement via recurrent optimization▪ Allows self-adaptive (eg, search-based) optimization

What else can we do with this technology?

Page 71: Non-traditional Parallelism

Dean Tullsen ACACES 2008

Works on serial code (and parallel) Provides parallel speedup by allowing the main

thread to run faster Is not limited by traditional theoretical limits to

parallel speedup Adapts easily to changes in available

parallelism

Other types of non-traditional parallelism??