29
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

Embed Size (px)

Citation preview

Page 1: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

SuperPin: Parallelizing Dynamic Instrumentation

for Real-Time Performance

Steven Wallaceand Kim Hazelwood

Page 2: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

2 Hazelwood – CGO 2007

Dynamic Binary Instrumentation

Inserts user-defined instructions into executing binaries

•Easily•Efficiently•Transparently

Why?•Detect inefficiencies•Detect bugs•Security checks•Add features

Examples•Valgrind, DynamoRIO, Strata, HDTrans, Pin

EXE

Transform

CodeCache

Execute

Profile

Page 3: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

3 Hazelwood – CGO 2007

Intel Pin

• A dynamic binary instrumentation system

• Easy-to-use instrumentation interface

• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS

• Robust and stable (Pin can run itself!)– 12+ active developers– Nightly testing of 25000 binaries on 15 platforms– Large user base in academia and industry – Active mailing list (Pinheads)

• 11,500 downloads

Page 4: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

4 Hazelwood – CGO 2007

Our Goal: Improve Performance

The latest Pin overhead numbers …

100%

120%

140%

160%

180%

200%

perlbench

sjeng

xalancbm

k

gobm

k

gcc

h264ref

omnetpp

bzip2

libquantum mcf

astar

hmmer

Rel

ativ

e to

Nat

ive

Page 5: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

5 Hazelwood – CGO 2007

Adding Instrumentation

100%

200%

300%

400%

500%

600%

700%

800%

perlbench

sjeng

xalancbm

k

gobm

k

gcc

h264ref

omnetpp

bzip2

libquantum mcf

astar

hmmer

Rel

ativ

e to

Nat

ive Pin

Pin+icount

Page 6: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

6 Hazelwood – CGO 2007

Sources of Overhead

Internal

• Compiling code & exit stubs (region detection, region formation, code generation)

• Managing code (eviction, linking)

• Managing directories and performing lookups

• Maintaining consistency (SMC, DLLs)

External

• User-inserted instrumentation

Page 7: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

7 Hazelwood – CGO 2007

“Normal Pin” Execution Flow

Instrumentation is interleaved with application

Uninstrumented Application

Instrumented Application

Pin Overhead

Instrumentation Overhead

“Pinned” Application

time

Page 8: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

8 Hazelwood – CGO 2007

“SuperPin” Execution Flow

SuperPin creates instrumented slices

Uninstrumented Application

SuperPinned Application

Instrumented Slices

Page 9: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

9 Hazelwood – CGO 2007

Issues and Design Decisions

Creating slices• How/when to start a slice• How/when to end a slice

System calls

Merging results

Page 10: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

10 Hazelwood – CGO 2007

for k

S6+fo

r k

S5+fo

r k

S4+fo

r kre

cord

si

gr3

, sl

eep

S3+

for k

S2+

sleep

S1+

Execution Timelinefo

r k

S1 S2 S3 S4 S5 S6

dete

ct

sigr4

dete

ct

exit

resu

me

dete

ct

sigr3

dete

ct

sigr6

dete

ct

sigr2

dete

ct

sigr5

resu

me

reco

rd

sigr4

, sl

eep

CPU2

CPU3

CPU4

time

reco

rd

sigr2

, sl

eep

resu

me

resu

me

reco

rd

sigr5

, sl

eep

resu

me

reco

rd

sigr6

, sl

eep

resu

me

original application

instrumentedapplication slices

CPU1

Page 11: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

11 Hazelwood – CGO 2007

Starting Slices

How?

• Master application process – ptrace

• Controlling process

• Child slices – fork– Reserve memory for transparency– Each slice has its own code cache (for now)

When?

• Timeouts– Uses a special timer process– Tunable parameter

• System calls

Page 12: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

12 Hazelwood – CGO 2007

S1

Handling System Calls

Problem: Don’t want to duplicate system calls in the main application/slices

Solutions:

• brk or anonymous mmap – duplication OK

• frequent calls – record and playback

• default – trigger new timeslice

S2

S1Instr AInstr BSysCallInstr CInstr D

Instr AInstr BMMAPInstr CInstr D

S1

Instr AInstr BSysCallPlayInstr CInstr D

Duplicate Record/Playback End Slice

Page 13: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

13 Hazelwood – CGO 2007

Ending Slices

Each slice is responsible for detecting its own end-of-slice condition

S4+

record sig4, sleep

resume

detect sig5

Challenges: • Need to efficiently capture a point in time (signature)• Need to efficiently detect when we’ve reached that point

Page 14: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

14 Hazelwood – CGO 2007

Signature Detection

End-of-slice conditions:

1. System calls – easy to detect

2. Timeouts at arbitrary points – harder to detect

Signature match:

• Instruction pointer

• Architectural state

• Top of stack

Page 15: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

15 Hazelwood – CGO 2007

Implementing Signature Detection

Uses Pin’s lightweight conditional analysis• INS_InsertIfCall – lightweight inlined check• INS_InsertThenCall – heavyweight (conditional)

analysis routine

Instrument the end-of-slice instruction pointer

1. Lightweight check – two registers

2. Heavyweight check – full architectural state

3. Heavyweight check – top 100 words on the stack

• Lightweight triggers heavyweight: ~2%

• Stack check fails: ~0%

Page 16: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

16 Hazelwood – CGO 2007

Performance Results

Icount1 – Instruments every instruction with count++

% pin –t icount1 -- ./helloHello CGOCount: 496043

Icount2 – Instruments every basic block with count += bblength

% pin –t icount2 -- ./helloHello CGOCount: 496043

Page 17: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

17 Hazelwood – CGO 2007

Performance – icount1

% pin –t icount1 -- <benchmark>

0%

500%

1000%

1500%

2000%

2500%

3000%

ammp

applu

apsi art

bzip2

crafty

eon

equake

facerec

fma3d

galgel

gap

gcc

gzip

lucas

mcf

mesa

mgrid

parser

perlbm

sixtrack

swim

twolf

vortex vpr

wupwis

AVG

Pin SuperPin

Page 18: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

18 Hazelwood – CGO 2007

Performance – icount2

% pin –t icount2 -- <benchmark>

0%

200%

400%

600%

800%

1000%am

mp

applu

apsi art

bzip2

crafty

eon

equake

facerec

fma3d

galgel

gap

gcc

gzip

lucas

mcf

mesa

mgrid

parser

perlbm

sixtrack

swim

twolf

vortex vpr

wupwis

AVG

Pin SuperPin

Page 19: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

19 Hazelwood – CGO 2007

Performance Scalability

0

100

200

300

400

500

600

1 2 4 8 12 16Max # Running Slices

Ru

nti

me

(sec

)

Running on an 8-processor HT-enabled machine (16 virtual processors)

Page 20: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

20 Hazelwood – CGO 2007

Overhead Categorization

Where is SuperPin spending its time?• Executing the application• Fork overheads• Sleeping (waiting to start a slice)• Pipeline delays

for k for k for k for k

S1 S2 S3 S4

reco

rd

sigr2

, sl

eep

dete

ct

sigr3

resu

me

reco

rd

sigr4

, sl

eep

dete

ct

exit

resu

me

sleep

dete

ct

sigr2

resu

me

dete

ct

sigr4

reco

rd

sigr3

, sl

eep

resu

me

CPU3

S2+ S4+

S1+ S3+

Page 21: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

21 Hazelwood – CGO 2007

Overhead Categorization

0

40

80

120

160

200

Ru

nti

me

(sec

)

.5s 1s 2s 4s

Timeslice (sec)

native fork&others sleep pipeline

Page 22: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

22 Hazelwood – CGO 2007

The SuperPin API

You may write SuperPin-aware Pintools:– SP_Init(fun)– SP_AddSliceBeginFunction(fun,val)– SP_AddSliceEndFunction(fun,val)– SP_EndSlice()– SP_CreateSharedArea(local,size,merge)

You may also control (via switches):– Spmsec {value}: milliseconds per timeslice– Spmp {value}: maximum slice count– Spsysrecs {value}: maximum syscalls per slice

Page 23: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

23 Hazelwood – CGO 2007

Pin Instrumentation API – icount2

VOID DoCount(INT32 c) { icount += c;}

VOID Trace(TRACE trace, VOID *v) {for (BBL bbl=Head(trace); Valid(bbl); bbl=Next(bbl)) { INS_InsertCall(BBL_InsHead(bbl), IPOINT_BEFORE, (AFUNPTR)DoCount, IARG_INT32, BBL_NumIns(bbl), IARG_END); }

}

int main(INT32 argc, CHAR **argv) {

PIN_Init(argc, argv);

TRACE_AddInstrumentFunction(Trace);

PIN_StartProgram();

return 0;

}

Page 24: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

24 Hazelwood – CGO 2007

SuperPin Version of Icount2

VOID DoCount(INT32 c) {//same as before}

VOID ToolReset(INT32 c) { icount = 0;} VOID Merge(INT32 sliceNum) { *sharedData += icount;}

VOID Trace(TRACE trace, VOID *v) {//same as before}

int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); SP_Init(ToolReset); sharedData = (UINT64*) SP_CreateSharedArea(&icount, sizeof(icount),0); SP_AddSliceEndFunction(Merge); TRACE_AddInstrumentFunction(Trace); PIN_StartProgram(); return 0; }

Page 25: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

25 Hazelwood – CGO 2007

SuperPin Limitations

Not all instrumentation tasks are a good fit

Great fit – independent tasks• Instruction profiling (counts, distributions)• Trace generation

Requires massaging – dependent tasks• Branch prediction• Data cache simulation

– Assume a starting state, resolve later

Stick with regular Pin – path modification• Adaptive execution

Page 26: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

26 Hazelwood – CGO 2007

Future (Rainy Day) Extensions

• Adaptive parallelism detection– Hardware feedback: adapts to available processors

– OS feedback: adapts to present load

• Adaptive slice timeouts

• Slice-shared code caches

• Multithreaded application support

Page 27: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

27 Hazelwood – CGO 2007

SuperPin Summary

Allows users to leverage available parallelism for certain instrumentation tasks

• Hides the gory details

• Enables significant speedup (for the right tasks … on the right machines)

• Exposed as Pin API extensions

Download it today!

http://rogue.colorado.edu/pin

Page 28: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

28 Hazelwood – CGO 2007

Thank You

Page 29: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood

29 Hazelwood – CGO 2007

Merging Results

Each slice is a separate process must merge slice-local results into a collective total

Merge routines:

• Addition

• Concatenation

• User-defined

Executed in program order

Use shared memory