25
DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor- Crummey Department of Computer Science Rice University Funded by the DARPA through the Air Force Research Lab (AFRL) and the TI Fellowship

DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Embed Size (px)

Citation preview

Page 1: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpyA Tool to Pinpoint

Program Inefficiencies

Milind Chabbi and John Mellor-CrummeyDepartment of Computer Science

Rice University

Funded by the DARPA through the Air Force Research Lab (AFRL) and the TI Fellowship

Page 2: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Impact of Code Efficiency

Figure credit: ornl.gov PFLO

TRAN

cod

e to

mod

el th

e di

strib

ution

of u

rani

um in

soi

l

Figure credit: Apple.com

Programs need to be efficient at all scales.

Page 3: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Causes of Inefficiencies Program design• Algorithms• Data structures

Programming practice• Lack of design for performance

Compiler optimizations• Sometimes optimizations can do more harm than good• Code must be tailored to enable some optimizations

One tool to pinpoint inefficiencies: DeadSpy

Page 4: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Classical Performance Analysis Focus on code regions with “high resource utilization” Look for “hot spots” in program• Time• Cache misses• Energy

Improve code in these “hot spots”Hot spot analysis is indispensable, but …

Can’t identify if resources were “well spent”Hot spots may be symptoms of problems

The onus is on the programmer to investigate causes

Page 5: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy Philosophy Shift focus from “resource usage” to

“resource wastage” Identify causes of wasteful resource consumption• E.g. Wasted memory accesses

Dead Write: Two writes happen to the same memory location without an intervening read

int x = 0;x = 20;

Dead writeint x = 0;Print(x);x = 20;

Not a dead writeDead write

Killing write

Page 6: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy: Motivating Example

Chombo: Framework for Adaptive Mesh Refinement (AMR) for solving partial differential equations

Spends 20% time

Compilers can’t eliminate all dead writes because of:• Aliasing / ambiguity• Aggregate variables• Function boundaries• Late binding• Partial deadness

3-level nested loop

C KERNEL:RIEMANNF 8 0do k = iboxlo2,iboxhi2do j = iboxlo1,iboxhi1do i = iboxlo0,iboxhi0 …

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) thenWgdnv(i,j,k,0) = roWgdnv(i,j,k,inorm) = unoWgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) thenWgdnv(i,j,k,0) = rstarWgdnv(i,j,k,inorm) = ustarWgdnv(i,j,k,4) = pstarendif

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstarendif

Page 7: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy: Motivating Example

C KERNEL:RIEMANNF 8 0do k = iboxlo2,iboxhi2do j = iboxlo1,iboxhi1do i = iboxlo0,iboxhi0 …

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) thenWgdnv(i,j,k,0) = roWgdnv(i,j,k,inorm) = unoWgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) thenWgdnv(i,j,k,0) = rstarWgdnv(i,j,k,inorm) = ustarWgdnv(i,j,k,4) = pstarendif

Else Wgdnv(i,j,k,0) = ro + frac*(rstar - ro) Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno) Wgdnv(i,j,k,4) = po + frac*(pstar - po)Endif

Else if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = po

if (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstar

Spends 20% time

3-level nested loop

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstarendifCode lacks design for performance

Correct code:• Else-if nesting

Page 8: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

WriteReport dead

DeadSpy: A Tool to Pinpoint Dead Writes Monitor every load and store operation in a program Maintain state information for each memory byte Detect every dead write in an execution with an

automaton

R

WV

Read

Write

Write

Read

Read

Page 9: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy: Implementation Intel Pin instruments code to monitor each load

and store• Pin - dynamic binary rewriting tool

Shadow memory maintains state • 1-1 mapping from a program address to the

corresponding state information Instrumentation code identifies dead writes• Check and update the state before performing any

memory operation• Record necessary information to report W W state

transitions

Page 10: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy: Measurement and Attribution Instrumentation ensures precise measurement • No false positives and no false negatives

DeadSpy provides useful (source-level) feedback with calling context for dead and killing writes• Build Calling Context Tree (CCT) on-the-fly by

instrumenting function entry and exit

int x = 0;Dead write

Killing write

x = 20;

Page 11: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy in Action

Main()

CCT

Page 12: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy in Action

Main()

A()

CCT

Page 13: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

CCT

Page 14: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

Dead table

Dead Killing F

+1

CCT

Page 15: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

2e10

Dead table

25080

Dead Context

Killing Context

Dead Killing F

CCT

• Implementation challenges due to limitations of binary analysis

• Cost of instrumentation

Page 16: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Formalizing Dead Writes in an Execution

2e10

Dead table

i j 25080

Dead Killing F

∑𝑖∑𝑗

𝐹 (𝐶𝑖𝑗)

Higher Poor performance

Page 17: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Deadness in SPEC CPU2006 Integer Reference Benchmarks

astar

bzip2 gc

c

gobmk

h264ref

hmmer

perlbench

libquan

tum mcf

omnetpp

sjeng

xalan

Averag

e0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

3.39.8

60.5

19.2

27.6

14.019.1

6.0

39.3

19.716.3

6.6

20.1

Benchmarks

%D

eadn

ess

Compiled with gcc 4.1.2 with –O2

The number of dead writes is surprisingly high!76.2

Across compilers and optimization levels

Page 18: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Case Study: GCCUse of Inappropriate Data Structure

static void loop_regs_scan (const struct loop * loop, int extra_size) { last_set = (rtx *) xcalloc (regs->num, sizeof (rtx)); /* Scan the loop, recording register usage. */ for (Instrunction insn in loop) { if (INSN_P (insn)) {

if (PATTERN (insn) sets a register) count_one_set (regs, insn, PATTERN (insn), last_set);

} if (Label(insn)) // new BBL memset (last_set, 0, regs->num * sizeof (rtx)); }

Dead

Killing

Calloc 16937 elements (132KB) array…

• Basic blocks are short • Median use 2 unique elements• Dense array is a poor data structure

choice

Dead

Performance implications on a JIT compiler

Reinitialize 16937 elements (132KB) each time> 28% running time improvement

Page 19: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Original codeBool mainGtU ( UInt32 i1, UInt32 i2, UChar* block,…) { Int32 k; UChar c1, c2; UInt16 s1, s2; /* 1 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 2 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 3 */ c1 = block[i1]; c2 = block[i2]; /* 12 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++;

REST OF THE FUNCTION

Optimized Code (gcc 4.1.2)Bool mainGtU ( UInt32 i1, UInt32 i2, UChar* block …) { Int32 k; UChar c1, c2; UInt16 s1, s2; tmp1 = i1 + 1; tmp2 = i1 + 2; tmp12 = i1 + 12; /* 1 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 2 */ c1 = block[tmp1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 3 */ c1 = block[tmp2]; c2 = block[i2];

/* 12 */ c1 = block[tmp11]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++;

Killing

… Killing

Case Study: Bzip2Overly Aggressive Compiler Optimization

>14% running time improvement

Dead

Dead

Early returnDead

Killing Early return

Page 20: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

HMMER: code for sequence alignment Dead writes

Case Study: HMMERLack of Design for Performance

for (i = 1; i <= L; i++) { for (k = 1; k <= M; k++) { ... ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;

Never Alias.Declare as “restrict” pointers.

Can vectorize.

for (i = 1; i <= L; i++) { for (k = 1; k <= M; k++) { ...

R1= mpp[k] + tpmi[k]; ic[k] = R1; if ((sc = ip[k] + tpii[k]) > R1) ic[k] = sc;

OptimizedUnoptimized

Deadness in unoptimized: 0.3%Deadness in optimized : 30%

54 times more absolute no. dead writes in O2

2-level nested loop

W

WR W

WKilling

Dead

else ic[k] = R1;

>16% running time improvement.> 40% with vectorization.

Page 21: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Speedup Based on DeadSpy’s Advice

gcc hmmer bzip2 chombo0

2

4

6

8

10

12

14

16

18

14.315.7

7.2 6.6

Benchmarks

% S

peed

up

> 40% on Intel compiler

AMD Opteron [email protected], 128GB of 1333MHz DDR3, CentOS 5.

Poor data structure choice

Lack of design forperformance

Overly aggressive compiler optimization

Lack of design forperformance

Page 22: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Conclusions Dead writes are a common symptom of inefficiency• DeadSpy pinpoints inefficiencies• Improvements in tuned codes

Lessons for compiler writers• Pay attention to code generation algorithms

‐ Understand when optimizations do more harm than good (bzip2)

‐ Eliminate dead writes when possible

• Choose compiler data structures prudently (gcc) Focus on wasteful resource consumption• New avenue for performance tuning

DeadSpy: http://www.cs.rice.edu/~mc29/deadspy.tgz (Linux x86_64 only)HPCToolkit: http://hpctoolkit.org

Page 23: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Backup: DeadSpy Overhead

astar bzip2 gcc h264ref hmmer libquantum mcf omnetpp perlbench sjeng xalan Average0

10

20

30

40

50

60

70

80

1217

3944

22

7 7

17

36

2024 2223

33

59

72

36

26

9

31

66

45

56

41

CCT only CCT + IPBenchmarks

Slow

dow

n (ti

mes

)

Page 24: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Deadness in Multi-Threaded Codes

bt.S cg.S dc.S ep.S ft.S is.S lu.S mg.S sp.S ua.SAverage1.00E-03

1.00E-02

1.00E-01

1.00E+00

1.00E+01

1.00E+02

Intrer-Thread Intra-Thread Total

OpenMP NAS Parallel Benchmarks

Dea

dnes

s in

% (l

og b

ase

2 sc

ale)

Page 25: DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor-Crummey Department of Computer Science Rice University Funded by the DARPA

Deadness Across Compilers and Optimization Levels