9/22/2002NC State University1 Detecting Performance Bottlenecks Using Binary Rewriting Jaydeep Marathe and Frank Mueller North Carolina State University

9/22/2002 NC State University 1

Detecting Performance Bottlenecks Using Binary Rewriting

Jaydeep Marathe and Frank Mueller

North Carolina State University

Department of Computer Science


CPU

DRAM

Why are Memory Performance Bottlenecks a Problem?

L1 Cache

L2 Cache

Main Memory

Processor

• Processor speeds growing much faster than memory access speeds.

• Application memory performance has increasingly significant impact on overall performance.


Increase Locality Decrease Misses !

• Temporal Locality :: same cache block element accessed repeatedly before block is evicted.

Temporal LocalityTemporal Locality

x

x

Processor

Cache

Main Memory

Miss !

Miss !Hit !Hit !

• Spatial Locality :: adjacent cache block elements accessed, before block is evicted.

x y z

x y z

Processor

Cache

Main Memory

Spatial LocalitySpatial Locality

Cache Block

Miss !

Hit !

Locality Of Reference


How To Gauge Memory Performance ?

Drawbacks :• Tradeoff between accuracy sampling overhead• Fairly coarse statistics - overall hits, misses etc.

Executing Application

Observer Process

Periodic SamplingProcessor

H/w counters

UsageStatistics

• Use Hardware event counters• Sample Counter Values at regular intervals

One Way ..

Instrumenting Compiler

Source Code

Instrumented binary

Complete Access Trace

Post-processor

Executes

UsageStatistics

• Use Instrumenting Compiler• Insert Code to Log Memory Accesses.• Use Complete Trace For Analysis

Another Way ..

Drawbacks :• High Execution Overhead due to logging.• Complete Trace is huge ! : hundreds of MBs in size

Need Accurate Metrics with Minimum Time & Space Overheads !


TargetBinary

ControllerProcess

Online Compression

Instrument

Memory trace

Trace file

Compressed trace

Cache Simulator

Detailed Cache

Statistics

TargetBinary

• Binary Rewriting to instrument application binary.

• Insert online compression routines to compress generated trace

• Drive incremental cache simulator with compressed trace.

• Simulator generates detailed cache metrics for user feedback.

Detecting Bottlenecks Using Binary Rewriting


TargetBinary

ControllerProcess

Online Compression

Instrument

Memory trace

Trace file

Compressed trace

Cache Simulator

Detailed Cache

Statistics

TargetBinary

• Selective InstrumentationSelective Instrumentation of parts of target binary of parts of target binary. .

• Partial Data TracesPartial Data Traces instead of complete traces. instead of complete traces.

• Online Compression Online Compression reduces trace reduces trace storage storage requirements. requirements.

• Statistics Correlated to Source Data Structures.Statistics Correlated to Source Data Structures.

Advantages ..


Mutator (controller)

Machinecode

Target Binary

CFG

• Extended a Portable Binary Manipulation Framework (DynInst: U. Maryland).

Instrumenting Target Binary

Target_func(){ for( I = 0 ; I < N ; I ++) { A [ I ] = B [I ] * C[I];

}/* end loop */} /* end function */

_Target_func:

LoopStart : …… LOAD B[I], R1 LOAD C[I], R2 MULT R1,R2,R3 STORE A[I] …….LOOP LoopStartend_routine

• Parse Control Flow Graph (CFG) : Locate Routine Scopes & Loop Scopes.

DynInst

• Instrument Memory Access (Load / Store) & Scope Change instructions.

• Instrumentation calls handler functions in Shared Library.

ENTER_SCOPE_Handler()

LOAD_Handler()

STORE_Handler()

EXIT_SCOPE_Handler()

Shared Library

Instrumentation


Compressing Generated Trace

• Generated Trace potentially contains millions of accesses !

• Solution :: Detect Regular Patterns in trace for effective compression..

• Regular Section Descriptor (RSD) :: primary representation

• RSD :: < start_addr :: starting address of pattern ,

length :: length of the pattern ,

addr_stride :: stride b/w successive addresses in pattern ,

start_seq :: starting position of pattern in overall trace ,

seq_stride :: interleave distance in overall trace b/w ,

successive addresses from this pattern ,

event_type :: Enter/Exit Scope or Load/Store Access.

src_index :: index into {source_line :: source_file } table.

>


Consider the Trace produced by following sample loop ..

for ( I = 0; I <= N ; I ++){ A[I] = A[I] + B[I][I];}

Two Loads :: A[I] & B[I] [I]One Store :: A[I]

Address Trace Generated :

B[0][0]B[0][0]A[0]A[0] A[0]A[0] B[1][1]B[1][1]A[1]A[1] A[1]A[1] B[N][N]B[N][N]A[N]A[N] A[N]A[N]

RSD-1 :: over the loads of A[I] RSD-1 :: < start_addr = &A[0], length = N+1 , addr_stride = 8 , start_seq = 0 , seq_stride = 3 , event_type = LOAD , src_table_index = 1 >

An RSD Example


Power Regular Section Descriptors (PRSDs)

• RSDs not powerful enough to compress address stream efficiently.

• Solution :: Nest RSD to create Power Regular Section Descriptor (PRSD) • PRSD < base_addr :: first address generated by PRSD. base_addr_shift :: stride of base_addr b/w PRSD iterations base_seq :: starting position of this pattern in trace. base_seq_shift :: interleave distance b/w PRSD iterations. length :: PRSD length

child PRSD/RSD :: nested PRSD/ RSD >

For( …; …; …) {

For(…; ….;….) {

For(….;…;…) {

A [ ][ ] [ ]

}

}

}

PRSD-3 , child = PRSD-2PRSD-3 , child = PRSD-2

PRSD-2 , child = RSD-1PRSD-2 , child = RSD-1RSD-1RSD-1

• With PRSDs , can efficiently represent Loop Nests of arbitrary depths.


Incremental Cache Simulation

Cache Simulator

ReportReportFileFile

CompressedTrace

ScopesFile

VariablesFile

Detailed Cache Statistics

AddressTrace

Scope Structureof Target

Base Addressesof Variables in

Target

• Incremental cache simulation (modified MHSim [ Rice U.])

• Correlate Trace Addresses <---> Variable Names

Access point (LD / ST Instructions) <---> Line Numbers in Source

• Metrics per Access Point, also aggregated by scope structure of target.


Report File

Spatial HitsTotal Hits

Temporal HitsTotal Hits

Cache Block Fraction Used,before eviction (Access Efficiency)

Metric Definition What it tells us

Miss ratio Coarse Indicator of performance

Temporal ratio Relative degree of temporal locality

Spatial ratio Relative degree of spatial locality

Spatial UseUsed BytesBlock Size

# evictions**

Evictor References List of evictors Conflicting Variables (useful !)

Total MissesTotal Accesses

Cache Metrics Per Access Point


Test Kernel: Matrix Multiplication

60 for(i=0;i<MAT_DIM;i++)61 for(j=0;j<MAT_DIM;j++)62 for(k = 0;k<MAT_DIM;k++)63 x[i][j] = y[i][k] * z[k][j] + x[i][j];

MAT_DIM = 800 total samples registered= 1000000

reads = 750000 writes = 250000 hits = 738811misses = 261189 miss ratio = 0.261 temporal hits = 703930spatial hits = 34881 temporal ratio = 0.95 spatial ratio = 0.04721 spatial use = 0.169

Miss Temporal SpatialLine Name Hits Ratio Ratio Use Evictors

66 z_Read_1 0.00e+00 1.0 no hits 0.171 Z,Y,X 66 y_Read_0 2.39e+05 0.044 0.854 0.129 Z 66 x_Read_2 2.50e+05 0.0006 1.00 0.5 Z 66 x_Write_3 2.50e+05 0.0 1.00 no evicts

Per Reference Information :

C Source code

• High miss ratio : More than 25 % of accesses were misses.• Low spatial use : References evicted before cache block fully referenced.• z_Read_1 dominating 100 % misses• cause: iteration space layout• z_Read_1 sole evictor (evicts itself 95% evictor table)• Evictions low spatial use for x , y and z loads

locality for z: interchange j & k loops. temporal reuse for y and x: blocking (tiling)

Overall Performance :


Overall performance : (New / Old )

hits = 982128 / 738811misses = 17872 / 261189 miss ratio = 0.017 / 0.261temporal hits = 947173 / 703930spatial hits = 34955 / 34881temporal ratio = 0.96441 / 0.95spatial ratio = 0.03559 / 0.04721spatial use = 0.7039 / 0.169

Optimized Matrix Multiply

81for(jj =0;jj<MAT_DIM;jj += ts)82 for(kk=0;kk<MAT_DIM;kk += ts)83 for(i=0;i<MAT_DIM;i++)84 for(k=kk;k< min(kk+ts,MAT_DIM);k++)85 for(j=jj;j< min(jj+ts,MAT_DIM);j++)86 x[i][j] = x[i][k]* z[k][j]+ x[i][j];

tile size ts = 16;

Per Reference Information :

no evictsno evicts1.000.890.00.00.00e+0002.50e+052.50e+05x_Write_3

OldOldOldOldOld NewNewNewNewNew

0.5

0.129

0.171

1.00

0.854

no hits

0.0006

0.044

1.0

1.57e+02

1.10e+04

2.50e+05

2.50e+05

2.39e+05

0

0.8610.990.0012.88e+022.50e+05x_Read_2

0.7320.8960.0358.79e+032.41e+05y_Read_0

0.6730.9720.0358.79e+032.41e+05z_Read_1

Spatial UseTemporal Ratio

Miss RatioMisses HitsName


Another example: ADI Integration

16 for(k=1;k<N;k++) {17 for(i=2;i<N;i++)18 x[i][k]=x[i][k]- x[i-1][k]* a[i][k]/b[i-1][k];

22 for(i=2;i<N;i++)23 b[i][k]=b[i][k]– a[i][k]* a[i][k]/b[i-1][k]; }

N= 800accesses logged = 1000000

reads = 800000 writes = 200000 hits = 499499

misses = 500501 miss ratio = 0.5 temporal hits = 351731spatial hits = 147768 temporal ratio = 0.704 spatial ratio = 0.29583 spatial use = 0.2018

Per Reference Metrics :

Overall Performance :

• high overall miss rate poor locality• low overall spatial use • first 5 references have 0 hits• pattern: references iterate over rows• spatial use values low evictions

increase locality: top 5 references increase spatial locality interchange loops

Miss Temporal Spatial

Line Name Source_Ref Hits Misses Ratio Ratio Use18 x_Read_3 x[i][k] 0 1.00e+05 1.00 no hits 0.1318 a_Read_1 a[i][k] 0 1.00e+05 1.00 no hits 0.2518 b_Read_2 b[i-1][k] 0 1.00e+05 1.00 no hits 0.1323 b_Read_8 b[i][k] 0 9.98e+04 1.00 no hits 0.2423 a_Read_5 a[i][k] 0 9.98e+04 1.00 no hits 0.2418 x_Read_0 x[i-1][k] 1.00e+05 1.26e+02 0 1.0 0.2523 b_Read_7 b[i-1][k] 9.96e+04 1.25e+02 0 1.0 0.2518 x_Write_4 x[i][k] 1.00e+05 0.00e+00 0 0.50 no evicts23 b_Write_9 b[i][k] 9.98e+04 0.00e+00 0 0.27 no evicts23 a_Read_6 a[i][k] 9.98e+04 0.00e+00 0 0.74 no evicts


Optimized :: ADI IntegrationOverall performance : (New / Old )

hits = 874600 / 499499misses = 125400 / 500501miss ratio = 0.125 / 0.5 temporal hits = 454867 / 351731spatial hits = 419733 / 147768temporal ratio = 0.52009 / 0.704 spatial ratio = 0.4799 / 0.29583 spatial use = 0.9628 / 0.2018Per Reference Information :

0.24No hits1.09.98e+040a_Read_5

0.24No hits1.09.98e+040b_Read_8

OldOldOldOldOld

0.13

0.25

0.13

No hits

No hits

No hits

1.0

1.0

1.0

1.00e+05

1.00e+05

1.00e+05

0

0

0

b_Read_2

a_Read_1

0.990.300.171.78e+048.20e+04

0.9480.00.252.50e+047.48e+04

NewNewNewNewNew

0.9340.4950.1451.45e+048.53e+04

0.910.350.252.51e+047.51e+05

0.980.00.252.51e+047.51e+05x_Read_3

Spatial UseTemporal RatioMiss RatioMisses HitsName

significantly more hits fewer evictions higher spatial use

14 for(i=2;i<N;i++)15 for(k=1;k<N;k++)16 x[i][k]=x[i][k]-x[i-1][k]*a[i][k]/b[i-1][k];17 for(k=1;k<N;k++)18 b[i][k]= b[i][k]–a[i][k]*a[i][k]/b[i-1][k];

N= 800 accesses logged = 1000000


• Use Binary Rewriting to instrument target executable.

• Compress Generated Trace online.

• Use Compressed Trace for cache simulation.

• Compiler-independent support.

• Useful for mixed-language applications.

• Partial Data Traces : targetted instrumentation

• Efficient Online Trace Compression.

• Enhanced User Feedback :: source-correlated statistics.

Process

HighlightsHighlights

Summing Up ..


Thank You !Thank You !


Future Work

Automatic Optimization

• Identify natural loops in CFG.

• Attempt to identify data dependencies from binary.

• Reconfigure binary with optimizations, without violating data

dependencies.

• Optimizations could include prefetching, tiling, loop fusion ,

loop interchange, etc.

ControllerExecuting

binary

CFGText

Section

Attach

Inject Optimization

s

Extract


Related Work

SIGMA [Supercomputing’02]Simulator Infrastructure to Guide Memory Analysis

Capture Full Address Trace

No Evictor Information , weaker compression algorithm

MTOOL [TPDS’93], CPROF [Computer’94]Correlation to source line numbers only.

PAPI , HPMAPIs to access hardware performance counters


The Compression Algorithm

• Targetted at regular array accesses in tightly nested loops. For( …; …; …) {

For(…; ….;….) {

For(….;…;…) {

A [ ][ ] [ ]

}

}

}

• Algorithm has growth rate O (n x w) where n = # accesses

and w = ‘pool size’ (maximum # accesses residing in

memory for pattern matching ).

• Tool Structure Modular :: Possible to use some other algorithm

more suited for application domain.

For( …; …; …)

{

A [ B[I] ] = 2.0;

} Works Well

Won’t

Work Well

Constant Size Compressed Trace !


Challenges

• Reverse-mapping of accesses to variable expressions in source

Currently limited to local and global variables only.

Difficult to support dynamically allocated objects , since

program counter might have passed object allocation stage

(malloc) by the time we attach to the application.

Reverse-engineering of access point --> source expression

difficult. (eg. A[I+j*2][Q+1][P+R] = 2.0 maps to lots of

machine instructions.)

• Symbol Table information must be present, for effective

user feedback ( getting variable names, line numbers etc.)


Memory Performance Metrics

• A hit occurs when layer contains accessed element.

L1 Cache L2 CacheMain Memory

Processor

Relative Relative Access CyclesAccess Cycles

~ 2~ 2 ~ 5~ 5 ~ 30~ 30

xHit !

• A miss occurs when requested element absent in layer.

• Misses bad ! :: force processor stall till data fetched from next layer. • Fewer Misses Faster Performance.

Miss !y

Documents

9/22/2002NC State University1 Detecting Performance Bottlenecks Using Binary Rewriting Jaydeep Marathe and Frank Mueller North Carolina State University