Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1...

Preview:

Citation preview

Presented by: SHASHI MYSORE shashimc@cs.ucsb.edu

Profiling over Adaptive

RangesShashidhar Mysore, Banit Agrawal, Timothy Sherwood

Nisheeth Shrivastava, Subhash Suri

Department of Computer Science

University of California, Santa Barbara

4th Annual ACM/IEEE International Symposium on Code

Generation and Optimization (CGO), 27th March 2006,

Manhattan, NY

Shashi Mysore 2

Motivation

Program Profiling: Understand system-workload

interactions - gather data, quantify, analyze, and

optimize

Let us consider an example of code profiling …

• At the core: We need to count events

• Basic blocks, load value distribution, load instructions, load addresses, zero-value loads, narrow-

width operands, etc.

• Challenge:– Huge complex programs

– Limited storage - tiny streaming profilers

– Runtime analysis - feasible hardware solutions

Shashi Mysore 3

BB1

BBn

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0 add

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0

add $0x10,%esp

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0

add $0x10,%esp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

push $0x28

push $0x8048468

mov %esp,%ebp

sub $0x38,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

mov %esp,%ebp

sub $0x8,%esp

push $0x28

push $0x8048468

An example: Code Profiling

Code Each basic block

executes some number

of times

Use counters …

Where are the hot regions?

Some are hot

Some are not

How hot are they?

… And can we discover this

knowledge at run time?

Shashi Mysore 4

Naïve Approach: Unlimited Counters

CodeBB1

BBn

Counter

Coverage of this counter

Shashi Mysore 5

Naïve Approach: Unlimited Counters

CodeBB1

BBn

We get:

• High Coverage

• High Precision

N basic blocks – N counters

Each counter covers one

basic block

Problem:

• Many programs have 800000

basic blocks or more!

Counter

Coverage of this counter

So let’s limit the number of counters …

but.. not all of them are important

to be quantified

Shashi Mysore 6

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

We get:

• High Precision - For the hot spots

• Low Coverage - At the right spots

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

but what if did …

Shashi Mysore 7

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

Shashi Mysore 8

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

We get:

• Low Coverage – and at unimportant regions

• High Precision – but is not as useful

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

Problem:

•We have zero information about hot

regions

•How do we know which region of the

code to cover with the K counters?

Distribute the basic blocks among the counters …

Shashi Mysore 9

Naïve Approach: Uniform RangesCode

BB1

BBn

We get:

• High Coverage with K counters

• Low Precision

Problem:

• One counter associated with a huge

set of basic blocks

• Only average behavior – low precision

• Precision important – especially

for hot regions

C1

CKK counters to cover the entire

program

Each counter counts a range of basic blocks

Shashi Mysore 10

Related Work• Profile Gathering and analysis schemes

– [Anderson, et. Al., ’97], [Arnold, et. Al., ’01],

[Heil and Smith, ’00], [Sastry, et. Al., ’01], [Ball and Larus, ’96], [Calder, et. Al., 97], [Hirzel and Chilimbi, ’01]

• Hardware assisted profiling and optimizations – [Brooks, et. Al., ’99], [Conte, et, al., ’94, ’96]

[Dean, et, al., ’97], [Narayanasamy, et., al.,’03], [Zhou, et. Al., ’04], [Zilles and Sohi, ’01], [Nagpurkar et. Al., ’05], [Mousa, et. Al, ’05]

�High Coverage

�High precision

�Limited number of counters

�Covers any stream of profile data

•Low precision information on cold regions

•Divide profile data hierarchically

Shashi Mysore 11

Ideally: Best RangesCode

BB1

BBn

We want:

• High Precision for hot regions

Challenge:

Discovering what to count and

With how much precision

• Lower precision information about

colder regions

• High coverage by optimal use

of a few counters

Shashi Mysore 12

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Start countingIdentify ranges

Or..

Shashi Mysore 13

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Start CountingWhat comes first?

Identify ranges

Shashi Mysore 14

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Range Adaptive Profiler solves exactly this problem

by dynamically identifying ranges as we count

Shashi Mysore 15

Our Approach: Adaptive ProfilingCode

BB1

BBn

Initially: One counter - covers entire range

When the counter is “hot” – split it

split

Shashi Mysore 16

Our Approach: Adaptive ProfilingCode

BB1

BBnNext: Whenever a node is “hot” – split it

Dynamically adapt counter coverage to suit the

execution frequency

Shashi Mysore 17

Our Approach: Adaptive ProfilingCode

BB1

BBn

As the program executes:

Allocate counters towards regions

that are hotter

Dynamically adapt counter coverage to suit the

execution frequencysplit

Shashi Mysore 18

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• High precision for hot regions

Shashi Mysore 19

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• Lower precision information about

colder regions

Shashi Mysore 20

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• High coverage by optimally using

a few counters

Shashi Mysore 21

Range Adaptive Profiling

Advantages:

• A streaming (one-pass) technique to hierarchically classify events

• Fixed number of counters – O(log(R) * 1/E)

• Precision adaptive to hot regions

• Guaranteed error bounds

Any stream of profile data that can be divided hierarchically:

• Code profiling

• Values profiling

• Load address profiling

• Zero-value load profiling

• Narrow-width operand profiling

Shashi Mysore 22

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Shashi Mysore 23

Adaptive Profiling - SplitsCode

BB1

BBn

split

Shashi Mysore 24

Adaptive Profiling - SplitsCode

BB1

BBn

Shashi Mysore 25

Adaptive Profiling - SplitsCode

BB1

BBn

split

Shashi Mysore 26

Adaptive Profiling - SplitsCode

BB1

BBn

split

split

split

• Adaptive to stream size

• Relative importance crucial

• Bounds error in counting

• SplitThreshold = E.N/(log(R))

… when to split?

Shashi Mysore 27

Adaptive Profiling - MergesCode

BB1

BBn

Beware - temporary

hot regions!

For example – program initialization phase…

Adaptive Precision Profiling accomplished!

Shashi Mysore 28

Adaptive Profiling - Splits

Note: Something that is hot now –

may become cold later

Adapting to initialization phase …

CodeBB1

BBn

Shashi Mysore 29

Adaptive Profiling - Splits

Adapting to initialization phase …

Note: Something that is hot now –

may become cold later

CodeBB1

BBn

Shashi Mysore 30

Adaptive Profiling - Splits

Adapting to initialization phase …

Note: Something that is hot now –

may become cold later

CodeBB1

BBn

Shashi Mysore 31

Adaptive Profiling - Splits

Note: Something that is hot now –

may become cold later

Adapting to initialization phase …

CodeBB1

BBn

Shashi Mysore 32

Adaptive Profiling - SplitsEnd of initialization phase –

precisely captured temporary hot regions

Program continues to execute –

overall hot region shifts …

CodeBB1

BBn

Shashi Mysore 33

Adaptive Profiling - SplitsCode

BB1

BBnRange adaptive profiler adapts

to the new hot region

}No longer hot

Program continues to execute –

overall hot region shifts

Shashi Mysore 34

Adaptive Profiling - MergesCode

BB1

BBn

Uses lot more counters than

needed!

Merge ‘non-hot counters’ …

Problem:

Changing hot region -

undo unnecessary adaptation

Shashi Mysore 35

Adaptive Profiling - MergesCode

BB1

BBn

RAP solution:

• Recursive Merges

• Collapse counters to

parents

merge

We never throw away profile

information, we only merge

Shashi Mysore 36

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 37

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 38

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 39

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 40

Adaptive Profiling - MergesCode

BB1

BBn

RAP solution:

• Recursive Merges

• Collapse counters to

parents

Shashi Mysore 41

Range Adaptive Profiling

Advantages

• Precision dynamically adaptive to hot regions

• Guaranteed error bounds

• Optimal usage of a few counters

• Plus -

• Independent of the stream size

• Independent of the stream order

Shashi Mysore 42

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Shashi Mysore 43

20 40 60 80 100

500

400

300

200

100

0

0Number of Points (in billions) - gcc

Num

ber

of C

ounte

rs

Batched Merges

When do we initiate a merge cycle:

• Periodic merging

• Exponentially increasing periods

RAP tree does not grow faster than logarithmic rate

Merge cycleMerge cycleMerge cycleMerge cycleMerge cycle Merge cycle

Shashi Mysore 44

Branching Factor

• We show that optimal branching factor is four

Less memory More memory

Faster Convergence

branching factor 2 4 8

Shashi Mysore 45

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Shashi Mysore 46

Results

• Range adaptive profiler– Online technique

– Does not have ideal knowledge – counts everything

– Error introduced by not splitting early enough

0

0.5

1

1.5

2

2.5

3

gcc

gzip

mcf

pars

er

vorte

xvp

r

aver

age

SPEC Benchmarks

Perc

ent Err

or

Average percent error less than 2%

Shashi Mysore 47

Results

0

50

100

150

200

250

300

gcc

bzip2

mcf

vpr

gzip

parser

vorte

x

averag

e

SPEC Benchmarks

Num

ber of C

ounte

rs

• On an average - 150 counters provides 99% accurate information on code profiles

High accuracy – but at what cost?

We show more results in the paper

Shashi Mysore 48

Software implementation

• Simple set of APIs

– Offline and online profiling

– rap_init

– rap_add_points – builds the RAP tree,

takes care of splits and merges too.

– rap_finalize

• Webpage

– www.cs.ucsb.edu/~arch/rap

Extremely high throughput profile data analysis …

Shashi Mysore 49

Hardware Profiling Engine

Stage 0:

Event Buffer

Event Name

(IP address, memory,

PC, value, etc…)

Quantity

0 0 *

0 0 0 1 *

0 0 0 0 *

0 0 1 0 *

0 0 0 0 0 0 *

0 0 1 1 *

0 0 0 0 1 0 *

0 0 0 0 0 0 *

0 0 0 0 1 1 *

0 0 0 0 0 0 1 *

. . .

Stage 1:

Range Matching

TCAM Cells

Stage 2:

Longest Match

Stage 3:

Counter Maintenance

Counter Indices

+pipeline latch

pipeline latch

pipeline latch

-

Split threshold

pipeline latch

Control

Block

positive?

Stage 4:

Split Handling

upda

te p

oint

ers

on s

plit

N x 1 Arbiter

Shashi Mysore 50

Conclusion

• Range Adaptive Profiling

– Summarizes high bandwidth profile data

– Fully streaming scheme

– Bounded memory and error

– General purpose – high applicability

– Multi-dimensional Profiling

Shashi Mysore 51

Future Work

• Multi-Dimensional Profiling

Future Work: Multidimensional Profiling

Edges

Code-mem

Code-val

Val-mem

Shashi Mysore 52

Thank You

Profiling over Adaptive Ranges -

http://www.cs.ucsb.edu/~arch/rap

http://www.cs.ucsb.edu/~shashimc

Recommended