26
Presented by: SHASHI MYSORE [email protected] Profiling over Adaptive Ranges Shashidhar Mysore, Banit Agrawal, Timothy Sherwood Nisheeth Shrivastava, Subhash Suri Department of Computer Science University of California, Santa Barbara 4 th Annual ACM/IEEE International Symposium on Code Generation and Optimization (CGO), 27 th March 2006, Manhattan, NY Shashi Mysore 2 Motivation Program Profiling: Understand system-workload interactions - gather data, quantify, analyze, and optimize Let us consider an example of code profiling … • At the core: We need to count events Basic blocks, load value distribution, load instructions, load addresses, zero-value loads, narrow- width operands, etc. Challenge: – Huge complex programs – Limited storage - tiny streaming profilers – Runtime analysis - feasible hardware solutions

Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Presented by: SHASHI MYSORE [email protected]

Profiling over Adaptive

RangesShashidhar Mysore, Banit Agrawal, Timothy Sherwood

Nisheeth Shrivastava, Subhash Suri

Department of Computer Science

University of California, Santa Barbara

4th Annual ACM/IEEE International Symposium on Code

Generation and Optimization (CGO), 27th March 2006,

Manhattan, NY

Shashi Mysore 2

Motivation

Program Profiling: Understand system-workload

interactions - gather data, quantify, analyze, and

optimize

Let us consider an example of code profiling …

• At the core: We need to count events

• Basic blocks, load value distribution, load instructions, load addresses, zero-value loads, narrow-

width operands, etc.

• Challenge:– Huge complex programs

– Limited storage - tiny streaming profilers

– Runtime analysis - feasible hardware solutions

Page 2: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 3

BB1

BBn

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0 add

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0

add $0x10,%esp

push %ebp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

call 80482b0

add $0x10,%esp

mov %esp,%ebp

sub $0x38,%esp

and $0xfffffff0,%esp

mov $0x0,%eax

sub %eax,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

push $0x28

push $0x8048468

mov %esp,%ebp

sub $0x38,%esp

sub $0x8,%esp

push $0x28

push $0x8048468

mov %esp,%ebp

sub $0x8,%esp

push $0x28

push $0x8048468

An example: Code Profiling

Code Each basic block

executes some number

of times

Use counters …

Where are the hot regions?

Some are hot

Some are not

How hot are they?

… And can we discover this

knowledge at run time?

Shashi Mysore 4

Naïve Approach: Unlimited Counters

CodeBB1

BBn

Counter

Coverage of this counter

Page 3: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 5

Naïve Approach: Unlimited Counters

CodeBB1

BBn

We get:

• High Coverage

• High Precision

N basic blocks – N counters

Each counter covers one

basic block

Problem:

• Many programs have 800000

basic blocks or more!

Counter

Coverage of this counter

So let’s limit the number of counters …

but.. not all of them are important

to be quantified

Shashi Mysore 6

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

We get:

• High Precision - For the hot spots

• Low Coverage - At the right spots

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

but what if did …

Page 4: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 7

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

Shashi Mysore 8

C1

CK

Naïve Approach: Limited CountersCode

BB1

BBn

We get:

• Low Coverage – and at unimportant regions

• High Precision – but is not as useful

N basic blocks – K counters

pick K basic blocks and let

the K counters cover them

Problem:

•We have zero information about hot

regions

•How do we know which region of the

code to cover with the K counters?

Distribute the basic blocks among the counters …

Page 5: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 9

Naïve Approach: Uniform RangesCode

BB1

BBn

We get:

• High Coverage with K counters

• Low Precision

Problem:

• One counter associated with a huge

set of basic blocks

• Only average behavior – low precision

• Precision important – especially

for hot regions

C1

CKK counters to cover the entire

program

Each counter counts a range of basic blocks

Shashi Mysore 10

Related Work• Profile Gathering and analysis schemes

– [Anderson, et. Al., ’97], [Arnold, et. Al., ’01],

[Heil and Smith, ’00], [Sastry, et. Al., ’01], [Ball and Larus, ’96], [Calder, et. Al., 97], [Hirzel and Chilimbi, ’01]

• Hardware assisted profiling and optimizations – [Brooks, et. Al., ’99], [Conte, et, al., ’94, ’96]

[Dean, et, al., ’97], [Narayanasamy, et., al.,’03], [Zhou, et. Al., ’04], [Zilles and Sohi, ’01], [Nagpurkar et. Al., ’05], [Mousa, et. Al, ’05]

�High Coverage

�High precision

�Limited number of counters

�Covers any stream of profile data

•Low precision information on cold regions

•Divide profile data hierarchically

Page 6: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 11

Ideally: Best RangesCode

BB1

BBn

We want:

• High Precision for hot regions

Challenge:

Discovering what to count and

With how much precision

• Lower precision information about

colder regions

• High coverage by optimal use

of a few counters

Shashi Mysore 12

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Start countingIdentify ranges

Or..

Page 7: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 13

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Start CountingWhat comes first?

Identify ranges

Shashi Mysore 14

Challenges

Ideal Profiler: Selects the best possible

ranges; decides the precision

Real Problem: Identifying the best possible

ranges to count before we already start

counting

Range Adaptive Profiler solves exactly this problem

by dynamically identifying ranges as we count

Page 8: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 15

Our Approach: Adaptive ProfilingCode

BB1

BBn

Initially: One counter - covers entire range

When the counter is “hot” – split it

split

Shashi Mysore 16

Our Approach: Adaptive ProfilingCode

BB1

BBnNext: Whenever a node is “hot” – split it

Dynamically adapt counter coverage to suit the

execution frequency

Page 9: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 17

Our Approach: Adaptive ProfilingCode

BB1

BBn

As the program executes:

Allocate counters towards regions

that are hotter

Dynamically adapt counter coverage to suit the

execution frequencysplit

Shashi Mysore 18

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• High precision for hot regions

Page 10: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 19

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• Lower precision information about

colder regions

Shashi Mysore 20

Our Approach: Adaptive ProfilingCode

BB1

BBn

At the end of the profiling phase -

We get:

• High coverage by optimally using

a few counters

Page 11: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 21

Range Adaptive Profiling

Advantages:

• A streaming (one-pass) technique to hierarchically classify events

• Fixed number of counters – O(log(R) * 1/E)

• Precision adaptive to hot regions

• Guaranteed error bounds

Any stream of profile data that can be divided hierarchically:

• Code profiling

• Values profiling

• Load address profiling

• Zero-value load profiling

• Narrow-width operand profiling

Shashi Mysore 22

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Page 12: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 23

Adaptive Profiling - SplitsCode

BB1

BBn

split

Shashi Mysore 24

Adaptive Profiling - SplitsCode

BB1

BBn

Page 13: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 25

Adaptive Profiling - SplitsCode

BB1

BBn

split

Shashi Mysore 26

Adaptive Profiling - SplitsCode

BB1

BBn

split

split

split

• Adaptive to stream size

• Relative importance crucial

• Bounds error in counting

• SplitThreshold = E.N/(log(R))

… when to split?

Page 14: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 27

Adaptive Profiling - MergesCode

BB1

BBn

Beware - temporary

hot regions!

For example – program initialization phase…

Adaptive Precision Profiling accomplished!

Shashi Mysore 28

Adaptive Profiling - Splits

Note: Something that is hot now –

may become cold later

Adapting to initialization phase …

CodeBB1

BBn

Page 15: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 29

Adaptive Profiling - Splits

Adapting to initialization phase …

Note: Something that is hot now –

may become cold later

CodeBB1

BBn

Shashi Mysore 30

Adaptive Profiling - Splits

Adapting to initialization phase …

Note: Something that is hot now –

may become cold later

CodeBB1

BBn

Page 16: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 31

Adaptive Profiling - Splits

Note: Something that is hot now –

may become cold later

Adapting to initialization phase …

CodeBB1

BBn

Shashi Mysore 32

Adaptive Profiling - SplitsEnd of initialization phase –

precisely captured temporary hot regions

Program continues to execute –

overall hot region shifts …

CodeBB1

BBn

Page 17: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 33

Adaptive Profiling - SplitsCode

BB1

BBnRange adaptive profiler adapts

to the new hot region

}No longer hot

Program continues to execute –

overall hot region shifts

Shashi Mysore 34

Adaptive Profiling - MergesCode

BB1

BBn

Uses lot more counters than

needed!

Merge ‘non-hot counters’ …

Problem:

Changing hot region -

undo unnecessary adaptation

Page 18: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 35

Adaptive Profiling - MergesCode

BB1

BBn

RAP solution:

• Recursive Merges

• Collapse counters to

parents

merge

We never throw away profile

information, we only merge

Shashi Mysore 36

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Page 19: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 37

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 38

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Page 20: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 39

Adaptive Profiling - MergesCode

BB1

BBn

merge

RAP solution:

• Recursive Merges

• Collapse counters to

parents

We never throw away profile

information, we only merge

Shashi Mysore 40

Adaptive Profiling - MergesCode

BB1

BBn

RAP solution:

• Recursive Merges

• Collapse counters to

parents

Page 21: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 41

Range Adaptive Profiling

Advantages

• Precision dynamically adaptive to hot regions

• Guaranteed error bounds

• Optimal usage of a few counters

• Plus -

• Independent of the stream size

• Independent of the stream order

Shashi Mysore 42

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Page 22: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 43

20 40 60 80 100

500

400

300

200

100

0

0Number of Points (in billions) - gcc

Num

ber

of C

ounte

rs

Batched Merges

When do we initiate a merge cycle:

• Periodic merging

• Exponentially increasing periods

RAP tree does not grow faster than logarithmic rate

Merge cycleMerge cycleMerge cycleMerge cycleMerge cycle Merge cycle

Shashi Mysore 44

Branching Factor

• We show that optimal branching factor is four

Less memory More memory

Faster Convergence

branching factor 2 4 8

Page 23: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 45

• Program Profiling

– An example: Code profiles

– Related work

• Range Adaptive Profiling

– Advantages and Applications

– Splits

– Merges

• Making it efficient

– Batching merges

– Branching Factor

• RAP implementation

– Results – Quantify error and memory

– Hardware and Software

• Conclusions

Outline

Shashi Mysore 46

Results

• Range adaptive profiler– Online technique

– Does not have ideal knowledge – counts everything

– Error introduced by not splitting early enough

0

0.5

1

1.5

2

2.5

3

gcc

gzip

mcf

pars

er

vorte

xvp

r

aver

age

SPEC Benchmarks

Perc

ent Err

or

Average percent error less than 2%

Page 24: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 47

Results

0

50

100

150

200

250

300

gcc

bzip2

mcf

vpr

gzip

parser

vorte

x

averag

e

SPEC Benchmarks

Num

ber of C

ounte

rs

• On an average - 150 counters provides 99% accurate information on code profiles

High accuracy – but at what cost?

We show more results in the paper

Shashi Mysore 48

Software implementation

• Simple set of APIs

– Offline and online profiling

– rap_init

– rap_add_points – builds the RAP tree,

takes care of splits and merges too.

– rap_finalize

• Webpage

– www.cs.ucsb.edu/~arch/rap

Extremely high throughput profile data analysis …

Page 25: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 49

Hardware Profiling Engine

Stage 0:

Event Buffer

Event Name

(IP address, memory,

PC, value, etc…)

Quantity

0 0 *

0 0 0 1 *

0 0 0 0 *

0 0 1 0 *

0 0 0 0 0 0 *

0 0 1 1 *

0 0 0 0 1 0 *

0 0 0 0 0 0 *

0 0 0 0 1 1 *

0 0 0 0 0 0 1 *

. . .

Stage 1:

Range Matching

TCAM Cells

Stage 2:

Longest Match

Stage 3:

Counter Maintenance

Counter Indices

+pipeline latch

pipeline latch

pipeline latch

-

Split threshold

pipeline latch

Control

Block

positive?

Stage 4:

Split Handling

upda

te p

oint

ers

on s

plit

N x 1 Arbiter

Shashi Mysore 50

Conclusion

• Range Adaptive Profiling

– Summarizes high bandwidth profile data

– Fully streaming scheme

– Bounded memory and error

– General purpose – high applicability

– Multi-dimensional Profiling

Page 26: Profiling over Adaptive Ranges · Shashi Mysore 7 C1 CK Naïve Approach: Limited Counters Code BB 1 BB n Nbasic blocks –Kcounters pick Kbasic blocks and let the Kcounters cover

Shashi Mysore 51

Future Work

• Multi-Dimensional Profiling

Future Work: Multidimensional Profiling

Edges

Code-mem

Code-val

Val-mem

Shashi Mysore 52

Thank You

Profiling over Adaptive Ranges -

http://www.cs.ucsb.edu/~arch/rap

http://www.cs.ucsb.edu/~shashimc