20
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1 , Tipp Moseley 2 , Richard Hank 2 , Derek Bruening 2 , Scott Mahlke 1 1 1 University of Michigan 2 Google

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott

Embed Size (px)

Citation preview

1

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications

Hyoun Kyu Cho1, Tipp Moseley2, Richard Hank2,

Derek Bruening2, Scott Mahlke1

1University of Michigan 2Google

2

Datacenter Applications

• In 2010, US Datacenters spent 70~90 billion kWh*

• Datacenter application performance is critical• Profiling can help

http://googleblog.blogspot.com

*[Koomey`11]

3

Challenges for Datacenters• Need to run on live traffic

• Difficult to isolate

• Overheads• Value profiling 3.8x slowdown1

• Path profiling 31%, edge profiling 16%2

• Binary management• Many programs, multiple

versions

Traditional Profiling

Source Code

Instrumented Binary

Input Data

Instrumentation Build

TrainingRun

Profile Data

1[Calder`99] 2[Ball`96]

4

Continuous profiling infrastructure for datacenters

Negligible overhead• Sampling based

• Aggregated profiling overhead less than 0.01% Limitations

• Heavily rely on Performance Monitoring Units

• Limited flexibility and portabiliity

[Ren et al.`10]

Google-Wide Profiling

5

Unified profiling infrastructure for datacenters• Flexible types of profile data

• Portable across heterogeneous datacenter While maintaining

• Low overhead

• Does not burden binary management

Goals

Sampling Dynamic Binary Instrumentation

6

Instrumentation Sampling

hardware

operating system

application

system call gateway

6

Instrumentation Sampling

hardware

operating system

application

[Bruening`04]

dispatchinstrumentation

engineclient

code cacheDynamoRIO

context switch

6

Instrumentation Sampling

hardware

operating system

application

shep

herd

ing

thre

ad

start profiling

dispatchinstrumentation

engineclient

code cachestopprofiling

Unbounded profiling periods due to fragment linking

Latency degradation due to initial instrumentation

Multi-threade programs

7

Problems with Basic Implementation

code cache

8

Temporal Unlinking/Relinking of Fragments

BB1

BB2

dispatch

contextswitch

BB2->BB1

9

S/W Code Cache Pre-population

hardware

operating system

application

shep

herd

ing

thre

ad

dispatchinstrumentation

engine

clientcode cache

Still have latency degradation for intial instrumentation phases

Sampling makes it possible to miss thread operations

Forces Instant Profiling’s signal handler for every thread

Enumerates all threads and sends profiling start signal to each thread

10

Multithreaded Program Support

6-core Intel Xeon 2.67GHz w/ 12MB L3 12GB main memory Linux kernel 2.6.32 gcc 4.4.3 w/ -O3 SPEC INT2006, BigTable, Web search Edge profiling client

11

Experimental Setup

12

Naïve Edge Profiling

40

0.p

erl

be

nch

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

46

2.li

bq

ua

ntu

m

46

4.h

26

4re

f

47

3.a

sta

r

we

b s

ea

rch

big

tab

le

a.m

ea

n

0

5

10

15

20

25

30

35

40

45

50

Slo

wd

ow

n

13

Profiling Overhead4

00

.pe

rlb

en

ch

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

46

2.li

bq

ua

ntu

m

46

4.h

26

4re

f

47

3.a

sta

r

we

b s

ea

rch

big

tab

le

a.m

ea

n

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

2ms/4s 1ms/1s 2ms/1s 4ms/1s 2ms/250ms

No

rma

lize

d E

xe

cu

tio

n T

ime

14

S/W Code Cache Prepopulation

0 1 2 3 4 5 6 7 8 90

500000

1000000

1500000

2000000

2500000

3000000

3500000

w/ pre-population w/o pre-population

Sampling Phases

Cu

mu

lati

ve

Nu

mb

er

of

Sa

mp

les

15

Profiling Accuracy4

00

.pe

rlb

en

ch

40

1.b

zip

2

40

3.g

cc

42

9.m

cf

44

5.g

ob

mk

46

2.li

bq

ua

ntu

m

46

4.h

26

4re

f

47

3.a

sta

r

we

b s

ea

rch

big

tab

le

a.m

ea

n

0

10

20

30

40

50

60

70

80

90

100

2ms/4s 1ms/1s 2ms/1s 4ms/1s 2ms/250ms

Pro

filin

g A

cc

ura

cy

16

Asymptotic Accuracy

0 20 40 60 80 100 120 1400

10

20

30

40

50

60

70

80

90

100

bigtable web search

Sampling Phases

Cu

mu

lati

ve A

ccu

racy

Low-overhead, portable, flexible profiling needed

Instant Profiling • Combines sampling and DBI

• Pre-populates S/W code cache

• Tunable tradeoff between overhead and information

• Provides eventual profiling accuracy

Less than 5% overhead, more than 80% accuracy for naïve edge profiling client

17

Conclusion

18

Thank you!