Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Shadow Profiling:Shadow Profiling:Hiding Instrumentation Hiding Instrumentation

Costs with ParallelismCosts with Parallelism

Tipp MoseleyAlex Shye

Vijay Janapa ReddiDirk Grunwald

(University of Colorado)

Ramesh Peri(Intel Corporation)

Motivation An ideal profiler will…

1. Collect arbitrarily detailed and abundant information2. Incur negligible overhead

A real profiler, e.g., using Pin, satisfies condition 1 But the cost is high

3X for BBL counting 25X for loop profiling 50X or higher for memory profiling

A real profiler, e.g. PMU sampling or code patching, satisfies condition 2

But the detail is very coarse

Motivation

Low Overhea

d

HighDetail

VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, …

Pintools, Valgrind, ATOM, …

“Bursty Tracing” (Sampled Instrumentation),Novel Hardware,Shadow Profiling

Goal

To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead Enable developers to focus on other things

The Big Idea Stems from fault tolerance work on deterministic replication Periodically fork(), profile “shadow” processes

Time CPU 0 CPU 1 CPU 2 CPU 3

0 Orig. Slice 0 Slice 0

1 Orig. Slice 1 Slice 0 Slice 1

2 Orig. Slice 2 Slice 0 Slice 1 Slice 2



5 Slice 3 Slice 4

6 Slice 4

Slice 0 Slice 5Slice 4Slice 2Slice 1

fork() fork()fork()fork() fork()

Original Execution

* Assuming instrumentation overhead of 3X

Challenges

Threads Shared Memory Asynchronous Interrupts System Calls JIT overhead Overhead vs. Number of CPUs

Maximum speedup is Number of CPUs If profiler overhead is 50X, need at least 51 CPUs to run

in real-time (probably many more)

Too many complications to ensure deterministic replication

Goal (Revised)

To create a profiler capable of sampling detailed traces (bursts) with negligible overhead Trade abundance for low overhead

Like SimPoints or SMARTS (but not as smart :)

The Big Idea (revised)

Time CPU 0 CPU 1 CPU 2 CPU 3

0 Orig. Slice 0 Slice 0 Spyware


2 Orig. Slice 2 Slice 0 Slice 1 Spyware



Do not strive for full, deterministic replica Instead, profile many short, mostly deterministic bursts

Profile a fixed number of instructions “Fake it” for system calls Must not allow shadow to side-effect system

Slice 0 Slice 1

fork() fork()

Original Execution

Design Overview

OriginalApplication

Monitor

ShadowApplication

Profiling Instrumentation

ShadowApplication

Profiling Instrumentation

fork() fork()

Design Overview

Monitor uses Pin Probes (code patching) Application runs natively

Monitor receives periodic timer signal and decides when to fork()

After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode.

Shadow process profiles as usual, except handling of special cases

Monitor logs special read() system calls and pipes result to shadow processes

System Calls

For SPEC CPU2000, system calls occur around 35 times per second Forking after each puts lots of pressure on CoW

pages, Pin JIT engine 95% of dynamic system calls can be safely

handled Some system calls can be allowed to execute (49%)

getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …

System Calls Some can be replaced with success assumed (39%)

write, ftruncate, writev, unlink, rename, … Some are handled specially, but execution

may continue (1.8%) mmap2, open(creat), mmap, mprotect, mremap, fcntl

read() is special (5.4%) For reads from pipes/sockets, the data must be logged from the

original app For reads from files, the file must be closed and reopened after the

fork() because the OS file pointer is not duplicated ioctl() is special (4.8%)

Frequent in perlbmk Behavior is device-dependent, safest action is to simply terminate the

segment and re-fork()

Other Issues

Shared Memory Disallow writes to shared memory

Asynchronous Interrupts (Userspace signals) Since we are only mostly deterministic, no longer an

issue When main program receives a signal, pass it along to

live children

JIT Overhead After each fork(), it is like Pinning a new program

Warmup is too slow Use Persistent Code Caching [CGO’07]

Multithreaded ProgramsIssue:fork() does not duplicate all threads

Only the thread that called fork()

Solution:1. Barrier all threads in the program and store their CPU state2. Fork the process and clone new threads for those that were

destroyed Identical address space; only register state was really

‘lost’3. In each new thread, restore previous CPU state

Modified clone() handling in Pin VM

4. Continue execution, virtualize thread IDs for relevant system calls

Tuning Overhead

Load Number of active shadow processes Tested 0.125, 0.25, 0.5, 1.0, 2.0

Sample Size Number of instructions to profile Longer samples for less overhead, more data Shorter samples for more evenly dispersed data Tested 1M, 10M, 100M

Experiments

Value Profiling Typical overhead ~100X Accuracy measured by Difference in Invariance

Path Profiling Typical overhead 50% - 10X Accuracy measured by percent of hot paths detected (2%

threshold)

All experiments use SPEC2000 INT Benchmarks with “ref” data set Arithmetic mean of 3 runs presented

Results - Value Profiling Overhead

Overhead versus native execution

Several configurations less than 1%

Path profiling exhibits similar trends

Value Profiling Overhead

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

0.063 0.125 0.25 0.5 1 2

Load

Overhead

1M

10M

100M

Results - Value Profiling Accuracy

All configurations within 7% of perfect profile Lower is better

0%

1%

2%

3%

4%

5%

6%

7%

8%

0.063 0.125 0.25 0.5 1 2Load

Difference in Invariance

1M

10M

100M

`

Results - Path Profiling Accuracy

Most configurations over 90% accurate Higher is better

Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot” 78%

80%

82%

84%

86%

88%

90%

92%

94%

96%

0.06 0.13 0.25 0.5 1 2

Load

Accuracy

1M 10M 100M

Results - Page Fault Increase

Proportional increase in page faults Shadow/Native

0

20

40

60

80

100

120

140

160

180

200

0.06 0.13 0.25 0.5 1 2

Load

Increase in Page Faults

1M

10M

100M

Results - Page Fault Rate

Difference in page faults per second experienced by native application

0

2000

4000

6000

8000

10000

12000

0.0630.125 0.25 0.51 2

Load

Paging Rate Increase

1M

10M

100M

Future Work

Improve stability for multithreaded programs Investigate effects of different persistent code cache

policies Compare sampling policies

Random (current) Phase/event-based Static analysis Study convergence

Apply technique Profile-guided optimizations Simulation techniques

Conclusion

Shadow Profiling allows collection of bursts of detailed traces Accuracy is over 90%

Incurs negligible overhead Often less than 1%

With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations

Documents

Shadow Profiling: Hiding Instrumentation Costs with Parallelism