Upload
sibley
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Shadow Profiling: Hiding Instrumentation Costs with Parallelism. Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation). Motivation. An ideal profiler will… Collect arbitrarily detailed and abundant information - PowerPoint PPT Presentation
Citation preview
Shadow Profiling:Shadow Profiling:Hiding Instrumentation Hiding Instrumentation
Costs with ParallelismCosts with Parallelism
Tipp MoseleyAlex Shye
Vijay Janapa ReddiDirk Grunwald
(University of Colorado)
Ramesh Peri(Intel Corporation)
Motivation An ideal profiler will…
1. Collect arbitrarily detailed and abundant information2. Incur negligible overhead
A real profiler, e.g., using Pin, satisfies condition 1 But the cost is high
3X for BBL counting 25X for loop profiling 50X or higher for memory profiling
A real profiler, e.g. PMU sampling or code patching, satisfies condition 2
But the detail is very coarse
Motivation
Low Overhea
d
HighDetail
VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, …
Pintools, Valgrind, ATOM, …
“Bursty Tracing” (Sampled Instrumentation),Novel Hardware,Shadow Profiling
Goal
To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead Enable developers to focus on other things
The Big Idea Stems from fault tolerance work on deterministic replication Periodically fork(), profile “shadow” processes
Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0
1 Orig. Slice 1 Slice 0 Slice 1
2 Orig. Slice 2 Slice 0 Slice 1 Slice 2
3 Orig. Slice 3 Slice 3 Slice 1 Slice 2
4 Orig. Slice 4 Slice 3 Slice 4 Slice 2
5 Slice 3 Slice 4
6 Slice 4
Slice 0 Slice 5Slice 4Slice 2Slice 1
fork() fork()fork()fork() fork()
Original Execution
* Assuming instrumentation overhead of 3X
Challenges
Threads Shared Memory Asynchronous Interrupts System Calls JIT overhead Overhead vs. Number of CPUs
Maximum speedup is Number of CPUs If profiler overhead is 50X, need at least 51 CPUs to run
in real-time (probably many more)
Too many complications to ensure deterministic replication
Goal (Revised)
To create a profiler capable of sampling detailed traces (bursts) with negligible overhead Trade abundance for low overhead
Like SimPoints or SMARTS (but not as smart :)
The Big Idea (revised)
Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0 Spyware
1 Orig. Slice 1 Slice 0 Spyware
2 Orig. Slice 2 Slice 0 Slice 1 Spyware
3 Orig. Slice 3 Slice 1 Spyware
4 Orig. Slice 4 Slice 1 Spyware
Do not strive for full, deterministic replica Instead, profile many short, mostly deterministic bursts
Profile a fixed number of instructions “Fake it” for system calls Must not allow shadow to side-effect system
Slice 0 Slice 1
fork() fork()
Original Execution
Design Overview
OriginalApplication
Monitor
ShadowApplication
Profiling Instrumentation
ShadowApplication
Profiling Instrumentation
fork() fork()
Design Overview
Monitor uses Pin Probes (code patching) Application runs natively
Monitor receives periodic timer signal and decides when to fork()
After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode.
Shadow process profiles as usual, except handling of special cases
Monitor logs special read() system calls and pipes result to shadow processes
System Calls
For SPEC CPU2000, system calls occur around 35 times per second Forking after each puts lots of pressure on CoW
pages, Pin JIT engine 95% of dynamic system calls can be safely
handled Some system calls can be allowed to execute (49%)
getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …
System Calls Some can be replaced with success assumed (39%)
write, ftruncate, writev, unlink, rename, … Some are handled specially, but execution
may continue (1.8%) mmap2, open(creat), mmap, mprotect, mremap, fcntl
read() is special (5.4%) For reads from pipes/sockets, the data must be logged from the
original app For reads from files, the file must be closed and reopened after the
fork() because the OS file pointer is not duplicated ioctl() is special (4.8%)
Frequent in perlbmk Behavior is device-dependent, safest action is to simply terminate the
segment and re-fork()
Other Issues
Shared Memory Disallow writes to shared memory
Asynchronous Interrupts (Userspace signals) Since we are only mostly deterministic, no longer an
issue When main program receives a signal, pass it along to
live children
JIT Overhead After each fork(), it is like Pinning a new program
Warmup is too slow Use Persistent Code Caching [CGO’07]
Multithreaded ProgramsIssue:fork() does not duplicate all threads
Only the thread that called fork()
Solution:1. Barrier all threads in the program and store their CPU state2. Fork the process and clone new threads for those that were
destroyed Identical address space; only register state was really
‘lost’3. In each new thread, restore previous CPU state
Modified clone() handling in Pin VM
4. Continue execution, virtualize thread IDs for relevant system calls
Tuning Overhead
Load Number of active shadow processes Tested 0.125, 0.25, 0.5, 1.0, 2.0
Sample Size Number of instructions to profile Longer samples for less overhead, more data Shorter samples for more evenly dispersed data Tested 1M, 10M, 100M
Experiments
Value Profiling Typical overhead ~100X Accuracy measured by Difference in Invariance
Path Profiling Typical overhead 50% - 10X Accuracy measured by percent of hot paths detected (2%
threshold)
All experiments use SPEC2000 INT Benchmarks with “ref” data set Arithmetic mean of 3 runs presented
Results - Value Profiling Overhead
Overhead versus native execution
Several configurations less than 1%
Path profiling exhibits similar trends
Value Profiling Overhead
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
0.063 0.125 0.25 0.5 1 2
Load
Overhead
1M
10M
100M
Results - Value Profiling Accuracy
All configurations within 7% of perfect profile Lower is better
0%
1%
2%
3%
4%
5%
6%
7%
8%
0.063 0.125 0.25 0.5 1 2Load
Difference in Invariance
1M
10M
100M
`
Results - Path Profiling Accuracy
Most configurations over 90% accurate Higher is better
Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot” 78%
80%
82%
84%
86%
88%
90%
92%
94%
96%
0.06 0.13 0.25 0.5 1 2
Load
Accuracy
1M 10M 100M
Results - Page Fault Increase
Proportional increase in page faults Shadow/Native
0
20
40
60
80
100
120
140
160
180
200
0.06 0.13 0.25 0.5 1 2
Load
Increase in Page Faults
1M
10M
100M
Results - Page Fault Rate
Difference in page faults per second experienced by native application
0
2000
4000
6000
8000
10000
12000
0.0630.125 0.25 0.51 2
Load
Paging Rate Increase
1M
10M
100M
Future Work
Improve stability for multithreaded programs Investigate effects of different persistent code cache
policies Compare sampling policies
Random (current) Phase/event-based Static analysis Study convergence
Apply technique Profile-guided optimizations Simulation techniques
Conclusion
Shadow Profiling allows collection of bursts of detailed traces Accuracy is over 90%
Incurs negligible overhead Often less than 1%
With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations