View
215
Download
0
Embed Size (px)
Citation preview
Online Performance AuditingUsing Hot Optimizations Without Getting Burned
Jeremy Lau (UCSD, IBM)Matthew Arnold (IBM)Michael Hind (IBM)Brad Calder (UCSD)
2
Problem
Trend: Increasing complexity of computer systems– Hardware: more speculation and parallelism– Software: more abstraction layers and
virtualization Increasing complexity makes it more difficult
to reason about performance– Will optimization X improve performance?
3
Increasing Complexity
Increasing distance between application and raw performance– Stack on right vs. classic
Application-OS-Hardware stack
Hard to predict how all layers will react to application-level optimization
Application
Application Server
OS
Hardware
Java VM
Hypervisor
4
Heuristics
When should I use optimization X? Common solution: Use heuristics Example: Apply optimization X if code size < N
– “We believe X will improve performance when code size < N”
Determine N by running benchmarks and tuning to maximize average performance
But heuristics will miss opportunities to improve performance– Because they are tuned for the average case
5
Experiment
Aggressive inlining: 4x inlining thresholds – Allows much larger methods to be inlined
Apply aggressive inlining to one hot method at a time
Calculate per-method speedups vs. default inlining policy– Use cycle counter to measure performance
6
Experiment ResultsAggressive inlining vs. default inlining
Per-Method Speedups
Using J9, IBM’s high-performance Java VM
7
Experiment Analysis
Aggressive inlining: mixed results More slowdowns than speedups But there are significant speedups!
8
Wishful Thinking
Dream: A world without slowdowns Default inlining heuristics miss these opportunities to
improve performance Goal: Be aggressive only when it produces speedup
9
Approach
Determine if optimization improves or degrades performance as program executes– For general purpose applications
– Using VM support (dynamic compilation)
Plan:– Compile two versions of the code: with and without
optimization
– Measure performance of both versions
– Use best performing version
10
Benefits
Defense: Avoid slowdowns due to poor optimization decisions– Sometimes O3 is slower than O2. Detect and
correct
Offense: Find speedups by searching the optimization space– Try high-risk optimizations without fear of long-
term slowdowns
11
Challenge
Which implementation is fastest? – Decide online, without stopping and restarting the program
Can’t just invoke each version once and compare times– Changing inputs, global state, etc
Example: Sorting routine. Size of input determines run time– SortVersionA(10 entries) vs. SortVersionB(1,000,000 entries) – Invocation timings don’t reflect performance of A and B
○ Unless we know that input size correlates with runtime ○ But that requires high-level understanding of program behavior
Solution: Collect multiple timing samples for each version– Use statistics to determine how many samples to collect
12
Timing Infrastructure
Sort()
VersionA
Sort()
Version B
Randomly chooseVersion A or B
Invocation of Sort()
Method exit
Start timer
Stop timerRecord timing
Can generalize: Doesn’t have to be
method granularity and Can use more than two
versions
13
Statistical Analysis
Is A faster than B?– How confident are we?
– Use standard statistical hypothesis testing (t-test)
If low confidence, collect more timing data
Version ATimings
Version BTimings
Statistical Timing
Analysis
INPUT: Two sets of method timings
OUTPUT:
A is faster (or slower) than B by X% with Y% confidence
14
Time to Converge
How long will it take to reach a confident conclusion?– Any speedup can be detected with enough timing data
Time to converge depends on:– Variance in timing data
○ Easy to detect speedup if method always does the same amount of work
– Speedup due to optimization ○ Easy to detect big speedups
Fastest convergence for low variance methods with high speedup
15
Fixed Number of Samples
Why not just collect 100 samples? Experiment: Try to detect an X% speedup with
100 samples How often do the samples indicate a
slowdown? Each slowdown detected is a false positive
– Samples do not accurately represent the population
17
Fixed Number of Samples
Number of samples needed depends on speedup– More speedup → Fewer samples
Fixed sampling inefficient – Suppose we want to maintain 5% false positive rate– Could always collect 10k samples, but that wastes time
Statistical approach collects only as many samples as needed to reach confident conclusion
18
Prototype Implementation
Prototype online performance auditing system implemented in IBM’s J9 Java VM
Currently audits a single optimization Experiment with aggressive inlining
– Infrastructure is not tied to aggressive inlining. Can evaluate any single optimization
When a method reaches highest optimization level:– Compile two versions of the method (with and without aggressive
inlining), collect timing data, run statistical analysis If aggressive inlining generates quickly detectable speedup,
use it, else fall back to default inlining– Timeout can occur when confident conclusion not reached in 5 seconds
22
Timeouts
Good news: Few incorrect decisions Timeouts: Only collect one timing sample for each method invocation
– Most methods are not invoked frequently enough to converge before timeout Future work: Reduce timeouts by reducing convergence time
– Collect multiple timings per invocation: use loop iteration times instead of invocation times
23
Future Work
Audit multiple optimizations and settings– Search the optimization space online, as program executes
Exponential search space is both challenge and opportunity
Apply prior work in offline optimization space search Use Performance Auditor to tune optimization
strategy for each method
24
Summary
Not easy to predict performance – Should I apply optimization X?
Online Performance Auditing– Measure code performance as the program executes
Detect slowdowns – Due to poor optimization decisions
Find speedups– Use high-risk optimizations without long-term slowdown
Enable online optimization space search