Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Online Performance AuditingUsing Hot Optimizations Without Getting Burned

Jeremy Lau (UCSD, IBM)Matthew Arnold (IBM)Michael Hind (IBM)Brad Calder (UCSD)

2

Problem

Trend: Increasing complexity of computer systems– Hardware: more speculation and parallelism– Software: more abstraction layers and

virtualization Increasing complexity makes it more difficult

to reason about performance– Will optimization X improve performance?

3

Increasing Complexity

Increasing distance between application and raw performance– Stack on right vs. classic

Application-OS-Hardware stack

Hard to predict how all layers will react to application-level optimization

Application

Application Server

OS

Hardware

Java VM

Hypervisor

4

Heuristics

When should I use optimization X? Common solution: Use heuristics Example: Apply optimization X if code size < N

– “We believe X will improve performance when code size < N”

Determine N by running benchmarks and tuning to maximize average performance

But heuristics will miss opportunities to improve performance– Because they are tuned for the average case

5

Experiment

Aggressive inlining: 4x inlining thresholds – Allows much larger methods to be inlined

Apply aggressive inlining to one hot method at a time

Calculate per-method speedups vs. default inlining policy– Use cycle counter to measure performance

6

Experiment ResultsAggressive inlining vs. default inlining

Per-Method Speedups

Using J9, IBM’s high-performance Java VM

7

Experiment Analysis

Aggressive inlining: mixed results More slowdowns than speedups But there are significant speedups!

8

Wishful Thinking

Dream: A world without slowdowns Default inlining heuristics miss these opportunities to

improve performance Goal: Be aggressive only when it produces speedup

9

Approach

Determine if optimization improves or degrades performance as program executes– For general purpose applications

– Using VM support (dynamic compilation)

Plan:– Compile two versions of the code: with and without

optimization

– Measure performance of both versions

– Use best performing version

10

Benefits

Defense: Avoid slowdowns due to poor optimization decisions– Sometimes O3 is slower than O2. Detect and

correct

Offense: Find speedups by searching the optimization space– Try high-risk optimizations without fear of long-

term slowdowns

11

Challenge

Which implementation is fastest? – Decide online, without stopping and restarting the program

Can’t just invoke each version once and compare times– Changing inputs, global state, etc

Example: Sorting routine. Size of input determines run time– SortVersionA(10 entries) vs. SortVersionB(1,000,000 entries) – Invocation timings don’t reflect performance of A and B

○ Unless we know that input size correlates with runtime ○ But that requires high-level understanding of program behavior

Solution: Collect multiple timing samples for each version– Use statistics to determine how many samples to collect

12

Timing Infrastructure

Sort()

VersionA

Sort()

Version B

Randomly chooseVersion A or B

Invocation of Sort()

Method exit

Start timer

Stop timerRecord timing

Can generalize: Doesn’t have to be

method granularity and Can use more than two

versions

13

Statistical Analysis

Is A faster than B?– How confident are we?

– Use standard statistical hypothesis testing (t-test)

If low confidence, collect more timing data

Version ATimings

Version BTimings

Statistical Timing

Analysis

INPUT: Two sets of method timings

OUTPUT:

A is faster (or slower) than B by X% with Y% confidence

14

Time to Converge

How long will it take to reach a confident conclusion?– Any speedup can be detected with enough timing data

Time to converge depends on:– Variance in timing data

○ Easy to detect speedup if method always does the same amount of work

– Speedup due to optimization ○ Easy to detect big speedups

Fastest convergence for low variance methods with high speedup

15

Fixed Number of Samples

Why not just collect 100 samples? Experiment: Try to detect an X% speedup with

100 samples How often do the samples indicate a

slowdown? Each slowdown detected is a false positive

– Samples do not accurately represent the population

16


17


Number of samples needed depends on speedup– More speedup → Fewer samples

Fixed sampling inefficient – Suppose we want to maintain 5% false positive rate– Could always collect 10k samples, but that wastes time

Statistical approach collects only as many samples as needed to reach confident conclusion

18

Prototype Implementation

Prototype online performance auditing system implemented in IBM’s J9 Java VM

Currently audits a single optimization Experiment with aggressive inlining

– Infrastructure is not tied to aggressive inlining. Can evaluate any single optimization

When a method reaches highest optimization level:– Compile two versions of the method (with and without aggressive

inlining), collect timing data, run statistical analysis If aggressive inlining generates quickly detectable speedup,

use it, else fall back to default inlining– Timeout can occur when confident conclusion not reached in 5 seconds

19

Results

20

Results

21

Per-Method Accuracy

22

Timeouts

Good news: Few incorrect decisions Timeouts: Only collect one timing sample for each method invocation

– Most methods are not invoked frequently enough to converge before timeout Future work: Reduce timeouts by reducing convergence time

– Collect multiple timings per invocation: use loop iteration times instead of invocation times

23

Future Work

Audit multiple optimizations and settings– Search the optimization space online, as program executes

Exponential search space is both challenge and opportunity

Apply prior work in offline optimization space search Use Performance Auditor to tune optimization

strategy for each method

24

Summary

Not easy to predict performance – Should I apply optimization X?

Online Performance Auditing– Measure code performance as the program executes

Detect slowdowns – Due to poor optimization decisions

Find speedups– Use high-risk optimizations without long-term slowdown

Enable online optimization space search

Documents

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)