54
Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca

Adaptive Optimization with On-Stack Replacement

  • Upload
    celina

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Adaptive Optimization with On-Stack Replacement. Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca. Motivation. Modern VM uses adaptive recompilation strategies - PowerPoint PPT Presentation

Citation preview

Page 1: Adaptive Optimization with On-Stack Replacement

Adaptive Optimization with On-Stack Replacement

Stephen J. Fink IBM T.J. Watson Research Center

Feng Qian (presenter)Sable Research Group, McGill University

http://www.sable.mcgill.ca

Page 2: Adaptive Optimization with On-Stack Replacement

Motivation

Modern VM uses adaptive recompilation strategies

VM replaces entry in dispatching table with newly compiled code

Switching to new code can only happen at the next invocation

On-stack replacement (OSR) allows transformation happen in the middle of method execution

Page 3: Adaptive Optimization with On-Stack Replacement

What is On-stack Replacement?

Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack

stack

PC

frame

m1

m1

stack

PC

frame

m2

m2

Page 4: Adaptive Optimization with On-Stack Replacement

Why On-Stack Replacement (OSR)?

Debugging optimized code via dynamic de-optimization [SELF-93]

Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001]

Promotion of long-run activations [SELF-93]

Safe invalidation for speculative optimization [HotSpot, SELF-91]

Page 5: Adaptive Optimization with On-Stack Replacement

Related Work

Holzle, Chambers, and Ungar (SELF-91, SELF-93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]

HotSpot server compiler [JVM’01]

Partial method compilation [OOPSLA’01]

Page 6: Adaptive Optimization with On-Stack Replacement

OSR Challenges

Engineering Complexity How to minimize disruption to VM code base? How to constrain optimizations?

Policies for applying OSR How to make rational decisions for applying OSR?

Effectiveness How does OSR improve/constrain dataflow

optimizations? How effective are online OSR-based optimizations?

Page 7: Adaptive Optimization with On-Stack Replacement

Outline

Motivation OSR Mechanism Applications Experimental Results Conclusion

Page 8: Adaptive Optimization with On-Stack Replacement

OSR Mechanism Overview

Extract compiler-independent state from a suspended activation for m1

Generate specialized code m2 for the suspended activation

Compile and transfer execution to the new code m2

m2

stack

PC

frame

m1

m1

compiler-

independent state

stack

PC

frame

m2

m2

1 2 3

Page 9: Adaptive Optimization with On-Stack Replacement

JVM Scope Descriptor

Compiler-independent state of a running activation

Based on Java Virtual Machine Architecture Five components:

1) Thread running the activation2) Reference to the activation's stack frame3) Program Counter (as a bytecode index)4) Value of each local variable5) Value of each stack location

Page 10: Adaptive Optimization with On-Stack Replacement

class C { static int sum(int c) { int y = 0; for (int i=0; i<c; i++) { y += i; } return y; }}

Running thread: MainThreadFrame Pointer: 0xSomeAddressProgram Counter: 16Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50;Stack Expressions: S0 = 50; S1 = 100;

JVM Scope Descriptor 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn

Bytecode

JVM Scope Descriptor Example

Suspend after50 loop iterations(i = 50)

Page 11: Adaptive Optimization with On-Stack Replacement

Extracting JVM Scope Descriptor

Trivial from interpreter Optimizing Compiler

Insert OSR Point (safe-point) instructions in initial IR OSR Point uses stack, local state needed to recover

scope descriptor OSR Point is treated as a call, transfers control to exit

block Aggregate OSR points to an OSR map when generating

machine instructionsstack

PC

frame

m1

m1

compiler-

independent state

1

Page 12: Adaptive Optimization with On-Stack Replacement

Specialized Code Generation

Prepend a specialized prologue to original bytecode

Prologue will• Save JVM Scope Descriptor values into local variables• Push JVM Scope Descriptor values onto the stack• Jump to the desired program counter

m2

compiler-

independent state

2

Page 13: Adaptive Optimization with On-Stack Replacement

Running thread: MainThreadFrame Pointer: 0xSomeAddressProgram Counter: 16Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50;Stack Expressions: S0 = 50; S1 = 100;

JVM Scope Descriptor

ldc 100 istore_0 ldc 1225 istore_1 ldc 50 istore_2 ldc 50 ldc 100 goto 160 iconst_0 ...16 if_icmplt 7 ...20 ireturn

Specialized Bytecode 0 iconst_0

1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn

Original Bytecode

Transition Example

Page 14: Adaptive Optimization with On-Stack Replacement

m2

stack

PC

frame

m2

m2

3

Transfer Execution to the New Code

Compile m2 as a normal method System unfolds the stack frame of m1 Reschedule the thread to execute m2 By construction, executing specialized m2 sets up

target stack frame and continues execution

m2

stack

PC

frame

m2

m2

3

Page 15: Adaptive Optimization with On-Stack Replacement

Suppose optimizer inlines A -> B -> C:

A'

stack

PC

frameA

A

1 2 3

JVM ScopeDescriptor A

JVM ScopeDescriptor C

JVM ScopeDescriptor B

C'

B'

stack

PC

frame

m2

C'

A'

B'

AA

frame

C'frame

A'

frame

B'

frame

Recovering from Inlining

Page 16: Adaptive Optimization with On-Stack Replacement

Inlining Example

foo_prime() { <specialized foo prologue> call bar_prime() goto A; ... bar(); A: ...}bar_prime() { <specialized bar prologue> goto B: ... B: ...}

void foo() { bar(); A: ... } void bar() { ... B: ... }

Wipe stackto caller C and call foo_prime

frame

A

stack

PC

frame

m2

foo'

bar'

C

frame

bar'

frame

foo'

Suspendat B: inA -> B

Page 17: Adaptive Optimization with On-Stack Replacement

Implementation Details

Target Compiler unmodified, except for .... New pseudo-bytecodes

Load literals (to avoid inserting new constants in constant pool)

Load an address/bytecode index: JSR return address on stack

Fix bytecode indices for GC maps, exception tables, line number tables

Page 18: Adaptive Optimization with On-Stack Replacement

Pros and Cons

Advantages mostly compiler-independent avoid multi-entry points of compiled code target compiler can exploit run-time constants

Disadvantage must compile target method twice (once for transition,

once for next invocation)

Page 19: Adaptive Optimization with On-Stack Replacement

Outline

MotivationOSR Mechanism Applications Experimental Results Conclusion

Page 20: Adaptive Optimization with On-Stack Replacement

Two OSR Applications

Promotion (see the paper for details) recompile a long-running activation

Deferred Compilation don't compile uncommon paths saves compile-time

x = 1; x = foo();

return x;

if (foo is currently final)

trap/OSR;

Page 21: Adaptive Optimization with On-Stack Replacement

Deferred Compilation

What's "infrequent"? static heuristics profile data

Adaptive recompilation decision is modified to consider OSR factors

Feng Qian:

Class initialization is called by a class loader, when do we need OSR

for it?

Feng Qian:

Class initialization is called by a class loader, when do we need OSR

for it?

Page 22: Adaptive Optimization with On-Stack Replacement

Outline

MotivationOSR MechanismApplications Experimental Results Conclusion

Page 23: Adaptive Optimization with On-Stack Replacement

Online Experiments

Eager : (by default) no deferred compilation OSR/static: deferred compilation for CHA-based inlining

only OSR/edge counts: deferred compilation w/online profile

data & CHA-based inlining

Page 24: Adaptive Optimization with On-Stack Replacement

Adaptive System Performance

First Run

0.8

0.9

1

1.1

1.2

com

pres

s

jess db

java

c

mpe

gaud

io

mtr

t

jack

g. m

ean

Per

form

ance

Rel

ativ

e to

Eag

er

OSR/ edge counts OSR/ static

bett

er

Page 25: Adaptive Optimization with On-Stack Replacement

Adaptive System Performance

Best Run of 10

0.8

0.9

1

1.1

1.2co

mpre

ss

jess db

java

c

mpegau

dio

mtr

t

jack

g.m

ean

Perf

orm

an

ce R

ela

tive t

o E

ag

er OSR/ edge counts OSR/ static

bett

er

Page 26: Adaptive Optimization with On-Stack Replacement

Promotions Invalidations

compress 3 6

jess 0 0

db 0 1

javac 0 10

mpegaudio 0 1

mtrt 0 5

jack 0 1

total 3 24

OSR ActivitiesSPECjvm98 size 100 First Run

Page 27: Adaptive Optimization with On-Stack Replacement

Outline

MotivationOSR MechanismApplicationsExperimental Results Conclusion

Page 28: Adaptive Optimization with On-Stack Replacement

Summary

A new On-stack replacement mechanism Online profile-directed deferred compilation Evaluation of OSR applications in JikesRVM

Page 29: Adaptive Optimization with On-Stack Replacement

Conclusion

Should a VM implement OSR?+Can be done with minimal intrusion to code

baseModest gains from deferred compilationNo benefit for class-hierarchy-based inlining+Debugging with dynamic de-optimization

valuable TODO: More advanced speculative

optimizations

Implementation is available to public in JikesRVM under CPL:

Linux/x86, Linux/PPC, and AIX/PPC

http://www-124.ibm.com/developerworks/oss/jikesrvm/

Page 30: Adaptive Optimization with On-Stack Replacement

Backup Slides

Page 31: Adaptive Optimization with On-Stack Replacement

Compile RateOffline Profile

Page 32: Adaptive Optimization with On-Stack Replacement

Compile RateOffline Profile

Page 33: Adaptive Optimization with On-Stack Replacement

Machine Code SizeOffline Profile

Page 34: Adaptive Optimization with On-Stack Replacement

Machine Code SizeOffline Profile

Page 35: Adaptive Optimization with On-Stack Replacement

Code QualityOffline Profile

Page 36: Adaptive Optimization with On-Stack Replacement

Code QualityOffline Profile

better

Page 37: Adaptive Optimization with On-Stack Replacement

Jikes RVM Analytic Recompilation Model

Definecur, current optimization level for method mTj, expected future execution time at level jCj, compilation cost at opt level j

Choose j > cur that minimizes Tj + CjIf Tj + Cj < Tcur recompile at level jAssumptions

Method will execute for twice its current duration Compilation cost and speedup based on offline average Sample data determines how long a method has executed

Page 38: Adaptive Optimization with On-Stack Replacement

Jikes RVM OSR Promotion Model

Given: Outdated activation A of method mDefine

L, last optimization level for any compiled version of mcur, current optimization level for activation A

Tcur , expected future execution time of A at level cur

CL , compilation cost for method m at opt level L

TL , expected future execution time of A at level L

If TL + CL < Tcur specialize A at level LAssumption

Outdated activation will execute for twice its current duration

Page 39: Adaptive Optimization with On-Stack Replacement

Jikes RVM Recompilation Model, with Profile-Driven Deferred Compilation

Definecur, current optimization level for method mTj, expected future execution time at level jCj, compilation cost at opt level j

P, percentage of code in m that profile data indicates was reached

Choose j > cur that minimizes Tj + P*CjIf Tj + P*Cj < Tcur recompile at level jAssumptions

Method will execute for twice its current duration Compilation cost and speedup based on offline average Sample data determines how long a method has executed

Page 40: Adaptive Optimization with On-Stack Replacement

Offline Profile experiments

Collect "perfect" profile data offline Mark any block never reached as "uncommon" Defer compilation of "uncommon" blocks Four configurations

Ideal: deferred compilation trap keeps no state liveIdeal-OSR: deferred compilation trap is valid OSR pointStatic-OSR: no profile data; defer compilation for CHA-based

inlining; trap is valid OSR pointEager: (default) no deferred compilation

Page 41: Adaptive Optimization with On-Stack Replacement

Compile RateOffline Profile

Page 42: Adaptive Optimization with On-Stack Replacement

Machine Code SizeOffline Profile

Page 43: Adaptive Optimization with On-Stack Replacement

Code QualityOffline Profile

Page 44: Adaptive Optimization with On-Stack Replacement

OSR Challenges

Engineering ComplexityHow to minimize disruption to VM code base?How to constrain optimizations?

Policies for applying OSRHow to make rational decisions for applying OSR?

EffectivenessHow does OSR improve/constrain dataflow optimizations?

How effective are online OSR-based optimizations?

Page 45: Adaptive Optimization with On-Stack Replacement

Recompilation ActivitiesFirst Run

O0 O1 O2 total O0 O1 O2 total

compress 17 7 2 26 13 9 6 28

jess 49 20 1 70 39 17 4 60

db 8 4 2 14 8 4 5 17

javac 171 19 2 192 168 16 3 187

mpegaudio

68 32 7 107 66 29 6 101

mtrt 57 14 3 74 61 11 3 75

jack 59 25 8 92 54 26 5 85

total 429 121 25 575 409 112 32 553

With OSR Without OSR

Page 46: Adaptive Optimization with On-Stack Replacement

Summary of Study (1)

Engineering ComplexityHow to minimize disruption to VM code base?

°Compiler-independent specialized source code to manage transition transparently

How to constrain optimizations?°Model OSR Points like CALLS in standard transformations

Policies for applying OSRHow to make rational decisions for applying OSR?

°Simple modifications to cost-benefit analytic model

Page 47: Adaptive Optimization with On-Stack Replacement

Summary of Study (2)

Effectiveness (for an implementation of online profile-directed deferred compilation)

How does OSR improve/constrain dataflow optimizations?

°small ideal benefit from dataflow merges (0.5 - 2.2%)°negligible benefit when constraining optimization for potential invalidation°negligible benefit for just CHA-based inlining

patch points + splitting + pre-existence good enough

How effective are online OSR-based optimizations? °average performance improvement of 2.6% on first run SPECjvm98 s=100°individual benchmarks range from +8% to -4%°negligible impact on steady state performance (best of 10 iterations)°adaptive recompilation model relatively insensitive, compiles 4% more methods

Page 48: Adaptive Optimization with On-Stack Replacement

Experimental Details

SPECjvm98, size 100Jikes RVM 2.1.1

FastAdaptiveSemispace configurationone virtual processor500MB heap

separate VM instance for each benchmarkIBM RS/6000 Model F80

six 500 MHz PowerPC 630'sAIX 4.3.34 GB memory

Page 49: Adaptive Optimization with On-Stack Replacement

Specialized Code Generation

Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics.

Express the transition to new stack frame in source code (bytecode)

m2

compiler-

independent state

2

Page 50: Adaptive Optimization with On-Stack Replacement

Deferred Compilation

Don't compile "infrequent" blocks

x = 1; trap/OSR;

return x;

if (foo is currently final)

x = 1; x = foo();

return x;

if (foo is currently final)

Page 51: Adaptive Optimization with On-Stack Replacement

Experimental Results

Online profile-directed deferred compilation Evaluation

How much do OSR points improve optimization by eliminating merges?How much do OSR points constrain optimization?How effective is online profile-directed deferred compilation?

Page 52: Adaptive Optimization with On-Stack Replacement

Adaptive System Performance

Page 53: Adaptive Optimization with On-Stack Replacement

Adaptive System Performance

Page 54: Adaptive Optimization with On-Stack Replacement

Online Experiments

Before optimizing, collect intraprocedural edge countersDefer compilation at blocks that profile data says not reachedIf deferred block reached

Trigger OSR and deoptimizeInvalidate compiled code

Modify analytic recompilation modelPromotion from baseline to optimizedCompile-time cost estimate modified according to profile data