22
Cross-Module Optimization Thomas Lindgren [email protected]

Cross-Module Optimization Thomas Lindgren [email protected]

Embed Size (px)

Citation preview

Cross-Module Optimization

Thomas [email protected]

Overview

• OM - optimization manager– Erlang-to-Erlang optimizer (mostly)– ~20k lines of Erlang– intended to accelerate large

applications

• The rest of this talk– What does OM do?– How well does it work?

Profiling code

Source code

Annotation trees

Training exec

aggregation

Higher-order eliminationApply open-codingOutliningModule splitting

Inlining

Simplification

Production exec

Om overview

(Other modules)

Profiling and annotation

• Instrument code with profiling counters– standard counters (per function clause, per

call site, …)– which modules call each other, how often– which function is used at apply

• Annotations saved as syntax trees + counters

• Post-training: read counters, decorate annotation trees, optimize the result

Per-module optimizations

• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable

• Apply open-coding: replace apply with explicit (open-ended) switch

• Outlining: cold (seldom-executed) clauses are moved out-of-line

• Module splitting: cold code moved into new module

Higher-order elimination

Call:

lists:map(

fun(X) -> X+Y end,

Xs)

Call:

lists_map_0(Xs,Y)

lists_map_0([X|A],Y) ->

[X+Y|lists_map_0(A,Y)];

lists_map_0([],Y) -> [].

(The equivalent is done for most functions in lists)

Per-module optimizations

• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable

• Apply open-coding: replace apply with explicit (open-ended) switch

• Outlining: cold (seldom-executed) clauses are moved out-of-line

• Module splitting: cold code moved into new module

Apply open-coding

• apply(M,F,[A1,…,An])• Profiling reveals that

certain {Mod,Func,Arity} tuples are most common

• Switch on likely functions

• Enables inlining of explicit call (e.g., m1:f1(A1,A2))

case {M,F,length(As)} of

{m1,f1,2} ->

[A1,A2] = As,

m1:f1(A1,A2);

_ -> apply(M,F,As)

end

(most general case; optimization possible

when arity known, when call is local, …)

Per-module optimizations

• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable

• Apply open-coding: replace apply with explicit (open-ended) switch

• Outlining: cold (seldom-executed) clauses are moved out-of-line

• Module splitting: cold code moved into new module

Outlining

• Move cold function clauses, switch clauses, ... out-of-line

• Reduces function size => more inlining possible– outlining + inlining = (structured) partial inlining

• Sometimes improves pattern matching codecase read_file(FD,Len) of

{error,closed} -> …;

{error,prot} -> …;

{ok,{incomplete,Data}} -> …;

{ok,{complete,Data}} -> …;

X -> ...

end

case read_file(FD,Len) of

{ok,{complete,Data}} -> …;

Else -> ‘OUTLINED’(Else)

end

Per-module optimizations

• Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable

• Apply open-coding: replace apply with explicit (open-ended) switch

• Outlining: cold (seldom-executed) clauses are moved out-of-line

• Module splitting: cold code moved into new module

Module splitting

• Hot code retained in original module• Cold functions moved into “cold module”

– currently: duplicate entire original module

• Calls to cold functions re-routed to cold module– outlined function clauses often end up in cold

module

• Benefit: reduces hot module size => more aggregation– drawback: total code size increases (unimportant?)

Aggregation

• Optimization across module boundaries– but in Erlang, any module can be replaced

at any time (“hot code loading”)

• Merge optimized hot modules into aggregates– optimize each aggregate aggressively– but in Erlang you can replace any module

at runtime– how to do it?

Hot code loading

• Remote calls m:f(X) logically do the following:– lookup module named m– lookup function named f/1 in the found module– call the found function

• A new version of m can be loaded at any time– but occurs seldom in practice (every month? week?)– (an aside: OTP further structures code replacement)

• we do not take advantage of this

Hot code loading (2)

• Inlining of remote calls is not possible– what if the inlined module subsequently

changes?– worse, remote calls are very common

• Merging two modules into one is problematic– making remote calls into local calls changes

behaviour– safe approach: speculate that code has

not changed.

Hot code loading (3)

• Remote call is rewritten into test + common-case local call + backup remote call

• latest(m) can be implemented in linker– initially, always true– when new m loaded,

becomes always false

m:f(X1,X2)

(case latest(m) of

true -> local_f(X1,X2);

false -> m:f(X1,X2)

end)

Aggregation

• Merge modules that call each other often– use module-module call profile– remote calls are rewritten to use latest(m)– aggregation limited by size

• Widely-shared modules (e.g., lists) are engulfed– copy engulfed module into the calling module– necessary to enable high-quality aggregation

without huge aggregates

Post-aggregation optimization

• Profile-guided inlining– consider call sites in order of importance (# calls)– total amount of inlining limited by code size

increase– avoids pitfalls of static inlining: working on wrong

code, too conservative for important sites

• Simplification of resulting code– dead function removal (occurs due to engulfing,

inlining)– case-of-case, beta reduction, ...

Results

• Benchmarks: important subsystems of OTP, daily use– (decode1: protocol processing “inner loop”)– beam: beam compiler on lists.erl– gen_tcp: small messages over local socket– ldapv2: encoding and decoding LDAPv2 ASN.1

PDUs– mnesia: realtime database running simple

pseudo-HLR

• Benchmark suite freely available from author

Results (2)

• Each benchmark compiled with OM– same input used for training and

production– latest(m) simulated with cheap test

• Each benchmark run 30-40 times for baseline and optimized– removed outliers for gen_tcp and

mnesia to get more focussed speedup values

Results (3)

speedup notes

decode1 1.12 due to outlining

beam 3.96 h-o elim 2.94x

gen_tcp 2.54 (2.15 w/ outliers)

ldapv2 1.01

mnesia 1.17 (1.28 w/ outliers)

Conclusions

• Optimization across modules beneficial• Profile-driven optimization practical and

beneficial• Future work:

– try real applications (100s-1000s of modules)

– more optimizations– tune optimizations– automate reprofiling/recompilation