Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
PEAK – A FAST AND EFFECTIVE PERFORMANCE TUNING SYSTEM VIA
COMPILER OPTIMIZATION ORCHESTRATION
A Thesis
Submitted to the Faculty
of
Purdue University
by
Zhelong Pan
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
May 2006
Purdue University
West Lafayette, Indiana
ii
To my wife Xiaojuan.
iii
ACKNOWLEDGMENTS
I would like to first thank my advisor, Rudi Eigenmann, for his support and advice
during my many days here at Purdue. It has been a true pleasure to work with him. I
would also like to thank the other professors at Purdue who have taught and advised
me, especially my Ph.D. committee members: Sam Midkiff, T.N. Vijaykumar, and
Zhiyuan Li.
Our research group at Purdue has been its share of good people. My thanks
go to Brian Armstrong, Seung-Jai Min, Hansang Bae, Troy Johnson, Xiaojuan Ren,
Sang-Ik Lee, Ayon Basumallik and Yili Zheng. I would especially like to thank Brian
Armstrong for answering many of my questions on the Polaris compiler, latex tools,
and even English.
I could never have done this without my family’s support and encouragement.
To my parents and my brother go my deepest love and gratitude. Of course, it
is my wife, Xiaojuan, that sacrificed the most for me. You are what makes it all
worthwhile.
And finally, I must thank all my friends and Purdue faculty and staffs, who make
my study and research at Purdue a joyful and meaningful journey.
iv
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 FAST AND EFFECTIVE OPTIMIZATION ORCHESTRATION ALGO-RITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Orchestration Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Orchestration algorithms . . . . . . . . . . . . . . . . . . . . . 13
2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Experimental environment . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Upper Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 The General Combined Elimination Algorithm . . . . . . . . . . . . . 33
2.5.1 Experimental results on SUN Forte compilers . . . . . . . . . 34
3 FAST AND ACCURATE RATING METHODS . . . . . . . . . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Rating Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
Page
3.2.1 Context Based Rating (CBR) . . . . . . . . . . . . . . . . . . 40
3.2.2 Model Based Rating (MBR) . . . . . . . . . . . . . . . . . . . 43
3.2.3 Re-execution Based Rating (RBR) . . . . . . . . . . . . . . . 45
3.3 The Use of Rating Methods in PEAK . . . . . . . . . . . . . . . . . . 48
3.4 Evaluation on Rating Accuracy . . . . . . . . . . . . . . . . . . . . . 50
4 TUNING SECTION SELECTION . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Profile Data for Selecting Tuning Sections . . . . . . . . . . . . . . . 55
4.3 A Formal Description of the Tuning Section Selection Problem . . . . 56
4.4 The Tuning Section Selection Algorithm . . . . . . . . . . . . . . . . 58
4.4.1 Dealing with recursive functions . . . . . . . . . . . . . . . . . 59
4.4.2 Maximizing tuning section coverage under Nlb . . . . . . . . . 61
4.4.3 The final tuning section selection algorithm . . . . . . . . . . 66
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.1 SPEC CPU2000 FP benchmarks . . . . . . . . . . . . . . . . 69
4.5.2 SPEC CPU2000 INT benchmarks . . . . . . . . . . . . . . . . 70
5 THE PEAK SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Design of PEAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 The steps of automated performance tuning . . . . . . . . . . 74
5.2.2 Dynamic code generation and loading . . . . . . . . . . . . . . 77
5.3 An Example of Using PEAK . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 Tuning time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.2 Tuned program performance . . . . . . . . . . . . . . . . . . . 86
5.4.3 Integer benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 87
6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 91
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
Page
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 Performance analysis on compiler optimizations . . . . . . . . 92
6.2.2 Other tuning problems . . . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Adaptive performance tuning . . . . . . . . . . . . . . . . . . 94
6.2.4 Program debugging . . . . . . . . . . . . . . . . . . . . . . . . 94
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
APPENDIX: PERFORMANCE OF GCC OPTIMIZATIONS . . . . . . . . . 102
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
vii
LIST OF TABLES
Table Page
2.1 Orchestration algorithm complexity (n is the number of optimization op-tions.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Optimization options in GCC 3.3.3 . . . . . . . . . . . . . . . . . . . . . 21
2.3 Mean performance on SPARC II. CE achieves both fast tuning speed andhigh program performance on SPARC II as well. . . . . . . . . . . . . . 28
2.4 Upper bound analysis under four different machine and benchmark set-tings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Optimization flags orchestrated by GCE . . . . . . . . . . . . . . . . . . 35
3.1 Rating accuracy for selected tuning sections . . . . . . . . . . . . . . . . 51
4.1 Tuning section selection for mgrid. The best Nlb is 400. The optimalcoverage and Nmin are 0.957 and 2000. . . . . . . . . . . . . . . . . . . . 67
4.2 Selected tuning sections in SPEC CPU2000 FP benchmarks.(Three manually partitioned benchmarks are annotated with ‘*’.The last row, wupwise+, uses a smaller Tlb = 1μsec.) . . . . . . . . . . . 69
4.3 Selected tuning sections in SPEC CPU2000 INT benchmarks.(The benchmarks annotated with ‘+’ use smaller Tlb’s.) . . . . . . . . . . 72
A.1 Average speedups of the optimization levels, relative to O0. In eachentry, the first number is the arithmetic mean, and the second one is thegeometric mean. The averages without art are put in parentheses for thefloating point benchmarks on the Pentium IV machine. . . . . . . . . . . 105
viii
LIST OF FIGURES
Figure Page
2.1 Normalized tuning time of five optimization orchestration algorithms forSPEC CPU2000 benchmarks on Pentium IV. Lower is better. CE hasthe shortest tuning time in all except a few cases. In all those cases, theextended tuning time leads to higher performance. . . . . . . . . . . . . 23
2.2 Program performance achieved by five optimization orchestration algo-rithms relative to the highest optimization level “O3” for SPEC CPU2000benchmarks on Pentium IV. Higher is better. In all cases, CE performsthe best or within 1% of the best. . . . . . . . . . . . . . . . . . . . . . 24
2.3 Overall comparison of the orchestration algorithms. CE achieves bothfast tuning speed and high program performance. . . . . . . . . . . . . . 29
2.4 Total negative effects of all the GCC 3.3.3 O3 optimization options . . . 30
2.5 Upper bound analysis on Pentium IV. ES 6: exhaustive search with 6optimizations; CE 6: combined elimination with 6 optimizations; CE 38:combined elimination with 38 optimizations; BE 6: batch eliminationwith 6 optimizations. CE 6 achieves nearly the same performance asES 6, in all cases. CE 38 performs better. (Exhaustive search with 38optimizations would be infeasible.) BE 6 is much worse than CE 6. CE 6is about 4 times faster than ES 6. . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Program performance achieved by the GCE algorithm vs the performanceof the manually tuned results (peak setting). Higher is better. In allcases, GCE achieves equal or better performance. On average, GCEnearly doubles the performance. . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Pseudo code of context variable analysis . . . . . . . . . . . . . . . . . . 42
3.2 A simple example of MBR . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Basic Re-execution-based rating method (RBR) . . . . . . . . . . . . . . 46
3.4 Improved Re-execution-based rating method . . . . . . . . . . . . . . . . 48
ix
Figure Page
4.1 An example of tuning section selection. The graph is a call graph withnode a as the main function. The weights on an edge are the number ofinvocations and the execution time in the parentheses. The optimal edgecut is (Θ = {a, c}, Ω = {b, d, e, f}), shown by the dashed curve. Edges(a, b) and (c, f) are chosen as the S set. Edge (c, e) in the cut (Θ, Ω)is not included in S, because its average execution time is 1/20000 lessthan Tlb = 1e−4. There are two tuning sections led by node b and nodef , T = {b, f}. The number of invocations to b and f are 1000 and 200in respect, so, Nmin = 200. The coverage of this optimal tuning sectionselection is (80+18)/100 = 0.98, where the total execution time, Ttotal, is100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 The pseudo code for call graph simplification. The algorithm generates acall graph from profile data, detects and discards recursive calls. Hence,the call graph is simplified to a directed acyclic graph. . . . . . . . . . . 60
4.3 An example of call graph simplification. The graph is a call graph withnode a as the main function. c is a self-recursive function. b and e re-cursively call each other. The weights on an edge are the number ofinvocations and the execution time in the parentheses. After simplifica-tion, the loop at node c is discarded. The strongly connected component{b, e} is merged into one node be. The entry node b for this stronglyconnected component is kept. A new edge (b, be) is added. Edges (b, f)and (e, f) are merged to (be, f). The profile data on edges (b, be) and(be, f) are updated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Tuning section selection algorithm to maximize program coverage underthe lower bound on numbers of TS invocations, Nlb. This algorithm tra-verses the simplified call graph from top down to find the code sectionswhose numbers of invocations are greater than Nlb. In addition, the al-gorithm finds the functions that may be manually partitioned to improvetuning section coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Update profile data after vi is chosen as the entry function to a tuningsection. The updated profile reflects the execution times and invocationnumbers after excluding the chosen tuning section. . . . . . . . . . . . . 65
4.6 The final tuning section selection algorithm. This algorithm achieves botha large Nmin and a high coverage. It iteratively uses the method shownin Figure 4.4 to maximize the tuning section coverage under a series ofthresholds Nlb’s, until the optimal Nlb is found. . . . . . . . . . . . . . . 68
5.1 Block diagram of the PEAK performance tuning system . . . . . . . . . 75
5.2 An example of the tuning section calc1 in swim . . . . . . . . . . . . . . 79
x
Figure Page
5.3 The tuning section calc1 instrumented by the PEAK compiler . . . . . . 80
5.4 The initialization function instrumented by the PEAK compiler . . . . . 81
5.5 The exit function instrumented by the PEAK compiler . . . . . . . . . . 82
5.6 Normalized tuning time of the whole-program tuning and the PEAKsystem for SPEC CPU2000 FP benchmarks on Pentium IV. Lower isbetter. On average, PEAK gains a speedup of 20.3. . . . . . . . . . . . . 84
5.7 Tuning time percentage of the six stages for SPEC CPU2000 FP bench-marks on Pentium IV. (TSS: tuning section selection, RMA: rating methodanalysis, CI: code instrumentation, DG: driver generation, PT: perfor-mance tuning, FVG: final version generation.) The most time-consumingsteps are PT, TSS and RMA. . . . . . . . . . . . . . . . . . . . . . . . . 85
5.8 Program performance improvement relative to the baseline under O3 forSPEC CPU2000 FP benchmarks on Pentium IV. Higher is better. Allthe benchmarks use the train dataset as the input to the tuning process.Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Refand PEAK Ref use the ref dataset to evaluate the tuned program per-formance, but still the train dataset for tuning. PEAK achieves equal orbetter program performance than the whole-program tuning. . . . . . . . 87
5.9 PEAK tuning time for INT benchmarks on Pentium IV . . . . . . . . . . 89
5.10 Program performance improvement relative to the baseline under O3 forSPEC CPU2000 INT benchmarks on Pentium IV. Higher is better. Allthe benchmarks use the train dataset as the input to the tuning process.Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Refand PEAK Ref use the ref dataset to evaluate the tuned program perfor-mance, but still the train dataset for tuning. . . . . . . . . . . . . . . . . 90
A.1 Execution time of SPEC CPU 2000 benchmarks under different optimiza-tion levels compiled by GCC. (Four floating point benchmarks written inf90 are not included, since GCC does not compile them.) Each benchmarkhas four bars for O0 to O3. (a) and (c) show the integer benchmarks; (b)and (d) show the floating point benchmarks. (a) and (b) are the resultson a Pentium IV machine; (c) and (d) are the results on a SPARC IImachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2 Relative improvement percentage of all O3 optimizations. . . . . . . . . . 107
A.3 Relative improvement percentage of all O3 optimizations. . . . . . . . . . 108
xi
Figure Page
A.4 Relative improvement percentage of all O3 optimizations. sixtrack on aPentium IV machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.5 Relative improvement percentage of strict aliasing. . . . . . . . . . . . . 110
A.6 Relative improvement percentage of global common subexpression elimi-nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.7 Relative improvement percentage of if-conversion. . . . . . . . . . . . . . 113
xii
ABSTRACT
Pan, Zhelong. Ph.D., Purdue University, May, 2006. PEAK – A Fast and EffectivePerformance Tuning System via Compiler Optimization Orchestration. MajorProfessor: Rudolf Eigenmann.
Compile-time optimizations generally improve program performance. Neverthe-
less, degradations caused by individual compiler optimization techniques are to be
expected. Feedback-directed optimization orchestration solutions generate optimized
code versions under a series of optimization combinations, evaluate their performance
and search for the best version. One challenge to such systems is to tune program
performance quickly in an exponential search space. Another challenge is to achieve
high program performance, considering that optimizations interact.
The PEAK system in this thesis is an automated performance tuning system,
which searches for the best compiler optimization combinations for important code
sections in a program. It achieves fast tuning speed and high program performance.
The following contributions are made in this work: (1) An algorithm called Combined
Elimination (CE) is developed to explore the optimization space quickly and effec-
tively. (2) Three fast and accurate rating methods are designed to evaluate the per-
formance of an optimized code section based on a partial execution of the program.
(3) An algorithm is developed to identify important code sections as candidates for
performance tuning, trading off tuning speed and tuned program performance.
CE improves performance by 6.01% over GCC O3 for SPEC CPU2000, while
reducing tuning time to 57% of the closest alternative algorithm. Using SUN Forte
compilers, CE improves performance by 10.8%, compared to 5.6% improved by man-
ual tuning. Applying the rating methods, PEAK reduces tuning time further from
2.19 hours to 5.85 minutes, while achieving equal or better program performance.
1
1. INTRODUCTION
1.1 Motivation and Introduction
Although compiler optimizations generally yield significant performance improve-
ments in many programs on modern architectures, the potential for performance
degradation in certain program patterns is known to compiler writers and many pro-
grammers. The state of the art is to let programmers deal with this problem through
compiler options. For example, a programmer can switch off an optimization after
finding that it causes performance degradation. The presence of these options re-
flects the inability of today’s compilers to make the best optimization decision at
compile time. In this thesis, we refer to this process of finding the best optimization
combination for a target program as optimization orchestration. The large number
of compiler optimizations, the complicated interactions between optimizations, the
sophistication of computer architectures, and the complexity of the program itself
make this optimization orchestration problem difficult to solve.
This thesis develops the PEAK (Program Evolution by Adaptive Compilation)
system to automate performance tuning. It aims at orchestrating compiler opti-
mizations for a given scientific program, in a fast and effective way. PEAK adopts a
feedback-directed approach to performance tuning. It generates a series of experimen-
tal versions, compiled under different optimization combinations, for every important
code segment. We call these code segments tuning sections. The performance of each
experimental version is rated based on a partial execution of the program, i.e., a few
invocations of the tuning section under a training input. Iteratively, our orchestra-
tion algorithm chooses the next experimental optimization combinations, based on
these performance ratings, until convergence criteria are satisfied. In the end, the
final tuned program will take the best version found for each tuning section.
2
To achieve the two goals of high program performance and fast tuning speed, this
thesis develops fast and effective orchestration algorithms to explore the optimization
space [1] and fast and accurate rating methods based on a partial execution of the
program to speed up performance evaluation [2]. Compiler tools are implemented to
analyze and to instrument the target program automatically. Specifically, this thesis
makes the following contributions:
1. An optimization orchestration algorithm is developed to explore the optimiza-
tion space fast and effectively. This algorithm achieves equal or better perfor-
mance than the other comparable algorithms, but with less tuning time (57%
of the closest alternative).
2. Three accurate and fast performance rating methods based on a partial exe-
cution of the program are designed to improve rating accuracy and to reduce
tuning time. These rating methods reduce tuning time from several hours to
several minutes, while achieving equal or higher program performance.
3. An algorithm to select important code segments as candidates for performance
tuning is presented. The selected tuning sections cover most (typically more
than 90%) of the total execution time. Each of them is invoked many times
(typically several hundreds) in one run of the program. A high execution time
coverage leads to high tuned program performance; a large number of tuning-
section invocations leads to fast tuning, because it means that many optimized
versions can be experimented in one run of the program.
4. An automatic performance tuning system via optimization orchestration is im-
plemented. The PEAK compiler analyzes and instruments the source program
before the tuning phase. The PEAK runtime system explores the optimization
space and evaluates the performance of optimized versions automatically.
5. Optimization orchestration performance is measured and analyzed comprehen-
sively for GCC and the SUN Forte compiler, with a focus on GCC. The ex-
periments are done for SPEC CPU2000 benchmarks on a Pentium IV machine
and a SPARC II machine.
3
1.2 Related Work
One attempt to alleviate the problem of optimization orchestration is to im-
prove compiler optimizations so as to reduce their potential performance degrada-
tion. However, this approach has two problems: (1) Compilers can hardly consider
the interaction between all the optimizations, given the large number of optimiza-
tions and the complexity of their interactions; (2) The compile-time performance
models used in the compiler are limited by the unavailability of program input data
and insufficient knowledge of the target architecture. Many projects try to improve
optimization performance in these two aspects, using either compile-time or runtime
techniques.
Some try to combine a few optimizations into one big pass, which considers the
interactions between the optimizations. For example, Wolf, Maydan and Chen [3]
develop an algorithm that applies fission, fusion, tiling, permutation and outer loop
unrolling to optimize loop nests. Similarly, Click and Cooper [4] show that combining
constant propagation, global value numbering, and dead code elimination leads to
more optimization opportunities. Nevertheless, it would be very difficult, or even
impossible, to combine all optimizations into one pass that removes all possible
performance degradation, because of the complexity of the compilation task and the
complicated interactions between optimizations.
Some postpone optimizations until runtime, when accurate knowledge about the
target architecture and the program input can be supplied to the compiler.
1. Similar to JVM JIT compilers [5–8], several projects aim at achieving both
portability and performance. DCG [9] proposes a retargetable dynamic code
generation system; VCODE [10] provides a machine-independent interface for
native machine code generation; DAISY [11] generates VLIW code on-the-fly
to emulate the existing architectures on a VLIW architecture.
2. Several projects generate optimized code using runtime input via Runtime
Specialization [12]. RCG and Fabius [13, 14] automatically translate ML pro-
grams into code that generates native binary code at runtime. Calpa [15]
4
and DyC [16] form a staged compiler [17]: Calpa annotates the program at
compile-time; DyC creates a runtime compiler from the annotated program;
this runtime compiler in turn generates the executable using runtime values.
3. Some try to re-optimize binaries based on runtime information. Dynamo [18,
19] (for HP workstations) and DynamoRIO [20] (for IA-32 machines) find and
optimize hot traces for statically generated native binaries at runtime. Simi-
larly, in [21–23], techniques are developed to detect hot spots and to generate
new traces for runtime optimization via hardware support. ADAPT [24, 25]
compiles code intervals under different optimization configurations at runtime,
and chooses the best version for each interval. Continuous Program Optimiza-
tion [26] continually adjusts the storage layouts of dynamic data structures to
enhance the data locality and re-schedules the instructions based on runtime
profiling.
4. Besides the above runtime compilation systems, some develop specific runtime
optimization techniques. For example, a runtime data and iteration reorder-
ing technique is applied to improve data locality in [27]. LRPD test [28, 29]
speculatively executes candidate loops in a parallel form, and re-executes the
loops serially when some data dependence is detected at runtime. In [30], Rus,
Rauchwerger and Hoeflinger analyze the memory reference at both compile-
time and runtime to help parallelization. Runtime path profiling is proposed
to help compiler optimizations in [31]. Dynamic Feedback [32] produces sev-
eral versions under different synchronization optimization policies and auto-
matically chooses the best version by periodically sampling the performance of
each version.
The above techniques push the optimizations to runtime, so they need to reduce
or amortize additional overhead introduced to program execution. Some systems
work on a slow baseline, therefore, there is an opportunity to amortize the overhead
by performance improvement from the optimizations. For example, Fabius [13, 14]
improves ML code, JIT [5–8] improves byte code, and Dynamo [18, 20] improves
5
un-optimized library code. Some [24, 25] off-load the compilation job to another
processor. Some [12–17] generate a small code generator before the production run,
so as to reduce the compilation overhead at runtime. Some [21–23] use hardware to
reduce the profiling overhead.
The goal of this thesis is to find the best compiler optimization combination
for scientific programs. These programs are mostly written in C or Fortran, and
the baseline is very fast compiled under the default optimization setting. Mean-
while, a large number of optimizations, 38 GCC optimizations in our experiments,
are involved. In this case, the performance improvement by optimization orches-
tration can hardly amortize the tremendous compilation overhead, so, it can hardly
achieve performance improvement by solving this optimization orchestration prob-
lem at runtime. Instead, similar to profile-based optimizations, this thesis tunes
program performance under a training input, and then the optimized final version
will be used at runtime. The tuning process can also be applied in-between the
production runs.
This thesis adopts the commonly used feedback-directed optimization approach.
In this approach, many different binary code versions generated under different ex-
perimental optimization combinations are evaluated. The performance of these ver-
sions is compared using either measured execution times or profile-based estimates.
Iteratively, the orchestration algorithms use this information to decide the next ex-
perimental optimization combinations, until convergence criteria are reached. In the
end, the optimization orchestration algorithm gives the final optimal version for the
entire program or important code sections in the program.
Many performance tuning systems use this feedback-directed approach. For ex-
ample, ATLAS [33] generates numerous variants of matrix multiplication to search
for the best one for a specific target machine. Similarly, Iterative Compilation [34]
searches through the transformation space to find the best block sizes and unrolling
factors. Meta optimization [35] uses machine-learning techniques to adjust several
compiler heuristics automatically.
6
The above three projects [33–35] have focused on a relatively small number of
optimization techniques, while this thesis tunes all optimizations that are controlled
by compiler options. All 38 GCC O3 optimization options are tuned in our exper-
iments. In other words, given a compiler and a program, this thesis tries to make
the best use of the compiler to generate the binary code with the best performance
for the program.
Several projects target the same optimization orchestration problem as this the-
sis. The Optimization-Space Exploration (OSE) compiler [36] defines sets of op-
timization configurations and an exploration space, which is traversed to find the
best configuration for the program using compile-time performance estimates as
feedback. Statistical Selection (SS) in [37] uses orthogonal arrays [38] to compute
the performance effects of the optimizations based on a statistical analysis of profile
information, which, in turn, is used to find the best optimization combination. Com-
piler Optimization Selection [39] applies fractional factorial design to optimize the
selection of compiler options. Option Recommendation [40] chooses the PA-RISC
compiler options intelligently for an application, using heuristics based on informa-
tion from the user, the compiler and the profiler. (Different from finding the best
optimization combination, Adaptive Optimizing Compiler [41] uses a biased random
search to discover the best order of optimizations. 1)
Still, there are two unsolved major issues regarding optimization orchestration.
1. How do we search through the optimization space fast and effectively, given
that the search space is huge due to the interactions between compiler opti-
mizations? Complex algorithms, such as the exhaustive search in ATLAS [33]
and the automatic theorem prover in Denali [42], would be prohibitively slow
to solve this problem for a real scientific application. Also, this thesis looks for
a general solution, not heuristics for a special environment as in [40]. (We will
use OSE [36] and SS [37] as reference points.)
1Usually, a compiler does not have the option to specify the order of the optimizations. So, thisthesis does not compare to [41], although the techniques developed in this thesis can be extendedto search for the best order of optimizations.
7
2. How do we evaluate the optimized versions fast and accurately? The most
accurate method is to use the real execution time to evaluate the performance
of the experimental versions. However, given the large number of experimental
versions and the execution time of a real application, this method leads to
excessive tuning times. The performance model used in [36] is fast; however,
it achieves significantly less program performance than the former method.
The goal of our PEAK system is to tune the performance of the important code
sections in a scientific program, in a fast and effective way, via orchestrating the
optimizations controlled by compiler options. PEAK is an automated system, tar-
geting the above two issues. First, a fast and effective feedback-directed algorithm
is designed to search through the optimization space, considering the interaction
between the optimizations. Second, PEAK uses fast and accurate performance eval-
uation methods based on a partial execution of the program.
1.3 Thesis Organization
Chapter 2 presents a fast and effective search algorithm named Combined Elim-
ination (CE), which considers the interaction between optimizations. From the ex-
periments with 38 GCC optimizations on Pentium IV and SPARC II, this algorithm
takes the least tuning time, while achieving the same program performance as other,
comparable algorithms. Through orchestrating a small set of optimizations causing
the most degradation, we show that the performance achieved by CE is close to the
upper bound obtained by an exhaustive search algorithm. The gap is less than 0.2%
on average. Experiments on SUN Forte compilers show that CE achieves a perfor-
mance significantly better than manually tuned peak performance presented by the
SPEC CPU2000 result report.
Chapter 3 proposes fast and accurate rating methods to evaluate the performance
of optimized versions. These rating methods operate on important code sections,
called tuning sections, of a program. The rating for one optimized version of a
tuning section is generated based on the execution times of several invocations to
8
the version. In one run of the program, there are many invocations to each tuning
section, so, multiple versions are evaluated in each run. In this way, this approach
improves the tuning time significantly. Meanwhile, the rating methods achieve fair
comparison by either identifying the invocations that have the same workload, finding
mathematical relationships between different workloads, or forcing re-execution of a
tuning section under the same input.
Chapter 4 develops an algorithm for selecting the important code sections in a
program as tuning sections. This algorithm maximizes the number of invocations to
the tuning sections and their execution time coverage. In this way, the tuning section
selection algorithm aims at both tuning speed and tuned program performance.
Chapter 5 shows the design of our automated performance tuning system –
PEAK. This chapter discusses two primary components, the PEAK compiler and
the PEAK runtime system, as well as special implementation problems related to
runtime code generation and loading. The experimental results on SPEC CPU2000
are presented. On average, compared to the whole-program tuning, PEAK reduces
the tuning time from 2.19 hours to 5.85 minutes and improves the performance from
11.7% to 12.1% for FP benchmarks, via orchestrating optimizations for each tuning
section.
Chapter 6 concludes this thesis. Future work is discussed as well.
The appendix discusses the performance behavior of all the GCC O3 optimization
options on the SPEC CPU2000 benchmarks, using a Pentium IV machine and a
SPARC II machine. The reasons for performance degradation are analyzed for several
important optimizations. One important finding is that optimizations may exhibit
unexpected performance behavior – even generally-beneficial techniques may degrade
performance. Degradations are often complex side-effects of the interaction with
other optimizations.
9
2. FAST AND EFFECTIVE OPTIMIZATION
ORCHESTRATION ALGORITHMS
2.1 Introduction
Compiler optimizations for modern architectures have reached a high level of so-
phistication. Although they yield significant improvements in many programs, the
potential for performance degradation in certain program patterns is known to com-
piler writers and many programmers. Today’s compilers have evolved to the point
where they present to programmers a large number of optimization options. For
example, GCC compilers include 38 options, roughly grouped into three optimiza-
tion levels, O1 through O3. On the other hand, compiler optimizations interact in
unpredictable manners, as many have observed [34,36,37,39,43]. How do we search
for the best optimization combination for a given program in order to achieve the
best performance? This chapter aims at developing a fast and effective algorithm to
do so. We call this process as optimization orchestration. In this chapter, we apply
the algorithm to the entire program. From the next chapter on, we will apply it to
each important code segment of the program.
Several automatic performance tuning systems have taken a dynamic, feedback-
directed approach to orchestrate compiler optimizations. In this approach, many
different binary code versions generated under different experimental optimization
combinations are being evaluated. The performance of these versions is compared
using either measured execution times or profile-based estimates. Iteratively, the
orchestration algorithms use this information to decide the next experimental opti-
mization combinations, until convergence criteria are reached.
10
The new algorithms presented in this thesis follow the above model. We first
develop two simple algorithms: (a) Batch Elimination (BE) identifies the harmful
optimizations and removes them in a batch. (b) Iterative Elimination (IE) succes-
sively removes harmful optimizations, measured through a series of program execu-
tions. Based on the above two algorithms, we design our final algorithm, Combined
Elimination (CE). We compare our algorithms with two algorithms proposed in the
literature: (i) The “compiler construction-time pruning” algorithm in Optimization-
Space Exploration (OSE) [36] iteratively constructs new optimization combinations
using “unions” of the ones in the previous iteration. (ii) Statistical Selection (SS)
in [37] uses orthogonal arrays [38] to compute the main effect of the optimizations
based on a statistical analysis of profile information, which in turn is used to find
the best optimization combination.
In addition to the above algorithms that we compare our work with, several
other approaches have been proposed. Typically, they need more than hundreds of
compilations and experimental runs, when tuning a large number of optimizations (38
optimizations in our experiments). The goal of our algorithm is to reduce this number
to several tens, while achieving comparative or even better program performance.
For the large number of benchmarks and optimizations experimented in this thesis,
we can only apply the algorithms in [36] and [37], which are closest to our new
algorithm, CE, in terms of tuning time. To further verify that our CE algorithm
achieves program performance comparable to other existing algorithms, we use a
small set of optimizations and show that CE closely approaches the upper bound
represented by exhaustive search. The other existing algorithms are as follows.
In [39], a fractional factorial design is developed based on aliasing or confound-
ing [44]; it illustrates a half-fraction design with 2n−1 experiments. In [40], heuristics
are designed to select PA-RISC compiler options based on information from the user,
the compiler, and the profiler. While the use of a priori knowledge of the interaction
between optimization techniques may reduce the complexity of the search for the
best, it has been found by others [43] that the number of techniques that potentially
11
interact is still large. ATLAS [33] starts with a parameterized, hand-coded set of
matrix multiplication variants and evaluates them on the target machine to deter-
mine the optimum settings for that context. Similarly, Iterative Compilation [34]
searches through the transformation space to find the best block sizes and unrolling
factors. In more recent research [45], five different algorithms, genetic algorithm,
simulated annealing, grid search, window search and random search are exploited to
find the best blocking and unrolling parameters. Based on the random search in [45],
some aim to find a general compiler optimization settings using GCC [46]. Meta op-
timization [35] uses machine-learning techniques to adjust the compiler heuristics
automatically.
Different from our goal of finding the best optimization combination is finding
the best order of optimization phases. In [41], a biased random search and genetic
algorithm is used to discover the best order of optimizations. Others have added hill
climbing and greedy constructive algorithms [47]. Furthermore, genetic algorithm
has been improved to reduce search time [48].
In this chapter, we make the following contributions:
• We present a new performance tuning algorithm, Combined Elimination (CE),
which aims at picking the best set of compiler optimizations for a program.
We show that this algorithm takes the shortest tuning time, while achieving
comparable or better performance than other algorithms. Using a small set
of (6) important optimizations, we also verify that CE closely approaches the
performance upper bound.
• We evaluate our and other algorithms on a large set of realistic programs. We
use all 23 SPEC CPU2000 benchmarks that are amenable to the GCC compiler
infrastructure (omitting 5 benchmarks, written in F90 and C++). By contrast,
many previous papers have used small kernel benchmarks. Among the papers
that used a large set of SPEC benchmarks are [2, 36,43].
• Our experiments use all (38) GCC O3 options, where the speed of the tuning
algorithm becomes of decisive importance. Except [36] and [46] that also use
12
a large number of optimizations, previous papers have generally evaluated a
small set of optimizations.
• Besides the GCC compiler, we apply our CE algorithm to the SUN Forte
compiler set as well, whose optimization options may have more than two values
instead of just “on” or “off”. CE achieves significantly better performance than
the peak setting in SPEC results [49], which we view as the manually tuned
results.
We apply the algorithms to tune the performance of SPEC CPU2000 benchmarks
on both a Pentium IV machine and a SPARC II machine. Using the full set of GCC
O3 optimizations, the average normalized tuning time, which will be defined formally
in Section 2.3.2, is 75.3 for our CE algorithm; 131.2 for the OSE algorithm described
in Section 5; 313.9 for the SS algorithm described in Section 6. Hence, CE reduces
tuning time to 57% of the closest alternative. CE improves performance by 6.01%,
over O3, the highest optimization level; OSE by 5.68%; SS by 5.46%. (Compared to
unoptimized programs, performance improvement achieved by CE would amount to
56.4%, on average.)
In order to compare CE with the manually tuned performance reported in SPEC
results, we implement the algorithm using the Forte compiler set. The experiments
are conducted for SPEC CPU2000 benchmarks on a SPARC II machine. On average,
for floating point benchmarks, CE achieves 10.8% improvement relative to the base
setting, compared to 5.6% by the SPEC peak settings; for integer benchmarks, CE
achieves 8.1% compared to 4.1% by the SPEC peak settings.
The remainder of this chapter is organized as follows. 1 In Section 2.2, we
describe the orchestration algorithms that we use in our comparison. In Section 2.3,
we compare tuning time and tuned program performance of these algorithms under
38 optimizations. In Section 2.4, we compare the performance of CE with the upper
bound obtained using exhaustive search under a smaller set of optimizations. In
1The main work of this chapter has been published in [1].
13
Section 2.5, we extend the CE algorithm to handle non-on-off options and experiment
it on the Forte compiler, and, we compare it with the manually tuned result.
2.2 Orchestration Algorithms
2.2.1 Problem description
We define the goal of optimization orchestration as follows:
Given a set of compiler optimization options {F1, F2, ..., Fn}, find the combination
that minimizes the program execution time. Do this efficiently, without the use of
a priori knowledge of the optimizations and their interactions. (Here, n is the number
of optimizations.)
In this section, we give an overview of several algorithms that pursue this goal.
We first present the exhaustive search algorithm, ES. Then, we develop two of our
algorithms, BE and IE, on which our final CE method builds, followed by CE itself.
Next, we present two existing algorithms, OSE and SS, with which our algorithm
compares. Each algorithm makes a number of full program runs, using the resulting
run times as performance feedback for deciding on the next run. We keep the algo-
rithms general and independent of specific compilers and optimization techniques.
The algorithms tune the options available in the given compiler via command line
flags. Here, we focus on on-off options, similar to several of the papers [2,37,39,43].
In Section 2.5, we will extend our CE algorithm to handle non-on-off options.
2.2.2 Orchestration algorithms
Algorithm 1: Exhaustive Search (ES)
Due to the interaction of compiler optimizations, the exhaustive search approach,
which is called the factorial design in [37, 39], would try every optimization combi-
nation to find the best. This approach provides an upper bound of an application’s
performance after optimization orchestration. However, its complexity is O(2n),
14
which is prohibitive if a large number of optimizations are involved. For 38 opti-
mizations in our experiments, it would take up to 238 program runs – a million years
for a program that runs in two minutes. We will not evaluate this algorithm under
the full set of options. However, Section 2.4 will use a feasible set of (6) options
to compare our algorithm with this upper bound. Using pseudo code, ES can be
described as follows.
1. Get all 2n combinations of n options, {F1, F2, ..., Fn}.2. Measure application execution time of the optimized version compiled under
every possible combination.
3. The best version is the one with the least execution time.
Algorithm 2: Batch Elimination (BE)
The idea of Batch Elimination (BE) is to identify the optimizations with nega-
tive effects and turn them off all at once. BE achieves good program performance,
when the optimizations do not interact with each other. It is the fastest among the
feedback-directed algorithms.
The negative effect of one optimization, Fi, can be represented by its Relative
Improvement Percentage (RIP), RIP(Fi), which is the relative difference of the ex-
ecution times of the two versions with and without Fi, T (Fi = 1) and T (Fi = 0).
Fi = 1 means Fi is on, 0 means off.
RIP(Fi) =T (Fi = 0) − T (Fi = 1)
T (Fi = 1)× 100% (2.1)
The baseline of this approach switches on all optimizations. T (Fi = 1) is the execu-
tion time of the baseline TB as shown in Equation 2.2. The performance improvement
by switching off Fi from the baseline B relative to the baseline performance can be
computed with Equation 2.3.
TB = T (Fi = 1) = T (F1 = 1, F2 = 1, ..., Fn = 1) (2.2)
RIPB(Fi = 0) =T (Fi = 0) − TB
TB
× 100% (2.3)
15
If RIPB(Fi = 0) < 0, the optimization of Fi has a negative effect. The BE algorithm
eliminates the optimizations with negative RIPs in a batch to generate the final,
tuned version. This algorithm has a complexity of O(n).
1. Compile the application under the baseline B = {F1 = 1, F2 = 1, ..., Fn = 1}.Execute the generated code version to get the baseline execution time TB.
2. For each optimization Fi, switch it off from B and compile the application.
Execute the generated version to get T (Fi = 0), and compute the RIPB(Fi = 0)
according to Equation 2.3.
3. Disable all optimizations with negative RIPs to generate the final, tuned ver-
sion.
Algorithm 3: Iterative Elimination (IE)
We design Iterative Elimination (IE) to take the interaction of optimizations into
consideration. Unlike BE, which turns off all the optimizations with negative effects
at once, IE iteratively turns off one optimization with the most negative effect at a
time.
IE starts with the baseline that switches on all the optimizations. After com-
puting the RIPs of the optimizations according to Equation 2.3, IE switches off
the one optimization with the most negative effect from the baseline. This process
repeats with all remaining optimizations, until none of them causes performance
degradation. The complexity of IE is O(n2).
1. Let B be the option combination for measuring the baseline execution time,
TB. Let S be the set of optimizations forming the optimization search space.
Initialize S = {F1, F2, ..., Fn} and B = {F1 = 1, F2 = 1, ..., Fn = 1}.2. Compile and execute the application under the baseline setting to get the
baseline execution time TB.
3. For each optimization Fi ε S, switch Fi off from B and compile the application,
execute the generated code version to get T (Fi = 0), and compute the RIP of
Fi relative to the baseline B, RIPB(Fi = 0), according to Equation 2.3.
16
4. Find the optimization Fx with the most negative RIP . Remove Fx from S,
and set Fx to 0 in B.
5. Repeat Steps 2, 3 and 4 until all options in S have non-negative RIPs. B
represents the final option combination.
Algorithm 4: Combined Elimination (CE)
CE, our final algorithm, combines the ideas of the two algorithms just described.
It has a similar iterative structure as IE; however, in each iteration, CE applies the
idea of BE: after identifying the optimizations with negative effects, in each iteration,
CE tries to eliminate these optimizations one by one in a greedy fashion.
We will see, in Section 2.3, that IE achieves better program performance than
BE, since it considers the interaction of optimizations. Nevertheless, when the in-
teractions have only small effects, BE may perform close to IE and more quickly
provide the solution. CE takes the advantages of both BE and IE. When the opti-
mizations interact weakly, CE eliminates the optimizations with negative effects in
one iteration, just like BE. Otherwise, CE eliminates them iteratively, like IE. As a
result, CE achieves both good program performance and fast tuning speed. CE has
the complexity of O(n2).
1. Let B be the baseline option combination. Let S be the set of optimiza-
tions forming the optimization search space. Initialize these two sets: S =
{F1, F2, ..., Fn} and B = {F1 = 1, F2 = 1, ..., Fn = 1}.2. Compile and execute the application under the baseline setting to get the
baseline execution time TB. Measure the RIPB(Fi = 0) of each optimization
option Fi in S relative to the baseline B.
3. Let X = {X1, X2, ..., Xl} be the set of optimization options with negative
RIPs. X is sorted in an increasing order, that is, the first element, X1, has the
most negative RIP . Remove X1 from S and set X1 to 0 in B. (B is changed
in this step.) For i from 2 to l,
17
∗ Measure the RIP of Xi relative to the baseline B.
∗ If the RIP of Xi is negative, remove Xi from S and set Xi to 0 in B.
4. Repeat Steps 2 and 3 until all options in S have non-negative RIPs. B repre-
sents the final solution.
Algorithm 5: Optimization Space Exploration(OSE)
In [36], the following method is used to orchestrate optimizations. First, a “com-
piler construction-time pruning” algorithm selects a small set of optimization combi-
nations that perform well on a given set of code segments. Then, these combinations
are used to construct a search tree, which is traversed to find good combinations for
code segments in a target program. To fairly compare this method with other or-
chestration algorithms, we slightly modify the “compiler construction-time pruning”
algorithm, which is then referred to as the OSE algorithm. (In [36], the pruning al-
gorithm aims at finding a set of good optimization combinations; while the modified
OSE algorithm in this thesis finds the best of this set. The modified algorithm is
applied to the whole application instead of code segments.)
The basic idea of the pruning algorithm is to iteratively find better optimization
combinations by merging the beneficial ones. In each iteration, a new test set Ω
is constructed by merging the optimization combinations in the old test set using
“union” operations. Next, after evaluating the optimization combinations in Ω, the
size of Ω is reduced to m by dropping the slowest combinations. The process repeats
until the performance increase in the Ω set of two consecutive iterations becomes
negligible. The complexity of OSE is O(m2 ∗n). We use the same m = 12 as in [36].
Roughly, m can be viewed as O(n), hence, the complexity of OSE is approximately
O(n3). The specific steps are as follows:
1. Construct a set, Ω, which consists of the default optimization combination,
and n combinations, each of which assigns a non-default value to a single
optimization. (In our experiments, the default optimization combination, O3,
turns on all optimizations. The non-default value for each optimization is off.)
18
2. Measure the application execution time for each optimization combination in
Ω. Keep the m fastest combinations in Ω, and drop the rest.
3. Construct a new Ω set, each element in which is a union of two optimization
combinations in the old Ω set. (The “union” operation takes non-default values
of the options in both combinations.)
4. Repeat Steps 2 and 3, until no new combinations can be generated or the
increase of the fastest version in Ω becomes negligible. We use the fastest
version in the final Ω as the final version .
Algorithm 6: Statistical Selection (SS)
SS was developed in [37]. It uses a statistical method to identify the performance
effect of the optimization options. The options with positive effects are turned
on, while the ones with negative effects are turned off in the final version, in an
iterative fashion. This statistical method takes the interactions of optimizations into
consideration. (All the other algorithms except BE consider the interactions.)
The statistical method is based on orthogonal arrays (OA), which have been
proposed as an efficient design of experiments [38,44]. Formally, an OA is an m× k
matrix of zeros and ones. Each column of the array corresponds to one compiler
option. Each row of the array corresponds to one optimization combination. SS
uses the OA with strength 2, that is, two arbitrary columns of the OA contain the
patterns 00, 01, 10, 11 equally often. Our experiments use the OA with 38 options
and 40 rows, which is constructed based on a Hadamard matrix taken from [50].
By a series of program runs, this SS approach identifies the options that have
the largest effect on code performance. Then, it switches on/off those options with
a large positive/negative effect. After iteratively applying the above solution to the
options that have not been set, SS finds an optimal combination of the options. SS
has a complexity of O(n2). The pseudo code is as follows.
1. Compile the application with each row from orthogonal array A as the compiler
optimization combination and execute the optimized version.
19
2. Compute the relative effect, RE(Fi), of each option using Equations 2.4 and 2.5,
where E(Fi) is the main effect of Fi, s is one row of A, T (s) is the execution
time of the version under s.
E(Fi) =(∑
sεA:si=1 T (s) − ∑sεA:si=0 T (s))2
m(2.4)
RE(Fi) =E(Fi)∑k
j=1 E(Fj)× 100% (2.5)
3. If the relative effect of an option is larger than a threshold of 10%,
∗ if the option has a positive improvement, I(Fi) > 0, according to Equa-
tion 2.6, switch the option on.
∗ else if it has a negative improvement, switch the option off.
I(Fi) =
∑sεA:si=0 T (s) − ∑
sεA:si=1 T (s)∑sεA:si=0 T (s)
(2.6)
4. Construct a new orthogonal array A by dropping the columns corresponding
to the options selected in the previous step.
5. Repeat all above steps until all of the options are set.
Summary of the orchestration algorithms
The goal of optimization orchestration is to find the optimal point in a high-
dimension space S = F1 × F2 × ... × Fn. BE probes each dimension to find and
adopt the ones that benefit performance. SS works in a similar way, but via a
statistical and iterative approach. OSE probes multiple directions, each of which
may involve multiple dimensions, and searches along the direction combinations that
may benefit performance. IE probes each dimension and fixes the dimension that
achieves the most performance at a time. CE probes each dimension and greedily
fixes the dimensions that benefit performance at each iteration.
Table 2.1 summarizes the complexities of all six algorithms compared in this
thesis.
20
Table 2.1Orchestration algorithm complexity (n is the number of optimization options.)
ES BE IE OSE SS CE
O(2n) O(n) O(n2) O(n3) O(n2) O(n2)
2.3 Experimental Results
2.3.1 Experimental environment
We evaluate our algorithm using the optimization options of the GCC 3.3.3 com-
piler on two different computer architectures: Pentium IV and SPARC II. Our rea-
sons for choosing GCC is that this compiler is widely used, has many easily accessible
compiler optimizations, and is portable across many different computer architectures.
In this section, we use all 38 optimization options implied by “O3”, the highest
optimization level. These options are listed in Table 2.2 and are described in the
GCC manual [51].
We take our measurements using all SPEC CPU2000 benchmarks written in
F77 and C, which are amenable to GCC. To differentiate the effect of compiler
optimizations on integer (INT) and floating-point (FP) programs, we display the
results of these two benchmark categories separately. Our overall tuning process is
similar to profile-based optimizations. A train dataset is used to tune the program.
A different input, the SPEC ref dataset, is usually used to measure performance.
To separate the performance effects attributed to the tuning algorithms from those
caused by the input sets, we measure program performance under both the train
and ref datasets. For our detailed comparison of the tuning algorithms, we will start
with the train set. In Section 2.3.3, we will show that, overall, the tuned benchmark
suite achieves similar performance improvement under the train and ref datasets.
To ensure accurate measurements and eliminate perturbation by the operating
system, we re-execute each code version multiple times under a single-user environ-
21
Table 2.2Optimization options in GCC 3.3.3
F1 rename-registers F2 inline-functions
F3 align-labels F4 align-loops
F5 align-jumps F6 align-functions
F7 strict-aliasing F8 reorder-functions
F9 reorder-blocks F10 peephole2
F11 caller-saves F12 sched-spec
F13 sched-interblock F14 schedule-insns2
F15 schedule-insns F16 regmove
F17 expensive-optimizations F18 delete-null-pointer-checks
F19 gcse-sm F20 gcse-lm
F21 gcse F22 rerun-loop-opt
F23 rerun-cse-after-loop F24 cse-skip-blocks
F25 cse-follow-jumps F26 strength-reduce
F27 optimize-sibling-calls F28 force-mem
F29 cprop-registers F30 guess-branch-probability
F31 delayed-branch F32 if-conversion2
F33 if-conversion F34 crossjumping
F35 loop-optimize F36 thread-jumps
F37 merge-constants F38 defer-pop
ment, until the three least execution times are within a range of [−1%,1%]. In most
of our experiments, each version is executed exactly three times. Hence, the impact
on tuning time is negligible.
In our experiments, the same code version may be generated under different opti-
mization combinations. 2 This observation allows us to reduce tuning time. We keep
a repository of code versions generated under different optimization combinations.
2Comparing the binaries generated under two different optimization combinations via the UNIXutility, diff, can show whether these two binaries are identical.
22
The repository allows us to memorize and reuse their performance results. Different
orchestration algorithms use their own repositories and get affected in similar ways,
so that our comparison remains fair.
2.3.2 Metrics
Two important metrics characterize the behavior of orchestration algorithms:
1. The program performance of the best optimized version found by the orches-
tration algorithm. We define it as the performance improvement percentage
of the best version relative to the base version under the highest optimization
level O3.
2. The total tuning time spent in the orchestration process. Because the execution
times of different benchmarks are not the same, we normalize the tuning time
(TT ) by the time of evaluating the base version, i.e., one compilation time
(CTB) plus three execution times (ETB) of the base version.
NTT = TT/(CTB + 3 × ETB) (2.7)
This normalized tuning time (NTT ) roughly represents the number of experi-
mented versions. (The number may be larger or smaller than the actual number
of tested optimization combinations due to three effects: a) Some optimiza-
tions may not have any effect on the program, allowing the version repository
to reduce the number of experiments. b) Perturbation filtering mechanism
in Section 2.3.1 may increase the number of runs of some versions. c) The
experimental versions may be faster or slower than the base version.)
A good optimization orchestration method is meant to achieve both high program
performance and short normalized tuning time. We will show that our CE algorithm
has the shortest tuning time, while achieving comparable or better performance than
other algorithms.
23
0
50
100
150
200
250
300
350
400
450
amm
p
appl
u
apsi art
equa
ke
me
sa
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
No
rmal
ized
Tu
nin
g T
ime
BE(Batch Elimination) IE(Iterative Elimination) OSE(Optimization Space Exploration) SS(Statistical Selection) CE(Combined Elimination)
CE is the algorithm proposed in this paper. BE and IE are steps towards CE. OSE and SS are alternatives proposed in related work.
(a) Normalized tuning time for SPEC CPU2000 FP benchmarks
0
100
200
300
400
500
600
bzip
2
craf
ty
gap gc
c
gzip
mcf
par
ser
perl
bmk
two
lf
vort
ex
vpr
Ge
oMea
n
No
rmai
zed
Tu
nin
g T
ime
BE(Batch Elimination) IE(Iterative Elimination) OSE(Optimization Space Exploration) SS(Statistical Selection) CE(Combined Elimination)
(b) Normalized tuning time for SPEC CPU2000 INT benchmarksFig. 2.1. Normalized tuning time of five optimization orchestrationalgorithms for SPEC CPU2000 benchmarks on Pentium IV. Lower isbetter. CE has the shortest tuning time in all except a few cases. Inall those cases, the extended tuning time leads to higher performance.
24
-20
-10
0
10
20
30
40
50
60
70
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
Per
form
ance
Imp
rove
men
t P
erce
nta
ge
Rel
ativ
e to
O3
(%)
BE(Batch Elimination) IE(Iterative Elimination) OSE(Optimization Space Exploration) SS(Statistical Selection) CE(Combined Elimination)
CE is the algorithm proposed in this paper. BE and IE are steps towards CE. OSE and SS are alternatives proposed in related work.
(a) Program performance for SPEC CPU2000 FP benchmarks
-4
-2
0
2
4
6
8
10
bzip
2
craf
ty
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
Geo
Mea
n
Per
form
ance
Imp
rove
men
t P
erce
nta
ge
Rel
ativ
e to
O3
(%)
BE(Batch Elimination) IE(Iterative Elimination) OSE(Optimization Space Exploration) SS(Statistical Selection) CE(Combined Elimination)
(b) Program performance for SPEC CPU2000 INT benchmarksFig. 2.2. Program performance achieved by five optimization orches-tration algorithms relative to the highest optimization level “O3” forSPEC CPU2000 benchmarks on Pentium IV. Higher is better. In allcases, CE performs the best or within 1% of the best.
25
2.3.3 Results
In this section, we compare our final optimization orchestration algorithm CE
with the four algorithms BE, IE, OSE and SS. Recall, that BE and IE are steps
towards CE; OSE and SS are algorithms proposed in related work. Figure 2.1 and
Figure 2.3.2 show the results of these five orchestration algorithms on the Pentium
IV machine for the SPEC CPU2000 FP and INT benchmarks in terms of the two
metrics. They provide evidence for our claim that CE has the fastest tuning speed
while achieving program performance comparable to the best alternatives. We will
discuss the basic BE method first, then the other four algorithms. Tuning time will
be analyzed first, then program performance.
Tuning time
For the applications used in our experiments, the slowest of the measured algo-
rithms takes up to several days to orchestrate the large number of optimizations.
Figure 2.1(a) and Figure 2.1(b) show that our new algorithm, CE, is the fastest
among the four orchestration algorithms that consider interactions. The absolute
tuning time, for CE, is 2.19 hours, on average, for FP benchmarks and 3.66 hours
for INT benchmarks on the 2.8 GHZ Pentium IV machine. On the 400 MHZ SPARC
II machine, 9.92 hours for FP benchmarks and 12.31 hours for INT benchmarks. we
compare the algorithms by normalized tuning time, shown in Figure 2.1(a) and Fig-
ure 2.1(b).
Although BE achieves the least program performance, its tuning speed is the
fastest, which is consistent with its complexity of O(n). BE can be viewed as a
lower bound on the tuning time for a feed-back directed orchestration algorithm
that does not have a priori knowledge of the optimizations. For such an algorithm,
each optimization must be tried at least once to find its performance effect.
OSE is of higher complexity and thus slower than IE and CE. However, SS turns
out to be the slowest method, even though its complexity is O(n2), less than OSE’s
26
O(n3). The reason for the long tuning time of SS is the higher number of iterations
it takes to converge.
Among the four algorithms (excluding BE), CE has the fastest average tuning
speed. For ammp, wupwise, bzip2, gap, gcc, perlbmk and vortex, CE is not the
fastest. However, the faster algorithms achieve their speed at significant expense of
program performance.
Program performance
In both Figure 2.2(a) and Figure 2.2(b), BE almost always achieves the least
program performance among the five algorithms. As described in Section 2.2, BE
ignores the interaction of the optimizations. Therefore, it does not achieve good
performance when the interaction has a significant negative performance effect. In
the cases of sixtrack and parser, BE even significantly degrades the performance.
In Figure 2.2(a), for art, all the algorithms improve performance by about 60%
on Pentium. This is mainly due to eliminating the option of “strict-aliasing”, which
does alias analysis, removes false data dependences, and increases register pressure.
This option results in lots of spill code for art, causing substantial performance degra-
dation. However, on the SPARC machine, the orchestration algorithms do not have
the above behavior for art. “Strict-aliasing” does not cause performance degrada-
tion, as the SPARC machine has more registers than the Pentium machine. In [43],
we have analyzed reasons for negative performance effects by several optimizations,
in detail.
The average performance improvement of all other orchestration algorithms,
which consider the interactions of optimizations, are about twice as high as BE’s.
Moreover, Figure 2.2(a) shows that these four algorithms perform essentially the
same for the FP benchmarks. On one hand, the regularity of FP programs con-
tributes to this result. On the other hand, the optimizations in GCC limit per-
formance tuning on FP benchmarks, because GCC options do not include advanced
dependence-based transformations, such as loop tiling. We expect that such transfor-
27
mations would be amenable to our tuning method and yield tangible improvement.
In Figure 2.2(b), performance similarity still holds in most of the INT benchmarks,
with a few exceptions. For gap, twolf and vortex, IE does not achieve as good a
performance as CE, though the performance gap is small. SS does not produce con-
sistent performance; for bzip2, SS does not achieve any performance; for bzip2, gzip
and vortex, SS’s performance is significantly inferior to CE. CE and OSE always
achieve good program performance improvement.
The fact that none of the algorithms constantly outperforms the others, reflects
the exponential complexity of the optimization orchestration problem. All five algo-
rithms use heuristics, which lead to sub-optimal results. Among these algorithms,
CE achieves consistent performance. Although, for crafty, parser, twolf, and vpr, CE
does not achieve the best performance, the gap is less than 1%.
The small performance differences between the measured algorithms indicate that
all methods properly deal with the primary interactions between optimization tech-
niques. However, there are differences in the ways the algorithms deal with secondary
interactions. These properties are consistent with those of a general optimization
problem, in which the main effects tend to be larger than two-factor interactions,
which in turn tend to be larger than three-factor interactions, and so on [44].
In Figure 2.3.2, we measured program performance under the train dataset. It
is important to evaluate how the algorithm performs under different input. To this
end, we measured execution times of each benchmark using the ref dataset as input,
for both the O3 version and the optimal version found by CE. (Still, the train dataset
is the input for the tuning process.) On average, CE improves FP benchmarks by
11.7% (compared to 11.9% under train) relative to O3; INT benchmarks by 3.9%
(4.4% under train). This shows that CE works well when the input is different
from the tuning input. On the other hand, we do find a few benchmarks that
do not achieve the same performance under the ref dataset as under train. The
highest differences are, for gzip and vortex, 1.95% and 2.18%. If the training input
of the orchestration algorithm differs significantly from actual workloads, our offline
28
Table 2.3Mean performance on SPARC II. CE achieves both fast tuning speedand high program performance on SPARC II as well.
Benchmark Algorithm improvement Normalizedover “O3” Tuning Time
FP BE −4.1 % 30.8FP IE 4.1 % 105.4FP OSE 4.0 % 142.0FP SS 3.7 % 384.9FP CE 4.1 % 63.4INT BE −0.8 % 36.2INT IE 3.6 % 98.7INT OSE 3.4 % 130.0INT SS 3.1 % 317.0INT CE 3.9 % 88.4
(profile-based) tuning approach may not reach the full tuning potential. In that case,
an online approach [25] could tune the program using the actual input.
Overall comparison of algorithms
CE achieves both fast tuning speed and high program performance. It does so
by combining the advantages of IE and BE: Like IE, it considers the interaction of
optimizations, leading to high program performance; like BE, it keeps tuning time
short when the interaction does not have a significant performance effect.
Similar observations hold on the SPARC II machine. Table 2.3 lists the mean
performance of each algorithm across the integer and floating point benchmarks,
respectively.
Figure 2.3 provides an overall comparison of the algorithms. The X-axis is average
program performance achieved by the algorithm; the Y-axis is average normalized
tuning time. The averages are taken across all benchmarks and machines. (The
figure under each benchmark and machine setting would be similar.) A good algo-
rithm achieves high program performance and short tuning time, represented by the
29
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7
Average program performance improvement percentage relative to "-O3" (%)
Ave
rag
e n
orm
aliz
ed t
un
ing
tim
e
BE
IE
OSE
SS
CE
Fig. 2.3. Overall comparison of the orchestration algorithms. CEachieves both fast tuning speed and high program performance.
bottom-right corner of Figure 2.3. The figure shows that CE is the best algorithm.
The runner-up is IE, which we developed as a step towards CE.
2.4 Upper Bound Analysis
We have shown that CE achieves good performance improvement. This section
attempts to answer the question of how much better than CE an algorithm could
perform. To this end, we look for a performance upper bound, which we find by
an exhaustive search (ES) through all optimization combinations. As it would be
impossible to do exhaustive search with 38 optimizations, we pick a small set of six
optimizations. This section will show that the performance improvement by CE is
close to this upper bound.
The six optimizations that have the largest performance effects are picked to con-
duct upper bound analysis. The performance effect of an optimization is the total
negative relative performance improvement of this optimization on all the bench-
marks. Figure 2.4 shows the effects of all 38 optimizations in a sorted fashion for
30
-70
-60
-50
-40
-30
-20
-10
0
F7
F21
F26
F22
F28
F20
F16
F23
F33
F14
F30
F10
F24
F35 F
9F
5F
6F
1F
4F
34F
37F
32F
25F
17F
36F
29F
27 F2
F11
F31
F19
F15
F13
F18 F
8F
3F
12F
38
Su
m o
f N
egat
ive
RIP
s (%
)
(a) Sum of negative RIPs of each optimization option over all benchmarks on the Pen-tium IV machine
-25
-20
-15
-10
-5
0
F21
F15
F34
F33 F
9F
23F
12F
30F
28F
13 F2
F1
F24
F22
F37
F25
F10
F20
F32
F17
F35
F14
F36
F16
F29
F11
F26
F27 F
6F
7F
31F
18 F3
F4
F5
F8
F19
F38
Su
m o
f N
egat
ive
RIP
s (%
)
(b) Sum of negative RIPs of each optimization option over all benchmarks on theSPARC II machine
Fig. 2.4. Total negative effects of all the GCC 3.3.3 O3 optimization options
each architecture. Comparing Figure 2.4(a) and Figure 2.4(b), we know that the
effect of an optimization is different on different architectures. So, these six opti-
mizations are picked separately on the SPARC II and Pentium IV machines. They
are strict-aliasing, schedule-insns2, regmove, gcse, rerun-loop-opt and force-mem for
Pentium IV, and rename-registers, reorder-blocks, sched-interblock, schedule-insns,
gcse and if-conversion for SPARC II.
2.4.1 Results
In Figure 2.5, ES represents the performance upper bound. Comparing the first
two columns in Figure 2.5(a) and Figure 2.5(b), we find that, under the 6 optimiza-
tions, CE performs close to ES. In about half of the benchmarks, they both find the
31
0
10
20
30
40
50
60
70
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
Per
form
ance
Im
pro
vem
ent
Per
cen
tag
e R
elat
ive
to O
3 (%
) ES_6 CE_6 CE_38 BE_6
(a) Program performance for SPEC CPU2000FP benchmarks
-2
0
2
4
6
8
10
bzip
2
craf
ty
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
Geo
Mea
n
Per
form
ance
Im
pro
vem
ent
Per
cen
tag
e R
elat
ive
to O
3 (%
)
ES_6 CE_6 CE_38 BE_6
(b) Program performance for SPECCPU2000 INT benchmarks
0
20
40
60
80
100
120
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
No
rmal
ized
Tu
nin
g T
ime
ES_6 CE_6 CE_38 BE_6
(c) Normalized tuning time for SPECCPU2000 FP benchmarks
0
20
40
60
80
100
120
140
160
180bz
ip2
craf
ty
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
Geo
Mea
n
No
rmal
ized
Tu
nin
g T
ime
ES_6 CE_6 CE_38 BE_6
(d) Normalized tuning time for SPECCPU2000 INT benchmarks
Fig. 2.5. Upper bound analysis on Pentium IV. ES 6: exhaustivesearch with 6 optimizations; CE 6: combined elimination with 6optimizations; CE 38: combined elimination with 38 optimizations;BE 6: batch elimination with 6 optimizations. CE 6 achieves nearlythe same performance as ES 6, in all cases. CE 38 performs better.(Exhaustive search with 38 optimizations would be infeasible.) BE 6is much worse than CE 6. CE 6 is about 4 times faster than ES 6.
32
Table 2.4Upper bound analysis under four different machine and benchmark settings
Machine Benchmark RIP by ES RIP by CEover “O3” over “O3”
Pentium IV FP 10.6 % 10.4 %Pentium IV INT 2.7 % 2.6 %SPARC II FP 2.9 % 2.7 %SPARC II INT 3.2 % 3.0 %
same best version. Another important fact shown in Figure 2.5(c) and Figure 2.5(d)
is that CE is more than 4 times as fast as ES, even for this small set of optimizations.
For comparison, the figures also show the tuning speed of CE for 38 options. ES for
38 options would be millions of years!
Figure 2.5 provides evidence that the heuristic-based search algorithms can achieve
performance close to the upper bound. This confirms our analysis in Section 2.3.3.
The heuristics find the primary and secondary performance effects, which are the
individual performance of an optimization and the main interaction with other opti-
mizations, respectively. These arguments hold for the SPARC II machine. Table 2.4
shows average program performance achieved under different machine and bench-
mark settings.
Comparing CE 6 with CE 38, the performance gap for FP benchmarks is negli-
gible, but not for INT benchmarks. This result is consistent with the finding of [43]
that INT programs are sensitive to a larger number of interactions between optimiza-
tion techniques than FP programs. These results suggest that a priori knowledge
of a small set of potentially interacting optimizations may help tuning numerical
programs. Exhaustive search within this small set can be feasible. However, this is
not the case for non-numerical applications.
In order to verify that the interaction between these six optimizations has a
significant performance effect, we apply BE as well. The result is shown as the last
column, BE 6, in Figure 2.5. From this figure, the performance of BE 6 is much
worse than CE 6, for example, in ammp, apsi, sixtrack, crafty, parser, and vpr.
33
2.5 The General Combined Elimination Algorithm
Our CE algorithm can be easily extended to handle “non-on-off” options, al-
though the previous experiments are done with “on-off” options. (All the GCC
O3 optimization options are of this “on-off” type.) The “non-on-off” options are
the ones with more than two values. One example is the “-unroll” option in SUN
Forte compilers [52], which take an argument indicating the degree of loop unrolling.
Therefore, a general optimization orchestration problem can be described as follows.
Given a set of optimization options {F1, F2, ..., Fn}, where Fi has Ki possible
values {Vi,j, j=1..Ki} (i = 1..n), find the combination that minimizes the program
execution time. Here, n is the number of options. (Moreover, one Fi can actually
contain multiple optimizations that have a high possibility of interaction; the possible
values of this Fi are all possible combinations of the values of the two optimizations.)
We name our algorithm to handle “non-on-off” options as the General Combined
Elimination algorithm (GCE). GCE has an iterative structure similar to CE with a
few extensions. The initial baseline of GCE is the default optimization setting used
in the compiler. In each iteration of GCE, all non-default values of the remaining op-
tions in the search space S are evaluated. For each of these options, GCE records the
value causing the most negative RIP , which is computed according to Equation 2.3.
(A negative RIP means that the corresponding value of the option improves pro-
gram performance.) GCE tries to apply these recorded values of the options with
these negative RIPs one by one in a greedy fashion just like CE. When none of
the remaining options in S has a value improving the performance, GCE gives the
final optimization setting for the program. GCE has a complexity of O(Σni=1Ki ×n).
In most cases, O(Σni=1Ki) = O(n), so roughly the complexity is still O(n2). (The
techniques developed in [45] could also be included in GCE to handle options with
a large number of possible values, for example, blocking factors.) The pseudo code
of GCE is as follows.
34
1. Let B be the baseline option setting. Let S be the set of optimizations forming
the optimization search space. Initialize B = { F1 = f1, F2 = f2, ..., Fn = fn |fi is the default value of option Fi } and S = {F1, F2, ..., Fn}.
2. Measure the RIPs of all the non-default values of the options in S relative to
the baseline B, that is, RIPB(Fi = Vi,j) s.t. Fi ε S and Vi,j �= fi. The definition
of RIP is the same as in Equation 2.3.
3. Let X = {X1 = x1, X2 = x2, ..., Xl = xl} be the set of options with negative
RIPs and xi have the most negative RIP among the possible values of option
Xi. X is sorted in an increasing order, that is, the first element in X, X1, has
the most negative RIP . Remove X1 from S and set X1 in B to be x1. (B is
changed in this step.) For i from 2 to l,
∗ Measure RIPB(Xi = xi).
∗ If it is negative, remove Xi from S and set Xi in B to be xi.
4. Repeat Step 2 and Step 3 until all options in S have non-negative RIPs. B
represents the final option setting.
2.5.1 Experimental results on SUN Forte compilers
We conduct an experiment to evaluate the GCE algorithm using SUN Forte
compilers. Another goal of this experiment is to compare the performance achieved
by our GCE algorithm with the one by manual tuning. The manual tuning result is
based on the SPEC CPU2000 performance results [49]. In such a result, there is a
base option setting and multiple peak option settings. The base setting is common
to all benchmarks. We use this as the baseline performance. The peak settings
may be different for different benchmarks. For each benchmark, the peak setting
stands for a tuned optimization setting, which achieves better performance than the
base setting. So, we will compare GCE with the peak setting, the latter represents
manually tuned result.
The experiments are conducted on a Sun Enterprise 450 SPARC II machine.
We evaluate the GCE algorithm using the optimization flags of the Forte Developer
35
Table 2.5Optimization flags orchestrated by GCE
Flag Name Experimented Values Meaning-xarch v8, v8plus, generic target architecture instruction set-xO 3, 4, 5 optimization level-xalias level std, strong, basic alias level-stackvar on/off using the stack to hold local variables-d y, n allowing dynamic libraries-xrestrict %all, %none pointer-valued parameters as restricted pointers-xdepend on/off data dependence test and loop restructuring-xsafe=mem on/off assuming no memory protection violations-Qoption iropt -crit on/off optimization of critical control paths-Qoption iropt -Abopt on/off aggressive optimizations of all branches-Qoption iropt -whole on/off whole program optimizations-Qoption iropt -Adata access on/off analysis of data access patterns-Qoption iropt -Mt 500, 1000, 2000, 6000, default max size of a routine body eligible for inlining-Qoption iropt -Mr 6000,12000,24000,40000,default max code increase due to inlining per routine-Qoption iropt -Mm 6000, 12000, 24000, default max code increase due to inlining per module-Qoption iropt -Ma 200, 400, 800, default max level of recursive inlining
6 compilers. The baseline flag setting is “-fast -xcrossfile -xprofile”. GCE tunes
the optimization flags that are used in the SPEC peak settings. Table 2.5 lists these
flags. The Forte compilers have some flags that can be passed directly to the compiler
components. These flags are passed by the “-W” flag for the C compiler and the
“-Qoption” flag for the Fortran compiler or the C++ compiler. In this table, we list
the “-Qoption” only (“-W” flags are similar). 3
Figure 2.6 shows the performance results. In summary, GCE achieves equal or
better performance for each benchmark. On average, for floating point benchmarks,
GCE achieves 10.8% improvement relative to the base setting, compared to 5.6% by
the peak settings. For integer benchmarks, GCE achieves 8.1% compared to 4.1%
by the peak settings.
3If the inlining flags are not specified, the compiler uses their default values, which are not listedin the manual.
36
0
10
20
30
40
50
60
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
aver
age
Per
form
ance
imp
rove
men
t p
erce
nta
ge
(%)
SPEC peak performance / base performance GCE performance / base performance
(a) SPEC CPU2000 FP benchmarks
0
5
10
15
20
25
30
35
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
aver
age
Per
form
ance
imp
rove
men
t p
erce
nta
ge
(%)
SPEC peak performance / base performance GCE performance / base performance
(b) SPEC CPU2000 INT benchmarks
Fig. 2.6. Program performance achieved by the GCE algorithm vsthe performance of the manually tuned results (peak setting). Higheris better. In all cases, GCE achieves equal or better performance. Onaverage, GCE nearly doubles the performance.
37
3. FAST AND ACCURATE RATING METHODS
3.1 Introduction
Chapter 2 presented a fast and effective algorithm, Combined Elimination, to
search for the best optimization combination for a program. Although the tuning
process takes the least time among the alternatives (57% of the closest one), the
tuning time is still in the order of several hours. The reason is that Chapter 2
evaluates an optimized version based on the total execution time spent in one run
of the program. This method is accurate but slow. This chapter aims at developing
fast and accurate methods to evaluate the performance of an optimized version. We
will tune the program at a finer granularity than the whole-program level. We will
use a partial execution of the program to evaluate one optimized version so as to
reduce the tuning time. We call this kind of performance evaluation methods as
rating methods. The rating methods will be accurate enough to guarantee tuned
program performance.
From this chapter on, we will tune the important code segments in the program
separately. These code segments are called Tuning Sections (TS) in this thesis.
Roughly, a tuning section is a procedure including all its callees. (We will show
how to select the tuning sections in Chapter 4.) To tune a program at this tuning-
section level, we still apply the CE algorithm developed in Chapter 2 to search for
the best optimization combination for each individual TS. Noticing that each TS is
invoked many (usually hundreds or thousands of) times, we develop rating methods
to evaluate the performance of one optimized version based on a few number of
invocations of the TS. (The number of invocations used to evaluate the performance
of an optimized version is called a window.) In this way, a partial execution of the
38
program can be used to rate the performance of an optimized version, which leads
to a tremendous speedup of the tuning process.
Besides speed, accuracy and flexibility are two other important issues for rat-
ing methods. A fast rating method means a short tuning time. However, if the
rating method is inaccurate, it may lead to limited performance improvement or
even degradation. If the method is not flexible, it may apply to a limited set of
applications, optimization techniques, compilers, or architectures, only. Many of the
proposed methods either are slow due to executing the whole program to rate one
version [35, 41, 43], not accurate enough for optimization orchestration [53], or only
applicable to specific code [33, 34]. Another example of fast rating method is the
performance model in [36], which estimates the performance of an optimized version
based on the profile information about data cache misses, instruction cache misses
and branch mis-prediction. This approach is fast but inaccurate. [36] shows that its
performance improvement is nearly half of the one achieved by using the accurate
execution times.
Rating based on a number of invocations of the tuning sections leads to fast
tuning speed, but, the workload may change from one invocation to another. Directly
averaging the execution times of a number of invocations is not an accurate method.
This chapter presents the rating methods that fairly compare the invocation times
of the optimized versions under different invocations. (The computed ratings will
be used as the feedback to the optimization orchestration algorithm developed in
Chapter 2.) These methods can be applied to general, regular and irregular, code
sections.
Similar to our idea of running part of the program to speed up performance
evaluation, SimPoint [54–56] and Simulation Sampling [57–59] simulate important
intervals or sampling points to speed up the simulation process. However, these
techniques can hardly be applied to our system, mainly due to the following reasons.
(1) Our system requires that the machine state should be “warmed-up” so that the
program could execute directly from the middle of the program without impacting
39
performance evaluation accuracy. However, it is very difficult to warm up a real ma-
chine to an accurate state, especially for the caches. (Code Isolator [53] tries to do so,
nevertheless, it is not accurate enough for our optimization orchestration.) SimPoint
and Simulation Sampling can warm up the machine state via check-pointing or fast-
forwarding, because the simulator has full control of the simulated machine. (2) Our
system evaluates the performance of many optimized versions, while SimPoint and
Simulation Sampling evaluate the performance of multiple simulations using one bi-
nary version of a program. If their techniques were applied to our system, it would
cause tremendous overhead to find the sampling points for each version and would
still be difficult to compare the performance of different versions, because different
optimized versions have different number of instructions and basic blocks. (3) Our
compiler system works at the source program level, while SimPoint and Simulation
Sampling work at the binary level.
The key ideas of our rating methods are as follows. Context-Based Rating (CBR)
identifies and compares invocations of a tuning section that have the same work-
load, in the course of the program run. Model-Based Rating (MBR) formulates the
relationship between different workloads, which it factors into the comparison. Re-
execution-Based Rating (RBR) directly re-executes a tuning section under the same
input for fair comparison. This chapter also presents automated compiler techniques
to analyze the source code for choosing the most appropriate rating method for each
tuning section. 1
The remainder of this chapter is organized as follows. Section 3.2 presents three
rating methods – CBR, MBR and RBR – along with the relevant compiler techniques.
Section 3.3 shows the use of these methods in the PEAK system, including static
program analysis and a dynamic solution to determining the window size. Section 3.4
evaluates the applicability and accuracy of the rating methods on a number of code
sections. (The complete evaluation will be done in Chapter 5, after the algorithm of
tuning section selection presented in Chapter 4.)
1The main work of this chapter has been published in [2].
40
3.2 Rating Methods
The rating methods are applied in an offline performance tuning scenario as
follows. Before tuning, the program is partitioned by our compiler 2 into a number
of code sections, called tuning sections (TS). The tuning system runs the program one
or several times under a training input, while dynamically generating and swapping
in/out new optimized versions for each TS. The performance of these versions is
compared using the proposed rating methods. The winning version will be used in
the final tuned program. (We will discuss the complete PEAK system in Chapter 5.)
The key issue in rating these versions is to achieve fair comparison. Our rating
methods achieve this goal by either identifying TS invocations that use the same
workload (CBR), finding mathematical relationships between different workloads
(MBR), or forcing re-execution of a TS under the same input (RBR).
3.2.1 Context Based Rating (CBR)
Context-based rating identifies the invocations of a tuning section under the same
workload in the course of program execution. The PEAK compiler finds the set of
context variables, which are the program variables that influence the execution time
of the tuning section. For example, the variables that determine the conditions of
control regions, such as if or loop constructs. (These variables can be function pa-
rameters, global variables, or static variables.) Thus, the context variables determine
the workload of a tuning section. We define the context of one TS invocation as the
set of values of all context variables. Therefore, each context represents one unique
workload.
2Our compiler, called the PEAK compiler, is a source-to-source compiler, which analyzes andinstruments the program before tuning. We developed the PEAK compiler based on the Polaris [60]compiler for programs written in Fortran and the SUIF2 [61] compiler for C. The backend compileris the one that generates the optimized executables. We focus on GCC compiler as the backend inthis thesis. During tuning, the backend compiler is invoked with different option settings to controlits optimizations.
41
CBR rates one optimized version under a certain context by using the average
execution time of several invocations. (Typically, this number is tens of times.) The
best versions for different contexts may be different, in which case CBR could report
the context-specific winners. PEAK makes use of only the best version under the
most important context, which covers most (e.g., more than 80%) of the execution
time spent in the TS. This major context is determined by one profile run of the
program. (If a tuning section has no major context or the number of invocations of
the major context is too small, the MBR method described next is preferred.)
In summary, the rating of a version v, R(v), is computed according to Equa-
tion 3.1, where x is the most time-consuming context of version v, T (i, x) is the
execution time of ith invocation under context x, and w is the window size, the
number of invocations used to rate one version. V ar(v) records the variance of the
measurement.
R(v) =∑
i=1..w
T (i, x)/w (3.1)
V ar(v) =∑
i=1..w
(T (i, x) − R(v))2/w (3.2)
Figure 3.1 shows the compiler analysis to find the context variable set so as to
determine the applicability of CBR. The algorithm traverses each control statement
and recursively finds the related variables; that is, it finds all of the input variables
that may influence the values used in control statements. All of these variables are
considered to be context variables. If there exist one or more non-scalar context
variables, there is usually no major context. So, in this case, CBR is not applicable.
Similarly, if the context variable is floating point, CBR is not applicable. To reduce
the context match overhead during tuning, we eliminate the runtime constants from
the context variable list. The runtime constant variables have always the same value
during all the invocations. This is done using the same profile run that determines
the major context.
In Figure 3.1, we illustrate the algorithm using use-def chains to track the data
flow. Static Single Assignment (SSA) can also be used to do so. In our Fortran
42
//ContextSet: the set of context variables.
VariableSet ContextSet;
//Return value: applicability of CBR on TS
Boolean GetContextSet(TuningSection TS)
{
ContextSet = {};
Set the state of each statement as "undone";
For each control statement s in TS {
For each variable v used in s {
if( GetStmtContextSet(v, s) == false )
return false;
}
}
Remove the constant variables from ContextSet;
}
Boolean GetStmtContextSet(Variable v, Statement s)
{
StatementSet SSet = Find_UD_Chain(v, s);
Set s as "done";
For each statement m in SSet {
if( m is the entry statement ) {
//v is in Input(TS).
if( v is scalar && v is not floating point )
put v into ContextSet;
else
return false;
}
if( m is "done" ) {//avoid loop.
continue;
}
For each variable r used in m {
if( GetStmtContextSet(r, m) == false )
return false;
}
}
return true;
}
Fig. 3.1. Pseudo code of context variable analysis
43
implementation, we use Gated Single Assignment (GSA) [62]. The algorithm is very
similar to Figure 3.1.
3.2.2 Model Based Rating (MBR)
Model-based rating formulates mathematical relationships between different con-
texts of a tuning section and adjusts the measured execution time accordingly. In
this way, different contexts become comparable.
The execution time of a tuning section consists of the execution time spent in all
of its basic blocks:
TTS =∑
(Tb × Cb) (3.3)
TTS is the execution time in one invocation of the whole tuning section; Tb is the
execution time in one entry to the basic block b; and Cb is the number of entries to
the basic block b in the TS invocation.
If the numbers of entries of two basic blocks, Cb1 and Cb2, are linearly dependent
on each other in every TS invocation through the whole run of the program, (that
is, Cb1 = α ∗ Cb2 + β, where α and β are constants), our compiler merges the items
corresponding to these basic blocks into one component. Hence, MBR uses the
following execution time estimation model.
TTS =∑
i=1..n
(Ti × Ci) (3.4)
TTS consists of several components, each of which has a component count Ci and
a component time Ti. We assume that there is always a constant component Tn,
with Cn = 1 for all TS invocations. Furthermore, MBR makes a number of simpli-
fications: (1) If two branches in a conditional statement have the same workload,
the components representing the branches are merged. (2) If the workload in condi-
tional statements is small, they are treated as normal statements. For example, an
if-statement with a simple increment statement is not treated as a basic block, but
as an increment statement. (3) Components that exhibit constant behavior are put
into the constant component.
44
The PEAK compiler finds the expression determining the number of entries to
each basic block b, Cb. If the TS contains an irregular code structure, for example a
while loop, MBR is not applicable. (In this case, the next rating method, RBR, will
be applied.) After a profile run, PEAK determines the relationship between Cb’s,
thus to merge them into independent components.
During tuning, PEAK collects the execution times of a number of invocations
to the optimized version until the rating error is small, which we will discuss in
Section 3.3. It gathers the TS-invocation-time vector, Y , and the component-count
matrix, C, in which Y (j) is the TTS in the jth invocation and C(i, j) is the ith
component count Ci in the jth invocation. Solving the following linear regression
problem yields the component-time vector T .
Y = T × C (3.5)
Here, T = (T1, T2, ..., Tn) represents the component-time vector of one particular
version. The version with smaller Ti’s performs better. Hence, MBR may compare
different versions using the rating, R(v), computed based on their T vectors according
to the following equation.
R(v) =∑
i=1..n
(Ti × Cavgi) (3.6)
V ar(v) =∑
j=1..w
(Yj −∑
i=1..n
(Ti × Ci,j))2/w (3.7)
Cavgi is the average count of component i during one whole run of the program.
These data are obtained from the profile run. And, w is the number of invocations
to compute the rating. The variance of the rating, V ar(v) is the residual error of
this linear regression.
Figure 3.2 (a) shows an example code with two components. The first component
is the loop body with a variable number, N, of entries during one invocation of the
tuning section. The second component is the tail code with one entry per invocation.
Figure 3.2 (b) shows the Y and C gathered by the performance rating system during
tuning. Each column of Y and C corresponds to the data in one invocation of the
45
DO I = 1, N...loop body...ENDDO...tail code...
(a) A tuning section with two components
Y =[
11015 5508 6626 6044 8793]
C =
[100 50 60 55 801 1 1 1 1
]
(b) TS-invocation-time vector Y and component-count matrix C collected during tuning
T =[
110.05 3.75]
(c) Component-time vector T by linear regression
Fig. 3.2. A simple example of MBR
tuning section. Linear regression generates the component-time vector T , shown in
Figure 3.2 (c). Given Cavg1 = 75, the rating of this version is 110.05 × 75 + 3.75 =
8257.5.
If there are many components in the execution time model, a large number of
invocations need to be experimented in order to perform an accurate linear regression.
MBR would lead to a long tuning time in this case and so is not applied. Instead,
RBR described next will be applied.
3.2.3 Re-execution Based Rating (RBR)
Re-execution-based rating forces a roll-back and re-execution of a tuning section
under the same input. It is applicable to all our tuning sections; however, it also
generally has the largest overhead. We first present a basic re-execution method,
followed by a method that reduces inaccuracies caused by cache effects.
46
Step 1. Save Input(TS)Step 2. Time Version 1 (the current best version)Step 3. Restore Input(TS)Step 4. Time Version 2 (the experimental version)Step 5. Return the two execution times
Fig. 3.3. Basic Re-execution-based rating method (RBR)
Basic RBR method
Figure 3.3 shows the basic idea of RBR. Before each invocation, the input data
to the TS is saved, then Version 1 is timed, the input is restored, and Version 2
is executed. These two execution times can be compared directly to decide which
version is better, since both versions are executed with the same input, so, with the
same workload.
RBR directly generates a relative performance rating based on the execution times
of the two versions, which are executed during one TS invocation. Suppose that the
execution times of these two versions are Tv1 and Tv2. Then, the performance rating
of Version 2 relative to Version 1 is Rv2/v1.
Rv2/v1 = Tv1/Tv2 (3.8)
If Rv2/v1 is larger than 1, Version 2 performs better than Version 1. Otherwise,
Version 2 performs worse. For multiple versions, we compare their performance
relative to the same base version. For example, if Rv2/v1 is less than Rv3/v1, Version 3
performs better than Version 2. In our tuning system, we use the average of Rvx/vb’s
across a number of TS invocations as the rating of Version vx relative to the base
version vb. The rating of v is computed based on Equation 3.9, where w is the
number of invocations. Similar to CBR and MBR, we compute the rating variance
V ar(v).
R(v) =∑
i=1..w
Rv/vb(i)/w (3.9)
V ar(v) =∑
i=1..w
(Rv/vb(i) − R(v))2/w (3.10)
47
The input set, Input(TS), is obtained through liveness analysis. Input(TS) is
equal to LiveIn(b1), the live-in set of the entry block in TS. (An eligible TS should
not call library functions with side effects, such as malloc, free, and I/O operations.
Right now, we exclude these function calls from the tuning section. As future work,
these functions could be re-written, so that they can be rolled back.)
Improved RBR method
Even under the same input, two invocations of a TS may result in different
execution times. The first invocation preconditions the cache, affecting the execution
time of the second invocation. To address this problem, the improved RBR method
(1) inserts a preconditional version before Version 1 to bring the used data into cache,
and (2) swaps Version 1 and Version 2 at each invocation, so that their order does
not bias the result.
In addition, the improved RBR method saves and restores only the input variables
that are modified in the invocation, the set of Modified Input(TS). (Def(TS) is
the def set of the TS.)
Modified Input(TS) = Input(TS) ∩ Def(TS) (3.11)
Compile time analysis may not be able to determine the exact Modified Input(TS)
set. Before write references to irregular arrays and pointers that are in this set,
inspector code is inserted into the preconditional version to record both the ad-
dresses and the values. The recorded data will be used to restore the input before
re-execution. (This technique is mostly used for C programs instead of Fortran
programs.)
Figure 3.4 shows the improved RBR. This method incurs three types of overhead:
(1) save and restore of the Modified Input(TS); (2) execution of the preconditional
version; and (3) execution of the second code version. The overhead of the save,
restore and precondition code can be reduced through a number of compiler opti-
mizations. For example, the save and restore overhead can be reduced by accurately
48
RBR(TuningSection TS) :1. Swap Version 1 and Version 22. Save the Modified Input(TS)3. Run the preconditional version4. Restore the Modified Input(TS)5. Time Version 16. Restore the Modified Input(TS)7. Time Version 28. Return the two execution times
Fig. 3.4. Improved Re-execution-based rating method
analyzing the Modified Input(TS) set. This can be achieved using symbolic range
analysis [63] for regular data accesses. Other optimizations include the combination
of a number of experimental runs into a batch, and the elimination of instructions
from the preconditional version that do not affect cache.
3.3 The Use of Rating Methods in PEAK
We have presented three rating methods: CBR, MBR and RBR. Context-based
rating (CBR) has the least overhead but is not applicable to code without a major
context. Model-based rating (MBR) works for code without a major context, but
is not applicable to irregular programs. Re-execution-based rating (RBR) can be
applied to almost all programs; however, the overhead is the highest among the
three. Generally, the applicability of these three rating approaches increases in the
order of CBR, MBR and RBR; so does the overhead.
Before tuning, our PEAK compiler divides the target program into several tun-
ing sections. The original source program is analyzed according to the techniques
presented in last section. After one profile run, the PEAK compiler finds the major
context, if it exists, for CBR, and the execution time model for MBR. From the
static analysis and profile information, the PEAK compiler decides which applicable
rating method should be used for each tuning section, in the priority order of CBR,
49
MBR and RBR. Then, the PEAK compiler inserts three kinds of instrumentation
code into the source program to construct a tuning driver: (1) code to activate per-
formance tuning; (2) code to measure the execution times and to trigger the rating
methods; (3) code to facilitate the rating methods, for example, context match code
for CBR and the preconditional version for RBR.
During tuning, the tuning driver generates the rating, R(v), and the rating vari-
ance, V ar(v), across a number of TS invocations, which is called a window. The
tuning driver compares R(v) of different versions to know which version is the best
one. In summary, R(v) and V ar(v) are computed as follows.
• CBR: Suppose that T (i, x) is the execution time of the ith invocation under
context x. R(v) and V ar(v) under context x are the mean and the variance of
T (i, x), i = 1...w, where w is the window size. They are computed according
to Equations 3.1 and 3.2.
• MBR: R(v) is execution time estimated from the execution time model, and
V ar(v) is the residual error of the linear regression. They are computed ac-
cording to Equations 3.6 and 3.7.
• RBR: Suppose that Rv/vb(i) is the relative performance of version v over base
version vb at the ith invocation. R(v) and V ar(v) are the mean and the
variance of Rv/vb(i), i = 1...w, according to Equations 3.9 and 3.10.
To improve rating accuracy, the tuning system applies two optimizations. (1) The
tuning system identifies and eliminates measurement outliers, which are far away
from the average. Such data may result from system perturbations, such as inter-
rupts. (2) The tuning system uses a dynamic approach to determining the window
size. It continually executes and rates a version until the rating variance V ar(v)
falls below a threshold. This optimization is applied based on the observation that
V ar(v) decreases with increasing size of the window.
50
3.4 Evaluation on Rating Accuracy
This section evaluates applicability and accuracy of the three rating methods.
Rating accuracy is represented by the mean and standard deviation of the ratings.
Our experimental system uniformly samples the ratings throughout the execution
under a training input. In this way, it gathers a vector of ratings, [R1, R2, ..., Rn],
where Ri is the R(v) computed at sampling time i, as described in Section 3.3. (Each
rating Ri is based on w invocations of the TS. The experimented version is optimized
under the default GCC O3 setting, same as the base version.) So, we can assume
that the ideal rating is the average of Ri, R, for CBR and MBR. The ideal rating
for RBR is 1, since the experimental version is the same as the base version.
R =∑
i=1..n
Ri/n (3.12)
We compute the rating error, Xi, at sampling time i.
Xi ={Ri/R − 1
Ri − 1
for CBR and MBR
for RBR(3.13)
Table 3.1 shows the statistic characteristics of the rating errors, the Mean, μ, and
the Standard Deviation, σ, which are the measure of rating accuracy.
μ =∑
i=1..n
Xi/n (3.14)
σ =√ ∑
i=1..n
(Xi − μ)2/(n − 1) (3.15)
High rating accuracy requires the Mean, μ, be close to zero and a small Standard
Deviation, σ. We also show how these two metrics change along with the window
size in Table 3.1.
Table 3.1 shows the most important tuning sections for the selected benchmarks.
The upper half lists the floating point benchmarks; the lower half lists the integer
benchmarks. Integer code exhibits a large number of conditional statements. Be-
cause of this irregularity, the PEAK compiler applies the re-execution-based method
(RBR) to all the integer benchmarks. The floating point benchmarks are more reg-
51
Table 3.1Rating accuracy for selected tuning sections
The columns from left to right show the benchmark name, the tuning section name,the applicable rating approaches, the number of invocations of the tuning sectionduring one run of the benchmark, and the rating accuracy under different windowsizes. The numbers for the accuracy columns are multiplied by 100 for readability.(For CBR, multiple rows are used for each tuning section, if there are multiplecontexts.)
Benchmark Tuning Rating #invo- Rating Accuracy: Mean (Standard Deviation) × 100Name Section Approach cations w=10 w=20 w=40 w=80 w=160applu blts CBR 250 0(0.71) 0(0.65) 0(0.57) 0(0.49) 0(0.18)apsi radb4(Context1) CBR 1.37M 0(2.2) 0(2.6) 0(3.0) 0(2.7) 0(1.4)
radb4(Context2) CBR 0(0.7) 0(0.7) 0(0.7) 0(0.7) 0(0.5)radb4(Context3) CBR 0(0.5) 0(0.4) 0(0.3) 0(0.3) 0(0.2)
art match RBR 250 -0.06(0.28) -0.07(0.17) -0.08(0.11) -0.1(0.07) -0.09(0.04)mgrid resid MBR 2410 0(1.0) 0(0.82) 0(0.76) 0(0.63) 0(0.48)equake smvp CBR 2709 0(2.7) 0(2.5) 0(2.4) 0(2.1) 0(1.6)mesa sample 1d linear RBR 193M -0.05(1.3) 0.07(1.0) 0.03(0.78) 0.07(0.57) 0.02(0.36)swim calc3 CBR 198 0(0.33) 0(0.29) 0(0.19) 0(0.06) 0(0.01)wupwise zgemm(Context1) CBR 22.5M 0(1.3) 0(1.1) 0(1.1) 0(0.94) 0(0.86)
zgemm(Context2) CBR 0(1.5) 0(1.6) 0(1.6) 0(1.7) 0(1.5)bzip2 fullGtU RBR 24.2M 0.95(2.6) 0.5(1.9) 0.27(1.3) 0.09(1.0) 0.07(0.7)crafty Attacked RBR 12.3M -0.91(2.3) -0.43(1.7) -0.25(1.5) -0.33(1.2) -0.16(0.8)gzip longest match RBR 82.6M -1.0(2.7) -0.14(1.2) -0.08(1.1) -0.1(0.9) -0.05(0.7)mcf primal bea mpp RBR 105K -0.23(0.92) -0.18(0.71) -0.16(0.48) -0.09(0.36) -0.11(0.31)twolf new dbox a RBR 3.19M -0.56(1.9) -0.45(1.3) -0.36(1.0) -0.23(0.58) -0.13(0.37)vortex ChkGetChunk RBR 80.4M -0.12(3.0) 0.26(1.6) 0.18(1.2) -0.16(0.97) -0.11(0.76)
ular. The context-based rating (CBR) and the model-based rating (MBR) methods
are applicable to them.
The last column in Table 3.1 shows the Mean and the Standard Deviation under
different window sizes. Generally, both metrics decrease with increasing window size.
RBR achieves a very small mean (< 0.002) and a small standard deviation (< 0.016)
with a reasonable window size for all cases. Equake has a relatively high variation,
which we attribute to its irregular memory access behavior, resulting from sparse
matrix operations. We conclude that our rating methods are accurate. Small tuning
sections exhibit more measurement variation but also tend to have higher numbers
of invocations. In these cases, accuracy is achieved through larger window sizes.
The fourth column in Table 3.1 shows the number of invocations to the tuning
section under the training dataset. For some benchmarks, the number of invocations
exceeds one million, while for others it is several hundred. In all benchmarks, the
52
system may rate multiple versions during one run of the program. The total number
of invocations needed in one tuning is roughly window size × number of versions.
Some benchmarks fit in one run, others fit in multiple runs. So, PEAK can reduce
the tuning time by a significant amount, which we will show in the following chapters.
53
4. TUNING SECTION SELECTION
4.1 Introduction
In previous chapters, we have shown that optimization orchestration improves
program performance, and that rating methods based on a partial execution of the
tuning sections can speed up the tuning process. This chapter deals with the problem
of how to select the important code sections in a program as the tuning sections, in
order to achieve fast tuning speed and high tuned program performance.
Basically, tuning sections need to meet the following requirements to achieve the
goal of improving tuning time and program performance.
1. A tuning section should be invoked a large number of times, for example, more
than 100 times, in one run of the program. A large number of invocations
usually mean fast tuning. For tuning section TSi, let the average number of
invocations used to rate one optimized version be N1(TSi), the total number of
invocations to the tuning section be Nt(TSi), the number of optimized versions
rated in one run of the program be Nv(TSi).
Nv(TSi) = Nt(TSi)/N1(TSi) (4.1)
So, when the number of invocations Nt(TSi) is large, we may rate a large
number of, Nv(TSi), versions in each run of the program. 1 This means a fast
tuning. If there are multiple tuning sections, the tuning time is bound by the
slowest one. Denote the smallest number of invocations to the tuning sections
as Nmin.
Nmin = mini(Nt(TSi)) (4.2)
So, the first requirement for tuning section selection is to have a large Nmin.
1Although N1(TSi) may not be the same for different tuning sections, generally, a large Nt(TSi)still means a large Nv(TSi).
54
2. Tuning sections should cover as large part of the program as possible. The
coverage of the tuning sections is computed based on the execution times.
Denote the time spent in tuning section TSi as Tt(TSi), which includes the
time spent in all the functions/subroutines invoked within this tuning section,
and the total execution time of the program as Ttotal.
Coverage =
∑Tt(TSi)
Ttotal
× 100% (4.3)
A large coverage means that a big part of the program is tuned via optimization
orchestration. So, we could achieve a good program performance.
3. A tuning section should be big enough, so that the average execution time spent
in one invocation is big enough, for example, greater than 100μsec. The average
execution time, Tavg(TSi), is computed based on the number of invocations,
Nt(TSi), and the execution time spent in TSi, Tt(TSi).
Tavg(TSi) = Tt(TSi)/Nt(TSi) (4.4)
We do not choose tiny tuning sections, because they usually cause a low timing
accuracy, even though we use a high-resolution timer. The low timing accuracy
results in a low rating accuracy and a large number of invocations per version,
N1(TSi).
Applying optimization orchestration to tuning sections separately may achieve
more program performance than tuning the program as a whole, because different
tuning sections may favor different optimizations. It would be desirable if the code
sections that favor different optimizations could be separated into different tuning
sections. However, this is not practical, as we do not know what optimizations are
beneficial to a code section before tuning. Remember, our approach to performance
tuning is to search for the best optimization combinations for tuning sections.
This chapter presents a tuning section selection algorithm, which meets all the
three aforementioned requirements. Section 4.2 shows the call graph annotated with
execution time profiles used for tuning section selection. Based on the call graph,
55
the problem of tuning section selection is formally defined in Section 4.3. Section 4.4
presents our tuning section selection algorithm. In this algorithm, nodes for recursive
functions are merged to construct a simplified call graph, which is a directed acyclic
graph. A simple algorithm is designed to select the tuning sections by maximizing the
program coverage under a given constraint for Nmin, working on the simplified call
graph. The final algorithm iteratively calls the previous simple algorithm to trade
off the coverage and Nmin. In Section 4.5, the results of tuning section selection are
discussed.
4.2 Profile Data for Selecting Tuning Sections
From the previous section, a tuning section is selected based on its number of
invocations and its execution time. These data are collected from a profile pass.
In our implementation, we use the call graph profile generated by gprof [64]. This
call graph profile shows how much time is spent in each function and its children,
how many times each function is called and how many times the function calls its
children, during the profile run.
Our tuning section selection algorithm reads the output of gprof and generates a
call graph G = (V,E). 2 This call graph is a directed graph. It has one source (root)
node, δ, whose in-degree is 0, and a set, Γ, of sink (leaf) nodes whose out-degrees are
0. Each node v ∈ V identifies a function. δ identifies the function main. The nodes
in Γ identify the functions that do not call any other function. Each edge e ∈ E
identifies a function call. The associated profile information is as follows.
v = {fn} (4.5)
e = {s, t, n, tm} (4.6)
2Here, call graph G contains the dynamic calls during the profile run, not the static calls appearedin the program code. So, if a function call appeared in the code was not executed during theprofile run, G does not include this call. However, a static call graph generated by a compilershould include this call. For the purpose of tuning section selection, the dynamic call graph is goodenough. After tuning sections are selected, our compiler tools analyze and transform the program.During this process, a static call graph is used. Our tuning selection algorithm can be applied tothe static call graph as well, if profile information is assigned to its nodes and edges.
56
fn(v) is the function name of the node. s(e) identifies the caller node; t(e) identifies
the callee node; n(e) is the number of invocations to t(e) made by s(e); tm(e) is the
time spent in t(e) and its callees, when t(e) is called from s(e). The next section
will give an example of this call graph in Figure 4.1 and a formal description of the
tuning section selection problem.
Ideally, we could truncate the program at any place, for example, in the beginning
of a basic block, to create a tuning section. In practice, we choose tuning sections at
the procedure level, since we use the call graph profile generated by gprof. If we had
accurate basic block profiles, we could select the tuning sections at the basic-block
level. The tuning section selection algorithm would be similar as the one presented
in next sections. The difference would be that a bigger graph should be used with
basic blocks as the nodes instead of functions/subroutines.
4.3 A Formal Description of the Tuning Section Selection Problem
Tuning section selection aims at partitioning the program into several compo-
nents, which meet the requirements in Section 4.1. Basically, a tuning section starts
with an entry function, including all the subroutines called directly or indirectly by
the entry function. If one subroutine is called by two different tuning sections, it is
replicated into these two tuning sections. In the representation of the call graph in
Section 4.2, tuning section selection tries to find a set of single entry-node regions
(subgraphs) in G; each region is a tuning section. The entry function of a region
identifies that region. One region can overlap with other regions, although it would
be better if no regions overlap.
The problem of tuning section selection can be described, in a formal way, as an
optimal edge cut problem. Given call graph G = (V, E), find an edge cut (Θ, Ω)
so as to maximize the invocation numbers and the coverage of the tuning sections.
Here, Θ and Ω are a partition of the node set V , such that Θ contains the source
node δ, and Ω contains the set of sink nodes in Γ. This edge cut (Θ, Ω) is a set
of edges, each of which leaves Θ and enters Ω. This edge cut determines the set
57
a
c
d e f
1000 (80)
5 (20)
200 (18) 10000
(30)
20000 (1) 20000
(1)
b
Fig. 4.1. An example of tuning section selection. The graph is acall graph with node a as the main function. The weights on anedge are the number of invocations and the execution time in theparentheses. The optimal edge cut is (Θ = {a, c}, Ω = {b, d, e, f}),shown by the dashed curve. Edges (a, b) and (c, f) are chosen as theS set. Edge (c, e) in the cut (Θ, Ω) is not included in S, becauseits average execution time is 1/20000 less than Tlb = 1e−4. Thereare two tuning sections led by node b and node f , T = {b, f}. Thenumber of invocations to b and f are 1000 and 200 in respect, so,Nmin = 200. The coverage of this optimal tuning section selection is(80+18)/100 = 0.98, where the total execution time, Ttotal, is 100.
of tuning sections in two steps. (1) Find all the edges in this cut whose average
execution times are greater than Tlb, the lower bound on the average execution time.
i.e., for each edge e ∈ (Θ, Ω), put e in set S, if tm(e)/n(e) ≥ Tlb. This step is done
to meet Requirement 3 in Section 4.1. (2) The edges in set S point to the selected
tuning sections. i.e., make the entry-node set T = {v|v = t(ei), ei ∈ S}. Each node
v in set T identifies a tuning section. (Tuning sections are the subgraphs led by the
entry function v.) Figure 4.1 gives an example.
The tuning section selection algorithm maximizes the invocation numbers and the
coverage of the tuning sections, which are computed according to aforementioned S
and T as follows.
58
1. The number of invocations to the tuning section v is denoted as Nt(v).
Nt(v) =∑
e∈S,t(e)=v
n(e). (4.7)
One goal of the tuning section selection algorithm is to maximize the smallest
Nt(v), denoted as Nmin. (Requirement 1 in Section 4.1.)
Nmin = minv∈T (Nt(v)) (4.8)
2. The other goal of the tuning section selection algorithm is to maximize the
execution coverage. (Requirement 2 in Section 4.1.)
Coverage =∑e∈S
tm(e)/Ttotal (4.9)
Ttotal is the total execution time of the program.
The tuning section selection problem does not always have a reasonable solution.
For example, suppose that a program has only one function, main(), which contains
a loop consuming most of the execution time. If main is chosen as the tuning
section, Nmin is 1 and coverage is 100%. Otherwise, coverage is 0%. The first
solution degrades to the whole-program tuning. The second solution does not find
any tuning section. Neither of them is acceptable. In fact, we should use the loop
body in main as a tuning section. Using a call graph profile, the selection algorithm
cannot identify the loops within a function. Some manual work is needed to find
the loop and retrieve the loop body into a separate function. We call this process
manual code partitioning. 3 After manual code partitioning, the loop body appears
in the call graph profile, which then is chosen as a tuning section by the selection
algorithm. Finding the functions that are not selected as tuning sections but worth
manual partitioning is another job of the algorithm.
4.4 The Tuning Section Selection Algorithm
From the previous section, the tuning section selection problem can be viewed as
a constrained max cut problem. (The original max cut problem is NP-complete [65].)
3We could automate this code partitioning process, if the profile was provided at the basic-blocklevel.
59
In this section, we will develop a greedy algorithm to select the tuning sections so as
to get maximal Nmin and coverage. This algorithm solves the problem in two steps.
(1) We design an algorithm which aims to maximize coverage under the constraint
that the number of invocations to each selected tuning section is larger than a lower
bound Nlb. (2) The final algorithm raises the Nlb gradually to trade off the coverage.
It aims to achieve a large Nmin by tolerating a small decrease of the coverage. These
two steps will be presented in Section 4.4.2 and Section 4.4.3 separately. We discuss
the handling of recursive functions in Section 4.4.1.
4.4.1 Dealing with recursive functions
There are recursive functions in some programs. To call a recursive function, the
program makes an initial call to the function. Then the function will be called by
itself, in the case of self-recursion, or by its callees, in the case of mutual-recursion.
Both self-recursive calls and mutually-recursive calls are referred to as recursive calls,
which are different from the initial call. Our PEAK system treats initial calls to a
recursive function as normal function calls; while recursive calls can be viewed as loop
iterations, which are ignored by tuning section selection. In other words, the tuning
section selection algorithm does not choose the call graph edges that correspond to
recursive calls, but only the edges corresponding to initial calls.
In a call graph, the functions (nodes) that recursively call themselves or each
other form cycles (including loops). To exclude recursive calls from tuning section
selection, our algorithm identifies the cycles and ignores the edges appeared in the
cycles. We do this through a call graph simplification process. This process merges
the nodes involved in a common cycle into one node, removes the edges used inside
a cycle, and adjusts the edges entering or leaving the merged nodes. This process
uses the strongly connected components to find the nodes and edges that appear in
a cycle. It adjusts the profile data as well. The pseudo code of this simplification
process is shown in Figure 4.2. An example is shown in Figure 4.3.
60
Subroutine G = GenerateSimplifiedCallGraph(profile)1. Construct call graph G = (V, E) according to the profile data as described in
Section 4.2. For each node v ∈ V , v = {fn}: fn(v) is the function name. Foreach edge e ∈ E, e = {s, t, n, tm}: s(e) is the caller node; t(e) is the calleenode; n(e) is the number of invocations to t(e) made by s(e); tm(e) is the timespent in t(e) and its callees, when t(e) is called from s(e).
2. Remove loops in G. (A loop identifies a self-recursive call.)3. For each strongly connected component SCC that contains more than one
nodes, do the following to remove the cycles by merging the nodes in SCC.(Such strongly connected components identify mutually-recursive calls.)(a) Construct a new node u for SCC. The name of u, fn(u) is the concate-
nation of the names of all the nodes in SCC.(b) Remove the inner edges, i.e., the edges starting from and ending at SCC.(c) Keep all the edges leaving SCC and set their starting node to be the new
node u. Merge the edges that start from u and end at the same node,and sum up the profile data for the merged edges.
(d) For each node v in SCC, remove v if there is no edge entering v.Otherwise, add a new edge e, which leaves v and enters the new node u.The profile information for this e is the sum of the profile data for all theedges enters v:
n(e) =∑
x∈E,t(x)=v
n(x) (4.10)
tm(e) =∑
x∈E,t(x)=v
tm(x) (4.11)
Fig. 4.2. The pseudo code for call graph simplification. The algo-rithm generates a call graph from profile data, detects and discardsrecursive calls. Hence, the call graph is simplified to a directed acyclicgraph.
61
a
b c
e
d
f
200 (60)
5 (1)
500 (100)
1000 (6)
5000 (30)
(a) Call graph before simplification
a
b c d
f
200 (60)
5 (1)
500 (100)
6000 (36)
be
200 (60)
(b) Call graph after simplification
Fig. 4.3. An example of call graph simplification. The graph is a callgraph with node a as the main function. c is a self-recursive function.b and e recursively call each other. The weights on an edge are thenumber of invocations and the execution time in the parentheses.After simplification, the loop at node c is discarded. The stronglyconnected component {b, e} is merged into one node be. The entrynode b for this strongly connected component is kept. A new edge(b, be) is added. Edges (b, f) and (e, f) are merged to (be, f). Theprofile data on edges (b, be) and (be, f) are updated.
This simplified call graph removes the self-recursive calls and makes a new node
for the functions that recursively call each other. This graph is a Directed Acyclic
Graph (DAG). The call graph simplification algorithm maintains the profile infor-
mation for the new nodes and edges. So, the graph has the same profile information
as described in Section 4.2.
4.4.2 Maximizing tuning section coverage under Nlb
This section describes a tuning section selection algorithm, which aims to achieve
as large a coverage as possible, under the constraint that the number of invocations to
each selected tuning section is larger than a lower bound Nlb. This algorithm selects
the tuning sections and puts their entry functions into set T . It finds the functions
62
that are worth manual code partitioning and puts them into set M . Besides Nlb, this
algorithm uses two other parameters: (1) Tlb, the lower bound on average execution
times; (2) Plb, the lower bound on the execution percentage for a code section worth
manual partitioning.
Tlb is determined by the timing accuracy of the PEAK system. We use 100μsec
in our experiments. Plb is used to determine whether a code section is worth tuning.
We use 0.02. This means that the code section is worth tuning if its execution time
is greater than 2% of the total execution time. Nlb will be adjusted to trade off the
tuning section coverage in the final tuning section selection algorithm described in
Section 4.4.3. The optimal Nlb picked by the final algorithm usually ranges from
tens to thousands.
In order to maximize the coverage, the algorithm traverses the call graph from
top down to select the code sections that meet the requirements. (We use a topo-
logical order to go through the call graph.) When a tuning section is selected, the
profile data are updated to reflect the execution times and invocation numbers after
excluding this selected tuning section. The execution time due to the selected tuning
section is deducted from the execution time of its ancestors as well. (Remember that
the execution time on node v includes the time spent in v itself and the descendants
of v.) After the selection process finishes, the remaining execution time on each
node v is used to judge whether it is worth manual partitioning. (It is worth manual
partitioning, if its execution time is greater than Plb of the total execution time.)
Figure 4.4 shows the pseudo code for the algorithm to maximize the tuning
section coverage. This algorithm constructs an acyclic call graph, annotated with
profile information after removing the recursive calls, according to the algorithm
described in Section 4.4.1. It ignores the edges whose average execution time is less
than threshold Tlb when computing the execution profile for a node. It goes through
the call graph in a topological order to find the nodes whose numbers of invocations
are greater than threshold Nlb. These nodes are selected as entry functions to the
tuning sections. The profile information of the relevant edges are adjusted to reflect
63
Subroutine [T , M ] = MaxCoverage(profile, Nlb, Tlb, Plb)There are three thresholds used in this algorithm: Nlb, the lower bound on numbersof invocations; Tlb, the lower bound on average execution times; Plb, the lower boundon the execution percentage for a code section worth manual partitioning. Thealgorithm selects the tuning sections and puts their entry functions into set T . Thefunctions that are worth manual code partitioning are put into M .
1. Construct the simplified call graph G = (V,E), according to Figure 4.2. i.e.,G = GenerateSimplifiedCallGraph(profile).
2. Clear the selection flag for each edge e ∈ E: f(e) = 0. Clear the executiontime due to the selected tuning sections, for each node v: TX(v) = 0. Emptyset T and M .
3. Mark the edges whose average execution times are less than Tlb. i.e., for eachedge e ∈ E, set f(e) = −1, if tm(e)/n(e) < Tlb. (These edges will not becounted when summing the profile data of the edges for one node.)
4. Sort all the nodes into a topological order: v1, v2, ..., i.e., if there is an edge(u, v), node u appears before node v. The nodes will be traversed in this order.
5. For node vi (i = 1, 2, ...), compute the total number of invocations to vi.
n(vi) =∑
e∈E,t(e)=vi,f(e)=0
n(e) (4.12)
If n(vi) is greater than Nlb, put vi into T and set f(e) = 1 if t(e) = vi; andupdate profile information of G, by calling UpdateProfile(G, vi, TX).
6. Put node v into M , if v is not selected but consume a large amount of executiontime, i.e.,
∑t(e)=v,f(e) �=1 tm(e) − TX(v) > Plb × Ttotal. (These nodes may be
manually partitioned to improve tuning section coverage.)
Fig. 4.4. Tuning section selection algorithm to maximize programcoverage under the lower bound on numbers of TS invocations, Nlb.This algorithm traverses the simplified call graph from top downto find the code sections whose numbers of invocations are greaterthan Nlb. In addition, the algorithm finds the functions that may bemanually partitioned to improve tuning section coverage.
64
the number of invocations and execution time spent in the rest of the program,
when a node is selected. In the end, the algorithm finds the functions worth manual
partitioning, using the residual execution time after excluding the selected tuning
sections.
When a tuning section is selected, the profile data need to be updated in order
to exclude the execution information (time and number of invocations) due to the
selected tuning section. This requires a context-sensitive inclusive profile, which
lists the direct and indirect callers of a function, when the execution information
of this function is given. This profile should show the execution time and number
of invocations of each function call; it should also split this information for each
call path. Unfortunately, gprof does not provide such information. Instead, we use
an algorithm to do an estimation, which is described in Figure 4.5. It handles the
descendants and the ancestors of the selected node v separately.
For each descendant u, the algorithm estimates how often u is directly or indi-
rectly called from the selected node v, out of the total invocations to u. We denote
this execution frequency of u as q(u). For the selected node v, q(v) = 1.0. For each
descendant u, q(u) is computed based on the execution frequency of u’s parents and
the numbers of invocation to u directly from the parents. Equation 4.17 does the
estimation, where q(u) is the execution frequency of u.
q(u) =
∑e∈E,t(e)=u q(s(e)) × n(e)∑
e∈E,t(e)=u n(e)(4.17)
Knowing the execution frequency of all the descendants, the profile information for
all the edges reachable from v can be estimated according to Equations 4.18 and 4.19.
n(e) = n(e) × (1 − q(s(e))) (4.18)
tm(e) = tm(e) × (1 − q(s(e))) (4.19)
The algorithm traverses the node list forwards from node v. (The nodes are sorted
in a topological order.)
For each ancestor u of the selected node v, the algorithm estimates how much
execution time is spent in u due to v, which is denoted as p(u). The time spent in
65
Subroutine UpdateProfile(G = (V,E), vi, TX)This algorithm removes the execution time due to vi for each edge reachable fromvi, and adds the time due to vi into TX for each ancestor of vi. (TX(v) records thetime spent in v due to all the selected tuning sections. Manual code partitioningwill use TX to estimate the remaining execution time after tuning section selection.)Update the profile information of the edges that are reachable from node vi.
1. Set execution frequency of all nodes except vi as 0, q(v) = 0; while q(vi) = 1.0,which means vi is executed 100% within the chosen tuning section.
2. For node vj (j = i + 1, i + 2, ...), compute the execution frequency of vj basedon the execution frequency of its parents according to the following equation.
q(vj) =
∑e∈E,t(e)=vj
q(s(e)) × n(e)∑e∈E,t(e)=vj
n(e)(4.13)
3. For node vj (j = i + 1, i + 2, ...), adjust the profile information for the edgesentering vj according to the following equations, where t(e) = vj.
n(e) = n(e) × (1 − q(s(e))) (4.14)
tm(e) = tm(e) × (1 − q(s(e))) (4.15)
For the nodes that reach node vi, record how much execution time is spent due tovi. This time will be excluded for choosing manual code partitioning candidates.
1. For each node v, set the time spent in v due to vi, p(v), as 0.2. Set the time spent in vi, p(vi) =
∑e∈E,t(e)=vi
tm(e).3. For node vj (j = i, i− 1, .., 1), update the time spent in its parents due to the
time spent in vi. For each edge e, who enters vj, i.e., t(e) = vj, adjust p(s(e))according to the following equation.
p(s(e)) = p(s(e)) +tm(e)∑
a∈E,t(a)=vjtm(a)
× p(vj) (4.16)
4. For node vj (j = i − 1, i − 2, .., 1), update the time spent in v due to all theselected tuning sections: TX(vj) = TX(vj) + p(vj)
Fig. 4.5. Update profile data after vi is chosen as the entry functionto a tuning section. The updated profile reflects the execution timesand invocation numbers after excluding the chosen tuning section.
66
the selected node v, p(v), is equal to∑
e∈E,t(e)=v tm(e). We assume that the time
spent in u due to v is distributed in u’s parent, x, proportional to the time spent in
u when u is called from x. That is, p(u) is distributed to p(x) proportional to tm(e),
where e leaves x and enters u. So, p(u) is distributed according to the following
equation.
p(s(e)) = p(s(e)) +tm(e)∑
a∈E,t(a)=u tm(a)× p(u) (4.20)
The algorithm traverses the node list backwards from node v.
After applying the algorithm to maximize the tuning section coverage to, for
example, the call graph shown in Figure 4.1 using 100 as Nlb, we will get the optimal
tuning section selection.
4.4.3 The final tuning section selection algorithm
The previous section describes an algorithm to maximize the tuning section cov-
erage with the constraint that the number of invocations to each selected tuning
section should be larger than Nlb. The algorithm described in this section aims to
achieve a large Nmin by tolerating a small decrease of the coverage. It does this via
raising the Nlb gradually to trade off the coverage.
Two new parameters are introduced to the final algorithm.
1. Clb, the lower bound on the tuning section coverage. If the coverage of the
selected tuning sections is smaller than Clb, manual partitioning is necessary.
We set it as 80% of the total execution time.
2. Rub, the upper bound on the coverage drop rate. The coverage drop rate is
computed based on two tuning section selection solutions as follows, where
Nmin2 is larger than Nmin1.
R =coverage1 − coverage2
Nmin2 − Nmin1
(4.21)
If, on average, after increasing Nmin by 1, the coverage drops more than Rub,
the algorithm finds a trade-off point. In our experiments, we tolerate 1%
67
Table 4.1Tuning section selection for mgrid. The best Nlb is 400. The optimalcoverage and Nmin are 0.957 and 2000.
iteration Nlb coverage Nmin
1 10 0.998 4002 400 0.957 20003 2000 0.808 2400
decrease of the coverage, if Nmin can be improved by 100. So, we set Rub as
0.01/100 = 1e−4.
Figure 4.6 shows the pseudo code of this tuning section selection algorithm. This
algorithm iteratively uses the method shown in Figure 4.4 to maximize the tuning
section coverage under a series of thresholds Nlb’s. The new Nlb in the next iteration,
Nlb2, is equal to the Nmin obtained from the previous iteration. We notice that this
Nmin is greater than the old Nlb in the previous iteration, Nlb1, and that any threshold
value in [Nlb1, Nmin) gives the same solution to maximize the coverage. Using Nmin
from the previous iteration as the new threshold value for Nlb makes the trade-off
process fast. This process finishes when the coverage drops below Clb or the coverage
drop rate is greater than Rub.
For example, Table 4.1 shows the result of each iteration, via applying this al-
gorithm to benchmark mgrid. The second iteration gets the optimal result, with
coverage = 0.957 and Nmin = 2000. (The initial Nlb is 10, which is a reasonable
boundary for our rating methods to achieve faster tuning than the whole-program
tuning.)
4.5 Results
We apply the tuning section selection algorithm to SPEC CPU2000 benchmarks.
The default threshold values are used, i.e., Rub = 1e−4, Clb = 80%, Tlb = 100μsec,
Plb = 2%. Focusing on the scientific programs, we analyze the FP benchmarks
68
Subroutine [T , M ] = TSSelection(profile, Rub, Clb, Tlb, Plb)There are four thresholds used in this algorithm: Rub, the upper bound on thecoverage drop rate; Clb, the lower bound on the tuning section coverage; Tlb, thelower bound on average execution times; Plb, the lower bound on the executionpercentage for a code section worth manual partitioning. The algorithm selects thetuning sections and puts their entry functions into set T . The functions that areworth manual code partitioning are put into M .
1. Initialization. Tbest = φ, Mbest = φ, coveragebest = 0, Nbest = 0, Nlb = 10.2. Use the algorithm in Figure 4.4 to maximize the coverage.
[T , M ] = MaxCoverage(profile, Nlb, Tlb, Plb)Compute coverage and Nmin of the selected tuning sections T .
3. If coverage is less than Clb, do the following.(a) If Tbest is φ, manual partitioning is necessary. Print T and M . Stop.(b) Else, Tbest and Mbest is the solution. Print Tbest and Mbest. Stop.
4. Trade off coverage and Nmin as follows.(a) If Tbest is φ, record this solution: Tbest = T , Mbest = M , coveragebest =
coverage, Nbest = Nmin.(b) Else, compute the coverage drop rate R.
R =coveragebest − coverage
Nmin − Nbest
(4.22)
If the drop rate R is less than Rub, record this solution: Tbest = T , Mbest =M , coveragebest = coverage, Nbest = Nmin.Otherwise, the trade-off point is found. Print Tbest and Mbest. Stop.
5. Set Nlb = Nmin and go to Step 2.
Fig. 4.6. The final tuning section selection algorithm. This algorithmachieves both a large Nmin and a high coverage. It iteratively uses themethod shown in Figure 4.4 to maximize the tuning section coverageunder a series of thresholds Nlb’s, until the optimal Nlb is found.
69
Table 4.2Selected tuning sections in SPEC CPU2000 FP benchmarks.(Three manually partitioned benchmarks are annotated with ‘*’.The last row, wupwise+, uses a smaller Tlb = 1μsec.)
Benchmark coverage Nmin # of TS TS nameammp 88.6 127 3 torsion, u f nonbon, angleapplu 97.9 250 5 jacld, buts, blts, rhs, jacuapsi 87.8 720 9 dctdx, dvdtz, dudtz, dkzmh, dtdtz,
dcdtz, leapfr, wcont, hydart 99.9 250 2 match, train matchequake 54.6 2709 1 smvpequake* 99.0 2709 1 iter bodymesa 96.9 4000 1 general textured trianglemgrid 95.7 2000 4 interp, rprj3, resid, psinvsixtrack 10.4 208 2 phasad, clorbsixtrack* 97.9 1693 2 thin6d sub, umlaufswim 83.9 198 3 calc3, calc2, calc1swim* 99.2 198 4 calc3, calc2, calc1, loop3500wupwise 91.7 22 1 matmulwupwise+ 83.0 22528000 2 su3mul, gammul
in Section 4.5.1 in detail. The results on the INT benchmarks are presented in
Section 4.5.2. The data are listed in Table 4.2 and Table 4.3, respectively.
4.5.1 SPEC CPU2000 FP benchmarks
The second column in Table 4.2 shows the program execution time coverage of the
selected tuning sections. All the benchmarks cover most of the program execution
after code partitioning if needed. (For most benchmarks, the coverage is above 90%.)
The third column in the table shows Nmin, the minimum number of invocations
to the tuning sections. For all the benchmarks except wupwise, Nmin ranges from
hundreds to thousands. For wupwise, Nmin is 22. The reason turns out to be the
small functions used in wupwise. These functions are invoked millions of times,
while the average execution time is very small, in the order of μsecs. Given a small
threshold of Tlb = 1μsec, we redo the tuning section selection and get a huge Nmin
70
of 22528000. (Note that our timing implementation is highly accurate, so that it
handles μsecs.)
From the second and the third columns, our algorithm achieves the goal of max-
imizing both the program coverage and the minimum number of invocations to the
tuning sections. We will show the final tuned program performance in the next
chapter.
Table 4.2 shows that three benchmarks (equake, sixtrack and swim) need code
partitioning. The general rule is to retrieve the loop body of the important loop in
the candidate function into a separate function. The candidate functions are given
by our tuning section selection algorithm as well. Basically, these functions take a
large execution time but with only a few invocations. After partitioning, the new
function will cover a big part of the program execution time with a large number of
invocations. The code partitioning is done as follows for the three benchmarks.
1. In equake, the function main contains a large loop. Since main is only invoked
once, our tuning section selection algorithm does not pick it as a tuning section.
So, we retrieve the loop body into a new function called iter body, which is then
invoked many times and selected as the tuning section. (The iter body function
calls the function smvp, which is selected before manual code partitioning.)
2. In sixtrack, the function thin6d contains a large loop. Since thin6d is only
invoked once, it is not picked as a tuning section. Similar to equake, we retrieve
the loop body to a separate function, thin6d sub, which is then picked as a
tuning section.
3. In swim, the main program contains an important loop, which is then taken
out as a new tuning section loop3500.
4.5.2 SPEC CPU2000 INT benchmarks
From Table 4.3, our tuning section selection algorithm achieves the goal of maxi-
mizing the program coverage and the minimum number of invocations to the tuning
sections for INT benchmarks as well.
71
In general, the profiles of SPEC CPU2000 INT benchmarks are different from
those of SPEC CPU2000 FP benchmarks: (1) The INT benchmarks have flatter
profiles with more functions in the call graphs. (2) The call graphs of the INT
benchmarks are deeper. (3) The functions in the INT benchmarks that are invoked
many times usually have small average execution times.
Due to (3), the default threshold of the average execution time, Tlb = 100μsec,
prohibits the selection of some important functions in gzip, mcf, perlbmk, vortex,
and vpr. After using a smaller Tlb ranging from 0.01μsec to 1μsec, their program
coverage is improved. (For these benchmarks, a high resolution timer is needed to
generate accurate performance ratings for the selected tuning sections.)
In bzip2, gap and perlbmk, some important functions are only invoked tens of
times, which brings down Nmin, the minimum number of invocations to the tuning
sections. For other benchmarks, Nmin is hundreds to thousands.
72
Table 4.3Selected tuning sections in SPEC CPU2000 INT benchmarks.(The benchmarks annotated with ‘+’ use smaller Tlb’s.)
Benchmark coverage Nmin #TS TS namebzip2 99.9 22 6 doReversibleTransformation,
generateMTFValues, sendMTFValues,loadAndRLEsource,undoReversibleTransformation fast,getAndMoveToFrontDecode
crafty 100.0 1272 1 Searchgap 96.6 24 3 EvFunccall, EvVarAss, EvElmListgcc 90.7 109 1 rest of compilationgzip 41.7 3668 3 fill window, flush block, inflate dynamicgzip+ 84.1 4021 5 ct tally, fill window, flush block,
inflate dynamic, longest matchmcf 46.6 5235 1 refresh potentialmcf+ 74.7 5235 2 refresh potential, primal bea mppparser 91.6 309 3 prepare to parse, parse, expression pruneperlbmk 76.2 11 5 incpush, Perl newXS, Perl av push,
Perl gv fetchpv, Perl sv freeperlbmk+ 92.6 11 6 incpush, Perl newXS, Perl av push,
Perl gv fetchpv, Perl sv free,Perl runops standard
twolf 98.0 120 1 uloopvortex 78.0 6209 2 BMT Validate, SaFindInvortex+ 95.5 4000 7 BMT CommitPartDrawObj, BMT Validate,
PersonObjs FindIn, BMT DeletePartDrawObj,Object Delete, OaDeleteFields, SetAddInto
vpr 52.7 10746 1 route netvpr+ 98.3 10746 2 route net, try swap
73
5. THE PEAK SYSTEM
5.1 Introduction
It is well understood that optimization techniques do not always improve pro-
gram performance significantly. They may have only negligible effects, and even
degrade the program performance under some un-expected cases. The interaction
between optimizations makes it difficult for a programmer to find the best optimiza-
tion combination for a given program. Our PEAK system automates this process
using a feedback-directed approach. Chapter 2 presents a fast and effective orches-
tration algorithm, which iteratively generates optimized code versions and evaluates
their performance based on the execution time under a training input until the best
version is found. Noticing that using a partial execution of the program (i.e., a few
invocations to a code section) may achieve accurate performance evaluation in a
faster way, we developed three rating methods applied to important code sections,
called tuning sections, in Chapter 3. These rating methods evaluate the performance
of an optimized version of a tuning section based on a number of invocations to the
tuning section. Chapter 4 designs an algorithm to select the tuning sections out of
a program to maximize both the program execution time coverage and the number
of invocations, aiming at high tuned program performance and fast tuning speed.
This chapter puts everything together to construct an automated performance
tuning system. Section 5.2 shows the design of the PEAK system, which applies the
described techniques into two primary parts: the PEAK compiler and the PEAK
runtime system. Special implementation problems related to runtime code gener-
ation and loading are also discussed. Section 5.3 shows an example of using the
PEAK system. Section 5.4 analyzes the experimental results of PEAK focusing on
the SPEC CPU2000 FP benchmarks, using two metrics: tuning time and tuned
74
program performance. On average, compared to whole-program tuning presented
in Chapter 2, PEAK reduces the tuning time from 2.19 hours to 5.85 minutes and
improves the performance from 11.7% to 12.1%.
5.2 Design of PEAK
5.2.1 The steps of automated performance tuning
The PEAK system has two major parts: the PEAK compiler and the PEAK
runtime system. The PEAK compiler is used before tuning, while the PEAK runtime
system is used during tuning.
Figure 5.1 shows a block diagram of the PEAK system, which lists all the compo-
nents in PEAK and all the performance tuning steps. Steps 1 to 4 are taken before
tuning to construct a tuning driver for a given program. In these steps, the PEAK
compiler analyzes and instruments the source code. During performance tuning at
Step 5, the tuning driver continually runs the program under a training input until
the best version is found for each tuning section. In this step, the PEAK runtime
system is involved in dynamically generating and loading optimized versions, rating
these versions, and feeding new optimization combinations to the tuning driver. Af-
ter tuning, in Step 6, each tuning section is compiled under its best optimization
combination and linked to the main program to generate the final tuned version. In
detail, PEAK takes the following steps.
1. The tuning section selector chooses the important code sections as the tuning
sections, using the call graph profile generated by gprof. It applies the algo-
rithm described in Chapter 4. The output of this step is a list of function
names, each of which identifies a tuning section.
2. The rating method consultant analyzes the source program to find the applica-
ble rating methods for each tuning section. The compiler techniques developed
in Chapter 3 are implemented here. This tool annotates the program with the
information about the context variables for CBR and the performance model
for MBR.
75
Profile Input
TS Selection Configuration TS Selector
TS List Rating Method Consultant
Instrumented Source Code
PEAK Runtime Backend Compiler
1
2
Performance Tuning Driver TS Code
Training Input
4
Generate New Versions
Compute Ratings
More?
Done
5
Final Version Generation 6
Annotated Program
Search Method Configuration
PEAK Instrumentation Tool
3
Before T
uning T
uning
After
Tuning
Fig. 5.1. Block diagram of the PEAK performance tuning system
3. The PEAK instrumentation tool applies the appropriate rating method to each
tuning section, after a profile run to find the major context for CBR and the
model parameters for MBR according to Chapter 3. It adds the initialization
and finalization functions to activate the PEAK runtime system and the func-
76
tions to load and save the tuning state of previous runs, since the performance
tuning driver may run the program multiple times in Step 5. The instru-
mentation tool retrieves each tuning section into a separate file, which will
be compiled at Step 5 under different optimization combinations to generate
optimized versions.
4. The instrumented code is compiled and linked with the PEAK runtime sys-
tem, which is provided in a library format, to construct the performance tuning
driver. The PEAK runtime system implements the three rating methods devel-
oped in Chapter 3 and the CE optimization orchestration algorithm developed
in Chapter 2. Special functions for dynamically loading the binary code during
tuning are also included in the PEAK runtime system. (The generation of the
tuning driver and the optimized tuning section versions is done by the backend
compiler, in this thesis, the GCC compiler.)
5. The performance tuning driver iteratively runs the program under a training
input until optimization orchestration finishes for all the tuning sections. At
each invocation to a tuning section, the driver takes over the control. It runs
and times the current experimental version and decides whether more invoca-
tions are needed to rate the performance of this version. After the rating of
this version is done (i.e., when the rating variance is small enough), the driver
generates new experimental versions according to the orchestration algorithm.
(The tuning sections are tuned independently.) The tuning process ends when
the best version is found for each tuning section.
6. After the tuning process finds the best optimized version for each tuning sec-
tion, these best versions are linked to the main program to generate the final
version. Here, the main program is the original source program with the tuning
sections removed. The final version is the one to be delivered to the end users.
This completes the tuning process.
77
5.2.2 Dynamic code generation and loading
PEAK tunes program performance via executing the program under a training
input. It generates and loads the binary code at runtime. (This distinguishing
feature is inherited from the ADAPT [24, 25] infrastructure.) To do so, PEAK uses
the dynamic linking facility functions: dlopen, dlsym, dlclose and dlerror. Basically,
these functions enable PEAK to load binary code into memory and to resolve the
address of the experimental version contained in that binary. (The binary code is
generated by the backend compiler, GCC, under the given optimization options.)
To generate an optimized version for a tuning section separately, excluding other
unrelated code, PEAK extracts each tuning section into a separate source file. This
source file includes the entry function to the tuning section and all its direct and
indirect callees. So, this source file can be compiled and optimized separately. Since
callees are included in the tuning section, inlining can be performed during code
generation. If a subroutine/function is called in two tuning sections, this subrou-
tine/function is replicated and renamed in the corresponding source files to avoid
name conflicts at link time. Such replication does not lead to code explosion, because
the number of tuning sections is fixed and usually small, around three. Our PEAK
compiler does the above job automatically using source-to-source compilation.
To load an optimized version successfully and correctly, PEAK needs to pay
attention to the global variables used in the tuning sections, especially the static
variables in C and the common blocks in Fortran. Multiple versions of the same
tuning section are invoked during one run of the program. Each global variable
used in these versions should be located to the same address, when the versions are
loaded. To solve this issue, a number of details are important:
1. When the tuning driver is compiled, all the global symbols are added to the
dynamic symbol table. The dynamic symbol table is the set of symbols which
are visible from dynamic objects at run time. When loading an optimized
version of a tuning section, the loader can use this table to locate the global
variables used in the optimized version. To put the global symbols into the
78
dynamic symbol table, the option of export-dynamic is passed to the linker
during the tuning driver generation.
2. The static variables in C programs are promoted to the global scope and are
given globally unique names. This is because an optimized version makes
a local copy of static variables during dynamic code loading. If the static
variables are not promoted, different code versions of the same tuning section
use different copies of the static variables, which leads to incorrect execution.
Our PEAK compiler does this promotion for both file-scope static variables
and function-scope static variables. (For Fortran programs, the local variables
are put into a common block, if a save statement is specified in the subroutine.)
3. For Fortran programs, our binary editing tool finds the common blocks in
the symbol table of an optimized binary code and makes the common blocks
linkable to the actual definition in the main program. This is because the
dynamic loader makes different copies of the same common block for different
code versions, just like static variables in C. Different from C, we need to work
at the binary level to solve this problem. After modifying the corresponding
attribute of these symbols, the binary editing tool makes the common block
linkable to the actual definition during dynamic loading. 1 This tool uses the
libelf functions.
5.3 An Example of Using PEAK
This section uses the benchmark swim to illustrate the six tuning steps of PEAK.
1. The tuning section selector uses a gprof output and selects four tuning sections:
calc1, calc2, calc3 and loop3500. Figure 5.2 shows the source code of the tuning
section calc1 as an example.
1Our binary editing tool changes the symbol attribute of “st shndx” from “SHN COMMON” to“SHN UNDEF” for the common blocks. “SHN COMMON” means a common block, which will beallocated a storage during linking. “SHN UNDEF” means an undefined symbol, whose referenceswill be linked to the actual definition during linking.
79
SUBROUTINE calc1
...
DO j = 1, n, 1
DO i = 1, m, 1
...
ENDDO
ENDDO
DO j = 1, n, 1
...
ENDDO
DO i = 1, m, 1
...
ENDDO
...
RETURN
END
Fig. 5.2. An example of the tuning section calc1 in swim
2. The rating method consultant analyzes the source code to find the applicable
rating methods. For calc1, all three rating methods are applicable. The context
variables are n and m. After one profile run, PEAK finds that n and m are
runtime constants. So, there is only one context for calc1 and context-based-
rating will be applied.
3. The PEAK instrumentation tool adds PEAK runtime library calls to the source
program. Each tuning section is retrieved into a separate file, which will be
compiled under different optimization combinations to generate experimental
versions during performance tuning. The instrumentation code is added to
each tuning section and the entry and exits of the program.
• Each entry to a tuning section is redirected to the corresponding instru-
mented code. Figure 5.3 shows the instrumented calc1, which calls two
PEAK runtime functions, DCGetVersion and DCMonRecordCBR.
(a) DCGetVersion implements the optimization orchestration algorithm
and dynamic code generation and loading. It returns the current
80
extern void calc1_old_();
void calc1_()
{
hrtime_t t0, t1;
typedef void(*FunType)();
FunType fun = NULL;
hrtime_t tm;
//Experiment the current optimized version
if( (fun=(FunType)DCGetVersion(0)) != NULL ) {
t0 = gethrtime();
fun(); //invoke the experimental version
t1 = gethrtime();
tm = t1 - t0; //time the current invocation
DCMonRecordCBR(0, tm); //rating generation
return;
}
//Optimization orchestration is done
calc1_old_();
}
Fig. 5.3. The tuning section calc1 instrumented by the PEAK compiler
experimental version. If the previous version has already been rated,
this call will generate and load a new optimized version.
(b) DCMonRecordCBR passes the invocation time obtained from a high
resolution timer to the context-based-rating method. The rating
method rates the current version based on a few invocations. The
orchestration algorithm uses the ratings of previous versions to guide
the generation of new optimized versions.
(c) The argument, 0, of these two functions identifies the tuning section
of calc1.
In the code, calc1 old is the original version of the tuning section.
• At the entry of the entire program, the function of init dc is called to set
up the tuning state. Figure 5.4 lists the instrumented code.
(a) DCInit(4) allocates the memory and initializes the tuning state for
four tuning sections.
81
void init_dc()
{
DCInit(4);
DCOrchGCCBaseO3();
DCOrchSetSuffix(".f");
DCSetTSProperty(0, DC_TS_NAME, "calc1");
DCSetTSProperty(0, DC_TS_SEARCH_METHOD, DC_SEARCH_CE);
DCSetTSProperty(0, DC_TS_RATING_METHOD, DC_RATING_CBR);
DCSetTSProperty(1, DC_TS_NAME, "calc2");
DCSetTSProperty(1, DC_TS_SEARCH_METHOD, DC_SEARCH_CE);
DCSetTSProperty(1, DC_TS_RATING_METHOD, DC_RATING_CBR);
DCSetTSProperty(2, DC_TS_NAME, "calc3");
DCSetTSProperty(2, DC_TS_SEARCH_METHOD, DC_SEARCH_CE);
DCSetTSProperty(2, DC_TS_RATING_METHOD, DC_RATING_CBR);
DCSetTSProperty(3, DC_TS_NAME, "loop3500");
DCSetTSProperty(3, DC_TS_SEARCH_METHOD, DC_SEARCH_CE);
DCSetTSProperty(3, DC_TS_RATING_METHOD, DC_RATING_CBR);
DCLoad("DCdump.dat");
}
Fig. 5.4. The initialization function instrumented by the PEAK compiler
(b) DCOrchGCCBaseO3() sets up the orchestrated GCC O3 optimiza-
tions, using “O3” as the baseline.
(c) DCOrchSetSuffix(“.f”) specifies that the tuned program is in Fortran.
(d) The series of calls to DCSetTSProperty() specify the tuning sec-
tion name, the optimization orchestration algorithm and the rating
method for each tuning section.
(e) DCLoad(“DCdump.dat”) loads the tuning state from the previous
run, because multiple runs of the program may be involved during
performance tuning.
• At the exits of the program, the function of exit dc is called to save the
tuning state. Figure 5.5 lists the instrumented code. DCDump() and
82
void exit_dc()
{
DCDump("DCdump.dat");
DCFinalize();
}
Fig. 5.5. The exit function instrumented by the PEAK compiler
DCFinalize() are called to save the tuning state and to free allocated
memory.
4. The instrumented program is compiled and linked with the PEAK runtime
library to generate the performance tuning driver.
5. During performance tuning, the above tuning driver continuously runs under
a training input until optimization orchestration is done for all the tuning sec-
tions. Its output is the best optimization combination for each tuning section.
An example output for calc1 on a Pentium IV machine is
“g77 -c -O3 -fno-strength-reduce -fno-rename-registers -fno-align-loops calc1.f”.
This means that three optimizations, strength-reduce, rename-registers and
align-loops, should be turned off from the baseline O3.
6. In the final version generation stage, each tuning section adopts the correspond-
ing best optimization combination obtained from the previous step. These
optimized tuning sections are linked to the main program, which is compiled
under the default optimization combination, to generate the final version.
To summarize, in Step 1, the tuning section selection algorithm chooses tuning
sections based on execution time profile; in Steps 2 and 3, the PEAK compiler an-
alyzes and instruments the source code to prepare performance tuning; in Step 4,
the instrumented code is compiled and linked with the PEAK runtime system to
generate a tuning driver; in Step 5, the tuning driver uses the PEAK runtime sys-
tem for dynamic code generation and loading, performance rating and optimization
orchestration; in Step 6, the final version is generated.
83
5.4 Experimental Results
We orchestrate the 38 GCC O3 optimizations for SPEC CPU2000 benchmarks,
on the same machines used in Chapter 2, a Pentium IV machine and a SPARC II
machine. For each benchmark, PEAK tunes the performance of the selected tuning
sections listed in Chapter 4. The goal of this experiment is to compare PEAK with
the whole-program tuning shown in Chapter 2. Similarly, we use two metrics: the
tuning time and the tuned program performance. Still, the tuning time is normalized
by the time to evaluate the performance of the base version according to Equation 2.7.
Since PEAK applies optimization orchestration to tuning sections separately, we
expect PEAK performs better than the whole-program tuning in two aspects.
1. PEAK takes much less tuning time than the whole-program tuning. This is
because PEAK evaluates the performance of an optimized version based on a
partial execution of the program, while the latter uses the complete run.
2. PEAK achieves equal or better program performance than the whole-program
tuning. Since our tuning section selection algorithm covers most of the pro-
gram, our PEAK system should be able to achieve equal performance to the
whole-program tuning. In some cases that the tuning sections favor different
optimizations, PEAK can achieve better performance than the whole-program
tuning.
Focusing on scientific programs, we present the detailed experimental results on
SPEC CPU2000 FP benchmarks in terms of these two metrics separately in Sec-
tion 5.4.1 and Section 5.4.2. Section 5.4.3 presents the results on INT benchmarks.
5.4.1 Tuning time
Figure 5.6 shows the normalized tuning time of the whole-program tuning and
the PEAK system. (For PEAK, the tuning time includes the time spent in all the
six tuning steps.) On average, the normalized tuning time is reduced from 68.3 to
3.36. So, PEAK gains a speedup of 20.3. The benchmark that has a high speedup
84
62.22
50.99
105.76
69.2363.14
89.28
50.59
87.32
36.96
102.97
68.28
2.337.06
11.214.03 1.79 2.33 3.38 4.22 2.59 1.61 3.36
0.00
20.00
40.00
60.00
80.00
100.00
120.00
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
No
rmal
ized
tu
nin
g t
ime
Whole PEAK
Fig. 5.6. Normalized tuning time of the whole-program tuning andthe PEAK system for SPEC CPU2000 FP benchmarks on PentiumIV. Lower is better. On average, PEAK gains a speedup of 20.3.
usually has a large number of invocations to the tuning sections, which are shown
in Table 4.2. This agrees with our assumption during the tuning section selection
in Chapter 4. On average, the absolute tuning time is reduced from 2.19 hours to
5.85 minutes. Applying the rating methods in Chapter 3 significantly improves the
tuning time.
Figure 5.7 shows the tuning time percentages of the six tuning steps. Most of
the time is spent in Step 5, the performance tuning (PT) stage. The second largest
portion of tuning time is spent in Step 1, the tuning section selection (TSS) stage.
This is because Step 1 does a profile run, which in some cases (e.g., wupwise) takes
more time than a normal run of the program. The last large portion of the tuning
time is spent in Step 2, the rating method analysis (RMA) stage. Some of the time
is spent in data flow analysis; some is spent in the profile run to get the context
parameters for CBR and execution model parameters for MBR. For programs with
lots of source code (e.g., ammp and mesa), compiling the source code into the internal
85
0%
20%
40%
60%
80%
100%
ammp applu apsi art equake mesa mgrid sixtrack swim wupwise
Per
cen
tag
e o
f th
e to
tal
tim
e sp
ent
in t
he
tun
ing
pro
cess
TSS RMA CI DG PT FVG
Fig. 5.7. Tuning time percentage of the six stages for SPEC CPU2000FP benchmarks on Pentium IV. (TSS: tuning section selection, RMA:rating method analysis, CI: code instrumentation, DG: driver gener-ation, PT: performance tuning, FVG: final version generation.) Themost time-consuming steps are PT, TSS and RMA.
representation of our source-to-source compiler also accounts for a big portion of the
tuning time. (This compilation time comes from the compiler infrastructure we use
for the implementation of our PEAK compiler. 2)
The results on SPARC II are similar. The normalized tuning time is reduced from
63.42 to 4.88, with a speedup of 13.0. The absolute tuning time is reduced from 9.83
hours to 43.7 minutes. (The SPARC II machine is slower than the Pentium IV
machine.)
2Our PEAK compiler is developed based on the Polaris [60] compiler for Fortran programs and theSUIF2 [61] compiler for c programs.
86
5.4.2 Tuned program performance
Figure 5.8 shows the program performance achieved by the whole-program tuning
and the PEAK system on Pentium IV. We use the train dataset as the input to the
tuning process.
The first two bars show the performance of the final tuned version under the
same train dataset for the whole-program tuning and the PEAK system. For all
the benchmarks, PEAK achieves equal or better performance. PEAK outperforms
the whole-program tuning by 1.5% on applu. Some benchmarks, such as equake and
mesa, have only one tuning section. Some benchmarks, such as art, swim and mgrid,
have similar code structure in the selected tuning sections, which leads to the fact
that the tuning sections favor similar optimizations. The above two observations
explain why PEAK does not outperform the whole-program tuning significantly in
terms of tuned program performance.
A fair performance evaluation should use an input different from the training
input. To this aim, we use the ref dataset to evaluate the performance of the tuned
version. (Still, the train dataset is the input to the tuning process.) The results
are shown by the last two bars. We can see that using different input still achieves
similar performance to the first two bars. So, our tuning scenario does find an
optimal combination of the compiler optimizations, which performs much better
than the default optimization configuration.
On average, PEAK improves the performance by 12.0% and 12.1% with respect
to the train dataset and the ref dataset, while the whole-program tuning by 11.9%
and 11.7%. The results on SPARC II are similar: PEAK improves the performance
by 4.1% and 3.7% with respect to train and ref; while the whole-program tuning by
4.1% and 3.7% as well. So, PEAK achieves equal or better program performance
than the whole-program tuning.
87
0
10
20
30
40
50
60
70
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Geo
Mea
n
Rel
ativ
e p
erfo
rman
ce i
mp
rove
men
t p
erce
nta
ge
(%)
Whole_Train PEAK_Train Whole_Ref PEAK_Ref
Fig. 5.8. Program performance improvement relative to the baselineunder O3 for SPEC CPU2000 FP benchmarks on Pentium IV. Higheris better. All the benchmarks use the train dataset as the input tothe tuning process. Whole Train (PEAK Train) is the performanceachieved by the whole-program tuning (the PEAK system) underthe train dataset. Whole Ref and PEAK Ref use the ref dataset toevaluate the tuned program performance, but still the train datasetfor tuning. PEAK achieves equal or better program performancethan the whole-program tuning.
5.4.3 Integer benchmarks
Integer benchmarks generally have irregular code structures with many condi-
tional statements. Moreover, the use of pointers complicates rating-method analy-
sis. As a result, only re-execution-based rating could be applicable. On the other
hand, the selected tuning sections in some benchmarks use library functions with
88
side effects, for example, the memory allocation functions. For these benchmarks,
PEAK fails tuning, unable to roll back the execution of the tuning section. 3
To illustrate the performance of PEAK on INT benchmarks, we experiment with
five benchmarks, bzip2, crafty, gzip, mcf and parser, which do not use the functions
with side effects in their tuning sections. (One future research topic is to include
the source code of these library functions into the main program, so that the PEAK
compiler has a way to roll back the execution.)
Figure 5.9(a) shows the normalized tuning time. PEAK speeds up the tuning
process for INT benchmarks as well. On average, the normalized tuning time is
reduced from 73.48 to 13.37, and the absolute tuning time is reduced from 71.64
minutes to 14.67 minutes. bzip2 does not gain a high speedup due to the small
Nmin of 22. Besides, there are three major reasons for the fact that the speedup
for INT benchmarks is not as high as the one for FP benchmarks: (1) Due to the
irregularity of the code, INT benchmarks can only use re-execution-based rating,
which introduces more overhead than the other two rating methods used in FP
benchmarks, mainly because of the execution of the preconditional version and the
base version. (2) PEAK compiler spends more time in analyzing INT benchmarks,
which generally have more code, especially in crafty and parser. (3) Unlike FP
benchmarks, the tuning sections in INT benchmarks generally do not have a similar
code structure. Different tuning sections in INT benchmarks may favor different
optimizations, which leads to more experimental versions, especially in parser.
Figure 5.9(b) shows the tuning time percentages of the six tuning steps. The
most time-consuming components are performance tuning, rating method analysis,
and tuning section selection. Due to the larger source code size, rating method
analysis and code instrumentation spend more time on INT benchmarks than on FP
benchmarks.
3The INT benchmarks have more code than the FP benchmarks. Our PEAK compiler for C, whichis based on SUIF2, has some difficulty in processing some of these benchmarks, failing to pass theC-to-SUIF conversion.
89
58.88
80.23 78.68 79.7172.32 73.48
24.1716.18
5.33 4.33
47.26
13.37
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
bzip
2
craf
ty
gzip
mcf
pars
er
Geo
Mea
n
No
rmal
ized
tu
nin
g t
ime
Whole PEAK
(a) Normalized tuning time of the whole-program tuning and the PEAK system for SPEC CPU2000INT benchmarks on Pentium IV. Lower is better. On average, PEAK gains a speedup of 5.5.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
bzip2 crafty gzip mcf parserPer
cen
tag
e o
f th
e to
tal t
ime
spen
t in
th
e tu
nin
g p
roce
ss
TSS RMA CI DG PT FVG
(b) Tuning time percentage of the six stages for SPEC CPU2000 INT benchmarks on Pentium IV.(TSS: tuning section selection, RMA: rating method analysis, CI: code instrumentation, DG: drivergeneration, PT: performance tuning, FVG: final version generation.) The most time-consuming stepsare PT, RMA and TSS.
Fig. 5.9. PEAK tuning time for INT benchmarks on Pentium IV
90
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
bzip
2
craf
ty
gzip
mcf
pars
er
Geo
Mea
n
Rel
ativ
e p
erfo
rman
ce i
mp
rove
men
t p
erce
nta
ge
(%) Whole_Train PEAK_Train Whole_Ref PEAK_Ref
Fig. 5.10. Program performance improvement relative to the base-line under O3 for SPEC CPU2000 INT benchmarks on Pentium IV.Higher is better. All the benchmarks use the train dataset as theinput to the tuning process. Whole Train (PEAK Train) is the per-formance achieved by the whole-program tuning (the PEAK sys-tem) under the train dataset. Whole Ref and PEAK Ref use theref dataset to evaluate the tuned program performance, but still thetrain dataset for tuning.
Figure 5.10 shows the tuned program performance. On average, PEAK improves
the performance by 4.6% and 4.2% with respect to the train dataset and the ref
dataset, while the whole-program tuning improves the performance by 4.4% and
4.2% respectively. PEAK achieves less performance for gzip than the whole-program
tuning, because of the low program coverage of 84%. For parser, PEAK achieves
much better performance than the whole-program tuning.
91
6. CONCLUSIONS AND FUTURE WORK
6.1 Conclusions
The techniques developed in this thesis have led to the creation of the automated
performance tuning system called PEAK. PEAK searches for the best compiler opti-
mization combinations for important tuning sections in a program, using a fast and
effective optimization orchestration algorithm – Combined Elimination. Three fast
and accurate rating methods – CBR, MBR and RBR – are developed to evaluate the
performance of an optimized version based on a partial execution of the program.
The PEAK compiler selects the important code sections for performance tuning, an-
alyzes the source program for applicable rating methods, and instruments the source
code to construct a tuning driver. The tuning driver adopts a feedback-directed tun-
ing approach. It continually runs the program, loads experimental versions generated
under different optimization combinations, rates these versions based on the execu-
tion times, and explores the optimization space according to the generated ratings,
until the best optimization combination is found for each tuning section. In addi-
tion to the above functionalities, the PEAK runtime system provides the facilities
to dynamically load executables at runtime.
PEAK achieves fast tuning speed and high tuned program performance. When
our Combined Elimination (CE) algorithm is applied to the whole program, the
program performance is improved by 12% over GCC O3 for SPEC CPU2000 FP
benchmarks and 4% for INT benchmarks; the tuning time is reduced to 57% of the
closest alternative algorithm on average. Using SUN Forte compilers, CE improves
performance by 10.8%, compared to 5.6% improved by manual tuning for FP bench-
marks, and 8.1% to 4.1% for INT benchmarks. After applying the rating methods to
the tuning sections, PEAK reduces tuning time from 2 hours to 4.9 minutes for FP
92
benchmarks, and from 1.2 hours to 13 minutes for selected INT benchmarks, while
achieving equal or better program performance.
PEAK can be applied to different computer architectures and backend compilers.
The PEAK compiler works at the source program level, and the PEAK runtime
system is provided in the form of a library. PEAK can be easily ported to new
computer systems. On the other hand, PEAK invokes the backend compiler through
a command line. The tuned optimizations are controlled via compiler options. So,
new optimization techniques can be easily plugged into PEAK, if the corresponding
options are provided.
6.2 Future Work
We have developed the PEAK system as a prototype for automatic performance
tuning. One could improve this system by working at basic-block level. Working on
basic-block profiles, one could automate the manual code-partitioning process. One
could also experiment with more compiler optimizations. If the compiler source code
is provided, one could even experiment with the internal optimization parameters
that are not usually exposed to the end-user. We show other future work as follows.
6.2.1 Performance analysis on compiler optimizations
This thesis shows that optimization orchestration can improve program perfor-
mance significantly. The main reasons for the negative effects of the compiler opti-
mizations are that the optimizations do not have accurate information about the pro-
gram, and that the interaction between optimizations is hard to predict. Sometimes,
a simple implementation of the optimization also causes performance degradation.
As an effort to analyze the performance effects of the compiler optimizations,
the appendix discusses the performance behavior of all the GCC O3 optimization
options on the SPEC CPU2000 benchmarks, using a Pentium IV machine and a
SPARC II machine. The reasons for performance degradation are analyzed for several
93
important optimizations. One important finding is that optimizations may exhibit
unexpected performance behavior – even generally-beneficial techniques may degrade
performance. Degradations are often complex side-effects of the interaction with
other optimizations. 1
More work is necessary on performance analysis of compiler optimizations, es-
pecially the interaction between optimizations. This kind of analysis can be used
to improve the compiler optimizations themselves. (There exist a large number of
optimizations; this thesis only analyzes a few of them, shown in the appendix.)
6.2.2 Other tuning problems
The primary goal of PEAK is to find the best compiler optimization combination
for each tuning section, so that the tuned program achieves better performance
than the program compiled under the default optimization configuration. Moreover,
PEAK can be extended to solve other tuning problems.
PEAK can tune the performance of a library by finding the best optimized version
for each library function. Each library function works as a tuning section. The
task is to create the set of driver routines, which call all the library functions with
representative calling parameters.
PEAK can tune the performance of backend compilers, so as to find the best
default compiler optimization configuration. In this case, PEAK rates the compiler
optimizations based on performance summaries on a set of benchmarks.
PEAK can tune other parameters than compiler optimizations. For example, it
can select the best algorithm among the possible ones to solve the target problem.
Here, each experimental version implements one algorithm. Another example is to
use PEAK for optimizing parallel programs.
PEAK can tune program performance in-between production runs, instead of
before production runs. If the compiler optimizations are sensitive to the execution
1 [43] provides a detailed report on the performance of GCC optimizations.
94
environment (e.g., computer system configuration and program input), the tuning
process can be repeated when this environment changes.
6.2.3 Adaptive performance tuning
PEAK tunes program performance using a profile-based mechanism. It uses
training input to tune performance rather than the actual input. Using the actual
input would require PEAK to tune program performance adaptively at runtime.
There are two reasons for our focus on the profile-based mechanism: (1) The com-
pilation and tuning overhead of adaptive performance tuning may be too high to be
amortized by the performance gain, especially when the search space is huge. (2) In
many cases, the program performance is not very sensitive to the program input, so,
the profile-based approach is good enough for performance tuning.
If the performance effects of a few optimizations vary during the execution of the
program, adaptive performance tuning may be used to achieve even better perfor-
mance than the profile-based approach. PEAK could be used do this job, since it
supports runtime analysis: (1) The rating methods do not affect the correctness of
the program. (2) The rating methods cause little performance overhead. (3) The
PEAK runtime can generate and load optimized code at runtime.
Still, there are two important questions to be answered with regard to adaptive
performance tuning. (1) What optimizations have such dynamic behavior or need
runtime information? (2) How can we minimize the tuning overhead so that it can
be amortized by the performance gaining from adaptive tuning?
6.2.4 Program debugging
Although PEAK is developed for performance tuning, its techniques of code
instrumentation, runtime compilation, runtime code loading and performance rating
can be used to help program debugging.
95
Traditional compilers and debuggers are used separately. A compiler generates
binary code and a symbol table; then a debugger loads the binary and the symbol
table during the process of debugging. The debugging commands will use the symbol
table to locate code and data.
PEAK can be extended to invoke a compiler in the course of program debugging.
This mechanism of compilation while debugging can shorten the debugging time and
reduce manual work. For example, a programmer can apply this mechanism to the
following scenarios without restarting the debugged program:
1. The programmer compiles the program after fixing a small error and plugs the
generated binary into the running program. Our PEAK runtime can facilitate
this process of runtime code loading.
2. The programmer adds some instrumentation code during debugging. Our
PEAK compiler can do this instrumentation automatically for the purpose
of performance debugging. Our PEAK runtime can load the instrumented
code during debugging.
3. The programmer can apply different optimizations to a code section to compare
their performance. This is very similar to the original goal of performance
tuning in PEAK.
4. The programmer can use a compiler to do data flow analysis while debugging
the code. For example, the programmer finds the input and output of a code
section. Then, he or she can roll back program execution by restoring the input
or check the correctness of an optimized version by comparing its output to the
output of the un-optimized version. The re-execution-based rating technique
can help this job.
This kind of debugging system can enable a programmer to modify, analyze,
compile and load code while debugging a program.
LIST OF REFERENCES
96
LIST OF REFERENCES
[1] Z. Pan and R. Eigenmann, “Fast and effective orchestration of compiler opti-mizations for automatic performance tuning,” in The 4th Annual InternationalSymposium on Code Generation and Optimization (CGO), p. (12 pages), March2006.
[2] Z. Pan and R. Eigenmann, “Rating compiler optimizations for automatic per-formance tuning,” in SC2004: High Performance Computing, Networking andStorage Conference, p. (10 pages), November 2004.
[3] M. E. Wolf, D. E. Maydan, and D.-K. Chen, “Combining loop transforma-tions considering caches and scheduling,” in Proceedings of the 29th annualACM/IEEE international symposium on Microarchitecture, pp. 274–286, 1996.
[4] C. Click and K. D. Cooper, “Combining analyses, combining optimiza-tions,” ACM Transactions on Programming Languages and Systems (TOPLAS),vol. 17, no. 2, pp. 181–196, 1995.
[5] A.-R. Adl-Tabatabai, M. Cierniak, G.-Y. Lueh, V. M. Parikh, and J. M. Stich-noth, “Fast, effective code generation in a just-in-time java compiler,” in Pro-ceedings of the ACM SIGPLAN 1998 conference on Programming language de-sign and implementation, pp. 280–290, ACM Press, 1998.
[6] M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney, “Adaptive opti-mization in the Jalapeno JVM,” in Proceedings of the 15th ACM SIGPLANconference on Object-oriented programming, systems, languages, and applica-tions, pp. 47–65, ACM Press, 2000.
[7] M. Arnold, M. Hind, and B. G. Ryder, “Online feedback-directed optimiza-tion of java,” in Proceedings of the 17th ACM conference on Object-orientedprogramming, systems, languages, and applications, pp. 111–129, ACM Press,2002.
[8] M. Cierniak, G.-Y. Lueh, and J. M. Stichnoth, “Practicing JUDO: Java underdynamic optimizations,” in Proceedings of the ACM SIGPLAN 2000 conferenceon Programming language design and implementation, pp. 13–26, ACM Press,2000.
[9] D. R. Engler and T. A. Proebsting, “DCG: an efficient, retargetable dynamiccode generation system,” in Proceedings of the sixth international conferenceon Architectural support for programming languages and operating systems,pp. 263–272, ACM Press, 1994.
[10] D. R. Engler, “VCODE: a retargetable, extensible, very fast dynamic code gen-eration system,” SIGPLAN Not., vol. 31, no. 5, pp. 160–170, 1996.
97
[11] K. Ebcioglu and E. R. Altman, “DAISY: Dynamic compilation for 100architec-tural compatibility,” in ISCA, pp. 26–37, 1997.
[12] C. Consel and F. Noel, “A general approach for run-time specialization and itsapplication to c,” in Proceedings of the 23rd ACM SIGPLAN-SIGACT sympo-sium on Principles of programming languages, pp. 145–156, ACM Press, 1996.
[13] P. Lee and M. Leone, “Optimizing ML with run-time code generation,” inSIGPLAN Conference on Programming Language Design and Implementation,pp. 137–148, 1996.
[14] M. Leone and P. Lee, “Dynamic specialization in the fabius system,” ACMComput. Surv., vol. 30, no. 3es, p. 23, 1998.
[15] M. Mock, C. Chambers, and S. J. Eggers, “Calpa: a tool for automating selec-tive dynamic compilation,” in International Symposium on Microarchitecture,pp. 291–302, 2000.
[16] B. Grant, M. Philipose, M. Mock, C. Chambers, and S. J. Eggers, “An eval-uation of staged run-time optimizations in dyc,” in Proceedings of the ACMSIGPLAN 1999 conference on Programming language design and implementa-tion, pp. 293–304, ACM Press, 1999.
[17] J. Auslander, M. Philipose, C. Chambers, S. J. Eggers, and B. N. Bershad,“Fast, effective dynamic compilation,” in SIGPLAN Conference on Program-ming Language Design and Implementation, pp. 149–159, 1996.
[18] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: a transparent dynamicoptimization system,” in Proceedings of the ACM SIGPLAN 2000 conferenceon Programming language design and implementation, pp. 1–12, ACM Press,2000.
[19] E. Duesterwald and V. Bala, “Software profiling for hot path prediction: lessis more,” in Proceedings of the ninth international conference on Architecturalsupport for programming languages and operating systems, pp. 202–211, ACMPress, 2000.
[20] D. Bruening, T. Garnett, and S. Amarasinghe, “An infrastructure for adaptivedynamic optimization,” 2003.
[21] M. C. Merten, A. R. Trick, C. N. George, J. C. Gyllenhaal, and W. W. Hwu, “Ahardware-driven profiling scheme for identifying program hot spots to supportruntime optimization,” in Proceedings of the 26th annual international sympo-sium on Computer architecture, pp. 136–147, IEEE Computer Society, 1999.
[22] M. C. Merten, A. R. Trick, R. D. Barnes, E. M. Nystrom, C. N. George, J. C.Gyllenhaal, and W. mei W. Hwu, “An architectural framework for runtimeoptimization,” IEEE Transactions on Computers, vol. 50, no. 6, pp. 567–589,2001.
[23] E. M. Nystrom, R. D. Barnes, M. C. Merten, and W. mei W. Hwu, “Codereordering and speculation support for dynamic optimization systems,” in Pro-ceedings of the International Conference on Parallel Architectures and Compi-lation Techniques, September 2001.
98
[24] M. Voss and R. Eigenmann, “ADAPT: Automated de-coupled adaptive programtransformation,” in International Conference on Parallel Processing, pp. 163–,2000.
[25] M. J. Voss and R. Eigemann, “High-level adaptive program optimization withADAPT,” in Proceedings of the eighth ACM SIGPLAN symposium on Principlesand practices of parallel programming, pp. 93–102, ACM Press, 2001.
[26] T. Kistler and M. Franz, “Continuous program optimization: A case study,”ACM Trans. Program. Lang. Syst., vol. 25, no. 4, pp. 500–548, 2003.
[27] M. M. Strout, L. Carter, and J. Ferrante, “Compile-time composition of run-time data and iteration reorderings,” in Proceedings of the 2003 ACM SIGPLANConference on Programming Language Design and Implementation (PLDI),June 2003.
[28] L. Rauchwerger and D. A. Padua, “The LRPD test: Speculative run-time par-allelization of loops with privatization and reduction parallelization,” IEEETransactions on Parallel and Distributed Systems, vol. 10, no. 2, pp. 160–??,1999.
[29] F. Dang, H. Yu, and L. Rauchwerger, “The R-LRPD test: Speculative par-allelization of partially parallel loops,” in the 16th International Parallel andDistributed Processing Symposium (IPDPS ’02), 2002.
[30] S. Rus, L. Rauchwerger, and J. Hoeflinger, “Hybrid analysis: static & dynamicmemory reference analysis,” in Proceedings of the 16th international conferenceon Supercomputing, pp. 274–284, ACM Press, 2002.
[31] S. Nandy, X. Gao, and J. Ferrante, “TFP: Time-sensitive, flow-specific profilingat runtime,” in Workshop on Languages and Compiling for Parallel Computing(LCPC), October 2003.
[32] P. C. Diniz and M. C. Rinard, “Dynamic feedback: An effective techniquefor adaptive computing,” in SIGPLAN Conference on Programming LanguageDesign and Implementation, pp. 71–84, 1997.
[33] R. C. Whaley and J. Dongarra, “Automatically tuned linear algebra software,”in SuperComputing 1998: High Performance Networking and Computing, 1998.
[34] T. Kisuki, P. M. W. Knijnenburg, M. F. P. O’Boyle, F. Bodin, and H. A. G.Wijshoff, “A feasibility study in iterative compilation,” in International Sym-posium on High Performance Computing (ISHPC’99), pp. 121–132, 1999.
[35] M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O’Reilly, “Meta opti-mization: improving compiler heuristics with machine learning,” in Proceedingsof the ACM SIGPLAN 2003 conference on Programming language design andimplementation, pp. 77–90, ACM Press, 2003.
[36] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August, “Compileroptimization-space exploration,” in Proceedings of the international symposiumon Code generation and optimization, pp. 204–215, 2003.
99
[37] R. P. J. Pinkers, P. M. W. Knijnenburg, M. Haneda, and H. A. G. Wijshoff,“Statistical selection of compiler options,” in The IEEE Computer Society’s12th Annual International Symposium on Modeling, Analysis, and Simulation ofComputer and Telecommunications Systems (MASCOTS’04), (Volendam, TheNetherlands), pp. 494–501, October 2004.
[38] A. Hedayat, N. Sloane, and J. Stufken, Orthogonal Arrays: Theory and Appli-cations. Springer, 1999.
[39] K. Chow and Y. Wu, “Feedback-directed selection and characterization of com-piler optimizations,” in Second Workshop on Feedback Directed Optimizations,(Israel), November 1999.
[40] E. D. Granston and A. Holler, “Automatic recommendation of compiler op-tions,” in 4th Workshop on Feedback-Directed and Dynamic Optimization(FDDO-4), December 2001.
[41] K. D. Cooper, D. Subramanian, and L. Torczon, “Adaptive optimizing compilersfor the 21st century,” The Journal of Supercomputing, vol. 23, no. 1, pp. 7–22,2002.
[42] R. Joshi, G. Nelson, and K. Randall, “Denali: a goal-directed superoptimizer,”in Proceedings of the ACM SIGPLAN 2002 Conference on Programming lan-guage design and implementation, pp. 304–314, ACM Press, 2002.
[43] Z. Pan and R. Eigenmann, “Compiler optimization orchestration for peak per-formance,” Tech. Rep. TR-ECE-04-01, School of Electrical and Computer En-gineering, Purdue University, 2004.
[44] G. E. P. Box, W. G. Hunter, and J. S. Hunter, Statistics for experimenters :an introduction to design, data analysis, and model building. John Wiley andSons, 1978.
[45] T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O’Boyle, “Combined selectionof tile sizes and unroll factors using iterative compilation,” in IEEE PACT,pp. 237–248, 2000.
[46] M. Haneda, P. Knijnenburg, and H. Wijshoff, “Generating new general compileroptimization settings,” in Proceedings of the 19th ACM International Confer-ence on Supercomputing, pp. 161–168, June 2005.
[47] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subrama-nian, L. Torczon, and T. Waterman, “Finding effective compilation sequences,”in LCTES ’04: Proceedings of the 2004 ACM SIGPLAN/SIGBED conferenceon Languages, compilers, a nd tools for embedded systems, (New York, NY,USA), pp. 231–239, ACM Press, 2004.
[48] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones, “Fastsearches for effective optimization phase sequences,” in PLDI ’04: Proceedingsof the ACM SIGPLAN 2004 conference on Programming language design andimplementation, (New York, NY, USA), pp. 171–182, ACM Press, 2004.
[49] SPEC, SPEC CPU2000 Results. http://www.spec.org/cpu2000/results, 2000.
100
[50] N. J. A. Sloane, A Library of Orthogonal Arrays.http://www.research.att.com/ njas/oadir/.
[51] GNU, GCC online documentation. http://gcc.gnu.org/onlinedocs/, 2005.
[52] Sun, Forte C 6 /Sun WorkShop 6 Compilers C User’s Guide.http://docs.sun.com/app/docs/doc/806-3567, 2000.
[53] Y.-J. Lee and M. Hall, “A code isolator: Isolating code fragments from largeprograms.,” in LCPC, pp. 164–178, 2004.
[54] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “Simpoint 3.0: Faster andmore flexible program analysis,” in Workshop on Modeling, Benchmarking andSimulation, June 2005.
[55] E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder, “Us-ing simpoint for accurate and efficient simulation,” in ACM SIGMETRICS theInternational Conference on Measurement and Modeling of Computer Systems,June 2003.
[56] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically char-acterizing large scale program behavior,” in Tenth International Conference onArchitectural Support for Programming Languages and Operating Systems, Oc-tober 2002.
[57] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “SMARTS: acceler-ating microarchitecture simulation via rigorous statistical sampling,” SIGARCHComput. Archit. News, vol. 31, no. 2, pp. 84–97, 2003.
[58] T. F. Wenisch, R. E. Wunderlich, B. Falsafi, and J. C. Hoe, “TurboSMARTS:accurate microarchitecture simulation sampling in minutes,” in SIGMETRICS’05: Proceedings of the 2005 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems, (New York, NY, USA),pp. 408–409, ACM Press, 2005.
[59] T. F. Wenisch and R. E. Wunderlich, “SimFlex: Fast, accurate and flexible sim-ulation of computer systems,” in International Symposium on Microarchitecture(MICRO-38), November 2005.
[60] W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence,J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu, “Parallelprogramming with Polaris,” IEEE Computer, vol. 29, pp. 78–82, December1996.
[61] SUIF2, The SUIF 2 Compiler System. http://suif.stanford.edu/suif/suif2/,2005.
[62] P. Tu and D. A. Padua, “Gated SSA-based demand-driven symbolic analysisfor parallelizing compilers,” in International Conference on Supercomputing,pp. 414–423, 1995.
[63] W. Blume and R. Eigenmann, “Symbolic range propagation,” in the 9th Inter-national Parallel Processing Symposium, pp. 357–363, 1995.
101
[64] S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: a call graph execu-tion profiler,” in SIGPLAN Symposium on Compiler Construction, pp. 120–126,1982.
[65] R. Karp, “Reducibility among combinatorial problems,” in a Symposium on theComplexity of Computer Computations, (New York), pp. 85–103, Plenum Press,1972.
[66] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann,“Effective compiler support for predicated execution using the hyperblock,” inProceedings of the 25th annual international symposium on Microarchitecture,pp. 45–54, IEEE Computer Society Press, 1992.
[67] S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, and W.-M. W. Hwu,“A comparison of full and partial predicated execution support for ilp proces-sors,” in Proceedings of the 22nd annual international symposium on Computerarchitecture, pp. 138–150, ACM Press, 1995.
[68] D. I. August, W. mei W. Hwu, and S. A. Mahlke, “A framework for balancingcontrol flow and predication,” in Proceedings of the 30th annual ACM/IEEEinternational symposium on Microarchitecture, pp. 92–103, IEEE Computer So-ciety, 1997.
[69] B. Pottenger and R. Eigenmann, “Idiom recognition in the polaris parallelizingcompiler,” in Proceedings of the 9th international conference on Supercomputing,pp. 444–448, ACM Press, 1995.
[70] P. Briggs, K. D. Cooper, and L. Torczon, “Improvements to graph coloring reg-ister allocation,” ACM Transactions on Programming Languages and Systems,vol. 16, pp. 428–455, May 1994.
[71] P. Bergner, P. Dahl, D. Engebretsen, and M. T. O’Keefe, “Spill code minimiza-tion via interference region spilling,” in SIGPLAN Conference on ProgrammingLanguage Design and Implementation, pp. 287–295, 1997.
[72] J. Park and M. Schlansker, “On predicated execution,” Tech. Rep. HPL-91-58,Hewlett-Packard Software Systems Laboratory, May 1991.
[73] K. M. Hazelwood and T. M. Conte, “A lightweight algorithm for dynamic if-conversion during dynamic optimization,” in 2000 International Conference onParallel Architectures and Compilation Techniques, pp. 71–80, 2000.
APPENDIX
102
APPENDIXPERFORMANCE OF GCC OPTIMIZATIONS
1 Introduction
Although compiler optimizations yield significant improvements in many pro-
grams, the potential for performance degradation in certain program patterns is
known to compiler researchers and many users. Potential degradations are well
understood for some techniques, while they are unexpected in other cases. For ex-
ample, the difficulty of employing predicated execution [66–68] or parallel recurrence
substitutions [69] is evident. On the other hand, performance degradation as a re-
sult of alias analysis is generally unexpected. (We will discuss this case in detail in
Section 5.)
In order to quantitatively understand the performance effects of a large number
of compiler techniques, we measure the performance of SPEC CPU2000 benchmarks
under different compiler configurations. We obtain these results on two different
computer architectures, focusing on the GNU Compiler Collection (GCC). The ex-
periments are conducted to answer the following questions: (1) Is the default opti-
mization combination suggested by the compiler good enough? (2) What optimiza-
tions may not always help performance? (3) Is performance degradation specific to a
particular architecture? (4) Do these optimizations have different effects on integer
benchmarks and on floating-point benchmarks?
Section 3 shows the performance of SPEC benchmarks under different GCC op-
timization levels, which are the optimization combinations suggested by GCC. Sec-
tion 4 shows the performance of individual optimizations. Section 5 analyzes the
reasons for the major performance degradation as a result of individual optimiza-
tions. Section 6 summarizes the answers to the above questions.
103
2 Experimental Setup
We measure the performance of GCC 3.3 optimizations on two different com-
puter architectures: a Pentium IV machine and a SPARC II machine. We focus on
GCC, because it is portable across many different computer architectures, and its
open-source nature helps us to understand the performance behavior of its optimiza-
tions. To verify that our results hold beyond the GCC compiler, we conduct similar
experiments with the Forte compilers from Sun Microsystems [52]. Our conclusions
are valid for these compilers as well, although they generally outperform GCC on
the SPARC machine.
We take the measurements using SPEC CPU2000 benchmarks. To differentiate
the effect of compiler optimizations on integer (INT) and floating-point (FP) pro-
grams, we display the results of these two benchmark categories separately. Among
all the FP benchmarks, facerec, fma3d, galgel, and lucas are written in f90. Because
GCC cannot currently handle f90, we do not measure them.
To ensure reliable measurements, we run the experiments multiple times. The
average execution time represents the performance, while the minimum and the
maximum are shown through “error bars” to indicate the degree of fluctuation. This
fluctuation is relevant where the performance gains and losses of an optimization
technique are small.
3 Performance of Optimization Levels O1 through O3
GCC provides three optimization levels, O1 through O3 [51], each applying a
larger number of optimization techniques. O0 does not apply any substantial code
optimizations. From Figure A.1, we make the following observations.
1. There is consistent, significant performance improvement from O0 to O1. How-
ever, O2 and O3 do not always lead to additional gains; in some cases, perfor-
mance even degrades. (In Section 5 we will analyze the significant degradation
104
(b) FP benchmarks on a Pentium IV machine
0
200
400
600
800
1000
1200
1400
1600
1800
2000
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
Exe
cuti
on
Tim
e/S
eco
nd
s
0
200
400
600
800
1000
1200
1400
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
twol
f
vort
ex vpr
Exe
cuti
on
Tim
e/S
eco
nd
s
(a) INT benchmarks on a Pentium IV machine
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
twol
f
vort
ex vpr
Exe
cuti
on
Tim
e/S
eco
nd
s
(c) INT benchmarks on a SPARC II machine
0
5000
10000
15000
20000
25000
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
Exe
cuti
on
Tim
e/S
eco
nd
s
(d) FP benchmarks on a SPARC II machine
Fig. A.1. Execution time of SPEC CPU 2000 benchmarks underdifferent optimization levels compiled by GCC. (Four floating pointbenchmarks written in f90 are not included, since GCC does notcompile them.) Each benchmark has four bars for O0 to O3. (a) and(c) show the integer benchmarks; (b) and (d) show the floating pointbenchmarks. (a) and (b) are the results on a Pentium IV machine;(c) and (d) are the results on a SPARC II machine.
of art). For different applications, any one of the three levels O1 through O3
may be the best.
2. As expected, Table A.1 shows that O2 is better on average than O12 and, for
the integer benchmarks, O3 is better than O2. However, for the floating point
benchmarks O2 is better than or close to O3. Most of the performance is gained
from the optimizations in level O1. The performance increase from O1 to O2
is bigger than that from O2 to O3.
2Except the anomalous art, to be discussed in Section 5
105
Table A.1Average speedups of the optimization levels, relative to O0. In eachentry, the first number is the arithmetic mean, and the second one isthe geometric mean. The averages without art are put in parenthesesfor the floating point benchmarks on the Pentium IV machine.
INT FP INT FPPentium IV Pentium IV SPARC II SPARC II
O1 1.49/1.47 1.74(1.77)/1.65(1.67) 2.32/2.28 3.17/2.88O2 1.53/1.50 1.81(1.95)/1.60(1.81) 2.50/2.43 4.40/3.78O3 1.55/1.51 1.80(1.94)/1.60(1.80) 2.58/2.52 4.38/3.79
3. Floating point benchmarks benefit more from compiler optimizations than in-
teger benchmarks. Possible reasons are that floating point benchmarks tend
to have fewer control statements than integer benchmarks and are written in a
more regular way. Six of them are written in Fortran 77.
4. Optimizations achieve higher performance on the SPARC II machine than on
the Pentium IV machine. Possible reasons are the regularity of RISC versus
CISC instruction sets and the fact that SPARC II has more registers than
Pentium IV. The latter gives the compiler more freedom to allocate registers,
resulting in less register spilling on the SPARC II machine.
5. Eon, the only C++ benchmark, benefits more from optimization than all other
integer benchmarks. On the Pentium IV machine, the highest speedup of eon
is 2.65, while the highest one among other integer benchmarks is 1.73. On the
SPARC II machine, the highest speedup of eon is 3.70, while the highest one
among other integer benchmarks is 3.47.
4 Performance of Individual Optimizations
This section discusses different performance behaviors of individual optimization
techniques used in GCC. We measure the execution time with all optimizations on as
the baseline performance. Then, for each optimization x, we measure the execution
106
time with all optimizations on except x. The performance of optimization x is
represented by its Relative Improvement Percentage (RIP), defined as follows:
RIP =( execution time without x
execution time of baseline− 1
)∗ 100 (A.1)
RIP represents the percent increase of the program execution time when disabling a
given optimization technique. A larger RIP value, indicates a bigger positive impact
of the technique.
Due to the huge amount of data, we do not show all the results but the repre-
sentative ones. We make a number of observations, and discuss opportunities and
needs for better orchestration of the techniques.
1. Ideally, one expects that most optimization techniques yield performance im-
provements with no degradation. This is the case for apsi on the SPARC II
machine, shown in Figure A.2 (a). This situation indicates little or no need and
opportunity for optimization orchestration.
2. In some benchmarks, only a few optimizations make a significant performance
difference, while others have very small effects. vortex on Pentium IV Fig-
ure A.2 (b) is such an example. Here also, little opportunity exists for perfor-
mance gain through optimization orchestration.
3. It is possible that many optimizations cause performance degradation, such as in
twolf on Pentium IV (Figure A.2 (c)), or individual degradations are large, as in
sixtrack on Pentium IV (Figure A.4). In these cases, optimization orchestration
may help significantly.
4. In some programs, the relative improvement percentages of individual opti-
mizations are between −1.5 and 1.5. For example, in Figure A.3 (a) and Fig-
ure A.3 (b) the improvements are in the same order of magnitude as their
variance. While optimization orchestration may combine small individual gains
to a substantial improvement, the need for accurate performance measurement
becomes evident. Effects such as OS activities need to be considered carefully.
5. The performance improvement or degradation may depend on the computer
architectures. According to the results of twolf on SPARC II, shown in Fig-
107
-2
0
2
4
6
8
10
12
14
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
p
Rel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-4-202468
10121416
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-3
-2
-1
0
1
2
3
4
5
6
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
(b) VORTEX on a Pentium IV machine
(a) APSI on a SPARC II machine
(c) TWOLF on a Pentium IV machine
Fig. A.2. Relative improvement percentage of all O3 optimizations.
108
-1.5
-1
-0.5
0
0.5
1
1.5
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-4
-3
-2
-1
0
1
2
3
4
5
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
(b) VPR on a Pentium IV machine
(a) BZIP2 on a Pentium IV machine
(c) TWOLF on a SPARC II machine
Fig. A.3. Relative improvement percentage of all O3 optimizations.
109
-20-15-10-505
10152025
base
(-O
3)
rena
me-
regi
ster
s
inlin
e-fu
nctio
ns
alig
n-la
bels
alig
n-lo
ops
alig
n-ju
mps
alig
n-fu
nctio
ns
stric
t-al
iasi
ng
reor
der-
func
tions
reor
der-
bloc
ks
peep
hole
2
calle
r-sa
ves
sche
d-sp
ec
sche
d-in
terb
lock
sche
dule
-insn
s2
sche
dule
-insn
s
regm
ove
expe
nsiv
e-op
timiz
atio
ns
dele
te-n
ull-p
oint
er-c
heck
s
gcse
-sm
gcse
-lm
gcse
reru
n-lo
op-o
pt
reru
n-cs
e-af
ter-
loop
cse-
skip
-blo
cks
cse-
follo
w-ju
mps
stre
ngth
-red
uce
optim
ize-
sibl
ing-
calls
forc
e-m
em
cpro
p-re
gist
ers
gues
s-br
anch
-pro
babi
lity
dela
yed-
bran
ch
if-co
nver
sion
2
if-co
nver
sion
cros
sjum
ping
loop
-opt
imiz
e
thre
ad-ju
mps
mer
ge-c
onst
ants
defe
r-po
pRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
Fig. A.4. Relative improvement percentage of all O3 optimizations.sixtrack on a Pentium IV machine.
ure A.3 (c), and the results on Pentium IV, in Figure A.2 (c). The optimiza-
tions causing degradations are completely different on these two platforms. It
clearly shows that optimization orchestration needs to consider the application
as well as the execution environment.
In some experiments, few or none of the tested optimizations cause significant
speedups (e.g. Figure A.3 (a)); however, there is significant improvement from O0
to O1, as shown in Figure A.1. This is because some basic optimizations are not
controllable by compiler options. These optimizations include the expansion of built-
in functions, and basic local and global register allocation.
5 Optimizations with Major Negative Effects
This section shows the performance of three optimizations, which cause major
degradations to some of the benchmarks. These optimizations are strict aliasing,
global common subexpression elimination, and if-conversion. We briefly discuss the
reasons to the degradation.
110
(b) FP benchmarks on a Pentium IV machine(a) INT benchmarks on a Pentium IV machine
(c) INT benchmarks on a SPARC II machine (d) FP benchmarks on a SPARC II machine
-2
0
2
4
6
8
10
12
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
eRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-2
-1
0
1
2
3
4
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-70
-60
-50
-40
-30
-20
-10
0
10
20
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
eRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e-2
-1
0
1
2
3
4
5
6
7
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprR
elat
ive
Imp
rove
men
tP
erce
nta
ge
Fig. A.5. Relative improvement percentage of strict aliasing.
5.1 Strict aliasing
Strict aliasing is a simple alias analysis technique. When turned on, objects of
different types are always assumed to reside at different addresses.3 If strict aliasing
is turned off, GCC assumes the existence of aliases very conservatively [51].
Generally, one expects a throughout positive effect of strict aliasing, as it avoids
conservative assumptions. Figure A.5 confirms this view for most cases. However,
the technique also leads to significant degradation in art, shown in Figure A.5 (b).
The RIP is -64.5.
The degradation in art on Pentium IV is due to the interaction between strict
aliasing and register allocation. GCC implements a graph coloring register alloca-
3Strict-aliasing in combination with type casting may lead to incorrect programs. We have notobserved any such problems in the SPEC CPU2000 benchmarks.
111
tor [70, 71]. With strict aliasing, the live ranges of the variables used in art become
longer, leading to higher register pressure and spilling. With more conservative
aliasing, the same variables incur memory transfers at the end of their (shorter)
live ranges as well. However, in the given compiler implementation, the spill code,
generated with strict aliasing on, includes substantially more memory accesses than
these transfers, generated without strict aliasing. Thus, strict aliasing causes perfor-
mance degradation to art. (Unfortunately, this performance degradation is highly
impossible to predict at compile-time, because register allocation is an NP-complete
problem.)
From Figure A.5 (d), on SPARC II, strict aliasing does not degrade art, but
improves the performance by 10.7%. We attribute this improvement to less spilling
due to the larger number of registers on SPARC II than on Pentium IV.
5.2 Global common subexpression elimination
Global common subexpression elimination (GCSE) employs partial redundancy
elimination (PRE), global constant propagation, and copy propagation [51]. GCSE
removes redundant computation and, therefore, generally improves performance. In
rare cases it increases register pressure by keeping the expression values longer. PRE
may also create additional move instructions, as it attempts to place the results of
the same expression computed in different basic blocks into the same register.
We have also found that GCSE can degrade the performance, as it interacts with
other optimizations. In applu (Figure A.6 (b)), we observed a significant perfor-
mance degradation in the subroutine of JACLD. Detailed analysis showed that this
problem happens when GCSE is used together with the flag force-mem. This flag
forces memory operands to be copied into registers before arithmetic operations,
which generally improves code by making all memory references potential common
subexpressions. However, in applu, this pass evidently interferes with the GCSE
algorithm. Comparing the assembly code with and without force-mem, we found
the former recognized fewer common subexpressions.
112
(b) FP benchmarks on a Pentium IV machine(a) INT benchmarks on a Pentium IV machine
(c) INT benchmarks on a SPARC II machine (d) FP benchmarks on a SPARC II machine
-20
-15
-10
-5
0
5
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
Rel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-4
-3
-2
-1
0
1
2
3
4
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-4
-2
0
2
4
6
8
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
eRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-5
-4
-3
-2
-1
0
1
2
3
4
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
Fig. A.6. Relative improvement percentage of global common subex-pression elimination.
5.3 If-conversion
If-conversion attempts to transform conditional jumps into branch-less equiva-
lents. It makes use of conditional moves, min, max, set flags and abs instructions,
and applies laws of standard arithmetic [51]. If the computer architecture supports
predication, if-conversion may be used to enable predicated instructions [72]. By
removing conditional jumps, if-conversion not only reduces the number of branches,
but also enlarges basic blocks, thus helps scheduling. The potential overhead of such
transformations and the opportunities for dynamic optimization are well-known [73].
In our measurements, we found many cases where if-conversion degrades the perfor-
mance.
113
(b) FP benchmarks on a Pentium IV machine(a) INT benchmarks on a Pentium IV machine
(c) INT benchmarks on a SPARC II machine (d) FP benchmarks on a SPARC II machine
-1
0
1
2
3
4
5
6
7
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
eRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e
-2
-1
0
1
2
3
4
5
amm
p
appl
u
apsi art
equa
ke
mes
a
mgr
id
sixt
rack
swim
wup
wis
eRel
ativ
eIm
pro
vem
ent
Per
cen
tag
e-4
-3
-2
-1
0
1
2
3
4
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprR
elat
ive
Imp
rove
men
tP
erce
nta
ge
-4
-3
-2
-1
0
1
2
3
4
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vprR
elat
ive
Imp
rove
men
tP
erce
nta
ge
Fig. A.7. Relative improvement percentage of if-conversion.
In vortex, there is a frequently called function named ChkGetChunk, which is
called billions of times in the course of the program. After if-conversion, the number
of basic blocks is reduced, but the number of instructions in the converted if-then-else
construct is still the same. However, the number of registers used in this function
is increased, because the coalesced basic block uses more physical registers to avoid
spilling. Thus, this function has to save the status of more registers. The additional
register saving causes substantial overhead, as the function is called many times.
6 Summary on GCC Optimization Performance
We make the following observations from the above discussion:
• The default optimization combinations do not guarantee the best performance.
There is still room for performance tuning.
114
• Optimizations may exhibit unexpected performance behavior. Even generally-
beneficial techniques may degrade performance. Degradations are often com-
plex side-effects of the interaction with other optimizations. They are near-
impossible to predict analytically.
• On different architectures, the optimizations may behave differently. This
means that compiler optimizations orchestrated on one architecture may not
suit the others.
• A larger number of optimization techniques cause performance degradations in
integer benchmarks. Integer benchmarks often contain irregular code with many
control statements, which tends to reduce the effectiveness of optimizations. On
the other hand, larger degradations (of fewer techniques) occur in floating point
benchmarks. This is consistent with the generally larger effect of optimization
techniques in these programs.
VITA
115
VITA
Zhelong Pan was born in 1976 at Suzhou, China. He received his B.E. degree
in Electrical Engineering from Tsinghua University in July, 1998. He was awarded
“Tsinghua Honorable Excellent Graduate” the same year. Three years later, in July,
2001, he received his M.E. degree in Electrical Engineering from Tsinghua Univer-
sity, awarded the “Excellent Master Thesis”. Then, he moved to West Lafayette at
Indiana to pursue his Ph.D. degree at the School of Electrical and Computer Engi-
neering at Purdue University. He successfully defended his Ph.D. research in April,
2006 and received his Ph.D. degree in May, 2006.
During the period of his master’s program, Zhelong Pan participated in develop-
ing software for power system analysis and simulation, which had been applied to
more than ten power companies in China at that time. In his master’s thesis, he
developed a distributed genetic algorithm to solve the reactive power optimization
problem. As a Ph.D. student at Purdue, he worked on optimizing compilers and
parallel computing. He developed a tiling algorithm for locality enhancement, which
works in concert with the existing parallelization techniques in the Polaris compiler.
He implemented a user-level socket virtualization technique, which enables MPI pro-
gram execution on virtual machines in iShare, an open Internet sharing system. In
his Ph.D. dissertation, he developed fast and effective algorithms and compiler tools
for automatic performance tuning via selecting the best compiler optimization com-
bination.