engineering.purdue.edu · iii ACKNOWLEDGMENTS Iwouldliketoﬁrstthankmyadvisor,RudiEigenmann,forhissupportandadvice duringmymanydayshereatPurdue. Ithasbeenatruepleasuretoworkwithhim

PEAK – A FAST AND EFFECTIVE PERFORMANCE TUNING SYSTEM VIA

COMPILER OPTIMIZATION ORCHESTRATION

A Thesis

Submitted to the Faculty

of

Purdue University

by

Zhelong Pan

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

May 2006

Purdue University

West Lafayette, Indiana

ii

To my wife Xiaojuan.

iii

ACKNOWLEDGMENTS

I would like to first thank my advisor, Rudi Eigenmann, for his support and advice

during my many days here at Purdue. It has been a true pleasure to work with him. I

would also like to thank the other professors at Purdue who have taught and advised

me, especially my Ph.D. committee members: Sam Midkiff, T.N. Vijaykumar, and

Zhiyuan Li.

Our research group at Purdue has been its share of good people. My thanks

go to Brian Armstrong, Seung-Jai Min, Hansang Bae, Troy Johnson, Xiaojuan Ren,

Sang-Ik Lee, Ayon Basumallik and Yili Zheng. I would especially like to thank Brian

Armstrong for answering many of my questions on the Polaris compiler, latex tools,

and even English.

I could never have done this without my family’s support and encouragement.

To my parents and my brother go my deepest love and gratitude. Of course, it

is my wife, Xiaojuan, that sacrificed the most for me. You are what makes it all

worthwhile.

And finally, I must thank all my friends and Purdue faculty and staffs, who make

my study and research at Purdue a joyful and meaningful journey.

iv

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 FAST AND EFFECTIVE OPTIMIZATION ORCHESTRATION ALGO-RITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Orchestration Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Orchestration algorithms . . . . . . . . . . . . . . . . . . . . . 13

2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Experimental environment . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Upper Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 The General Combined Elimination Algorithm . . . . . . . . . . . . . 33

2.5.1 Experimental results on SUN Forte compilers . . . . . . . . . 34

3 FAST AND ACCURATE RATING METHODS . . . . . . . . . . . . . . . 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Rating Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

v

Page

3.2.1 Context Based Rating (CBR) . . . . . . . . . . . . . . . . . . 40

3.2.2 Model Based Rating (MBR) . . . . . . . . . . . . . . . . . . . 43

3.2.3 Re-execution Based Rating (RBR) . . . . . . . . . . . . . . . 45

3.3 The Use of Rating Methods in PEAK . . . . . . . . . . . . . . . . . . 48

3.4 Evaluation on Rating Accuracy . . . . . . . . . . . . . . . . . . . . . 50

4 TUNING SECTION SELECTION . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Profile Data for Selecting Tuning Sections . . . . . . . . . . . . . . . 55

4.3 A Formal Description of the Tuning Section Selection Problem . . . . 56

4.4 The Tuning Section Selection Algorithm . . . . . . . . . . . . . . . . 58

4.4.1 Dealing with recursive functions . . . . . . . . . . . . . . . . . 59

4.4.2 Maximizing tuning section coverage under Nlb . . . . . . . . . 61

4.4.3 The final tuning section selection algorithm . . . . . . . . . . 66

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.1 SPEC CPU2000 FP benchmarks . . . . . . . . . . . . . . . . 69

4.5.2 SPEC CPU2000 INT benchmarks . . . . . . . . . . . . . . . . 70

5 THE PEAK SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Design of PEAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 The steps of automated performance tuning . . . . . . . . . . 74

5.2.2 Dynamic code generation and loading . . . . . . . . . . . . . . 77

5.3 An Example of Using PEAK . . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.1 Tuning time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.2 Tuned program performance . . . . . . . . . . . . . . . . . . . 86

5.4.3 Integer benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 87

6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 91

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vi

Page

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 Performance analysis on compiler optimizations . . . . . . . . 92

6.2.2 Other tuning problems . . . . . . . . . . . . . . . . . . . . . . 93

6.2.3 Adaptive performance tuning . . . . . . . . . . . . . . . . . . 94

6.2.4 Program debugging . . . . . . . . . . . . . . . . . . . . . . . . 94

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

APPENDIX: PERFORMANCE OF GCC OPTIMIZATIONS . . . . . . . . . 102

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

vii

LIST OF TABLES

Table Page

2.1 Orchestration algorithm complexity (n is the number of optimization op-tions.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Optimization options in GCC 3.3.3 . . . . . . . . . . . . . . . . . . . . . 21

2.3 Mean performance on SPARC II. CE achieves both fast tuning speed andhigh program performance on SPARC II as well. . . . . . . . . . . . . . 28

2.4 Upper bound analysis under four different machine and benchmark set-tings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Optimization flags orchestrated by GCE . . . . . . . . . . . . . . . . . . 35

3.1 Rating accuracy for selected tuning sections . . . . . . . . . . . . . . . . 51

4.1 Tuning section selection for mgrid. The best Nlb is 400. The optimalcoverage and Nmin are 0.957 and 2000. . . . . . . . . . . . . . . . . . . . 67

4.2 Selected tuning sections in SPEC CPU2000 FP benchmarks.(Three manually partitioned benchmarks are annotated with ‘*’.The last row, wupwise+, uses a smaller Tlb = 1μsec.) . . . . . . . . . . . 69

4.3 Selected tuning sections in SPEC CPU2000 INT benchmarks.(The benchmarks annotated with ‘+’ use smaller Tlb’s.) . . . . . . . . . . 72

A.1 Average speedups of the optimization levels, relative to O0. In eachentry, the first number is the arithmetic mean, and the second one is thegeometric mean. The averages without art are put in parentheses for thefloating point benchmarks on the Pentium IV machine. . . . . . . . . . . 105

viii

LIST OF FIGURES

Figure Page

2.1 Normalized tuning time of five optimization orchestration algorithms forSPEC CPU2000 benchmarks on Pentium IV. Lower is better. CE hasthe shortest tuning time in all except a few cases. In all those cases, theextended tuning time leads to higher performance. . . . . . . . . . . . . 23

2.2 Program performance achieved by five optimization orchestration algo-rithms relative to the highest optimization level “O3” for SPEC CPU2000benchmarks on Pentium IV. Higher is better. In all cases, CE performsthe best or within 1% of the best. . . . . . . . . . . . . . . . . . . . . . 24

2.3 Overall comparison of the orchestration algorithms. CE achieves bothfast tuning speed and high program performance. . . . . . . . . . . . . . 29

2.4 Total negative effects of all the GCC 3.3.3 O3 optimization options . . . 30

2.5 Upper bound analysis on Pentium IV. ES 6: exhaustive search with 6optimizations; CE 6: combined elimination with 6 optimizations; CE 38:combined elimination with 38 optimizations; BE 6: batch eliminationwith 6 optimizations. CE 6 achieves nearly the same performance asES 6, in all cases. CE 38 performs better. (Exhaustive search with 38optimizations would be infeasible.) BE 6 is much worse than CE 6. CE 6is about 4 times faster than ES 6. . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Program performance achieved by the GCE algorithm vs the performanceof the manually tuned results (peak setting). Higher is better. In allcases, GCE achieves equal or better performance. On average, GCEnearly doubles the performance. . . . . . . . . . . . . . . . . . . . . . . 36

3.1 Pseudo code of context variable analysis . . . . . . . . . . . . . . . . . . 42

3.2 A simple example of MBR . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Basic Re-execution-based rating method (RBR) . . . . . . . . . . . . . . 46

3.4 Improved Re-execution-based rating method . . . . . . . . . . . . . . . . 48

ix

Figure Page

4.1 An example of tuning section selection. The graph is a call graph withnode a as the main function. The weights on an edge are the number ofinvocations and the execution time in the parentheses. The optimal edgecut is (Θ = {a, c}, Ω = {b, d, e, f}), shown by the dashed curve. Edges(a, b) and (c, f) are chosen as the S set. Edge (c, e) in the cut (Θ, Ω)is not included in S, because its average execution time is 1/20000 lessthan Tlb = 1e−4. There are two tuning sections led by node b and nodef , T = {b, f}. The number of invocations to b and f are 1000 and 200in respect, so, Nmin = 200. The coverage of this optimal tuning sectionselection is (80+18)/100 = 0.98, where the total execution time, Ttotal, is100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 The pseudo code for call graph simplification. The algorithm generates acall graph from profile data, detects and discards recursive calls. Hence,the call graph is simplified to a directed acyclic graph. . . . . . . . . . . 60

4.3 An example of call graph simplification. The graph is a call graph withnode a as the main function. c is a self-recursive function. b and e re-cursively call each other. The weights on an edge are the number ofinvocations and the execution time in the parentheses. After simplifica-tion, the loop at node c is discarded. The strongly connected component{b, e} is merged into one node be. The entry node b for this stronglyconnected component is kept. A new edge (b, be) is added. Edges (b, f)and (e, f) are merged to (be, f). The profile data on edges (b, be) and(be, f) are updated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Tuning section selection algorithm to maximize program coverage underthe lower bound on numbers of TS invocations, Nlb. This algorithm tra-verses the simplified call graph from top down to find the code sectionswhose numbers of invocations are greater than Nlb. In addition, the al-gorithm finds the functions that may be manually partitioned to improvetuning section coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Update profile data after vi is chosen as the entry function to a tuningsection. The updated profile reflects the execution times and invocationnumbers after excluding the chosen tuning section. . . . . . . . . . . . . 65

4.6 The final tuning section selection algorithm. This algorithm achieves botha large Nmin and a high coverage. It iteratively uses the method shownin Figure 4.4 to maximize the tuning section coverage under a series ofthresholds Nlb’s, until the optimal Nlb is found. . . . . . . . . . . . . . . 68

5.1 Block diagram of the PEAK performance tuning system . . . . . . . . . 75

5.2 An example of the tuning section calc1 in swim . . . . . . . . . . . . . . 79

x

Figure Page

5.3 The tuning section calc1 instrumented by the PEAK compiler . . . . . . 80

5.4 The initialization function instrumented by the PEAK compiler . . . . . 81

5.5 The exit function instrumented by the PEAK compiler . . . . . . . . . . 82

5.6 Normalized tuning time of the whole-program tuning and the PEAKsystem for SPEC CPU2000 FP benchmarks on Pentium IV. Lower isbetter. On average, PEAK gains a speedup of 20.3. . . . . . . . . . . . . 84

5.7 Tuning time percentage of the six stages for SPEC CPU2000 FP bench-marks on Pentium IV. (TSS: tuning section selection, RMA: rating methodanalysis, CI: code instrumentation, DG: driver generation, PT: perfor-mance tuning, FVG: final version generation.) The most time-consumingsteps are PT, TSS and RMA. . . . . . . . . . . . . . . . . . . . . . . . . 85

5.8 Program performance improvement relative to the baseline under O3 forSPEC CPU2000 FP benchmarks on Pentium IV. Higher is better. Allthe benchmarks use the train dataset as the input to the tuning process.Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Refand PEAK Ref use the ref dataset to evaluate the tuned program per-formance, but still the train dataset for tuning. PEAK achieves equal orbetter program performance than the whole-program tuning. . . . . . . . 87

5.9 PEAK tuning time for INT benchmarks on Pentium IV . . . . . . . . . . 89

5.10 Program performance improvement relative to the baseline under O3 forSPEC CPU2000 INT benchmarks on Pentium IV. Higher is better. Allthe benchmarks use the train dataset as the input to the tuning process.Whole Train (PEAK Train) is the performance achieved by the whole-program tuning (the PEAK system) under the train dataset. Whole Refand PEAK Ref use the ref dataset to evaluate the tuned program perfor-mance, but still the train dataset for tuning. . . . . . . . . . . . . . . . . 90

A.1 Execution time of SPEC CPU 2000 benchmarks under different optimiza-tion levels compiled by GCC. (Four floating point benchmarks written inf90 are not included, since GCC does not compile them.) Each benchmarkhas four bars for O0 to O3. (a) and (c) show the integer benchmarks; (b)and (d) show the floating point benchmarks. (a) and (b) are the resultson a Pentium IV machine; (c) and (d) are the results on a SPARC IImachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.2 Relative improvement percentage of all O3 optimizations. . . . . . . . . . 107

A.3 Relative improvement percentage of all O3 optimizations. . . . . . . . . . 108

xi

Figure Page

A.4 Relative improvement percentage of all O3 optimizations. sixtrack on aPentium IV machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.5 Relative improvement percentage of strict aliasing. . . . . . . . . . . . . 110

A.6 Relative improvement percentage of global common subexpression elimi-nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

A.7 Relative improvement percentage of if-conversion. . . . . . . . . . . . . . 113

xii

ABSTRACT

Pan, Zhelong. Ph.D., Purdue University, May, 2006. PEAK – A Fast and EffectivePerformance Tuning System via Compiler Optimization Orchestration. MajorProfessor: Rudolf Eigenmann.

Compile-time optimizations generally improve program performance. Neverthe-

less, degradations caused by individual compiler optimization techniques are to be

expected. Feedback-directed optimization orchestration solutions generate optimized

code versions under a series of optimization combinations, evaluate their performance

and search for the best version. One challenge to such systems is to tune program

performance quickly in an exponential search space. Another challenge is to achieve

high program performance, considering that optimizations interact.

The PEAK system in this thesis is an automated performance tuning system,

which searches for the best compiler optimization combinations for important code

sections in a program. It achieves fast tuning speed and high program performance.

The following contributions are made in this work: (1) An algorithm called Combined

Elimination (CE) is developed to explore the optimization space quickly and effec-

tively. (2) Three fast and accurate rating methods are designed to evaluate the per-

formance of an optimized code section based on a partial execution of the program.

(3) An algorithm is developed to identify important code sections as candidates for

performance tuning, trading off tuning speed and tuned program performance.

CE improves performance by 6.01% over GCC O3 for SPEC CPU2000, while

reducing tuning time to 57% of the closest alternative algorithm. Using SUN Forte

compilers, CE improves performance by 10.8%, compared to 5.6% improved by man-

ual tuning. Applying the rating methods, PEAK reduces tuning time further from

2.19 hours to 5.85 minutes, while achieving equal or better program performance.

1

1. INTRODUCTION

1.1 Motivation and Introduction

Although compiler optimizations generally yield significant performance improve-

ments in many programs on modern architectures, the potential for performance

degradation in certain program patterns is known to compiler writers and many pro-

grammers. The state of the art is to let programmers deal with this problem through

compiler options. For example, a programmer can switch off an optimization after

finding that it causes performance degradation. The presence of these options re-

flects the inability of today’s compilers to make the best optimization decision at

compile time. In this thesis, we refer to this process of finding the best optimization

combination for a target program as optimization orchestration. The large number

of compiler optimizations, the complicated interactions between optimizations, the

sophistication of computer architectures, and the complexity of the program itself

make this optimization orchestration problem difficult to solve.

This thesis develops the PEAK (Program Evolution by Adaptive Compilation)

system to automate performance tuning. It aims at orchestrating compiler opti-

mizations for a given scientific program, in a fast and effective way. PEAK adopts a

feedback-directed approach to performance tuning. It generates a series of experimen-

tal versions, compiled under different optimization combinations, for every important

code segment. We call these code segments tuning sections. The performance of each

experimental version is rated based on a partial execution of the program, i.e., a few

invocations of the tuning section under a training input. Iteratively, our orchestra-

tion algorithm chooses the next experimental optimization combinations, based on

these performance ratings, until convergence criteria are satisfied. In the end, the

final tuned program will take the best version found for each tuning section.

2

To achieve the two goals of high program performance and fast tuning speed, this

thesis develops fast and effective orchestration algorithms to explore the optimization

space [1] and fast and accurate rating methods based on a partial execution of the

program to speed up performance evaluation [2]. Compiler tools are implemented to

analyze and to instrument the target program automatically. Specifically, this thesis

makes the following contributions:

1. An optimization orchestration algorithm is developed to explore the optimiza-

tion space fast and effectively. This algorithm achieves equal or better perfor-

mance than the other comparable algorithms, but with less tuning time (57%

of the closest alternative).

2. Three accurate and fast performance rating methods based on a partial exe-

cution of the program are designed to improve rating accuracy and to reduce

tuning time. These rating methods reduce tuning time from several hours to

several minutes, while achieving equal or higher program performance.

3. An algorithm to select important code segments as candidates for performance

tuning is presented. The selected tuning sections cover most (typically more

than 90%) of the total execution time. Each of them is invoked many times

(typically several hundreds) in one run of the program. A high execution time

coverage leads to high tuned program performance; a large number of tuning-

section invocations leads to fast tuning, because it means that many optimized

versions can be experimented in one run of the program.

4. An automatic performance tuning system via optimization orchestration is im-

plemented. The PEAK compiler analyzes and instruments the source program

before the tuning phase. The PEAK runtime system explores the optimization

space and evaluates the performance of optimized versions automatically.

5. Optimization orchestration performance is measured and analyzed comprehen-

sively for GCC and the SUN Forte compiler, with a focus on GCC. The ex-

periments are done for SPEC CPU2000 benchmarks on a Pentium IV machine

and a SPARC II machine.

3

1.2 Related Work

One attempt to alleviate the problem of optimization orchestration is to im-

prove compiler optimizations so as to reduce their potential performance degrada-

tion. However, this approach has two problems: (1) Compilers can hardly consider

the interaction between all the optimizations, given the large number of optimiza-

tions and the complexity of their interactions; (2) The compile-time performance

models used in the compiler are limited by the unavailability of program input data

and insufficient knowledge of the target architecture. Many projects try to improve

optimization performance in these two aspects, using either compile-time or runtime

techniques.

Some try to combine a few optimizations into one big pass, which considers the

interactions between the optimizations. For example, Wolf, Maydan and Chen [3]

develop an algorithm that applies fission, fusion, tiling, permutation and outer loop

unrolling to optimize loop nests. Similarly, Click and Cooper [4] show that combining

constant propagation, global value numbering, and dead code elimination leads to

more optimization opportunities. Nevertheless, it would be very difficult, or even

impossible, to combine all optimizations into one pass that removes all possible

performance degradation, because of the complexity of the compilation task and the

complicated interactions between optimizations.

Some postpone optimizations until runtime, when accurate knowledge about the

target architecture and the program input can be supplied to the compiler.

1. Similar to JVM JIT compilers [5–8], several projects aim at achieving both

portability and performance. DCG [9] proposes a retargetable dynamic code

generation system; VCODE [10] provides a machine-independent interface for

native machine code generation; DAISY [11] generates VLIW code on-the-fly

to emulate the existing architectures on a VLIW architecture.

2. Several projects generate optimized code using runtime input via Runtime

Specialization [12]. RCG and Fabius [13, 14] automatically translate ML pro-

grams into code that generates native binary code at runtime. Calpa [15]

4

and DyC [16] form a staged compiler [17]: Calpa annotates the program at

compile-time; DyC creates a runtime compiler from the annotated program;

this runtime compiler in turn generates the executable using runtime values.

3. Some try to re-optimize binaries based on runtime information. Dynamo [18,

19] (for HP workstations) and DynamoRIO [20] (for IA-32 machines) find and

optimize hot traces for statically generated native binaries at runtime. Simi-

larly, in [21–23], techniques are developed to detect hot spots and to generate

new traces for runtime optimization via hardware support. ADAPT [24, 25]

compiles code intervals under different optimization configurations at runtime,

and chooses the best version for each interval. Continuous Program Optimiza-

tion [26] continually adjusts the storage layouts of dynamic data structures to

enhance the data locality and re-schedules the instructions based on runtime

profiling.

4. Besides the above runtime compilation systems, some develop specific runtime

optimization techniques. For example, a runtime data and iteration reorder-

ing technique is applied to improve data locality in [27]. LRPD test [28, 29]

speculatively executes candidate loops in a parallel form, and re-executes the

loops serially when some data dependence is detected at runtime. In [30], Rus,

Rauchwerger and Hoeflinger analyze the memory reference at both compile-

time and runtime to help parallelization. Runtime path profiling is proposed

to help compiler optimizations in [31]. Dynamic Feedback [32] produces sev-

eral versions under different synchronization optimization policies and auto-

matically chooses the best version by periodically sampling the performance of

each version.

The above techniques push the optimizations to runtime, so they need to reduce

or amortize additional overhead introduced to program execution. Some systems

work on a slow baseline, therefore, there is an opportunity to amortize the overhead

by performance improvement from the optimizations. For example, Fabius [13, 14]

improves ML code, JIT [5–8] improves byte code, and Dynamo [18, 20] improves

5

un-optimized library code. Some [24, 25] off-load the compilation job to another

processor. Some [12–17] generate a small code generator before the production run,

so as to reduce the compilation overhead at runtime. Some [21–23] use hardware to

reduce the profiling overhead.

The goal of this thesis is to find the best compiler optimization combination

for scientific programs. These programs are mostly written in C or Fortran, and

the baseline is very fast compiled under the default optimization setting. Mean-

while, a large number of optimizations, 38 GCC optimizations in our experiments,

are involved. In this case, the performance improvement by optimization orches-

tration can hardly amortize the tremendous compilation overhead, so, it can hardly

achieve performance improvement by solving this optimization orchestration prob-

lem at runtime. Instead, similar to profile-based optimizations, this thesis tunes

program performance under a training input, and then the optimized final version

will be used at runtime. The tuning process can also be applied in-between the

production runs.

This thesis adopts the commonly used feedback-directed optimization approach.

In this approach, many different binary code versions generated under different ex-

perimental optimization combinations are evaluated. The performance of these ver-

sions is compared using either measured execution times or profile-based estimates.

Iteratively, the orchestration algorithms use this information to decide the next ex-

perimental optimization combinations, until convergence criteria are reached. In the

end, the optimization orchestration algorithm gives the final optimal version for the

entire program or important code sections in the program.

Many performance tuning systems use this feedback-directed approach. For ex-

ample, ATLAS [33] generates numerous variants of matrix multiplication to search

for the best one for a specific target machine. Similarly, Iterative Compilation [34]

searches through the transformation space to find the best block sizes and unrolling

factors. Meta optimization [35] uses machine-learning techniques to adjust several

compiler heuristics automatically.

6

The above three projects [33–35] have focused on a relatively small number of

optimization techniques, while this thesis tunes all optimizations that are controlled

by compiler options. All 38 GCC O3 optimization options are tuned in our exper-

iments. In other words, given a compiler and a program, this thesis tries to make

the best use of the compiler to generate the binary code with the best performance

for the program.

Several projects target the same optimization orchestration problem as this the-

sis. The Optimization-Space Exploration (OSE) compiler [36] defines sets of op-

timization configurations and an exploration space, which is traversed to find the

best configuration for the program using compile-time performance estimates as

feedback. Statistical Selection (SS) in [37] uses orthogonal arrays [38] to compute

the performance effects of the optimizations based on a statistical analysis of profile

information, which, in turn, is used to find the best optimization combination. Com-

piler Optimization Selection [39] applies fractional factorial design to optimize the

selection of compiler options. Option Recommendation [40] chooses the PA-RISC

compiler options intelligently for an application, using heuristics based on informa-

tion from the user, the compiler and the profiler. (Different from finding the best

optimization combination, Adaptive Optimizing Compiler [41] uses a biased random

search to discover the best order of optimizations. 1)

Still, there are two unsolved major issues regarding optimization orchestration.

1. How do we search through the optimization space fast and effectively, given

that the search space is huge due to the interactions between compiler opti-

mizations? Complex algorithms, such as the exhaustive search in ATLAS [33]

and the automatic theorem prover in Denali [42], would be prohibitively slow

to solve this problem for a real scientific application. Also, this thesis looks for

a general solution, not heuristics for a special environment as in [40]. (We will

use OSE [36] and SS [37] as reference points.)

1Usually, a compiler does not have the option to specify the order of the optimizations. So, thisthesis does not compare to [41], although the techniques developed in this thesis can be extendedto search for the best order of optimizations.

7

2. How do we evaluate the optimized versions fast and accurately? The most

accurate method is to use the real execution time to evaluate the performance

of the experimental versions. However, given the large number of experimental

versions and the execution time of a real application, this method leads to

excessive tuning times. The performance model used in [36] is fast; however,

it achieves significantly less program performance than the former method.

The goal of our PEAK system is to tune the performance of the important code

sections in a scientific program, in a fast and effective way, via orchestrating the

optimizations controlled by compiler options. PEAK is an automated system, tar-

geting the above two issues. First, a fast and effective feedback-directed algorithm

is designed to search through the optimization space, considering the interaction

between the optimizations. Second, PEAK uses fast and accurate performance eval-

uation methods based on a partial execution of the program.

1.3 Thesis Organization

Chapter 2 presents a fast and effective search algorithm named Combined Elim-

ination (CE), which considers the interaction between optimizations. From the ex-

periments with 38 GCC optimizations on Pentium IV and SPARC II, this algorithm

takes the least tuning time, while achieving the same program performance as other,

comparable algorithms. Through orchestrating a small set of optimizations causing

the most degradation, we show that the performance achieved by CE is close to the

upper bound obtained by an exhaustive search algorithm. The gap is less than 0.2%

on average. Experiments on SUN Forte compilers show that CE achieves a perfor-

mance significantly better than manually tuned peak performance presented by the

SPEC CPU2000 result report.

Chapter 3 proposes fast and accurate rating methods to evaluate the performance

of optimized versions. These rating methods operate on important code sections,

called tuning sections, of a program. The rating for one optimized version of a

tuning section is generated based on the execution times of several invocations to

8

the version. In one run of the program, there are many invocations to each tuning

section, so, multiple versions are evaluated in each run. In this way, this approach

improves the tuning time significantly. Meanwhile, the rating methods achieve fair

comparison by either identifying the invocations that have the same workload, finding

mathematical relationships between different workloads, or forcing re-execution of a

tuning section under the same input.

Chapter 4 develops an algorithm for selecting the important code sections in a

program as tuning sections. This algorithm maximizes the number of invocations to

the tuning sections and their execution time coverage. In this way, the tuning section

selection algorithm aims at both tuning speed and tuned program performance.

Chapter 5 shows the design of our automated performance tuning system –

PEAK. This chapter discusses two primary components, the PEAK compiler and

the PEAK runtime system, as well as special implementation problems related to

runtime code generation and loading. The experimental results on SPEC CPU2000

are presented. On average, compared to the whole-program tuning, PEAK reduces

the tuning time from 2.19 hours to 5.85 minutes and improves the performance from

11.7% to 12.1% for FP benchmarks, via orchestrating optimizations for each tuning

section.

Chapter 6 concludes this thesis. Future work is discussed as well.

The appendix discusses the performance behavior of all the GCC O3 optimization

options on the SPEC CPU2000 benchmarks, using a Pentium IV machine and a

SPARC II machine. The reasons for performance degradation are analyzed for several

important optimizations. One important finding is that optimizations may exhibit

unexpected performance behavior – even generally-beneficial techniques may degrade

performance. Degradations are often complex side-effects of the interaction with

other optimizations.

9

2. FAST AND EFFECTIVE OPTIMIZATION

ORCHESTRATION ALGORITHMS

2.1 Introduction

Compiler optimizations for modern architectures have reached a high level of so-

phistication. Although they yield significant improvements in many programs, the

potential for performance degradation in certain program patterns is known to com-

piler writers and many programmers. Today’s compilers have evolved to the point

where they present to programmers a large number of optimization options. For

example, GCC compilers include 38 options, roughly grouped into three optimiza-

tion levels, O1 through O3. On the other hand, compiler optimizations interact in

unpredictable manners, as many have observed [34,36,37,39,43]. How do we search

for the best optimization combination for a given program in order to achieve the

best performance? This chapter aims at developing a fast and effective algorithm to

do so. We call this process as optimization orchestration. In this chapter, we apply

the algorithm to the entire program. From the next chapter on, we will apply it to

each important code segment of the program.

Several automatic performance tuning systems have taken a dynamic, feedback-

directed approach to orchestrate compiler optimizations. In this approach, many

different binary code versions generated under different experimental optimization

combinations are being evaluated. The performance of these versions is compared

using either measured execution times or profile-based estimates. Iteratively, the

orchestration algorithms use this information to decide the next experimental opti-

mization combinations, until convergence criteria are reached.

10

The new algorithms presented in this thesis follow the above model. We first

develop two simple algorithms: (a) Batch Elimination (BE) identifies the harmful

optimizations and removes them in a batch. (b) Iterative Elimination (IE) succes-

sively removes harmful optimizations, measured through a series of program execu-

tions. Based on the above two algorithms, we design our final algorithm, Combined

Elimination (CE). We compare our algorithms with two algorithms proposed in the

literature: (i) The “compiler construction-time pruning” algorithm in Optimization-

Space Exploration (OSE) [36] iteratively constructs new optimization combinations

using “unions” of the ones in the previous iteration. (ii) Statistical Selection (SS)

in [37] uses orthogonal arrays [38] to compute the main effect of the optimizations

based on a statistical analysis of profile information, which in turn is used to find

the best optimization combination.

In addition to the above algorithms that we compare our work with, several

other approaches have been proposed. Typically, they need more than hundreds of

compilations and experimental runs, when tuning a large number of optimizations (38

optimizations in our experiments). The goal of our algorithm is to reduce this number

to several tens, while achieving comparative or even better program performance.

For the large number of benchmarks and optimizations experimented in this thesis,

we can only apply the algorithms in [36] and [37], which are closest to our new

algorithm, CE, in terms of tuning time. To further verify that our CE algorithm

achieves program performance comparable to other existing algorithms, we use a

small set of optimizations and show that CE closely approaches the upper bound

represented by exhaustive search. The other existing algorithms are as follows.

In [39], a fractional factorial design is developed based on aliasing or confound-

ing [44]; it illustrates a half-fraction design with 2n−1 experiments. In [40], heuristics

are designed to select PA-RISC compiler options based on information from the user,

the compiler, and the profiler. While the use of a priori knowledge of the interaction

between optimization techniques may reduce the complexity of the search for the

best, it has been found by others [43] that the number of techniques that potentially

11

interact is still large. ATLAS [33] starts with a parameterized, hand-coded set of

matrix multiplication variants and evaluates them on the target machine to deter-

mine the optimum settings for that context. Similarly, Iterative Compilation [34]

searches through the transformation space to find the best block sizes and unrolling

factors. In more recent research [45], five different algorithms, genetic algorithm,

simulated annealing, grid search, window search and random search are exploited to

find the best blocking and unrolling parameters. Based on the random search in [45],

some aim to find a general compiler optimization settings using GCC [46]. Meta op-

timization [35] uses machine-learning techniques to adjust the compiler heuristics

automatically.

Different from our goal of finding the best optimization combination is finding

the best order of optimization phases. In [41], a biased random search and genetic

algorithm is used to discover the best order of optimizations. Others have added hill

climbing and greedy constructive algorithms [47]. Furthermore, genetic algorithm

has been improved to reduce search time [48].

In this chapter, we make the following contributions:

• We present a new performance tuning algorithm, Combined Elimination (CE),

which aims at picking the best set of compiler optimizations for a program.

We show that this algorithm takes the shortest tuning time, while achieving

comparable or better performance than other algorithms. Using a small set

of (6) important optimizations, we also verify that CE closely approaches the

performance upper bound.

• We evaluate our and other algorithms on a large set of realistic programs. We

use all 23 SPEC CPU2000 benchmarks that are amenable to the GCC compiler

infrastructure (omitting 5 benchmarks, written in F90 and C++). By contrast,

many previous papers have used small kernel benchmarks. Among the papers

that used a large set of SPEC benchmarks are [2, 36,43].

• Our experiments use all (38) GCC O3 options, where the speed of the tuning

algorithm becomes of decisive importance. Except [36] and [46] that also use

12

a large number of optimizations, previous papers have generally evaluated a

small set of optimizations.

• Besides the GCC compiler, we apply our CE algorithm to the SUN Forte

compiler set as well, whose optimization options may have more than two values

instead of just “on” or “off”. CE achieves significantly better performance than

the peak setting in SPEC results [49], which we view as the manually tuned

results.

We apply the algorithms to tune the performance of SPEC CPU2000 benchmarks

on both a Pentium IV machine and a SPARC II machine. Using the full set of GCC

O3 optimizations, the average normalized tuning time, which will be defined formally

in Section 2.3.2, is 75.3 for our CE algorithm; 131.2 for the OSE algorithm described

in Section 5; 313.9 for the SS algorithm described in Section 6. Hence, CE reduces

tuning time to 57% of the closest alternative. CE improves performance by 6.01%,

over O3, the highest optimization level; OSE by 5.68%; SS by 5.46%. (Compared to

unoptimized programs, performance improvement achieved by CE would amount to

56.4%, on average.)

In order to compare CE with the manually tuned performance reported in SPEC

results, we implement the algorithm using the Forte compiler set. The experiments

are conducted for SPEC CPU2000 benchmarks on a SPARC II machine. On average,

for floating point benchmarks, CE achieves 10.8% improvement relative to the base

setting, compared to 5.6% by the SPEC peak settings; for integer benchmarks, CE

achieves 8.1% compared to 4.1% by the SPEC peak settings.

The remainder of this chapter is organized as follows. 1 In Section 2.2, we

describe the orchestration algorithms that we use in our comparison. In Section 2.3,

we compare tuning time and tuned program performance of these algorithms under

38 optimizations. In Section 2.4, we compare the performance of CE with the upper

bound obtained using exhaustive search under a smaller set of optimizations. In

1The main work of this chapter has been published in [1].

13

Section 2.5, we extend the CE algorithm to handle non-on-off options and experiment

it on the Forte compiler, and, we compare it with the manually tuned result.

2.2 Orchestration Algorithms

2.2.1 Problem description

We define the goal of optimization orchestration as follows:

Given a set of compiler optimization options {F1, F2, ..., Fn}, find the combination

that minimizes the program execution time. Do this efficiently, without the use of

a priori knowledge of the optimizations and their interactions. (Here, n is the number

of optimizations.)

In this section, we give an overview of several algorithms that pursue this goal.

We first present the exhaustive search algorithm, ES. Then, we develop two of our

algorithms, BE and IE, on which our final CE method builds, followed by CE itself.

Next, we present two existing algorithms, OSE and SS, with which our algorithm

compares. Each algorithm makes a number of full program runs, using the resulting

run times as performance feedback for deciding on the next run. We keep the algo-

rithms general and independent of specific compilers and optimization techniques.

The algorithms tune the options available in the given compiler via command line

flags. Here, we focus on on-off options, similar to several of the papers [2,37,39,43].

In Section 2.5, we will extend our CE algorithm to handle non-on-off options.

2.2.2 Orchestration algorithms

Algorithm 1: Exhaustive Search (ES)

Due to the interaction of compiler optimizations, the exhaustive search approach,

which is called the factorial design in [37, 39], would try every optimization combi-

nation to find the best. This approach provides an upper bound of an application’s

performance after optimization orchestration. However, its complexity is O(2n),

14

which is prohibitive if a large number of optimizations are involved. For 38 opti-

mizations in our experiments, it would take up to 238 program runs – a million years

for a program that runs in two minutes. We will not evaluate this algorithm under

the full set of options. However, Section 2.4 will use a feasible set of (6) options

to compare our algorithm with this upper bound. Using pseudo code, ES can be

described as follows.

1. Get all 2n combinations of n options, {F1, F2, ..., Fn}.2. Measure application execution time of the optimized version compiled under

every possible combination.

3. The best version is the one with the least execution time.

Algorithm 2: Batch Elimination (BE)

The idea of Batch Elimination (BE) is to identify the optimizations with nega-

tive effects and turn them off all at once. BE achieves good program performance,

when the optimizations do not interact with each other. It is the fastest among the

feedback-directed algorithms.

The negative effect of one optimization, Fi, can be represented by its Relative

Improvement Percentage (RIP), RIP(Fi), which is the relative difference of the ex-

ecution times of the two versions with and without Fi, T (Fi = 1) and T (Fi = 0).

Fi = 1 means Fi is on, 0 means off.

RIP(Fi) =T (Fi = 0) − T (Fi = 1)

T (Fi = 1)× 100% (2.1)

The baseline of this approach switches on all optimizations. T (Fi = 1) is the execu-

tion time of the baseline TB as shown in Equation 2.2. The performance improvement

by switching off Fi from the baseline B relative to the baseline performance can be

computed with Equation 2.3.

TB = T (Fi = 1) = T (F1 = 1, F2 = 1, ..., Fn = 1) (2.2)

RIPB(Fi = 0) =T (Fi = 0) − TB

TB

× 100% (2.3)

15

If RIPB(Fi = 0) < 0, the optimization of Fi has a negative effect. The BE algorithm

eliminates the optimizations with negative RIPs in a batch to generate the final,

tuned version. This algorithm has a complexity of O(n).

1. Compile the application under the baseline B = {F1 = 1, F2 = 1, ..., Fn = 1}.Execute the generated code version to get the baseline execution time TB.

2. For each optimization Fi, switch it off from B and compile the application.

Execute the generated version to get T (Fi = 0), and compute the RIPB(Fi = 0)

according to Equation 2.3.

3. Disable all optimizations with negative RIPs to generate the final, tuned ver-

sion.

Algorithm 3: Iterative Elimination (IE)

We design Iterative Elimination (IE) to take the interaction of optimizations into

consideration. Unlike BE, which turns off all the optimizations with negative effects

at once, IE iteratively turns off one optimization with the most negative effect at a

time.

IE starts with the baseline that switches on all the optimizations. After com-

puting the RIPs of the optimizations according to Equation 2.3, IE switches off

the one optimization with the most negative effect from the baseline. This process

repeats with all remaining optimizations, until none of them causes performance

degradation. The complexity of IE is O(n2).

1. Let B be the option combination for measuring the baseline execution time,

TB. Let S be the set of optimizations forming the optimization search space.

Initialize S = {F1, F2, ..., Fn} and B = {F1 = 1, F2 = 1, ..., Fn = 1}.2. Compile and execute the application under the baseline setting to get the

baseline execution time TB.

3. For each optimization Fi ε S, switch Fi off from B and compile the application,

execute the generated code version to get T (Fi = 0), and compute the RIP of

Fi relative to the baseline B, RIPB(Fi = 0), according to Equation 2.3.

16

4. Find the optimization Fx with the most negative RIP . Remove Fx from S,

and set Fx to 0 in B.

5. Repeat Steps 2, 3 and 4 until all options in S have non-negative RIPs. B

represents the final option combination.

Algorithm 4: Combined Elimination (CE)

CE, our final algorithm, combines the ideas of the two algorithms just described.

It has a similar iterative structure as IE; however, in each iteration, CE applies the

idea of BE: after identifying the optimizations with negative effects, in each iteration,

CE tries to eliminate these optimizations one by one in a greedy fashion.

We will see, in Section 2.3, that IE achieves better program performance than

BE, since it considers the interaction of optimizations. Nevertheless, when the in-

teractions have only small effects, BE may perform close to IE and more quickly

provide the solution. CE takes the advantages of both BE and IE. When the opti-

mizations interact weakly, CE eliminates the optimizations with negative effects in

one iteration, just like BE. Otherwise, CE eliminates them iteratively, like IE. As a

result, CE achieves both good program performance and fast tuning speed. CE has

the complexity of O(n2).

1. Let B be the baseline option combination. Let S be the set of optimiza-

tions forming the optimization search space. Initialize these two sets: S =

{F1, F2, ..., Fn} and B = {F1 = 1, F2 = 1, ..., Fn = 1}.2. Compile and execute the application under the baseline setting to get the

baseline execution time TB. Measure the RIPB(Fi = 0) of each optimization

option Fi in S relative to the baseline B.

3. Let X = {X1, X2, ..., Xl} be the set of optimization options with negative

RIPs. X is sorted in an increasing order, that is, the first element, X1, has the

most negative RIP . Remove X1 from S and set X1 to 0 in B. (B is changed

in this step.) For i from 2 to l,

17

∗ Measure the RIP of Xi relative to the baseline B.

∗ If the RIP of Xi is negative, remove Xi from S and set Xi to 0 in B.

4. Repeat Steps 2 and 3 until all options in S have non-negative RIPs. B repre-

sents the final solution.

Algorithm 5: Optimization Space Exploration(OSE)

In [36], the following method is used to orchestrate optimizations. First, a “com-

piler construction-time pruning” algorithm selects a small set of optimization combi-

nations that perform well on a given set of code segments. Then, these combinations

are used to construct a search tree, which is traversed to find good combinations for

code segments in a target program. To fairly compare this method with other or-

chestration algorithms, we slightly modify the “compiler construction-time pruning”

algorithm, which is then referred to as the OSE algorithm. (In [36], the pruning al-

gorithm aims at finding a set of good optimization combinations; while the modified

OSE algorithm in this thesis finds the best of this set. The modified algorithm is

applied to the whole application instead of code segments.)

The basic idea of the pruning algorithm is to iteratively find better optimization

combinations by merging the beneficial ones. In each iteration, a new test set Ω

is constructed by merging the optimization combinations in the old test set using

“union” operations. Next, after evaluating the optimization combinations in Ω, the

size of Ω is reduced to m by dropping the slowest combinations. The process repeats

until the performance increase in the Ω set of two consecutive iterations becomes

negligible. The complexity of OSE is O(m2 ∗n). We use the same m = 12 as in [36].

Roughly, m can be viewed as O(n), hence, the complexity of OSE is approximately

O(n3). The specific steps are as follows:

1. Construct a set, Ω, which consists of the default optimization combination,

and n combinations, each of which assigns a non-default value to a single

optimization. (In our experiments, the default optimization combination, O3,

turns on all optimizations. The non-default value for each optimization is off.)

18

2. Measure the application execution time for each optimization combination in

Ω. Keep the m fastest combinations in Ω, and drop the rest.

3. Construct a new Ω set, each element in which is a union of two optimization

combinations in the old Ω set. (The “union” operation takes non-default values

of the options in both combinations.)

4. Repeat Steps 2 and 3, until no new combinations can be generated or the

increase of the fastest version in Ω becomes negligible. We use the fastest

version in the final Ω as the final version .

Algorithm 6: Statistical Selection (SS)

SS was developed in [37]. It uses a statistical method to identify the performance

effect of the optimization options. The options with positive effects are turned

on, while the ones with negative effects are turned off in the final version, in an

iterative fashion. This statistical method takes the interactions of optimizations into

consideration. (All the other algorithms except BE consider the interactions.)

The statistical method is based on orthogonal arrays (OA), which have been

proposed as an efficient design of experiments [38,44]. Formally, an OA is an m× k

matrix of zeros and ones. Each column of the array corresponds to one compiler

option. Each row of the array corresponds to one optimization combination. SS

uses the OA with strength 2, that is, two arbitrary columns of the OA contain the

patterns 00, 01, 10, 11 equally often. Our experiments use the OA with 38 options

and 40 rows, which is constructed based on a Hadamard matrix taken from [50].

By a series of program runs, this SS approach identifies the options that have

the largest effect on code performance. Then, it switches on/off those options with

a large positive/negative effect. After iteratively applying the above solution to the

options that have not been set, SS finds an optimal combination of the options. SS

has a complexity of O(n2). The pseudo code is as follows.

1. Compile the application with each row from orthogonal array A as the compiler

optimization combination and execute the optimized version.

19

2. Compute the relative effect, RE(Fi), of each option using Equations 2.4 and 2.5,

where E(Fi) is the main effect of Fi, s is one row of A, T (s) is the execution

time of the version under s.

E(Fi) =(∑

sεA:si=1 T (s) − ∑sεA:si=0 T (s))2

m(2.4)

RE(Fi) =E(Fi)∑k

j=1 E(Fj)× 100% (2.5)

3. If the relative effect of an option is larger than a threshold of 10%,

∗ if the option has a positive improvement, I(Fi) > 0, according to Equa-

tion 2.6, switch the option on.

∗ else if it has a negative improvement, switch the option off.

I(Fi) =

∑sεA:si=0 T (s) − ∑

sεA:si=1 T (s)∑sεA:si=0 T (s)

(2.6)

4. Construct a new orthogonal array A by dropping the columns corresponding

to the options selected in the previous step.

5. Repeat all above steps until all of the options are set.

Summary of the orchestration algorithms

The goal of optimization orchestration is to find the optimal point in a high-

dimension space S = F1 × F2 × ... × Fn. BE probes each dimension to find and

adopt the ones that benefit performance. SS works in a similar way, but via a

statistical and iterative approach. OSE probes multiple directions, each of which

may involve multiple dimensions, and searches along the direction combinations that

may benefit performance. IE probes each dimension and fixes the dimension that

achieves the most performance at a time. CE probes each dimension and greedily

fixes the dimensions that benefit performance at each iteration.

Table 2.1 summarizes the complexities of all six algorithms compared in this

thesis.

20

Table 2.1Orchestration algorithm complexity (n is the number of optimization options.)

ES BE IE OSE SS CE

O(2n) O(n) O(n2) O(n3) O(n2) O(n2)

2.3 Experimental Results

2.3.1 Experimental environment

We evaluate our algorithm using the optimization options of the GCC 3.3.3 com-

piler on two different computer architectures: Pentium IV and SPARC II. Our rea-

sons for choosing GCC is that this compiler is widely used, has many easily accessible

compiler optimizations, and is portable across many different computer architectures.

In this section, we use all 38 optimization options implied by “O3”, the highest

optimization level. These options are listed in Table 2.2 and are described in the

GCC manual [51].

We take our measurements using all SPEC CPU2000 benchmarks written in

F77 and C, which are amenable to GCC. To differentiate the effect of compiler

optimizations on integer (INT) and floating-point (FP) programs, we display the

results of these two benchmark categories separately. Our overall tuning process is

similar to profile-based optimizations. A train dataset is used to tune the program.

A different input, the SPEC ref dataset, is usually used to measure performance.

To separate the performance effects attributed to the tuning algorithms from those

caused by the input sets, we measure program performance under both the train

and ref datasets. For our detailed comparison of the tuning algorithms, we will start

with the train set. In Section 2.3.3, we will show that, overall, the tuned benchmark

suite achieves similar performance improvement under the train and ref datasets.

To ensure accurate measurements and eliminate perturbation by the operating

system, we re-execute each code version multiple times under a single-user environ-

21

Table 2.2Optimization options in GCC 3.3.3

F1 rename-registers F2 inline-functions

F3 align-labels F4 align-loops

F5 align-jumps F6 align-functions

F7 strict-aliasing F8 reorder-functions

F9 reorder-blocks F10 peephole2

F11 caller-saves F12 sched-spec

F13 sched-interblock F14 schedule-insns2

F15 schedule-insns F16 regmove

F17 expensive-optimizations F18 delete-null-pointer-checks

F19 gcse-sm F20 gcse-lm

F21 gcse F22 rerun-loop-opt

F23 rerun-cse-after-loop F24 cse-skip-blocks

F25 cse-follow-jumps F26 strength-reduce

F27 optimize-sibling-calls F28 force-mem

F29 cprop-registers F30 guess-branch-probability

F31 delayed-branch F32 if-conversion2

F33 if-conversion F34 crossjumping

F35 loop-optimize F36 thread-jumps

F37 merge-constants F38 defer-pop

ment, until the three least execution times are within a range of [−1%,1%]. In most

of our experiments, each version is executed exactly three times. Hence, the impact

on tuning time is negligible.

In our experiments, the same code version may be generated under different opti-

mization combinations. 2 This observation allows us to reduce tuning time. We keep

a repository of code versions generated under different optimization combinations.

2Comparing the binaries generated under two different optimization combinations via the UNIXutility, diff, can show whether these two binaries are identical.

22

The repository allows us to memorize and reuse their performance results. Different

orchestration algorithms use their own repositories and get affected in similar ways,

so that our comparison remains fair.

2.3.2 Metrics

Two important metrics characterize the behavior of orchestration algorithms:

1. The program performance of the best optimized version found by the orches-

tration algorithm. We define it as the performance improvement percentage

of the best version relative to the base version under the highest optimization

level O3.

2. The total tuning time spent in the orchestration process. Because the execution

times of different benchmarks are not the same, we normalize the tuning time

(TT ) by the time of evaluating the base version, i.e., one compilation time

(CTB) plus three execution times (ETB) of the base version.

NTT = TT/(CTB + 3 × ETB) (2.7)

This normalized tuning time (NTT ) roughly represents the number of experi-

mented versions. (The number may be larger or smaller than the actual number

of tested optimization combinations due to three effects: a) Some optimiza-

tions may not have any effect on the program, allowing the version repository

to reduce the number of experiments. b) Perturbation filtering mechanism

in Section 2.3.1 may increase the number of runs of some versions. c) The

experimental versions may be faster or slower than the base version.)

A good optimization orchestration method is meant to achieve both high program

performance and short normalized tuning time. We will show that our CE algorithm

has the shortest tuning time, while achieving comparable or better performance than

other algorithms.

23

0

50

100

150

200

250

300

350

400

450

amm

p

appl

u

apsi art

equa

ke

me

sa

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

No

rmal

ized

Tu

nin

g T

ime

BE(Batch Elimination) IE(Iterative Elimination) OSE(Optimization Space Exploration) SS(Statistical Selection) CE(Combined Elimination)

CE is the algorithm proposed in this paper. BE and IE are steps towards CE. OSE and SS are alternatives proposed in related work.

(a) Normalized tuning time for SPEC CPU2000 FP benchmarks

0

100

200

300

400

500

600

bzip

2

craf

ty

gap gc

c

gzip

mcf

par

ser

perl

bmk

two

lf

vort

ex

vpr

Ge

oMea

n

No

rmai

zed

Tu

nin

g T

ime


(b) Normalized tuning time for SPEC CPU2000 INT benchmarksFig. 2.1. Normalized tuning time of five optimization orchestrationalgorithms for SPEC CPU2000 benchmarks on Pentium IV. Lower isbetter. CE has the shortest tuning time in all except a few cases. Inall those cases, the extended tuning time leads to higher performance.

24

-20

-10

0

10

20

30

40

50

60

70

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

Per

form

ance

Imp

rove

men

t P

erce

nta

ge

Rel

ativ

e to

O3

(%)


CE is the algorithm proposed in this paper. BE and IE are steps towards CE. OSE and SS are alternatives proposed in related work.

(a) Program performance for SPEC CPU2000 FP benchmarks

-4

-2

0

2

4

6

8

10

bzip

2

craf

ty

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

Geo

Mea

n

Per

form

ance

Imp

rove

men

t P

erce

nta

ge

Rel

ativ

e to

O3

(%)


(b) Program performance for SPEC CPU2000 INT benchmarksFig. 2.2. Program performance achieved by five optimization orches-tration algorithms relative to the highest optimization level “O3” forSPEC CPU2000 benchmarks on Pentium IV. Higher is better. In allcases, CE performs the best or within 1% of the best.

25

2.3.3 Results

In this section, we compare our final optimization orchestration algorithm CE

with the four algorithms BE, IE, OSE and SS. Recall, that BE and IE are steps

towards CE; OSE and SS are algorithms proposed in related work. Figure 2.1 and

Figure 2.3.2 show the results of these five orchestration algorithms on the Pentium

IV machine for the SPEC CPU2000 FP and INT benchmarks in terms of the two

metrics. They provide evidence for our claim that CE has the fastest tuning speed

while achieving program performance comparable to the best alternatives. We will

discuss the basic BE method first, then the other four algorithms. Tuning time will

be analyzed first, then program performance.

Tuning time

For the applications used in our experiments, the slowest of the measured algo-

rithms takes up to several days to orchestrate the large number of optimizations.

Figure 2.1(a) and Figure 2.1(b) show that our new algorithm, CE, is the fastest

among the four orchestration algorithms that consider interactions. The absolute

tuning time, for CE, is 2.19 hours, on average, for FP benchmarks and 3.66 hours

for INT benchmarks on the 2.8 GHZ Pentium IV machine. On the 400 MHZ SPARC

II machine, 9.92 hours for FP benchmarks and 12.31 hours for INT benchmarks. we

compare the algorithms by normalized tuning time, shown in Figure 2.1(a) and Fig-

ure 2.1(b).

Although BE achieves the least program performance, its tuning speed is the

fastest, which is consistent with its complexity of O(n). BE can be viewed as a

lower bound on the tuning time for a feed-back directed orchestration algorithm

that does not have a priori knowledge of the optimizations. For such an algorithm,

each optimization must be tried at least once to find its performance effect.

OSE is of higher complexity and thus slower than IE and CE. However, SS turns

out to be the slowest method, even though its complexity is O(n2), less than OSE’s

26

O(n3). The reason for the long tuning time of SS is the higher number of iterations

it takes to converge.

Among the four algorithms (excluding BE), CE has the fastest average tuning

speed. For ammp, wupwise, bzip2, gap, gcc, perlbmk and vortex, CE is not the

fastest. However, the faster algorithms achieve their speed at significant expense of

program performance.

Program performance

In both Figure 2.2(a) and Figure 2.2(b), BE almost always achieves the least

program performance among the five algorithms. As described in Section 2.2, BE

ignores the interaction of the optimizations. Therefore, it does not achieve good

performance when the interaction has a significant negative performance effect. In

the cases of sixtrack and parser, BE even significantly degrades the performance.

In Figure 2.2(a), for art, all the algorithms improve performance by about 60%

on Pentium. This is mainly due to eliminating the option of “strict-aliasing”, which

does alias analysis, removes false data dependences, and increases register pressure.

This option results in lots of spill code for art, causing substantial performance degra-

dation. However, on the SPARC machine, the orchestration algorithms do not have

the above behavior for art. “Strict-aliasing” does not cause performance degrada-

tion, as the SPARC machine has more registers than the Pentium machine. In [43],

we have analyzed reasons for negative performance effects by several optimizations,

in detail.

The average performance improvement of all other orchestration algorithms,

which consider the interactions of optimizations, are about twice as high as BE’s.

Moreover, Figure 2.2(a) shows that these four algorithms perform essentially the

same for the FP benchmarks. On one hand, the regularity of FP programs con-

tributes to this result. On the other hand, the optimizations in GCC limit per-

formance tuning on FP benchmarks, because GCC options do not include advanced

dependence-based transformations, such as loop tiling. We expect that such transfor-

27

mations would be amenable to our tuning method and yield tangible improvement.

In Figure 2.2(b), performance similarity still holds in most of the INT benchmarks,

with a few exceptions. For gap, twolf and vortex, IE does not achieve as good a

performance as CE, though the performance gap is small. SS does not produce con-

sistent performance; for bzip2, SS does not achieve any performance; for bzip2, gzip

and vortex, SS’s performance is significantly inferior to CE. CE and OSE always

achieve good program performance improvement.

The fact that none of the algorithms constantly outperforms the others, reflects

the exponential complexity of the optimization orchestration problem. All five algo-

rithms use heuristics, which lead to sub-optimal results. Among these algorithms,

CE achieves consistent performance. Although, for crafty, parser, twolf, and vpr, CE

does not achieve the best performance, the gap is less than 1%.

The small performance differences between the measured algorithms indicate that

all methods properly deal with the primary interactions between optimization tech-

niques. However, there are differences in the ways the algorithms deal with secondary

interactions. These properties are consistent with those of a general optimization

problem, in which the main effects tend to be larger than two-factor interactions,

which in turn tend to be larger than three-factor interactions, and so on [44].

In Figure 2.3.2, we measured program performance under the train dataset. It

is important to evaluate how the algorithm performs under different input. To this

end, we measured execution times of each benchmark using the ref dataset as input,

for both the O3 version and the optimal version found by CE. (Still, the train dataset

is the input for the tuning process.) On average, CE improves FP benchmarks by

11.7% (compared to 11.9% under train) relative to O3; INT benchmarks by 3.9%

(4.4% under train). This shows that CE works well when the input is different

from the tuning input. On the other hand, we do find a few benchmarks that

do not achieve the same performance under the ref dataset as under train. The

highest differences are, for gzip and vortex, 1.95% and 2.18%. If the training input

of the orchestration algorithm differs significantly from actual workloads, our offline

28

Table 2.3Mean performance on SPARC II. CE achieves both fast tuning speedand high program performance on SPARC II as well.

Benchmark Algorithm improvement Normalizedover “O3” Tuning Time

FP BE −4.1 % 30.8FP IE 4.1 % 105.4FP OSE 4.0 % 142.0FP SS 3.7 % 384.9FP CE 4.1 % 63.4INT BE −0.8 % 36.2INT IE 3.6 % 98.7INT OSE 3.4 % 130.0INT SS 3.1 % 317.0INT CE 3.9 % 88.4

(profile-based) tuning approach may not reach the full tuning potential. In that case,

an online approach [25] could tune the program using the actual input.

Overall comparison of algorithms

CE achieves both fast tuning speed and high program performance. It does so

by combining the advantages of IE and BE: Like IE, it considers the interaction of

optimizations, leading to high program performance; like BE, it keeps tuning time

short when the interaction does not have a significant performance effect.

Similar observations hold on the SPARC II machine. Table 2.3 lists the mean

performance of each algorithm across the integer and floating point benchmarks,

respectively.

Figure 2.3 provides an overall comparison of the algorithms. The X-axis is average

program performance achieved by the algorithm; the Y-axis is average normalized

tuning time. The averages are taken across all benchmarks and machines. (The

figure under each benchmark and machine setting would be similar.) A good algo-

rithm achieves high program performance and short tuning time, represented by the

29

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7

Average program performance improvement percentage relative to "-O3" (%)

Ave

rag

e n

orm

aliz

ed t

un

ing

tim

e

BE

IE

OSE

SS

CE

Fig. 2.3. Overall comparison of the orchestration algorithms. CEachieves both fast tuning speed and high program performance.

bottom-right corner of Figure 2.3. The figure shows that CE is the best algorithm.

The runner-up is IE, which we developed as a step towards CE.

2.4 Upper Bound Analysis

We have shown that CE achieves good performance improvement. This section

attempts to answer the question of how much better than CE an algorithm could

perform. To this end, we look for a performance upper bound, which we find by

an exhaustive search (ES) through all optimization combinations. As it would be

impossible to do exhaustive search with 38 optimizations, we pick a small set of six

optimizations. This section will show that the performance improvement by CE is

close to this upper bound.

The six optimizations that have the largest performance effects are picked to con-

duct upper bound analysis. The performance effect of an optimization is the total

negative relative performance improvement of this optimization on all the bench-

marks. Figure 2.4 shows the effects of all 38 optimizations in a sorted fashion for

30

-70

-60

-50

-40

-30

-20

-10

0

F7

F21

F26

F22

F28

F20

F16

F23

F33

F14

F30

F10

F24

F35 F

9F

5F

6F

1F

4F

34F

37F

32F

25F

17F

36F

29F

27 F2

F11

F31

F19

F15

F13

F18 F

8F

3F

12F

38

Su

m o

f N

egat

ive

RIP

s (%

)

(a) Sum of negative RIPs of each optimization option over all benchmarks on the Pen-tium IV machine

-25

-20

-15

-10

-5

0

F21

F15

F34

F33 F

9F

23F

12F

30F

28F

13 F2

F1

F24

F22

F37

F25

F10

F20

F32

F17

F35

F14

F36

F16

F29

F11

F26

F27 F

6F

7F

31F

18 F3

F4

F5

F8

F19

F38

Su

m o

f N

egat

ive

RIP

s (%

)

(b) Sum of negative RIPs of each optimization option over all benchmarks on theSPARC II machine

Fig. 2.4. Total negative effects of all the GCC 3.3.3 O3 optimization options

each architecture. Comparing Figure 2.4(a) and Figure 2.4(b), we know that the

effect of an optimization is different on different architectures. So, these six opti-

mizations are picked separately on the SPARC II and Pentium IV machines. They

are strict-aliasing, schedule-insns2, regmove, gcse, rerun-loop-opt and force-mem for

Pentium IV, and rename-registers, reorder-blocks, sched-interblock, schedule-insns,

gcse and if-conversion for SPARC II.

2.4.1 Results

In Figure 2.5, ES represents the performance upper bound. Comparing the first

two columns in Figure 2.5(a) and Figure 2.5(b), we find that, under the 6 optimiza-

tions, CE performs close to ES. In about half of the benchmarks, they both find the

31

0

10

20

30

40

50

60

70

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

Per

form

ance

Im

pro

vem

ent

Per

cen

tag

e R

elat

ive

to O

3 (%

) ES_6 CE_6 CE_38 BE_6

(a) Program performance for SPEC CPU2000FP benchmarks

-2

0

2

4

6

8

10

bzip

2

craf

ty

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

Geo

Mea

n

Per

form

ance

Im

pro

vem

ent

Per

cen

tag

e R

elat

ive

to O

3 (%

)

ES_6 CE_6 CE_38 BE_6

(b) Program performance for SPECCPU2000 INT benchmarks

0

20

40

60

80

100

120

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

No

rmal

ized

Tu

nin

g T

ime


(c) Normalized tuning time for SPECCPU2000 FP benchmarks

0

20

40

60

80

100

120

140

160

180bz

ip2

craf

ty

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

Geo

Mea

n

No

rmal

ized

Tu

nin

g T

ime


(d) Normalized tuning time for SPECCPU2000 INT benchmarks

Fig. 2.5. Upper bound analysis on Pentium IV. ES 6: exhaustivesearch with 6 optimizations; CE 6: combined elimination with 6optimizations; CE 38: combined elimination with 38 optimizations;BE 6: batch elimination with 6 optimizations. CE 6 achieves nearlythe same performance as ES 6, in all cases. CE 38 performs better.(Exhaustive search with 38 optimizations would be infeasible.) BE 6is much worse than CE 6. CE 6 is about 4 times faster than ES 6.

32

Table 2.4Upper bound analysis under four different machine and benchmark settings

Machine Benchmark RIP by ES RIP by CEover “O3” over “O3”

Pentium IV FP 10.6 % 10.4 %Pentium IV INT 2.7 % 2.6 %SPARC II FP 2.9 % 2.7 %SPARC II INT 3.2 % 3.0 %

same best version. Another important fact shown in Figure 2.5(c) and Figure 2.5(d)

is that CE is more than 4 times as fast as ES, even for this small set of optimizations.

For comparison, the figures also show the tuning speed of CE for 38 options. ES for

38 options would be millions of years!

Figure 2.5 provides evidence that the heuristic-based search algorithms can achieve

performance close to the upper bound. This confirms our analysis in Section 2.3.3.

The heuristics find the primary and secondary performance effects, which are the

individual performance of an optimization and the main interaction with other opti-

mizations, respectively. These arguments hold for the SPARC II machine. Table 2.4

shows average program performance achieved under different machine and bench-

mark settings.

Comparing CE 6 with CE 38, the performance gap for FP benchmarks is negli-

gible, but not for INT benchmarks. This result is consistent with the finding of [43]

that INT programs are sensitive to a larger number of interactions between optimiza-

tion techniques than FP programs. These results suggest that a priori knowledge

of a small set of potentially interacting optimizations may help tuning numerical

programs. Exhaustive search within this small set can be feasible. However, this is

not the case for non-numerical applications.

In order to verify that the interaction between these six optimizations has a

significant performance effect, we apply BE as well. The result is shown as the last

column, BE 6, in Figure 2.5. From this figure, the performance of BE 6 is much

worse than CE 6, for example, in ammp, apsi, sixtrack, crafty, parser, and vpr.

33

2.5 The General Combined Elimination Algorithm

Our CE algorithm can be easily extended to handle “non-on-off” options, al-

though the previous experiments are done with “on-off” options. (All the GCC

O3 optimization options are of this “on-off” type.) The “non-on-off” options are

the ones with more than two values. One example is the “-unroll” option in SUN

Forte compilers [52], which take an argument indicating the degree of loop unrolling.

Therefore, a general optimization orchestration problem can be described as follows.

Given a set of optimization options {F1, F2, ..., Fn}, where Fi has Ki possible

values {Vi,j, j=1..Ki} (i = 1..n), find the combination that minimizes the program

execution time. Here, n is the number of options. (Moreover, one Fi can actually

contain multiple optimizations that have a high possibility of interaction; the possible

values of this Fi are all possible combinations of the values of the two optimizations.)

We name our algorithm to handle “non-on-off” options as the General Combined

Elimination algorithm (GCE). GCE has an iterative structure similar to CE with a

few extensions. The initial baseline of GCE is the default optimization setting used

in the compiler. In each iteration of GCE, all non-default values of the remaining op-

tions in the search space S are evaluated. For each of these options, GCE records the

value causing the most negative RIP , which is computed according to Equation 2.3.

(A negative RIP means that the corresponding value of the option improves pro-

gram performance.) GCE tries to apply these recorded values of the options with

these negative RIPs one by one in a greedy fashion just like CE. When none of

the remaining options in S has a value improving the performance, GCE gives the

final optimization setting for the program. GCE has a complexity of O(Σni=1Ki ×n).

In most cases, O(Σni=1Ki) = O(n), so roughly the complexity is still O(n2). (The

techniques developed in [45] could also be included in GCE to handle options with

a large number of possible values, for example, blocking factors.) The pseudo code

of GCE is as follows.

34

1. Let B be the baseline option setting. Let S be the set of optimizations forming

the optimization search space. Initialize B = { F1 = f1, F2 = f2, ..., Fn = fn |fi is the default value of option Fi } and S = {F1, F2, ..., Fn}.

2. Measure the RIPs of all the non-default values of the options in S relative to

the baseline B, that is, RIPB(Fi = Vi,j) s.t. Fi ε S and Vi,j �= fi. The definition

of RIP is the same as in Equation 2.3.

3. Let X = {X1 = x1, X2 = x2, ..., Xl = xl} be the set of options with negative

RIPs and xi have the most negative RIP among the possible values of option

Xi. X is sorted in an increasing order, that is, the first element in X, X1, has

the most negative RIP . Remove X1 from S and set X1 in B to be x1. (B is

changed in this step.) For i from 2 to l,

∗ Measure RIPB(Xi = xi).

∗ If it is negative, remove Xi from S and set Xi in B to be xi.

4. Repeat Step 2 and Step 3 until all options in S have non-negative RIPs. B

represents the final option setting.

2.5.1 Experimental results on SUN Forte compilers

We conduct an experiment to evaluate the GCE algorithm using SUN Forte

compilers. Another goal of this experiment is to compare the performance achieved

by our GCE algorithm with the one by manual tuning. The manual tuning result is

based on the SPEC CPU2000 performance results [49]. In such a result, there is a

base option setting and multiple peak option settings. The base setting is common

to all benchmarks. We use this as the baseline performance. The peak settings

may be different for different benchmarks. For each benchmark, the peak setting

stands for a tuned optimization setting, which achieves better performance than the

base setting. So, we will compare GCE with the peak setting, the latter represents

manually tuned result.

The experiments are conducted on a Sun Enterprise 450 SPARC II machine.

We evaluate the GCE algorithm using the optimization flags of the Forte Developer

35

Table 2.5Optimization flags orchestrated by GCE

Flag Name Experimented Values Meaning-xarch v8, v8plus, generic target architecture instruction set-xO 3, 4, 5 optimization level-xalias level std, strong, basic alias level-stackvar on/off using the stack to hold local variables-d y, n allowing dynamic libraries-xrestrict %all, %none pointer-valued parameters as restricted pointers-xdepend on/off data dependence test and loop restructuring-xsafe=mem on/off assuming no memory protection violations-Qoption iropt -crit on/off optimization of critical control paths-Qoption iropt -Abopt on/off aggressive optimizations of all branches-Qoption iropt -whole on/off whole program optimizations-Qoption iropt -Adata access on/off analysis of data access patterns-Qoption iropt -Mt 500, 1000, 2000, 6000, default max size of a routine body eligible for inlining-Qoption iropt -Mr 6000,12000,24000,40000,default max code increase due to inlining per routine-Qoption iropt -Mm 6000, 12000, 24000, default max code increase due to inlining per module-Qoption iropt -Ma 200, 400, 800, default max level of recursive inlining

6 compilers. The baseline flag setting is “-fast -xcrossfile -xprofile”. GCE tunes

the optimization flags that are used in the SPEC peak settings. Table 2.5 lists these

flags. The Forte compilers have some flags that can be passed directly to the compiler

components. These flags are passed by the “-W” flag for the C compiler and the

“-Qoption” flag for the Fortran compiler or the C++ compiler. In this table, we list

the “-Qoption” only (“-W” flags are similar). 3

Figure 2.6 shows the performance results. In summary, GCE achieves equal or

better performance for each benchmark. On average, for floating point benchmarks,

GCE achieves 10.8% improvement relative to the base setting, compared to 5.6% by

the peak settings. For integer benchmarks, GCE achieves 8.1% compared to 4.1%

by the peak settings.

3If the inlining flags are not specified, the compiler uses their default values, which are not listedin the manual.

36

0

10

20

30

40

50

60

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

aver

age

Per

form

ance

imp

rove

men

t p

erce

nta

ge

(%)

SPEC peak performance / base performance GCE performance / base performance

(a) SPEC CPU2000 FP benchmarks

0

5

10

15

20

25

30

35

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

aver

age

Per

form

ance

imp

rove

men

t p

erce

nta

ge

(%)

SPEC peak performance / base performance GCE performance / base performance

(b) SPEC CPU2000 INT benchmarks

Fig. 2.6. Program performance achieved by the GCE algorithm vsthe performance of the manually tuned results (peak setting). Higheris better. In all cases, GCE achieves equal or better performance. Onaverage, GCE nearly doubles the performance.

37

3. FAST AND ACCURATE RATING METHODS

3.1 Introduction

Chapter 2 presented a fast and effective algorithm, Combined Elimination, to

search for the best optimization combination for a program. Although the tuning

process takes the least time among the alternatives (57% of the closest one), the

tuning time is still in the order of several hours. The reason is that Chapter 2

evaluates an optimized version based on the total execution time spent in one run

of the program. This method is accurate but slow. This chapter aims at developing

fast and accurate methods to evaluate the performance of an optimized version. We

will tune the program at a finer granularity than the whole-program level. We will

use a partial execution of the program to evaluate one optimized version so as to

reduce the tuning time. We call this kind of performance evaluation methods as

rating methods. The rating methods will be accurate enough to guarantee tuned

program performance.

From this chapter on, we will tune the important code segments in the program

separately. These code segments are called Tuning Sections (TS) in this thesis.

Roughly, a tuning section is a procedure including all its callees. (We will show

how to select the tuning sections in Chapter 4.) To tune a program at this tuning-

section level, we still apply the CE algorithm developed in Chapter 2 to search for

the best optimization combination for each individual TS. Noticing that each TS is

invoked many (usually hundreds or thousands of) times, we develop rating methods

to evaluate the performance of one optimized version based on a few number of

invocations of the TS. (The number of invocations used to evaluate the performance

of an optimized version is called a window.) In this way, a partial execution of the

38

program can be used to rate the performance of an optimized version, which leads

to a tremendous speedup of the tuning process.

Besides speed, accuracy and flexibility are two other important issues for rat-

ing methods. A fast rating method means a short tuning time. However, if the

rating method is inaccurate, it may lead to limited performance improvement or

even degradation. If the method is not flexible, it may apply to a limited set of

applications, optimization techniques, compilers, or architectures, only. Many of the

proposed methods either are slow due to executing the whole program to rate one

version [35, 41, 43], not accurate enough for optimization orchestration [53], or only

applicable to specific code [33, 34]. Another example of fast rating method is the

performance model in [36], which estimates the performance of an optimized version

based on the profile information about data cache misses, instruction cache misses

and branch mis-prediction. This approach is fast but inaccurate. [36] shows that its

performance improvement is nearly half of the one achieved by using the accurate

execution times.

Rating based on a number of invocations of the tuning sections leads to fast

tuning speed, but, the workload may change from one invocation to another. Directly

averaging the execution times of a number of invocations is not an accurate method.

This chapter presents the rating methods that fairly compare the invocation times

of the optimized versions under different invocations. (The computed ratings will

be used as the feedback to the optimization orchestration algorithm developed in

Chapter 2.) These methods can be applied to general, regular and irregular, code

sections.

Similar to our idea of running part of the program to speed up performance

evaluation, SimPoint [54–56] and Simulation Sampling [57–59] simulate important

intervals or sampling points to speed up the simulation process. However, these

techniques can hardly be applied to our system, mainly due to the following reasons.

(1) Our system requires that the machine state should be “warmed-up” so that the

program could execute directly from the middle of the program without impacting

39

performance evaluation accuracy. However, it is very difficult to warm up a real ma-

chine to an accurate state, especially for the caches. (Code Isolator [53] tries to do so,

nevertheless, it is not accurate enough for our optimization orchestration.) SimPoint

and Simulation Sampling can warm up the machine state via check-pointing or fast-

forwarding, because the simulator has full control of the simulated machine. (2) Our

system evaluates the performance of many optimized versions, while SimPoint and

Simulation Sampling evaluate the performance of multiple simulations using one bi-

nary version of a program. If their techniques were applied to our system, it would

cause tremendous overhead to find the sampling points for each version and would

still be difficult to compare the performance of different versions, because different

optimized versions have different number of instructions and basic blocks. (3) Our

compiler system works at the source program level, while SimPoint and Simulation

Sampling work at the binary level.

The key ideas of our rating methods are as follows. Context-Based Rating (CBR)

identifies and compares invocations of a tuning section that have the same work-

load, in the course of the program run. Model-Based Rating (MBR) formulates the

relationship between different workloads, which it factors into the comparison. Re-

execution-Based Rating (RBR) directly re-executes a tuning section under the same

input for fair comparison. This chapter also presents automated compiler techniques

to analyze the source code for choosing the most appropriate rating method for each

tuning section. 1

The remainder of this chapter is organized as follows. Section 3.2 presents three

rating methods – CBR, MBR and RBR – along with the relevant compiler techniques.

Section 3.3 shows the use of these methods in the PEAK system, including static

program analysis and a dynamic solution to determining the window size. Section 3.4

evaluates the applicability and accuracy of the rating methods on a number of code

sections. (The complete evaluation will be done in Chapter 5, after the algorithm of

tuning section selection presented in Chapter 4.)

1The main work of this chapter has been published in [2].

40

3.2 Rating Methods

The rating methods are applied in an offline performance tuning scenario as

follows. Before tuning, the program is partitioned by our compiler 2 into a number

of code sections, called tuning sections (TS). The tuning system runs the program one

or several times under a training input, while dynamically generating and swapping

in/out new optimized versions for each TS. The performance of these versions is

compared using the proposed rating methods. The winning version will be used in

the final tuned program. (We will discuss the complete PEAK system in Chapter 5.)

The key issue in rating these versions is to achieve fair comparison. Our rating

methods achieve this goal by either identifying TS invocations that use the same

workload (CBR), finding mathematical relationships between different workloads

(MBR), or forcing re-execution of a TS under the same input (RBR).

3.2.1 Context Based Rating (CBR)

Context-based rating identifies the invocations of a tuning section under the same

workload in the course of program execution. The PEAK compiler finds the set of

context variables, which are the program variables that influence the execution time

of the tuning section. For example, the variables that determine the conditions of

control regions, such as if or loop constructs. (These variables can be function pa-

rameters, global variables, or static variables.) Thus, the context variables determine

the workload of a tuning section. We define the context of one TS invocation as the

set of values of all context variables. Therefore, each context represents one unique

workload.

2Our compiler, called the PEAK compiler, is a source-to-source compiler, which analyzes andinstruments the program before tuning. We developed the PEAK compiler based on the Polaris [60]compiler for programs written in Fortran and the SUIF2 [61] compiler for C. The backend compileris the one that generates the optimized executables. We focus on GCC compiler as the backend inthis thesis. During tuning, the backend compiler is invoked with different option settings to controlits optimizations.

41

CBR rates one optimized version under a certain context by using the average

execution time of several invocations. (Typically, this number is tens of times.) The

best versions for different contexts may be different, in which case CBR could report

the context-specific winners. PEAK makes use of only the best version under the

most important context, which covers most (e.g., more than 80%) of the execution

time spent in the TS. This major context is determined by one profile run of the

program. (If a tuning section has no major context or the number of invocations of

the major context is too small, the MBR method described next is preferred.)

In summary, the rating of a version v, R(v), is computed according to Equa-

tion 3.1, where x is the most time-consuming context of version v, T (i, x) is the

execution time of ith invocation under context x, and w is the window size, the

number of invocations used to rate one version. V ar(v) records the variance of the

measurement.

R(v) =∑

i=1..w

T (i, x)/w (3.1)

V ar(v) =∑

i=1..w

(T (i, x) − R(v))2/w (3.2)

Figure 3.1 shows the compiler analysis to find the context variable set so as to

determine the applicability of CBR. The algorithm traverses each control statement

and recursively finds the related variables; that is, it finds all of the input variables

that may influence the values used in control statements. All of these variables are

considered to be context variables. If there exist one or more non-scalar context

variables, there is usually no major context. So, in this case, CBR is not applicable.

Similarly, if the context variable is floating point, CBR is not applicable. To reduce

the context match overhead during tuning, we eliminate the runtime constants from

the context variable list. The runtime constant variables have always the same value

during all the invocations. This is done using the same profile run that determines

the major context.

In Figure 3.1, we illustrate the algorithm using use-def chains to track the data

flow. Static Single Assignment (SSA) can also be used to do so. In our Fortran

42

//ContextSet: the set of context variables.

VariableSet ContextSet;

//Return value: applicability of CBR on TS

Boolean GetContextSet(TuningSection TS)

{

ContextSet = {};

Set the state of each statement as "undone";

For each control statement s in TS {

For each variable v used in s {

if( GetStmtContextSet(v, s) == false )

return false;

}

}

Remove the constant variables from ContextSet;

}

Boolean GetStmtContextSet(Variable v, Statement s)

{

StatementSet SSet = Find_UD_Chain(v, s);

Set s as "done";

For each statement m in SSet {

if( m is the entry statement ) {

//v is in Input(TS).

if( v is scalar && v is not floating point )

put v into ContextSet;

else

return false;

}

if( m is "done" ) {//avoid loop.

continue;

}

For each variable r used in m {

if( GetStmtContextSet(r, m) == false )

return false;

}

}

return true;

}

Fig. 3.1. Pseudo code of context variable analysis

43

implementation, we use Gated Single Assignment (GSA) [62]. The algorithm is very

similar to Figure 3.1.

3.2.2 Model Based Rating (MBR)

Model-based rating formulates mathematical relationships between different con-

texts of a tuning section and adjusts the measured execution time accordingly. In

this way, different contexts become comparable.

The execution time of a tuning section consists of the execution time spent in all

of its basic blocks:

TTS =∑

(Tb × Cb) (3.3)

TTS is the execution time in one invocation of the whole tuning section; Tb is the

execution time in one entry to the basic block b; and Cb is the number of entries to

the basic block b in the TS invocation.

If the numbers of entries of two basic blocks, Cb1 and Cb2, are linearly dependent

on each other in every TS invocation through the whole run of the program, (that

is, Cb1 = α ∗ Cb2 + β, where α and β are constants), our compiler merges the items

corresponding to these basic blocks into one component. Hence, MBR uses the

following execution time estimation model.

TTS =∑

i=1..n

(Ti × Ci) (3.4)

TTS consists of several components, each of which has a component count Ci and

a component time Ti. We assume that there is always a constant component Tn,

with Cn = 1 for all TS invocations. Furthermore, MBR makes a number of simpli-

fications: (1) If two branches in a conditional statement have the same workload,

the components representing the branches are merged. (2) If the workload in condi-

tional statements is small, they are treated as normal statements. For example, an

if-statement with a simple increment statement is not treated as a basic block, but

as an increment statement. (3) Components that exhibit constant behavior are put

into the constant component.

44

The PEAK compiler finds the expression determining the number of entries to

each basic block b, Cb. If the TS contains an irregular code structure, for example a

while loop, MBR is not applicable. (In this case, the next rating method, RBR, will

be applied.) After a profile run, PEAK determines the relationship between Cb’s,

thus to merge them into independent components.

During tuning, PEAK collects the execution times of a number of invocations

to the optimized version until the rating error is small, which we will discuss in

Section 3.3. It gathers the TS-invocation-time vector, Y , and the component-count

matrix, C, in which Y (j) is the TTS in the jth invocation and C(i, j) is the ith

component count Ci in the jth invocation. Solving the following linear regression

problem yields the component-time vector T .

Y = T × C (3.5)

Here, T = (T1, T2, ..., Tn) represents the component-time vector of one particular

version. The version with smaller Ti’s performs better. Hence, MBR may compare

different versions using the rating, R(v), computed based on their T vectors according

to the following equation.

R(v) =∑

i=1..n

(Ti × Cavgi) (3.6)

V ar(v) =∑

j=1..w

(Yj −∑

i=1..n

(Ti × Ci,j))2/w (3.7)

Cavgi is the average count of component i during one whole run of the program.

These data are obtained from the profile run. And, w is the number of invocations

to compute the rating. The variance of the rating, V ar(v) is the residual error of

this linear regression.

Figure 3.2 (a) shows an example code with two components. The first component

is the loop body with a variable number, N, of entries during one invocation of the

tuning section. The second component is the tail code with one entry per invocation.

Figure 3.2 (b) shows the Y and C gathered by the performance rating system during

tuning. Each column of Y and C corresponds to the data in one invocation of the

45

DO I = 1, N...loop body...ENDDO...tail code...

(a) A tuning section with two components

Y =[

11015 5508 6626 6044 8793]

C =

[100 50 60 55 801 1 1 1 1

]

(b) TS-invocation-time vector Y and component-count matrix C collected during tuning

T =[

110.05 3.75]

(c) Component-time vector T by linear regression

Fig. 3.2. A simple example of MBR

tuning section. Linear regression generates the component-time vector T , shown in

Figure 3.2 (c). Given Cavg1 = 75, the rating of this version is 110.05 × 75 + 3.75 =

8257.5.

If there are many components in the execution time model, a large number of

invocations need to be experimented in order to perform an accurate linear regression.

MBR would lead to a long tuning time in this case and so is not applied. Instead,

RBR described next will be applied.

3.2.3 Re-execution Based Rating (RBR)

Re-execution-based rating forces a roll-back and re-execution of a tuning section

under the same input. It is applicable to all our tuning sections; however, it also

generally has the largest overhead. We first present a basic re-execution method,

followed by a method that reduces inaccuracies caused by cache effects.

46

Step 1. Save Input(TS)Step 2. Time Version 1 (the current best version)Step 3. Restore Input(TS)Step 4. Time Version 2 (the experimental version)Step 5. Return the two execution times

Fig. 3.3. Basic Re-execution-based rating method (RBR)

Basic RBR method

Figure 3.3 shows the basic idea of RBR. Before each invocation, the input data

to the TS is saved, then Version 1 is timed, the input is restored, and Version 2

is executed. These two execution times can be compared directly to decide which

version is better, since both versions are executed with the same input, so, with the

same workload.

RBR directly generates a relative performance rating based on the execution times

of the two versions, which are executed during one TS invocation. Suppose that the

execution times of these two versions are Tv1 and Tv2. Then, the performance rating

of Version 2 relative to Version 1 is Rv2/v1.

Rv2/v1 = Tv1/Tv2 (3.8)

If Rv2/v1 is larger than 1, Version 2 performs better than Version 1. Otherwise,

Version 2 performs worse. For multiple versions, we compare their performance

relative to the same base version. For example, if Rv2/v1 is less than Rv3/v1, Version 3

performs better than Version 2. In our tuning system, we use the average of Rvx/vb’s

across a number of TS invocations as the rating of Version vx relative to the base

version vb. The rating of v is computed based on Equation 3.9, where w is the

number of invocations. Similar to CBR and MBR, we compute the rating variance

V ar(v).

R(v) =∑

i=1..w

Rv/vb(i)/w (3.9)

V ar(v) =∑

i=1..w

(Rv/vb(i) − R(v))2/w (3.10)

47

The input set, Input(TS), is obtained through liveness analysis. Input(TS) is

equal to LiveIn(b1), the live-in set of the entry block in TS. (An eligible TS should

not call library functions with side effects, such as malloc, free, and I/O operations.

Right now, we exclude these function calls from the tuning section. As future work,

these functions could be re-written, so that they can be rolled back.)

Improved RBR method

Even under the same input, two invocations of a TS may result in different

execution times. The first invocation preconditions the cache, affecting the execution

time of the second invocation. To address this problem, the improved RBR method

(1) inserts a preconditional version before Version 1 to bring the used data into cache,

and (2) swaps Version 1 and Version 2 at each invocation, so that their order does

not bias the result.

In addition, the improved RBR method saves and restores only the input variables

that are modified in the invocation, the set of Modified Input(TS). (Def(TS) is

the def set of the TS.)

Modified Input(TS) = Input(TS) ∩ Def(TS) (3.11)

Compile time analysis may not be able to determine the exact Modified Input(TS)

set. Before write references to irregular arrays and pointers that are in this set,

inspector code is inserted into the preconditional version to record both the ad-

dresses and the values. The recorded data will be used to restore the input before

re-execution. (This technique is mostly used for C programs instead of Fortran

programs.)

Figure 3.4 shows the improved RBR. This method incurs three types of overhead:

(1) save and restore of the Modified Input(TS); (2) execution of the preconditional

version; and (3) execution of the second code version. The overhead of the save,

restore and precondition code can be reduced through a number of compiler opti-

mizations. For example, the save and restore overhead can be reduced by accurately

48

RBR(TuningSection TS) :1. Swap Version 1 and Version 22. Save the Modified Input(TS)3. Run the preconditional version4. Restore the Modified Input(TS)5. Time Version 16. Restore the Modified Input(TS)7. Time Version 28. Return the two execution times

Fig. 3.4. Improved Re-execution-based rating method

analyzing the Modified Input(TS) set. This can be achieved using symbolic range

analysis [63] for regular data accesses. Other optimizations include the combination

of a number of experimental runs into a batch, and the elimination of instructions

from the preconditional version that do not affect cache.

3.3 The Use of Rating Methods in PEAK

We have presented three rating methods: CBR, MBR and RBR. Context-based

rating (CBR) has the least overhead but is not applicable to code without a major

context. Model-based rating (MBR) works for code without a major context, but

is not applicable to irregular programs. Re-execution-based rating (RBR) can be

applied to almost all programs; however, the overhead is the highest among the

three. Generally, the applicability of these three rating approaches increases in the

order of CBR, MBR and RBR; so does the overhead.

Before tuning, our PEAK compiler divides the target program into several tun-

ing sections. The original source program is analyzed according to the techniques

presented in last section. After one profile run, the PEAK compiler finds the major

context, if it exists, for CBR, and the execution time model for MBR. From the

static analysis and profile information, the PEAK compiler decides which applicable

rating method should be used for each tuning section, in the priority order of CBR,

49

MBR and RBR. Then, the PEAK compiler inserts three kinds of instrumentation

code into the source program to construct a tuning driver: (1) code to activate per-

formance tuning; (2) code to measure the execution times and to trigger the rating

methods; (3) code to facilitate the rating methods, for example, context match code

for CBR and the preconditional version for RBR.

During tuning, the tuning driver generates the rating, R(v), and the rating vari-

ance, V ar(v), across a number of TS invocations, which is called a window. The

tuning driver compares R(v) of different versions to know which version is the best

one. In summary, R(v) and V ar(v) are computed as follows.

• CBR: Suppose that T (i, x) is the execution time of the ith invocation under

context x. R(v) and V ar(v) under context x are the mean and the variance of

T (i, x), i = 1...w, where w is the window size. They are computed according

to Equations 3.1 and 3.2.

• MBR: R(v) is execution time estimated from the execution time model, and

V ar(v) is the residual error of the linear regression. They are computed ac-

cording to Equations 3.6 and 3.7.

• RBR: Suppose that Rv/vb(i) is the relative performance of version v over base

version vb at the ith invocation. R(v) and V ar(v) are the mean and the

variance of Rv/vb(i), i = 1...w, according to Equations 3.9 and 3.10.

To improve rating accuracy, the tuning system applies two optimizations. (1) The

tuning system identifies and eliminates measurement outliers, which are far away

from the average. Such data may result from system perturbations, such as inter-

rupts. (2) The tuning system uses a dynamic approach to determining the window

size. It continually executes and rates a version until the rating variance V ar(v)

falls below a threshold. This optimization is applied based on the observation that

V ar(v) decreases with increasing size of the window.

50

3.4 Evaluation on Rating Accuracy

This section evaluates applicability and accuracy of the three rating methods.

Rating accuracy is represented by the mean and standard deviation of the ratings.

Our experimental system uniformly samples the ratings throughout the execution

under a training input. In this way, it gathers a vector of ratings, [R1, R2, ..., Rn],

where Ri is the R(v) computed at sampling time i, as described in Section 3.3. (Each

rating Ri is based on w invocations of the TS. The experimented version is optimized

under the default GCC O3 setting, same as the base version.) So, we can assume

that the ideal rating is the average of Ri, R, for CBR and MBR. The ideal rating

for RBR is 1, since the experimental version is the same as the base version.

R =∑

i=1..n

Ri/n (3.12)

We compute the rating error, Xi, at sampling time i.

Xi ={Ri/R − 1

Ri − 1

for CBR and MBR

for RBR(3.13)

Table 3.1 shows the statistic characteristics of the rating errors, the Mean, μ, and

the Standard Deviation, σ, which are the measure of rating accuracy.

μ =∑

i=1..n

Xi/n (3.14)

σ =√ ∑

i=1..n

(Xi − μ)2/(n − 1) (3.15)

High rating accuracy requires the Mean, μ, be close to zero and a small Standard

Deviation, σ. We also show how these two metrics change along with the window

size in Table 3.1.

Table 3.1 shows the most important tuning sections for the selected benchmarks.

The upper half lists the floating point benchmarks; the lower half lists the integer

benchmarks. Integer code exhibits a large number of conditional statements. Be-

cause of this irregularity, the PEAK compiler applies the re-execution-based method

(RBR) to all the integer benchmarks. The floating point benchmarks are more reg-

51

Table 3.1Rating accuracy for selected tuning sections

The columns from left to right show the benchmark name, the tuning section name,the applicable rating approaches, the number of invocations of the tuning sectionduring one run of the benchmark, and the rating accuracy under different windowsizes. The numbers for the accuracy columns are multiplied by 100 for readability.(For CBR, multiple rows are used for each tuning section, if there are multiplecontexts.)

Benchmark Tuning Rating #invo- Rating Accuracy: Mean (Standard Deviation) × 100Name Section Approach cations w=10 w=20 w=40 w=80 w=160applu blts CBR 250 0(0.71) 0(0.65) 0(0.57) 0(0.49) 0(0.18)apsi radb4(Context1) CBR 1.37M 0(2.2) 0(2.6) 0(3.0) 0(2.7) 0(1.4)

radb4(Context2) CBR 0(0.7) 0(0.7) 0(0.7) 0(0.7) 0(0.5)radb4(Context3) CBR 0(0.5) 0(0.4) 0(0.3) 0(0.3) 0(0.2)

art match RBR 250 -0.06(0.28) -0.07(0.17) -0.08(0.11) -0.1(0.07) -0.09(0.04)mgrid resid MBR 2410 0(1.0) 0(0.82) 0(0.76) 0(0.63) 0(0.48)equake smvp CBR 2709 0(2.7) 0(2.5) 0(2.4) 0(2.1) 0(1.6)mesa sample 1d linear RBR 193M -0.05(1.3) 0.07(1.0) 0.03(0.78) 0.07(0.57) 0.02(0.36)swim calc3 CBR 198 0(0.33) 0(0.29) 0(0.19) 0(0.06) 0(0.01)wupwise zgemm(Context1) CBR 22.5M 0(1.3) 0(1.1) 0(1.1) 0(0.94) 0(0.86)

zgemm(Context2) CBR 0(1.5) 0(1.6) 0(1.6) 0(1.7) 0(1.5)bzip2 fullGtU RBR 24.2M 0.95(2.6) 0.5(1.9) 0.27(1.3) 0.09(1.0) 0.07(0.7)crafty Attacked RBR 12.3M -0.91(2.3) -0.43(1.7) -0.25(1.5) -0.33(1.2) -0.16(0.8)gzip longest match RBR 82.6M -1.0(2.7) -0.14(1.2) -0.08(1.1) -0.1(0.9) -0.05(0.7)mcf primal bea mpp RBR 105K -0.23(0.92) -0.18(0.71) -0.16(0.48) -0.09(0.36) -0.11(0.31)twolf new dbox a RBR 3.19M -0.56(1.9) -0.45(1.3) -0.36(1.0) -0.23(0.58) -0.13(0.37)vortex ChkGetChunk RBR 80.4M -0.12(3.0) 0.26(1.6) 0.18(1.2) -0.16(0.97) -0.11(0.76)

ular. The context-based rating (CBR) and the model-based rating (MBR) methods

are applicable to them.

The last column in Table 3.1 shows the Mean and the Standard Deviation under

different window sizes. Generally, both metrics decrease with increasing window size.

RBR achieves a very small mean (< 0.002) and a small standard deviation (< 0.016)

with a reasonable window size for all cases. Equake has a relatively high variation,

which we attribute to its irregular memory access behavior, resulting from sparse

matrix operations. We conclude that our rating methods are accurate. Small tuning

sections exhibit more measurement variation but also tend to have higher numbers

of invocations. In these cases, accuracy is achieved through larger window sizes.

The fourth column in Table 3.1 shows the number of invocations to the tuning

section under the training dataset. For some benchmarks, the number of invocations

exceeds one million, while for others it is several hundred. In all benchmarks, the

52

system may rate multiple versions during one run of the program. The total number

of invocations needed in one tuning is roughly window size × number of versions.

Some benchmarks fit in one run, others fit in multiple runs. So, PEAK can reduce

the tuning time by a significant amount, which we will show in the following chapters.

53

4. TUNING SECTION SELECTION

4.1 Introduction

In previous chapters, we have shown that optimization orchestration improves

program performance, and that rating methods based on a partial execution of the

tuning sections can speed up the tuning process. This chapter deals with the problem

of how to select the important code sections in a program as the tuning sections, in

order to achieve fast tuning speed and high tuned program performance.

Basically, tuning sections need to meet the following requirements to achieve the

goal of improving tuning time and program performance.

1. A tuning section should be invoked a large number of times, for example, more

than 100 times, in one run of the program. A large number of invocations

usually mean fast tuning. For tuning section TSi, let the average number of

invocations used to rate one optimized version be N1(TSi), the total number of

invocations to the tuning section be Nt(TSi), the number of optimized versions

rated in one run of the program be Nv(TSi).

Nv(TSi) = Nt(TSi)/N1(TSi) (4.1)

So, when the number of invocations Nt(TSi) is large, we may rate a large

number of, Nv(TSi), versions in each run of the program. 1 This means a fast

tuning. If there are multiple tuning sections, the tuning time is bound by the

slowest one. Denote the smallest number of invocations to the tuning sections

as Nmin.

Nmin = mini(Nt(TSi)) (4.2)

So, the first requirement for tuning section selection is to have a large Nmin.

1Although N1(TSi) may not be the same for different tuning sections, generally, a large Nt(TSi)still means a large Nv(TSi).

54

2. Tuning sections should cover as large part of the program as possible. The

coverage of the tuning sections is computed based on the execution times.

Denote the time spent in tuning section TSi as Tt(TSi), which includes the

time spent in all the functions/subroutines invoked within this tuning section,

and the total execution time of the program as Ttotal.

Coverage =

∑Tt(TSi)

Ttotal

× 100% (4.3)

A large coverage means that a big part of the program is tuned via optimization

orchestration. So, we could achieve a good program performance.

3. A tuning section should be big enough, so that the average execution time spent

in one invocation is big enough, for example, greater than 100μsec. The average

execution time, Tavg(TSi), is computed based on the number of invocations,

Nt(TSi), and the execution time spent in TSi, Tt(TSi).

Tavg(TSi) = Tt(TSi)/Nt(TSi) (4.4)

We do not choose tiny tuning sections, because they usually cause a low timing

accuracy, even though we use a high-resolution timer. The low timing accuracy

results in a low rating accuracy and a large number of invocations per version,

N1(TSi).

Applying optimization orchestration to tuning sections separately may achieve

more program performance than tuning the program as a whole, because different

tuning sections may favor different optimizations. It would be desirable if the code

sections that favor different optimizations could be separated into different tuning

sections. However, this is not practical, as we do not know what optimizations are

beneficial to a code section before tuning. Remember, our approach to performance

tuning is to search for the best optimization combinations for tuning sections.

This chapter presents a tuning section selection algorithm, which meets all the

three aforementioned requirements. Section 4.2 shows the call graph annotated with

execution time profiles used for tuning section selection. Based on the call graph,

55

the problem of tuning section selection is formally defined in Section 4.3. Section 4.4

presents our tuning section selection algorithm. In this algorithm, nodes for recursive

functions are merged to construct a simplified call graph, which is a directed acyclic

graph. A simple algorithm is designed to select the tuning sections by maximizing the

program coverage under a given constraint for Nmin, working on the simplified call

graph. The final algorithm iteratively calls the previous simple algorithm to trade

off the coverage and Nmin. In Section 4.5, the results of tuning section selection are

discussed.

4.2 Profile Data for Selecting Tuning Sections

From the previous section, a tuning section is selected based on its number of

invocations and its execution time. These data are collected from a profile pass.

In our implementation, we use the call graph profile generated by gprof [64]. This

call graph profile shows how much time is spent in each function and its children,

how many times each function is called and how many times the function calls its

children, during the profile run.

Our tuning section selection algorithm reads the output of gprof and generates a

call graph G = (V,E). 2 This call graph is a directed graph. It has one source (root)

node, δ, whose in-degree is 0, and a set, Γ, of sink (leaf) nodes whose out-degrees are

0. Each node v ∈ V identifies a function. δ identifies the function main. The nodes

in Γ identify the functions that do not call any other function. Each edge e ∈ E

identifies a function call. The associated profile information is as follows.

v = {fn} (4.5)

e = {s, t, n, tm} (4.6)

2Here, call graph G contains the dynamic calls during the profile run, not the static calls appearedin the program code. So, if a function call appeared in the code was not executed during theprofile run, G does not include this call. However, a static call graph generated by a compilershould include this call. For the purpose of tuning section selection, the dynamic call graph is goodenough. After tuning sections are selected, our compiler tools analyze and transform the program.During this process, a static call graph is used. Our tuning selection algorithm can be applied tothe static call graph as well, if profile information is assigned to its nodes and edges.

56

fn(v) is the function name of the node. s(e) identifies the caller node; t(e) identifies

the callee node; n(e) is the number of invocations to t(e) made by s(e); tm(e) is the

time spent in t(e) and its callees, when t(e) is called from s(e). The next section

will give an example of this call graph in Figure 4.1 and a formal description of the

tuning section selection problem.

Ideally, we could truncate the program at any place, for example, in the beginning

of a basic block, to create a tuning section. In practice, we choose tuning sections at

the procedure level, since we use the call graph profile generated by gprof. If we had

accurate basic block profiles, we could select the tuning sections at the basic-block

level. The tuning section selection algorithm would be similar as the one presented

in next sections. The difference would be that a bigger graph should be used with

basic blocks as the nodes instead of functions/subroutines.

4.3 A Formal Description of the Tuning Section Selection Problem

Tuning section selection aims at partitioning the program into several compo-

nents, which meet the requirements in Section 4.1. Basically, a tuning section starts

with an entry function, including all the subroutines called directly or indirectly by

the entry function. If one subroutine is called by two different tuning sections, it is

replicated into these two tuning sections. In the representation of the call graph in

Section 4.2, tuning section selection tries to find a set of single entry-node regions

(subgraphs) in G; each region is a tuning section. The entry function of a region

identifies that region. One region can overlap with other regions, although it would

be better if no regions overlap.

The problem of tuning section selection can be described, in a formal way, as an

optimal edge cut problem. Given call graph G = (V, E), find an edge cut (Θ, Ω)

so as to maximize the invocation numbers and the coverage of the tuning sections.

Here, Θ and Ω are a partition of the node set V , such that Θ contains the source

node δ, and Ω contains the set of sink nodes in Γ. This edge cut (Θ, Ω) is a set

of edges, each of which leaves Θ and enters Ω. This edge cut determines the set

57

a

c

d e f

1000 (80)

5 (20)

200 (18) 10000

(30)

20000 (1) 20000

(1)

b

Fig. 4.1. An example of tuning section selection. The graph is acall graph with node a as the main function. The weights on anedge are the number of invocations and the execution time in theparentheses. The optimal edge cut is (Θ = {a, c}, Ω = {b, d, e, f}),shown by the dashed curve. Edges (a, b) and (c, f) are chosen as theS set. Edge (c, e) in the cut (Θ, Ω) is not included in S, becauseits average execution time is 1/20000 less than Tlb = 1e−4. Thereare two tuning sections led by node b and node f , T = {b, f}. Thenumber of invocations to b and f are 1000 and 200 in respect, so,Nmin = 200. The coverage of this optimal tuning section selection is(80+18)/100 = 0.98, where the total execution time, Ttotal, is 100.

of tuning sections in two steps. (1) Find all the edges in this cut whose average

execution times are greater than Tlb, the lower bound on the average execution time.

i.e., for each edge e ∈ (Θ, Ω), put e in set S, if tm(e)/n(e) ≥ Tlb. This step is done

to meet Requirement 3 in Section 4.1. (2) The edges in set S point to the selected

tuning sections. i.e., make the entry-node set T = {v|v = t(ei), ei ∈ S}. Each node

v in set T identifies a tuning section. (Tuning sections are the subgraphs led by the

entry function v.) Figure 4.1 gives an example.

The tuning section selection algorithm maximizes the invocation numbers and the

coverage of the tuning sections, which are computed according to aforementioned S

and T as follows.

58

1. The number of invocations to the tuning section v is denoted as Nt(v).

Nt(v) =∑

e∈S,t(e)=v

n(e). (4.7)

One goal of the tuning section selection algorithm is to maximize the smallest

Nt(v), denoted as Nmin. (Requirement 1 in Section 4.1.)

Nmin = minv∈T (Nt(v)) (4.8)

2. The other goal of the tuning section selection algorithm is to maximize the

execution coverage. (Requirement 2 in Section 4.1.)

Coverage =∑e∈S

tm(e)/Ttotal (4.9)

Ttotal is the total execution time of the program.

The tuning section selection problem does not always have a reasonable solution.

For example, suppose that a program has only one function, main(), which contains

a loop consuming most of the execution time. If main is chosen as the tuning

section, Nmin is 1 and coverage is 100%. Otherwise, coverage is 0%. The first

solution degrades to the whole-program tuning. The second solution does not find

any tuning section. Neither of them is acceptable. In fact, we should use the loop

body in main as a tuning section. Using a call graph profile, the selection algorithm

cannot identify the loops within a function. Some manual work is needed to find

the loop and retrieve the loop body into a separate function. We call this process

manual code partitioning. 3 After manual code partitioning, the loop body appears

in the call graph profile, which then is chosen as a tuning section by the selection

algorithm. Finding the functions that are not selected as tuning sections but worth

manual partitioning is another job of the algorithm.

4.4 The Tuning Section Selection Algorithm

From the previous section, the tuning section selection problem can be viewed as

a constrained max cut problem. (The original max cut problem is NP-complete [65].)

3We could automate this code partitioning process, if the profile was provided at the basic-blocklevel.

59

In this section, we will develop a greedy algorithm to select the tuning sections so as

to get maximal Nmin and coverage. This algorithm solves the problem in two steps.

(1) We design an algorithm which aims to maximize coverage under the constraint

that the number of invocations to each selected tuning section is larger than a lower

bound Nlb. (2) The final algorithm raises the Nlb gradually to trade off the coverage.

It aims to achieve a large Nmin by tolerating a small decrease of the coverage. These

two steps will be presented in Section 4.4.2 and Section 4.4.3 separately. We discuss

the handling of recursive functions in Section 4.4.1.

4.4.1 Dealing with recursive functions

There are recursive functions in some programs. To call a recursive function, the

program makes an initial call to the function. Then the function will be called by

itself, in the case of self-recursion, or by its callees, in the case of mutual-recursion.

Both self-recursive calls and mutually-recursive calls are referred to as recursive calls,

which are different from the initial call. Our PEAK system treats initial calls to a

recursive function as normal function calls; while recursive calls can be viewed as loop

iterations, which are ignored by tuning section selection. In other words, the tuning

section selection algorithm does not choose the call graph edges that correspond to

recursive calls, but only the edges corresponding to initial calls.

In a call graph, the functions (nodes) that recursively call themselves or each

other form cycles (including loops). To exclude recursive calls from tuning section

selection, our algorithm identifies the cycles and ignores the edges appeared in the

cycles. We do this through a call graph simplification process. This process merges

the nodes involved in a common cycle into one node, removes the edges used inside

a cycle, and adjusts the edges entering or leaving the merged nodes. This process

uses the strongly connected components to find the nodes and edges that appear in

a cycle. It adjusts the profile data as well. The pseudo code of this simplification

process is shown in Figure 4.2. An example is shown in Figure 4.3.

60

Subroutine G = GenerateSimplifiedCallGraph(profile)1. Construct call graph G = (V, E) according to the profile data as described in

Section 4.2. For each node v ∈ V , v = {fn}: fn(v) is the function name. Foreach edge e ∈ E, e = {s, t, n, tm}: s(e) is the caller node; t(e) is the calleenode; n(e) is the number of invocations to t(e) made by s(e); tm(e) is the timespent in t(e) and its callees, when t(e) is called from s(e).

2. Remove loops in G. (A loop identifies a self-recursive call.)3. For each strongly connected component SCC that contains more than one

nodes, do the following to remove the cycles by merging the nodes in SCC.(Such strongly connected components identify mutually-recursive calls.)(a) Construct a new node u for SCC. The name of u, fn(u) is the concate-

nation of the names of all the nodes in SCC.(b) Remove the inner edges, i.e., the edges starting from and ending at SCC.(c) Keep all the edges leaving SCC and set their starting node to be the new

node u. Merge the edges that start from u and end at the same node,and sum up the profile data for the merged edges.

(d) For each node v in SCC, remove v if there is no edge entering v.Otherwise, add a new edge e, which leaves v and enters the new node u.The profile information for this e is the sum of the profile data for all theedges enters v:

n(e) =∑

x∈E,t(x)=v

n(x) (4.10)

tm(e) =∑

x∈E,t(x)=v

tm(x) (4.11)

Fig. 4.2. The pseudo code for call graph simplification. The algo-rithm generates a call graph from profile data, detects and discardsrecursive calls. Hence, the call graph is simplified to a directed acyclicgraph.

61

a

b c

e

d

f

200 (60)

5 (1)

500 (100)

1000 (6)

5000 (30)

(a) Call graph before simplification

a

b c d

f

200 (60)

5 (1)

500 (100)

6000 (36)

be

200 (60)

(b) Call graph after simplification

Fig. 4.3. An example of call graph simplification. The graph is a callgraph with node a as the main function. c is a self-recursive function.b and e recursively call each other. The weights on an edge are thenumber of invocations and the execution time in the parentheses.After simplification, the loop at node c is discarded. The stronglyconnected component {b, e} is merged into one node be. The entrynode b for this strongly connected component is kept. A new edge(b, be) is added. Edges (b, f) and (e, f) are merged to (be, f). Theprofile data on edges (b, be) and (be, f) are updated.

This simplified call graph removes the self-recursive calls and makes a new node

for the functions that recursively call each other. This graph is a Directed Acyclic

Graph (DAG). The call graph simplification algorithm maintains the profile infor-

mation for the new nodes and edges. So, the graph has the same profile information

as described in Section 4.2.

4.4.2 Maximizing tuning section coverage under Nlb

This section describes a tuning section selection algorithm, which aims to achieve

as large a coverage as possible, under the constraint that the number of invocations to

each selected tuning section is larger than a lower bound Nlb. This algorithm selects

the tuning sections and puts their entry functions into set T . It finds the functions

62

that are worth manual code partitioning and puts them into set M . Besides Nlb, this

algorithm uses two other parameters: (1) Tlb, the lower bound on average execution

times; (2) Plb, the lower bound on the execution percentage for a code section worth

manual partitioning.

Tlb is determined by the timing accuracy of the PEAK system. We use 100μsec

in our experiments. Plb is used to determine whether a code section is worth tuning.

We use 0.02. This means that the code section is worth tuning if its execution time

is greater than 2% of the total execution time. Nlb will be adjusted to trade off the

tuning section coverage in the final tuning section selection algorithm described in

Section 4.4.3. The optimal Nlb picked by the final algorithm usually ranges from

tens to thousands.

In order to maximize the coverage, the algorithm traverses the call graph from

top down to select the code sections that meet the requirements. (We use a topo-

logical order to go through the call graph.) When a tuning section is selected, the

profile data are updated to reflect the execution times and invocation numbers after

excluding this selected tuning section. The execution time due to the selected tuning

section is deducted from the execution time of its ancestors as well. (Remember that

the execution time on node v includes the time spent in v itself and the descendants

of v.) After the selection process finishes, the remaining execution time on each

node v is used to judge whether it is worth manual partitioning. (It is worth manual

partitioning, if its execution time is greater than Plb of the total execution time.)

Figure 4.4 shows the pseudo code for the algorithm to maximize the tuning

section coverage. This algorithm constructs an acyclic call graph, annotated with

profile information after removing the recursive calls, according to the algorithm

described in Section 4.4.1. It ignores the edges whose average execution time is less

than threshold Tlb when computing the execution profile for a node. It goes through

the call graph in a topological order to find the nodes whose numbers of invocations

are greater than threshold Nlb. These nodes are selected as entry functions to the

tuning sections. The profile information of the relevant edges are adjusted to reflect

63

Subroutine [T , M ] = MaxCoverage(profile, Nlb, Tlb, Plb)There are three thresholds used in this algorithm: Nlb, the lower bound on numbersof invocations; Tlb, the lower bound on average execution times; Plb, the lower boundon the execution percentage for a code section worth manual partitioning. Thealgorithm selects the tuning sections and puts their entry functions into set T . Thefunctions that are worth manual code partitioning are put into M .

1. Construct the simplified call graph G = (V,E), according to Figure 4.2. i.e.,G = GenerateSimplifiedCallGraph(profile).

2. Clear the selection flag for each edge e ∈ E: f(e) = 0. Clear the executiontime due to the selected tuning sections, for each node v: TX(v) = 0. Emptyset T and M .

3. Mark the edges whose average execution times are less than Tlb. i.e., for eachedge e ∈ E, set f(e) = −1, if tm(e)/n(e) < Tlb. (These edges will not becounted when summing the profile data of the edges for one node.)

4. Sort all the nodes into a topological order: v1, v2, ..., i.e., if there is an edge(u, v), node u appears before node v. The nodes will be traversed in this order.

5. For node vi (i = 1, 2, ...), compute the total number of invocations to vi.

n(vi) =∑

e∈E,t(e)=vi,f(e)=0

n(e) (4.12)

If n(vi) is greater than Nlb, put vi into T and set f(e) = 1 if t(e) = vi; andupdate profile information of G, by calling UpdateProfile(G, vi, TX).

6. Put node v into M , if v is not selected but consume a large amount of executiontime, i.e.,

∑t(e)=v,f(e) �=1 tm(e) − TX(v) > Plb × Ttotal. (These nodes may be

manually partitioned to improve tuning section coverage.)

Fig. 4.4. Tuning section selection algorithm to maximize programcoverage under the lower bound on numbers of TS invocations, Nlb.This algorithm traverses the simplified call graph from top downto find the code sections whose numbers of invocations are greaterthan Nlb. In addition, the algorithm finds the functions that may bemanually partitioned to improve tuning section coverage.

64

the number of invocations and execution time spent in the rest of the program,

when a node is selected. In the end, the algorithm finds the functions worth manual

partitioning, using the residual execution time after excluding the selected tuning

sections.

When a tuning section is selected, the profile data need to be updated in order

to exclude the execution information (time and number of invocations) due to the

selected tuning section. This requires a context-sensitive inclusive profile, which

lists the direct and indirect callers of a function, when the execution information

of this function is given. This profile should show the execution time and number

of invocations of each function call; it should also split this information for each

call path. Unfortunately, gprof does not provide such information. Instead, we use

an algorithm to do an estimation, which is described in Figure 4.5. It handles the

descendants and the ancestors of the selected node v separately.

For each descendant u, the algorithm estimates how often u is directly or indi-

rectly called from the selected node v, out of the total invocations to u. We denote

this execution frequency of u as q(u). For the selected node v, q(v) = 1.0. For each

descendant u, q(u) is computed based on the execution frequency of u’s parents and

the numbers of invocation to u directly from the parents. Equation 4.17 does the

estimation, where q(u) is the execution frequency of u.

q(u) =

∑e∈E,t(e)=u q(s(e)) × n(e)∑

e∈E,t(e)=u n(e)(4.17)

Knowing the execution frequency of all the descendants, the profile information for

all the edges reachable from v can be estimated according to Equations 4.18 and 4.19.

n(e) = n(e) × (1 − q(s(e))) (4.18)

tm(e) = tm(e) × (1 − q(s(e))) (4.19)

The algorithm traverses the node list forwards from node v. (The nodes are sorted

in a topological order.)

For each ancestor u of the selected node v, the algorithm estimates how much

execution time is spent in u due to v, which is denoted as p(u). The time spent in

65

Subroutine UpdateProfile(G = (V,E), vi, TX)This algorithm removes the execution time due to vi for each edge reachable fromvi, and adds the time due to vi into TX for each ancestor of vi. (TX(v) records thetime spent in v due to all the selected tuning sections. Manual code partitioningwill use TX to estimate the remaining execution time after tuning section selection.)Update the profile information of the edges that are reachable from node vi.

1. Set execution frequency of all nodes except vi as 0, q(v) = 0; while q(vi) = 1.0,which means vi is executed 100% within the chosen tuning section.

2. For node vj (j = i + 1, i + 2, ...), compute the execution frequency of vj basedon the execution frequency of its parents according to the following equation.

q(vj) =

∑e∈E,t(e)=vj

q(s(e)) × n(e)∑e∈E,t(e)=vj

n(e)(4.13)

3. For node vj (j = i + 1, i + 2, ...), adjust the profile information for the edgesentering vj according to the following equations, where t(e) = vj.

n(e) = n(e) × (1 − q(s(e))) (4.14)

tm(e) = tm(e) × (1 − q(s(e))) (4.15)

For the nodes that reach node vi, record how much execution time is spent due tovi. This time will be excluded for choosing manual code partitioning candidates.

1. For each node v, set the time spent in v due to vi, p(v), as 0.2. Set the time spent in vi, p(vi) =

∑e∈E,t(e)=vi

tm(e).3. For node vj (j = i, i− 1, .., 1), update the time spent in its parents due to the

time spent in vi. For each edge e, who enters vj, i.e., t(e) = vj, adjust p(s(e))according to the following equation.

p(s(e)) = p(s(e)) +tm(e)∑

a∈E,t(a)=vjtm(a)

× p(vj) (4.16)

4. For node vj (j = i − 1, i − 2, .., 1), update the time spent in v due to all theselected tuning sections: TX(vj) = TX(vj) + p(vj)

Fig. 4.5. Update profile data after vi is chosen as the entry functionto a tuning section. The updated profile reflects the execution timesand invocation numbers after excluding the chosen tuning section.

66

the selected node v, p(v), is equal to∑

e∈E,t(e)=v tm(e). We assume that the time

spent in u due to v is distributed in u’s parent, x, proportional to the time spent in

u when u is called from x. That is, p(u) is distributed to p(x) proportional to tm(e),

where e leaves x and enters u. So, p(u) is distributed according to the following

equation.

p(s(e)) = p(s(e)) +tm(e)∑

a∈E,t(a)=u tm(a)× p(u) (4.20)

The algorithm traverses the node list backwards from node v.

After applying the algorithm to maximize the tuning section coverage to, for

example, the call graph shown in Figure 4.1 using 100 as Nlb, we will get the optimal

tuning section selection.

4.4.3 The final tuning section selection algorithm

The previous section describes an algorithm to maximize the tuning section cov-

erage with the constraint that the number of invocations to each selected tuning

section should be larger than Nlb. The algorithm described in this section aims to

achieve a large Nmin by tolerating a small decrease of the coverage. It does this via

raising the Nlb gradually to trade off the coverage.

Two new parameters are introduced to the final algorithm.

1. Clb, the lower bound on the tuning section coverage. If the coverage of the

selected tuning sections is smaller than Clb, manual partitioning is necessary.

We set it as 80% of the total execution time.

2. Rub, the upper bound on the coverage drop rate. The coverage drop rate is

computed based on two tuning section selection solutions as follows, where

Nmin2 is larger than Nmin1.

R =coverage1 − coverage2

Nmin2 − Nmin1

(4.21)

If, on average, after increasing Nmin by 1, the coverage drops more than Rub,

the algorithm finds a trade-off point. In our experiments, we tolerate 1%

67

Table 4.1Tuning section selection for mgrid. The best Nlb is 400. The optimalcoverage and Nmin are 0.957 and 2000.

iteration Nlb coverage Nmin

1 10 0.998 4002 400 0.957 20003 2000 0.808 2400

decrease of the coverage, if Nmin can be improved by 100. So, we set Rub as

0.01/100 = 1e−4.

Figure 4.6 shows the pseudo code of this tuning section selection algorithm. This

algorithm iteratively uses the method shown in Figure 4.4 to maximize the tuning

section coverage under a series of thresholds Nlb’s. The new Nlb in the next iteration,

Nlb2, is equal to the Nmin obtained from the previous iteration. We notice that this

Nmin is greater than the old Nlb in the previous iteration, Nlb1, and that any threshold

value in [Nlb1, Nmin) gives the same solution to maximize the coverage. Using Nmin

from the previous iteration as the new threshold value for Nlb makes the trade-off

process fast. This process finishes when the coverage drops below Clb or the coverage

drop rate is greater than Rub.

For example, Table 4.1 shows the result of each iteration, via applying this al-

gorithm to benchmark mgrid. The second iteration gets the optimal result, with

coverage = 0.957 and Nmin = 2000. (The initial Nlb is 10, which is a reasonable

boundary for our rating methods to achieve faster tuning than the whole-program

tuning.)

4.5 Results

We apply the tuning section selection algorithm to SPEC CPU2000 benchmarks.

The default threshold values are used, i.e., Rub = 1e−4, Clb = 80%, Tlb = 100μsec,

Plb = 2%. Focusing on the scientific programs, we analyze the FP benchmarks

68

Subroutine [T , M ] = TSSelection(profile, Rub, Clb, Tlb, Plb)There are four thresholds used in this algorithm: Rub, the upper bound on thecoverage drop rate; Clb, the lower bound on the tuning section coverage; Tlb, thelower bound on average execution times; Plb, the lower bound on the executionpercentage for a code section worth manual partitioning. The algorithm selects thetuning sections and puts their entry functions into set T . The functions that areworth manual code partitioning are put into M .

1. Initialization. Tbest = φ, Mbest = φ, coveragebest = 0, Nbest = 0, Nlb = 10.2. Use the algorithm in Figure 4.4 to maximize the coverage.

[T , M ] = MaxCoverage(profile, Nlb, Tlb, Plb)Compute coverage and Nmin of the selected tuning sections T .

3. If coverage is less than Clb, do the following.(a) If Tbest is φ, manual partitioning is necessary. Print T and M . Stop.(b) Else, Tbest and Mbest is the solution. Print Tbest and Mbest. Stop.

4. Trade off coverage and Nmin as follows.(a) If Tbest is φ, record this solution: Tbest = T , Mbest = M , coveragebest =

coverage, Nbest = Nmin.(b) Else, compute the coverage drop rate R.

R =coveragebest − coverage

Nmin − Nbest

(4.22)

If the drop rate R is less than Rub, record this solution: Tbest = T , Mbest =M , coveragebest = coverage, Nbest = Nmin.Otherwise, the trade-off point is found. Print Tbest and Mbest. Stop.

5. Set Nlb = Nmin and go to Step 2.

Fig. 4.6. The final tuning section selection algorithm. This algorithmachieves both a large Nmin and a high coverage. It iteratively uses themethod shown in Figure 4.4 to maximize the tuning section coverageunder a series of thresholds Nlb’s, until the optimal Nlb is found.

69

Table 4.2Selected tuning sections in SPEC CPU2000 FP benchmarks.(Three manually partitioned benchmarks are annotated with ‘*’.The last row, wupwise+, uses a smaller Tlb = 1μsec.)

Benchmark coverage Nmin # of TS TS nameammp 88.6 127 3 torsion, u f nonbon, angleapplu 97.9 250 5 jacld, buts, blts, rhs, jacuapsi 87.8 720 9 dctdx, dvdtz, dudtz, dkzmh, dtdtz,

dcdtz, leapfr, wcont, hydart 99.9 250 2 match, train matchequake 54.6 2709 1 smvpequake* 99.0 2709 1 iter bodymesa 96.9 4000 1 general textured trianglemgrid 95.7 2000 4 interp, rprj3, resid, psinvsixtrack 10.4 208 2 phasad, clorbsixtrack* 97.9 1693 2 thin6d sub, umlaufswim 83.9 198 3 calc3, calc2, calc1swim* 99.2 198 4 calc3, calc2, calc1, loop3500wupwise 91.7 22 1 matmulwupwise+ 83.0 22528000 2 su3mul, gammul

in Section 4.5.1 in detail. The results on the INT benchmarks are presented in

Section 4.5.2. The data are listed in Table 4.2 and Table 4.3, respectively.

4.5.1 SPEC CPU2000 FP benchmarks

The second column in Table 4.2 shows the program execution time coverage of the

selected tuning sections. All the benchmarks cover most of the program execution

after code partitioning if needed. (For most benchmarks, the coverage is above 90%.)

The third column in the table shows Nmin, the minimum number of invocations

to the tuning sections. For all the benchmarks except wupwise, Nmin ranges from

hundreds to thousands. For wupwise, Nmin is 22. The reason turns out to be the

small functions used in wupwise. These functions are invoked millions of times,

while the average execution time is very small, in the order of μsecs. Given a small

threshold of Tlb = 1μsec, we redo the tuning section selection and get a huge Nmin

70

of 22528000. (Note that our timing implementation is highly accurate, so that it

handles μsecs.)

From the second and the third columns, our algorithm achieves the goal of max-

imizing both the program coverage and the minimum number of invocations to the

tuning sections. We will show the final tuned program performance in the next

chapter.

Table 4.2 shows that three benchmarks (equake, sixtrack and swim) need code

partitioning. The general rule is to retrieve the loop body of the important loop in

the candidate function into a separate function. The candidate functions are given

by our tuning section selection algorithm as well. Basically, these functions take a

large execution time but with only a few invocations. After partitioning, the new

function will cover a big part of the program execution time with a large number of

invocations. The code partitioning is done as follows for the three benchmarks.

1. In equake, the function main contains a large loop. Since main is only invoked

once, our tuning section selection algorithm does not pick it as a tuning section.

So, we retrieve the loop body into a new function called iter body, which is then

invoked many times and selected as the tuning section. (The iter body function

calls the function smvp, which is selected before manual code partitioning.)

2. In sixtrack, the function thin6d contains a large loop. Since thin6d is only

invoked once, it is not picked as a tuning section. Similar to equake, we retrieve

the loop body to a separate function, thin6d sub, which is then picked as a

tuning section.

3. In swim, the main program contains an important loop, which is then taken

out as a new tuning section loop3500.

4.5.2 SPEC CPU2000 INT benchmarks

From Table 4.3, our tuning section selection algorithm achieves the goal of maxi-

mizing the program coverage and the minimum number of invocations to the tuning

sections for INT benchmarks as well.

71

In general, the profiles of SPEC CPU2000 INT benchmarks are different from

those of SPEC CPU2000 FP benchmarks: (1) The INT benchmarks have flatter

profiles with more functions in the call graphs. (2) The call graphs of the INT

benchmarks are deeper. (3) The functions in the INT benchmarks that are invoked

many times usually have small average execution times.

Due to (3), the default threshold of the average execution time, Tlb = 100μsec,

prohibits the selection of some important functions in gzip, mcf, perlbmk, vortex,

and vpr. After using a smaller Tlb ranging from 0.01μsec to 1μsec, their program

coverage is improved. (For these benchmarks, a high resolution timer is needed to

generate accurate performance ratings for the selected tuning sections.)

In bzip2, gap and perlbmk, some important functions are only invoked tens of

times, which brings down Nmin, the minimum number of invocations to the tuning

sections. For other benchmarks, Nmin is hundreds to thousands.

72

Table 4.3Selected tuning sections in SPEC CPU2000 INT benchmarks.(The benchmarks annotated with ‘+’ use smaller Tlb’s.)

Benchmark coverage Nmin #TS TS namebzip2 99.9 22 6 doReversibleTransformation,

generateMTFValues, sendMTFValues,loadAndRLEsource,undoReversibleTransformation fast,getAndMoveToFrontDecode

crafty 100.0 1272 1 Searchgap 96.6 24 3 EvFunccall, EvVarAss, EvElmListgcc 90.7 109 1 rest of compilationgzip 41.7 3668 3 fill window, flush block, inflate dynamicgzip+ 84.1 4021 5 ct tally, fill window, flush block,

inflate dynamic, longest matchmcf 46.6 5235 1 refresh potentialmcf+ 74.7 5235 2 refresh potential, primal bea mppparser 91.6 309 3 prepare to parse, parse, expression pruneperlbmk 76.2 11 5 incpush, Perl newXS, Perl av push,

Perl gv fetchpv, Perl sv freeperlbmk+ 92.6 11 6 incpush, Perl newXS, Perl av push,

Perl gv fetchpv, Perl sv free,Perl runops standard

twolf 98.0 120 1 uloopvortex 78.0 6209 2 BMT Validate, SaFindInvortex+ 95.5 4000 7 BMT CommitPartDrawObj, BMT Validate,

PersonObjs FindIn, BMT DeletePartDrawObj,Object Delete, OaDeleteFields, SetAddInto

vpr 52.7 10746 1 route netvpr+ 98.3 10746 2 route net, try swap

73

5. THE PEAK SYSTEM

5.1 Introduction

It is well understood that optimization techniques do not always improve pro-

gram performance significantly. They may have only negligible effects, and even

degrade the program performance under some un-expected cases. The interaction

between optimizations makes it difficult for a programmer to find the best optimiza-

tion combination for a given program. Our PEAK system automates this process

using a feedback-directed approach. Chapter 2 presents a fast and effective orches-

tration algorithm, which iteratively generates optimized code versions and evaluates

their performance based on the execution time under a training input until the best

version is found. Noticing that using a partial execution of the program (i.e., a few

invocations to a code section) may achieve accurate performance evaluation in a

faster way, we developed three rating methods applied to important code sections,

called tuning sections, in Chapter 3. These rating methods evaluate the performance

of an optimized version of a tuning section based on a number of invocations to the

tuning section. Chapter 4 designs an algorithm to select the tuning sections out of

a program to maximize both the program execution time coverage and the number

of invocations, aiming at high tuned program performance and fast tuning speed.

This chapter puts everything together to construct an automated performance

tuning system. Section 5.2 shows the design of the PEAK system, which applies the

described techniques into two primary parts: the PEAK compiler and the PEAK

runtime system. Special implementation problems related to runtime code gener-

ation and loading are also discussed. Section 5.3 shows an example of using the

PEAK system. Section 5.4 analyzes the experimental results of PEAK focusing on

the SPEC CPU2000 FP benchmarks, using two metrics: tuning time and tuned

74

program performance. On average, compared to whole-program tuning presented

in Chapter 2, PEAK reduces the tuning time from 2.19 hours to 5.85 minutes and

improves the performance from 11.7% to 12.1%.

5.2 Design of PEAK

5.2.1 The steps of automated performance tuning

The PEAK system has two major parts: the PEAK compiler and the PEAK

runtime system. The PEAK compiler is used before tuning, while the PEAK runtime

system is used during tuning.

Figure 5.1 shows a block diagram of the PEAK system, which lists all the compo-

nents in PEAK and all the performance tuning steps. Steps 1 to 4 are taken before

tuning to construct a tuning driver for a given program. In these steps, the PEAK

compiler analyzes and instruments the source code. During performance tuning at

Step 5, the tuning driver continually runs the program under a training input until

the best version is found for each tuning section. In this step, the PEAK runtime

system is involved in dynamically generating and loading optimized versions, rating

these versions, and feeding new optimization combinations to the tuning driver. Af-

ter tuning, in Step 6, each tuning section is compiled under its best optimization

combination and linked to the main program to generate the final tuned version. In

detail, PEAK takes the following steps.

1. The tuning section selector chooses the important code sections as the tuning

sections, using the call graph profile generated by gprof. It applies the algo-

rithm described in Chapter 4. The output of this step is a list of function

names, each of which identifies a tuning section.

2. The rating method consultant analyzes the source program to find the applica-

ble rating methods for each tuning section. The compiler techniques developed

in Chapter 3 are implemented here. This tool annotates the program with the

information about the context variables for CBR and the performance model

for MBR.

75

Profile Input

TS Selection Configuration TS Selector

TS List Rating Method Consultant

Instrumented Source Code

PEAK Runtime Backend Compiler

1

2

Performance Tuning Driver TS Code

Training Input

4

Generate New Versions

Compute Ratings

More?

Done

5

Final Version Generation 6

Annotated Program

Search Method Configuration

PEAK Instrumentation Tool

3

Before T

uning T

uning

After

Tuning

Fig. 5.1. Block diagram of the PEAK performance tuning system

3. The PEAK instrumentation tool applies the appropriate rating method to each

tuning section, after a profile run to find the major context for CBR and the

model parameters for MBR according to Chapter 3. It adds the initialization

and finalization functions to activate the PEAK runtime system and the func-

76

tions to load and save the tuning state of previous runs, since the performance

tuning driver may run the program multiple times in Step 5. The instru-

mentation tool retrieves each tuning section into a separate file, which will

be compiled at Step 5 under different optimization combinations to generate

optimized versions.

4. The instrumented code is compiled and linked with the PEAK runtime sys-

tem, which is provided in a library format, to construct the performance tuning

driver. The PEAK runtime system implements the three rating methods devel-

oped in Chapter 3 and the CE optimization orchestration algorithm developed

in Chapter 2. Special functions for dynamically loading the binary code during

tuning are also included in the PEAK runtime system. (The generation of the

tuning driver and the optimized tuning section versions is done by the backend

compiler, in this thesis, the GCC compiler.)

5. The performance tuning driver iteratively runs the program under a training

input until optimization orchestration finishes for all the tuning sections. At

each invocation to a tuning section, the driver takes over the control. It runs

and times the current experimental version and decides whether more invoca-

tions are needed to rate the performance of this version. After the rating of

this version is done (i.e., when the rating variance is small enough), the driver

generates new experimental versions according to the orchestration algorithm.

(The tuning sections are tuned independently.) The tuning process ends when

the best version is found for each tuning section.

6. After the tuning process finds the best optimized version for each tuning sec-

tion, these best versions are linked to the main program to generate the final

version. Here, the main program is the original source program with the tuning

sections removed. The final version is the one to be delivered to the end users.

This completes the tuning process.

77

5.2.2 Dynamic code generation and loading

PEAK tunes program performance via executing the program under a training

input. It generates and loads the binary code at runtime. (This distinguishing

feature is inherited from the ADAPT [24, 25] infrastructure.) To do so, PEAK uses

the dynamic linking facility functions: dlopen, dlsym, dlclose and dlerror. Basically,

these functions enable PEAK to load binary code into memory and to resolve the

address of the experimental version contained in that binary. (The binary code is

generated by the backend compiler, GCC, under the given optimization options.)

To generate an optimized version for a tuning section separately, excluding other

unrelated code, PEAK extracts each tuning section into a separate source file. This

source file includes the entry function to the tuning section and all its direct and

indirect callees. So, this source file can be compiled and optimized separately. Since

callees are included in the tuning section, inlining can be performed during code

generation. If a subroutine/function is called in two tuning sections, this subrou-

tine/function is replicated and renamed in the corresponding source files to avoid

name conflicts at link time. Such replication does not lead to code explosion, because

the number of tuning sections is fixed and usually small, around three. Our PEAK

compiler does the above job automatically using source-to-source compilation.

To load an optimized version successfully and correctly, PEAK needs to pay

attention to the global variables used in the tuning sections, especially the static

variables in C and the common blocks in Fortran. Multiple versions of the same

tuning section are invoked during one run of the program. Each global variable

used in these versions should be located to the same address, when the versions are

loaded. To solve this issue, a number of details are important:

1. When the tuning driver is compiled, all the global symbols are added to the

dynamic symbol table. The dynamic symbol table is the set of symbols which

are visible from dynamic objects at run time. When loading an optimized

version of a tuning section, the loader can use this table to locate the global

variables used in the optimized version. To put the global symbols into the

78

dynamic symbol table, the option of export-dynamic is passed to the linker

during the tuning driver generation.

2. The static variables in C programs are promoted to the global scope and are

given globally unique names. This is because an optimized version makes

a local copy of static variables during dynamic code loading. If the static

variables are not promoted, different code versions of the same tuning section

use different copies of the static variables, which leads to incorrect execution.

Our PEAK compiler does this promotion for both file-scope static variables

and function-scope static variables. (For Fortran programs, the local variables

are put into a common block, if a save statement is specified in the subroutine.)

3. For Fortran programs, our binary editing tool finds the common blocks in

the symbol table of an optimized binary code and makes the common blocks

linkable to the actual definition in the main program. This is because the

dynamic loader makes different copies of the same common block for different

code versions, just like static variables in C. Different from C, we need to work

at the binary level to solve this problem. After modifying the corresponding

attribute of these symbols, the binary editing tool makes the common block

linkable to the actual definition during dynamic loading. 1 This tool uses the

libelf functions.

5.3 An Example of Using PEAK

This section uses the benchmark swim to illustrate the six tuning steps of PEAK.

1. The tuning section selector uses a gprof output and selects four tuning sections:

calc1, calc2, calc3 and loop3500. Figure 5.2 shows the source code of the tuning

section calc1 as an example.

1Our binary editing tool changes the symbol attribute of “st shndx” from “SHN COMMON” to“SHN UNDEF” for the common blocks. “SHN COMMON” means a common block, which will beallocated a storage during linking. “SHN UNDEF” means an undefined symbol, whose referenceswill be linked to the actual definition during linking.

79

SUBROUTINE calc1

...

DO j = 1, n, 1

DO i = 1, m, 1

...

ENDDO

ENDDO

DO j = 1, n, 1

...

ENDDO

DO i = 1, m, 1

...

ENDDO

...

RETURN

END

Fig. 5.2. An example of the tuning section calc1 in swim

2. The rating method consultant analyzes the source code to find the applicable

rating methods. For calc1, all three rating methods are applicable. The context

variables are n and m. After one profile run, PEAK finds that n and m are

runtime constants. So, there is only one context for calc1 and context-based-

rating will be applied.

3. The PEAK instrumentation tool adds PEAK runtime library calls to the source

program. Each tuning section is retrieved into a separate file, which will be

compiled under different optimization combinations to generate experimental

versions during performance tuning. The instrumentation code is added to

each tuning section and the entry and exits of the program.

• Each entry to a tuning section is redirected to the corresponding instru-

mented code. Figure 5.3 shows the instrumented calc1, which calls two

PEAK runtime functions, DCGetVersion and DCMonRecordCBR.

(a) DCGetVersion implements the optimization orchestration algorithm

and dynamic code generation and loading. It returns the current

80

extern void calc1_old_();

void calc1_()

{

hrtime_t t0, t1;

typedef void(*FunType)();

FunType fun = NULL;

hrtime_t tm;

//Experiment the current optimized version

if( (fun=(FunType)DCGetVersion(0)) != NULL ) {

t0 = gethrtime();

fun(); //invoke the experimental version

t1 = gethrtime();

tm = t1 - t0; //time the current invocation

DCMonRecordCBR(0, tm); //rating generation

return;

}

//Optimization orchestration is done

calc1_old_();

}

Fig. 5.3. The tuning section calc1 instrumented by the PEAK compiler

experimental version. If the previous version has already been rated,

this call will generate and load a new optimized version.

(b) DCMonRecordCBR passes the invocation time obtained from a high

resolution timer to the context-based-rating method. The rating

method rates the current version based on a few invocations. The

orchestration algorithm uses the ratings of previous versions to guide

the generation of new optimized versions.

(c) The argument, 0, of these two functions identifies the tuning section

of calc1.

In the code, calc1 old is the original version of the tuning section.

• At the entry of the entire program, the function of init dc is called to set

up the tuning state. Figure 5.4 lists the instrumented code.

(a) DCInit(4) allocates the memory and initializes the tuning state for

four tuning sections.

81

void init_dc()

{

DCInit(4);

DCOrchGCCBaseO3();

DCOrchSetSuffix(".f");

DCSetTSProperty(0, DC_TS_NAME, "calc1");

DCSetTSProperty(0, DC_TS_SEARCH_METHOD, DC_SEARCH_CE);

DCSetTSProperty(0, DC_TS_RATING_METHOD, DC_RATING_CBR);







DCSetTSProperty(3, DC_TS_NAME, "loop3500");



DCLoad("DCdump.dat");

}

Fig. 5.4. The initialization function instrumented by the PEAK compiler

(b) DCOrchGCCBaseO3() sets up the orchestrated GCC O3 optimiza-

tions, using “O3” as the baseline.

(c) DCOrchSetSuffix(“.f”) specifies that the tuned program is in Fortran.

(d) The series of calls to DCSetTSProperty() specify the tuning sec-

tion name, the optimization orchestration algorithm and the rating

method for each tuning section.

(e) DCLoad(“DCdump.dat”) loads the tuning state from the previous

run, because multiple runs of the program may be involved during

performance tuning.

• At the exits of the program, the function of exit dc is called to save the

tuning state. Figure 5.5 lists the instrumented code. DCDump() and

82

void exit_dc()

{

DCDump("DCdump.dat");

DCFinalize();

}

Fig. 5.5. The exit function instrumented by the PEAK compiler

DCFinalize() are called to save the tuning state and to free allocated

memory.

4. The instrumented program is compiled and linked with the PEAK runtime

library to generate the performance tuning driver.

5. During performance tuning, the above tuning driver continuously runs under

a training input until optimization orchestration is done for all the tuning sec-

tions. Its output is the best optimization combination for each tuning section.

An example output for calc1 on a Pentium IV machine is

“g77 -c -O3 -fno-strength-reduce -fno-rename-registers -fno-align-loops calc1.f”.

This means that three optimizations, strength-reduce, rename-registers and

align-loops, should be turned off from the baseline O3.

6. In the final version generation stage, each tuning section adopts the correspond-

ing best optimization combination obtained from the previous step. These

optimized tuning sections are linked to the main program, which is compiled

under the default optimization combination, to generate the final version.

To summarize, in Step 1, the tuning section selection algorithm chooses tuning

sections based on execution time profile; in Steps 2 and 3, the PEAK compiler an-

alyzes and instruments the source code to prepare performance tuning; in Step 4,

the instrumented code is compiled and linked with the PEAK runtime system to

generate a tuning driver; in Step 5, the tuning driver uses the PEAK runtime sys-

tem for dynamic code generation and loading, performance rating and optimization

orchestration; in Step 6, the final version is generated.

83

5.4 Experimental Results

We orchestrate the 38 GCC O3 optimizations for SPEC CPU2000 benchmarks,

on the same machines used in Chapter 2, a Pentium IV machine and a SPARC II

machine. For each benchmark, PEAK tunes the performance of the selected tuning

sections listed in Chapter 4. The goal of this experiment is to compare PEAK with

the whole-program tuning shown in Chapter 2. Similarly, we use two metrics: the

tuning time and the tuned program performance. Still, the tuning time is normalized

by the time to evaluate the performance of the base version according to Equation 2.7.

Since PEAK applies optimization orchestration to tuning sections separately, we

expect PEAK performs better than the whole-program tuning in two aspects.

1. PEAK takes much less tuning time than the whole-program tuning. This is

because PEAK evaluates the performance of an optimized version based on a

partial execution of the program, while the latter uses the complete run.

2. PEAK achieves equal or better program performance than the whole-program

tuning. Since our tuning section selection algorithm covers most of the pro-

gram, our PEAK system should be able to achieve equal performance to the

whole-program tuning. In some cases that the tuning sections favor different

optimizations, PEAK can achieve better performance than the whole-program

tuning.

Focusing on scientific programs, we present the detailed experimental results on

SPEC CPU2000 FP benchmarks in terms of these two metrics separately in Sec-

tion 5.4.1 and Section 5.4.2. Section 5.4.3 presents the results on INT benchmarks.

5.4.1 Tuning time

Figure 5.6 shows the normalized tuning time of the whole-program tuning and

the PEAK system. (For PEAK, the tuning time includes the time spent in all the

six tuning steps.) On average, the normalized tuning time is reduced from 68.3 to

3.36. So, PEAK gains a speedup of 20.3. The benchmark that has a high speedup

84

62.22

50.99

105.76

69.2363.14

89.28

50.59

87.32

36.96

102.97

68.28

2.337.06

11.214.03 1.79 2.33 3.38 4.22 2.59 1.61 3.36

0.00

20.00

40.00

60.00

80.00

100.00

120.00

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

No

rmal

ized

tu

nin

g t

ime

Whole PEAK

Fig. 5.6. Normalized tuning time of the whole-program tuning andthe PEAK system for SPEC CPU2000 FP benchmarks on PentiumIV. Lower is better. On average, PEAK gains a speedup of 20.3.

usually has a large number of invocations to the tuning sections, which are shown

in Table 4.2. This agrees with our assumption during the tuning section selection

in Chapter 4. On average, the absolute tuning time is reduced from 2.19 hours to

5.85 minutes. Applying the rating methods in Chapter 3 significantly improves the

tuning time.

Figure 5.7 shows the tuning time percentages of the six tuning steps. Most of

the time is spent in Step 5, the performance tuning (PT) stage. The second largest

portion of tuning time is spent in Step 1, the tuning section selection (TSS) stage.

This is because Step 1 does a profile run, which in some cases (e.g., wupwise) takes

more time than a normal run of the program. The last large portion of the tuning

time is spent in Step 2, the rating method analysis (RMA) stage. Some of the time

is spent in data flow analysis; some is spent in the profile run to get the context

parameters for CBR and execution model parameters for MBR. For programs with

lots of source code (e.g., ammp and mesa), compiling the source code into the internal

85

0%

20%

40%

60%

80%

100%

ammp applu apsi art equake mesa mgrid sixtrack swim wupwise

Per

cen

tag

e o

f th

e to

tal

tim

e sp

ent

in t

he

tun

ing

pro

cess

TSS RMA CI DG PT FVG

Fig. 5.7. Tuning time percentage of the six stages for SPEC CPU2000FP benchmarks on Pentium IV. (TSS: tuning section selection, RMA:rating method analysis, CI: code instrumentation, DG: driver gener-ation, PT: performance tuning, FVG: final version generation.) Themost time-consuming steps are PT, TSS and RMA.

representation of our source-to-source compiler also accounts for a big portion of the

tuning time. (This compilation time comes from the compiler infrastructure we use

for the implementation of our PEAK compiler. 2)

The results on SPARC II are similar. The normalized tuning time is reduced from

63.42 to 4.88, with a speedup of 13.0. The absolute tuning time is reduced from 9.83

hours to 43.7 minutes. (The SPARC II machine is slower than the Pentium IV

machine.)

2Our PEAK compiler is developed based on the Polaris [60] compiler for Fortran programs and theSUIF2 [61] compiler for c programs.

86

5.4.2 Tuned program performance

Figure 5.8 shows the program performance achieved by the whole-program tuning

and the PEAK system on Pentium IV. We use the train dataset as the input to the

tuning process.

The first two bars show the performance of the final tuned version under the

same train dataset for the whole-program tuning and the PEAK system. For all

the benchmarks, PEAK achieves equal or better performance. PEAK outperforms

the whole-program tuning by 1.5% on applu. Some benchmarks, such as equake and

mesa, have only one tuning section. Some benchmarks, such as art, swim and mgrid,

have similar code structure in the selected tuning sections, which leads to the fact

that the tuning sections favor similar optimizations. The above two observations

explain why PEAK does not outperform the whole-program tuning significantly in

terms of tuned program performance.

A fair performance evaluation should use an input different from the training

input. To this aim, we use the ref dataset to evaluate the performance of the tuned

version. (Still, the train dataset is the input to the tuning process.) The results

are shown by the last two bars. We can see that using different input still achieves

similar performance to the first two bars. So, our tuning scenario does find an

optimal combination of the compiler optimizations, which performs much better

than the default optimization configuration.

On average, PEAK improves the performance by 12.0% and 12.1% with respect

to the train dataset and the ref dataset, while the whole-program tuning by 11.9%

and 11.7%. The results on SPARC II are similar: PEAK improves the performance

by 4.1% and 3.7% with respect to train and ref; while the whole-program tuning by

4.1% and 3.7% as well. So, PEAK achieves equal or better program performance

than the whole-program tuning.

87

0

10

20

30

40

50

60

70

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Geo

Mea

n

Rel

ativ

e p

erfo

rman

ce i

mp

rove

men

t p

erce

nta

ge

(%)

Whole_Train PEAK_Train Whole_Ref PEAK_Ref

Fig. 5.8. Program performance improvement relative to the baselineunder O3 for SPEC CPU2000 FP benchmarks on Pentium IV. Higheris better. All the benchmarks use the train dataset as the input tothe tuning process. Whole Train (PEAK Train) is the performanceachieved by the whole-program tuning (the PEAK system) underthe train dataset. Whole Ref and PEAK Ref use the ref dataset toevaluate the tuned program performance, but still the train datasetfor tuning. PEAK achieves equal or better program performancethan the whole-program tuning.

5.4.3 Integer benchmarks

Integer benchmarks generally have irregular code structures with many condi-

tional statements. Moreover, the use of pointers complicates rating-method analy-

sis. As a result, only re-execution-based rating could be applicable. On the other

hand, the selected tuning sections in some benchmarks use library functions with

88

side effects, for example, the memory allocation functions. For these benchmarks,

PEAK fails tuning, unable to roll back the execution of the tuning section. 3

To illustrate the performance of PEAK on INT benchmarks, we experiment with

five benchmarks, bzip2, crafty, gzip, mcf and parser, which do not use the functions

with side effects in their tuning sections. (One future research topic is to include

the source code of these library functions into the main program, so that the PEAK

compiler has a way to roll back the execution.)

Figure 5.9(a) shows the normalized tuning time. PEAK speeds up the tuning

process for INT benchmarks as well. On average, the normalized tuning time is

reduced from 73.48 to 13.37, and the absolute tuning time is reduced from 71.64

minutes to 14.67 minutes. bzip2 does not gain a high speedup due to the small

Nmin of 22. Besides, there are three major reasons for the fact that the speedup

for INT benchmarks is not as high as the one for FP benchmarks: (1) Due to the

irregularity of the code, INT benchmarks can only use re-execution-based rating,

which introduces more overhead than the other two rating methods used in FP

benchmarks, mainly because of the execution of the preconditional version and the

base version. (2) PEAK compiler spends more time in analyzing INT benchmarks,

which generally have more code, especially in crafty and parser. (3) Unlike FP

benchmarks, the tuning sections in INT benchmarks generally do not have a similar

code structure. Different tuning sections in INT benchmarks may favor different

optimizations, which leads to more experimental versions, especially in parser.

Figure 5.9(b) shows the tuning time percentages of the six tuning steps. The

most time-consuming components are performance tuning, rating method analysis,

and tuning section selection. Due to the larger source code size, rating method

analysis and code instrumentation spend more time on INT benchmarks than on FP

benchmarks.

3The INT benchmarks have more code than the FP benchmarks. Our PEAK compiler for C, whichis based on SUIF2, has some difficulty in processing some of these benchmarks, failing to pass theC-to-SUIF conversion.

89

58.88

80.23 78.68 79.7172.32 73.48

24.1716.18

5.33 4.33

47.26

13.37

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

bzip

2

craf

ty

gzip

mcf

pars

er

Geo

Mea

n

No

rmal

ized

tu

nin

g t

ime

Whole PEAK

(a) Normalized tuning time of the whole-program tuning and the PEAK system for SPEC CPU2000INT benchmarks on Pentium IV. Lower is better. On average, PEAK gains a speedup of 5.5.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bzip2 crafty gzip mcf parserPer

cen

tag

e o

f th

e to

tal t

ime

spen

t in

th

e tu

nin

g p

roce

ss

TSS RMA CI DG PT FVG

(b) Tuning time percentage of the six stages for SPEC CPU2000 INT benchmarks on Pentium IV.(TSS: tuning section selection, RMA: rating method analysis, CI: code instrumentation, DG: drivergeneration, PT: performance tuning, FVG: final version generation.) The most time-consuming stepsare PT, RMA and TSS.

Fig. 5.9. PEAK tuning time for INT benchmarks on Pentium IV

90

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

bzip

2

craf

ty

gzip

mcf

pars

er

Geo

Mea

n

Rel

ativ

e p

erfo

rman

ce i

mp

rove

men

t p

erce

nta

ge

(%) Whole_Train PEAK_Train Whole_Ref PEAK_Ref

Fig. 5.10. Program performance improvement relative to the base-line under O3 for SPEC CPU2000 INT benchmarks on Pentium IV.Higher is better. All the benchmarks use the train dataset as theinput to the tuning process. Whole Train (PEAK Train) is the per-formance achieved by the whole-program tuning (the PEAK sys-tem) under the train dataset. Whole Ref and PEAK Ref use theref dataset to evaluate the tuned program performance, but still thetrain dataset for tuning.

Figure 5.10 shows the tuned program performance. On average, PEAK improves

the performance by 4.6% and 4.2% with respect to the train dataset and the ref

dataset, while the whole-program tuning improves the performance by 4.4% and

4.2% respectively. PEAK achieves less performance for gzip than the whole-program

tuning, because of the low program coverage of 84%. For parser, PEAK achieves

much better performance than the whole-program tuning.

91

6. CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

The techniques developed in this thesis have led to the creation of the automated

performance tuning system called PEAK. PEAK searches for the best compiler opti-

mization combinations for important tuning sections in a program, using a fast and

effective optimization orchestration algorithm – Combined Elimination. Three fast

and accurate rating methods – CBR, MBR and RBR – are developed to evaluate the

performance of an optimized version based on a partial execution of the program.

The PEAK compiler selects the important code sections for performance tuning, an-

alyzes the source program for applicable rating methods, and instruments the source

code to construct a tuning driver. The tuning driver adopts a feedback-directed tun-

ing approach. It continually runs the program, loads experimental versions generated

under different optimization combinations, rates these versions based on the execu-

tion times, and explores the optimization space according to the generated ratings,

until the best optimization combination is found for each tuning section. In addi-

tion to the above functionalities, the PEAK runtime system provides the facilities

to dynamically load executables at runtime.

PEAK achieves fast tuning speed and high tuned program performance. When

our Combined Elimination (CE) algorithm is applied to the whole program, the

program performance is improved by 12% over GCC O3 for SPEC CPU2000 FP

benchmarks and 4% for INT benchmarks; the tuning time is reduced to 57% of the

closest alternative algorithm on average. Using SUN Forte compilers, CE improves

performance by 10.8%, compared to 5.6% improved by manual tuning for FP bench-

marks, and 8.1% to 4.1% for INT benchmarks. After applying the rating methods to

the tuning sections, PEAK reduces tuning time from 2 hours to 4.9 minutes for FP

92

benchmarks, and from 1.2 hours to 13 minutes for selected INT benchmarks, while

achieving equal or better program performance.

PEAK can be applied to different computer architectures and backend compilers.

The PEAK compiler works at the source program level, and the PEAK runtime

system is provided in the form of a library. PEAK can be easily ported to new

computer systems. On the other hand, PEAK invokes the backend compiler through

a command line. The tuned optimizations are controlled via compiler options. So,

new optimization techniques can be easily plugged into PEAK, if the corresponding

options are provided.

6.2 Future Work

We have developed the PEAK system as a prototype for automatic performance

tuning. One could improve this system by working at basic-block level. Working on

basic-block profiles, one could automate the manual code-partitioning process. One

could also experiment with more compiler optimizations. If the compiler source code

is provided, one could even experiment with the internal optimization parameters

that are not usually exposed to the end-user. We show other future work as follows.

6.2.1 Performance analysis on compiler optimizations

This thesis shows that optimization orchestration can improve program perfor-

mance significantly. The main reasons for the negative effects of the compiler opti-

mizations are that the optimizations do not have accurate information about the pro-

gram, and that the interaction between optimizations is hard to predict. Sometimes,

a simple implementation of the optimization also causes performance degradation.

As an effort to analyze the performance effects of the compiler optimizations,

the appendix discusses the performance behavior of all the GCC O3 optimization

options on the SPEC CPU2000 benchmarks, using a Pentium IV machine and a

SPARC II machine. The reasons for performance degradation are analyzed for several

93

important optimizations. One important finding is that optimizations may exhibit

unexpected performance behavior – even generally-beneficial techniques may degrade

performance. Degradations are often complex side-effects of the interaction with

other optimizations. 1

More work is necessary on performance analysis of compiler optimizations, es-

pecially the interaction between optimizations. This kind of analysis can be used

to improve the compiler optimizations themselves. (There exist a large number of

optimizations; this thesis only analyzes a few of them, shown in the appendix.)

6.2.2 Other tuning problems

The primary goal of PEAK is to find the best compiler optimization combination

for each tuning section, so that the tuned program achieves better performance

than the program compiled under the default optimization configuration. Moreover,

PEAK can be extended to solve other tuning problems.

PEAK can tune the performance of a library by finding the best optimized version

for each library function. Each library function works as a tuning section. The

task is to create the set of driver routines, which call all the library functions with

representative calling parameters.

PEAK can tune the performance of backend compilers, so as to find the best

default compiler optimization configuration. In this case, PEAK rates the compiler

optimizations based on performance summaries on a set of benchmarks.

PEAK can tune other parameters than compiler optimizations. For example, it

can select the best algorithm among the possible ones to solve the target problem.

Here, each experimental version implements one algorithm. Another example is to

use PEAK for optimizing parallel programs.

PEAK can tune program performance in-between production runs, instead of

before production runs. If the compiler optimizations are sensitive to the execution

1 [43] provides a detailed report on the performance of GCC optimizations.

94

environment (e.g., computer system configuration and program input), the tuning

process can be repeated when this environment changes.

6.2.3 Adaptive performance tuning

PEAK tunes program performance using a profile-based mechanism. It uses

training input to tune performance rather than the actual input. Using the actual

input would require PEAK to tune program performance adaptively at runtime.

There are two reasons for our focus on the profile-based mechanism: (1) The com-

pilation and tuning overhead of adaptive performance tuning may be too high to be

amortized by the performance gain, especially when the search space is huge. (2) In

many cases, the program performance is not very sensitive to the program input, so,

the profile-based approach is good enough for performance tuning.

If the performance effects of a few optimizations vary during the execution of the

program, adaptive performance tuning may be used to achieve even better perfor-

mance than the profile-based approach. PEAK could be used do this job, since it

supports runtime analysis: (1) The rating methods do not affect the correctness of

the program. (2) The rating methods cause little performance overhead. (3) The

PEAK runtime can generate and load optimized code at runtime.

Still, there are two important questions to be answered with regard to adaptive

performance tuning. (1) What optimizations have such dynamic behavior or need

runtime information? (2) How can we minimize the tuning overhead so that it can

be amortized by the performance gaining from adaptive tuning?

6.2.4 Program debugging

Although PEAK is developed for performance tuning, its techniques of code

instrumentation, runtime compilation, runtime code loading and performance rating

can be used to help program debugging.

95

Traditional compilers and debuggers are used separately. A compiler generates

binary code and a symbol table; then a debugger loads the binary and the symbol

table during the process of debugging. The debugging commands will use the symbol

table to locate code and data.

PEAK can be extended to invoke a compiler in the course of program debugging.

This mechanism of compilation while debugging can shorten the debugging time and

reduce manual work. For example, a programmer can apply this mechanism to the

following scenarios without restarting the debugged program:

1. The programmer compiles the program after fixing a small error and plugs the

generated binary into the running program. Our PEAK runtime can facilitate

this process of runtime code loading.

2. The programmer adds some instrumentation code during debugging. Our

PEAK compiler can do this instrumentation automatically for the purpose

of performance debugging. Our PEAK runtime can load the instrumented

code during debugging.

3. The programmer can apply different optimizations to a code section to compare

their performance. This is very similar to the original goal of performance

tuning in PEAK.

4. The programmer can use a compiler to do data flow analysis while debugging

the code. For example, the programmer finds the input and output of a code

section. Then, he or she can roll back program execution by restoring the input

or check the correctness of an optimized version by comparing its output to the

output of the un-optimized version. The re-execution-based rating technique

can help this job.

This kind of debugging system can enable a programmer to modify, analyze,

compile and load code while debugging a program.

LIST OF REFERENCES

96

LIST OF REFERENCES

[1] Z. Pan and R. Eigenmann, “Fast and effective orchestration of compiler opti-mizations for automatic performance tuning,” in The 4th Annual InternationalSymposium on Code Generation and Optimization (CGO), p. (12 pages), March2006.

[2] Z. Pan and R. Eigenmann, “Rating compiler optimizations for automatic per-formance tuning,” in SC2004: High Performance Computing, Networking andStorage Conference, p. (10 pages), November 2004.

[3] M. E. Wolf, D. E. Maydan, and D.-K. Chen, “Combining loop transforma-tions considering caches and scheduling,” in Proceedings of the 29th annualACM/IEEE international symposium on Microarchitecture, pp. 274–286, 1996.

[4] C. Click and K. D. Cooper, “Combining analyses, combining optimiza-tions,” ACM Transactions on Programming Languages and Systems (TOPLAS),vol. 17, no. 2, pp. 181–196, 1995.

[5] A.-R. Adl-Tabatabai, M. Cierniak, G.-Y. Lueh, V. M. Parikh, and J. M. Stich-noth, “Fast, effective code generation in a just-in-time java compiler,” in Pro-ceedings of the ACM SIGPLAN 1998 conference on Programming language de-sign and implementation, pp. 280–290, ACM Press, 1998.

[6] M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney, “Adaptive opti-mization in the Jalapeno JVM,” in Proceedings of the 15th ACM SIGPLANconference on Object-oriented programming, systems, languages, and applica-tions, pp. 47–65, ACM Press, 2000.

[7] M. Arnold, M. Hind, and B. G. Ryder, “Online feedback-directed optimiza-tion of java,” in Proceedings of the 17th ACM conference on Object-orientedprogramming, systems, languages, and applications, pp. 111–129, ACM Press,2002.

[8] M. Cierniak, G.-Y. Lueh, and J. M. Stichnoth, “Practicing JUDO: Java underdynamic optimizations,” in Proceedings of the ACM SIGPLAN 2000 conferenceon Programming language design and implementation, pp. 13–26, ACM Press,2000.

[9] D. R. Engler and T. A. Proebsting, “DCG: an efficient, retargetable dynamiccode generation system,” in Proceedings of the sixth international conferenceon Architectural support for programming languages and operating systems,pp. 263–272, ACM Press, 1994.

[10] D. R. Engler, “VCODE: a retargetable, extensible, very fast dynamic code gen-eration system,” SIGPLAN Not., vol. 31, no. 5, pp. 160–170, 1996.

97

[11] K. Ebcioglu and E. R. Altman, “DAISY: Dynamic compilation for 100architec-tural compatibility,” in ISCA, pp. 26–37, 1997.

[12] C. Consel and F. Noel, “A general approach for run-time specialization and itsapplication to c,” in Proceedings of the 23rd ACM SIGPLAN-SIGACT sympo-sium on Principles of programming languages, pp. 145–156, ACM Press, 1996.

[13] P. Lee and M. Leone, “Optimizing ML with run-time code generation,” inSIGPLAN Conference on Programming Language Design and Implementation,pp. 137–148, 1996.

[14] M. Leone and P. Lee, “Dynamic specialization in the fabius system,” ACMComput. Surv., vol. 30, no. 3es, p. 23, 1998.

[15] M. Mock, C. Chambers, and S. J. Eggers, “Calpa: a tool for automating selec-tive dynamic compilation,” in International Symposium on Microarchitecture,pp. 291–302, 2000.

[16] B. Grant, M. Philipose, M. Mock, C. Chambers, and S. J. Eggers, “An eval-uation of staged run-time optimizations in dyc,” in Proceedings of the ACMSIGPLAN 1999 conference on Programming language design and implementa-tion, pp. 293–304, ACM Press, 1999.

[17] J. Auslander, M. Philipose, C. Chambers, S. J. Eggers, and B. N. Bershad,“Fast, effective dynamic compilation,” in SIGPLAN Conference on Program-ming Language Design and Implementation, pp. 149–159, 1996.

[18] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: a transparent dynamicoptimization system,” in Proceedings of the ACM SIGPLAN 2000 conferenceon Programming language design and implementation, pp. 1–12, ACM Press,2000.

[19] E. Duesterwald and V. Bala, “Software profiling for hot path prediction: lessis more,” in Proceedings of the ninth international conference on Architecturalsupport for programming languages and operating systems, pp. 202–211, ACMPress, 2000.

[20] D. Bruening, T. Garnett, and S. Amarasinghe, “An infrastructure for adaptivedynamic optimization,” 2003.

[21] M. C. Merten, A. R. Trick, C. N. George, J. C. Gyllenhaal, and W. W. Hwu, “Ahardware-driven profiling scheme for identifying program hot spots to supportruntime optimization,” in Proceedings of the 26th annual international sympo-sium on Computer architecture, pp. 136–147, IEEE Computer Society, 1999.

[22] M. C. Merten, A. R. Trick, R. D. Barnes, E. M. Nystrom, C. N. George, J. C.Gyllenhaal, and W. mei W. Hwu, “An architectural framework for runtimeoptimization,” IEEE Transactions on Computers, vol. 50, no. 6, pp. 567–589,2001.

[23] E. M. Nystrom, R. D. Barnes, M. C. Merten, and W. mei W. Hwu, “Codereordering and speculation support for dynamic optimization systems,” in Pro-ceedings of the International Conference on Parallel Architectures and Compi-lation Techniques, September 2001.

98

[24] M. Voss and R. Eigenmann, “ADAPT: Automated de-coupled adaptive programtransformation,” in International Conference on Parallel Processing, pp. 163–,2000.

[25] M. J. Voss and R. Eigemann, “High-level adaptive program optimization withADAPT,” in Proceedings of the eighth ACM SIGPLAN symposium on Principlesand practices of parallel programming, pp. 93–102, ACM Press, 2001.

[26] T. Kistler and M. Franz, “Continuous program optimization: A case study,”ACM Trans. Program. Lang. Syst., vol. 25, no. 4, pp. 500–548, 2003.

[27] M. M. Strout, L. Carter, and J. Ferrante, “Compile-time composition of run-time data and iteration reorderings,” in Proceedings of the 2003 ACM SIGPLANConference on Programming Language Design and Implementation (PLDI),June 2003.

[28] L. Rauchwerger and D. A. Padua, “The LRPD test: Speculative run-time par-allelization of loops with privatization and reduction parallelization,” IEEETransactions on Parallel and Distributed Systems, vol. 10, no. 2, pp. 160–??,1999.

[29] F. Dang, H. Yu, and L. Rauchwerger, “The R-LRPD test: Speculative par-allelization of partially parallel loops,” in the 16th International Parallel andDistributed Processing Symposium (IPDPS ’02), 2002.

[30] S. Rus, L. Rauchwerger, and J. Hoeflinger, “Hybrid analysis: static & dynamicmemory reference analysis,” in Proceedings of the 16th international conferenceon Supercomputing, pp. 274–284, ACM Press, 2002.

[31] S. Nandy, X. Gao, and J. Ferrante, “TFP: Time-sensitive, flow-specific profilingat runtime,” in Workshop on Languages and Compiling for Parallel Computing(LCPC), October 2003.

[32] P. C. Diniz and M. C. Rinard, “Dynamic feedback: An effective techniquefor adaptive computing,” in SIGPLAN Conference on Programming LanguageDesign and Implementation, pp. 71–84, 1997.

[33] R. C. Whaley and J. Dongarra, “Automatically tuned linear algebra software,”in SuperComputing 1998: High Performance Networking and Computing, 1998.

[34] T. Kisuki, P. M. W. Knijnenburg, M. F. P. O’Boyle, F. Bodin, and H. A. G.Wijshoff, “A feasibility study in iterative compilation,” in International Sym-posium on High Performance Computing (ISHPC’99), pp. 121–132, 1999.

[35] M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O’Reilly, “Meta opti-mization: improving compiler heuristics with machine learning,” in Proceedingsof the ACM SIGPLAN 2003 conference on Programming language design andimplementation, pp. 77–90, ACM Press, 2003.

[36] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August, “Compileroptimization-space exploration,” in Proceedings of the international symposiumon Code generation and optimization, pp. 204–215, 2003.

99

[37] R. P. J. Pinkers, P. M. W. Knijnenburg, M. Haneda, and H. A. G. Wijshoff,“Statistical selection of compiler options,” in The IEEE Computer Society’s12th Annual International Symposium on Modeling, Analysis, and Simulation ofComputer and Telecommunications Systems (MASCOTS’04), (Volendam, TheNetherlands), pp. 494–501, October 2004.

[38] A. Hedayat, N. Sloane, and J. Stufken, Orthogonal Arrays: Theory and Appli-cations. Springer, 1999.

[39] K. Chow and Y. Wu, “Feedback-directed selection and characterization of com-piler optimizations,” in Second Workshop on Feedback Directed Optimizations,(Israel), November 1999.

[40] E. D. Granston and A. Holler, “Automatic recommendation of compiler op-tions,” in 4th Workshop on Feedback-Directed and Dynamic Optimization(FDDO-4), December 2001.

[41] K. D. Cooper, D. Subramanian, and L. Torczon, “Adaptive optimizing compilersfor the 21st century,” The Journal of Supercomputing, vol. 23, no. 1, pp. 7–22,2002.

[42] R. Joshi, G. Nelson, and K. Randall, “Denali: a goal-directed superoptimizer,”in Proceedings of the ACM SIGPLAN 2002 Conference on Programming lan-guage design and implementation, pp. 304–314, ACM Press, 2002.

[43] Z. Pan and R. Eigenmann, “Compiler optimization orchestration for peak per-formance,” Tech. Rep. TR-ECE-04-01, School of Electrical and Computer En-gineering, Purdue University, 2004.

[44] G. E. P. Box, W. G. Hunter, and J. S. Hunter, Statistics for experimenters :an introduction to design, data analysis, and model building. John Wiley andSons, 1978.

[45] T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O’Boyle, “Combined selectionof tile sizes and unroll factors using iterative compilation,” in IEEE PACT,pp. 237–248, 2000.

[46] M. Haneda, P. Knijnenburg, and H. Wijshoff, “Generating new general compileroptimization settings,” in Proceedings of the 19th ACM International Confer-ence on Supercomputing, pp. 161–168, June 2005.

[47] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subrama-nian, L. Torczon, and T. Waterman, “Finding effective compilation sequences,”in LCTES ’04: Proceedings of the 2004 ACM SIGPLAN/SIGBED conferenceon Languages, compilers, a nd tools for embedded systems, (New York, NY,USA), pp. 231–239, ACM Press, 2004.

[48] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones, “Fastsearches for effective optimization phase sequences,” in PLDI ’04: Proceedingsof the ACM SIGPLAN 2004 conference on Programming language design andimplementation, (New York, NY, USA), pp. 171–182, ACM Press, 2004.

[49] SPEC, SPEC CPU2000 Results. http://www.spec.org/cpu2000/results, 2000.

100

[50] N. J. A. Sloane, A Library of Orthogonal Arrays.http://www.research.att.com/ njas/oadir/.

[51] GNU, GCC online documentation. http://gcc.gnu.org/onlinedocs/, 2005.

[52] Sun, Forte C 6 /Sun WorkShop 6 Compilers C User’s Guide.http://docs.sun.com/app/docs/doc/806-3567, 2000.

[53] Y.-J. Lee and M. Hall, “A code isolator: Isolating code fragments from largeprograms.,” in LCPC, pp. 164–178, 2004.

[54] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “Simpoint 3.0: Faster andmore flexible program analysis,” in Workshop on Modeling, Benchmarking andSimulation, June 2005.

[55] E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder, “Us-ing simpoint for accurate and efficient simulation,” in ACM SIGMETRICS theInternational Conference on Measurement and Modeling of Computer Systems,June 2003.

[56] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically char-acterizing large scale program behavior,” in Tenth International Conference onArchitectural Support for Programming Languages and Operating Systems, Oc-tober 2002.

[57] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “SMARTS: acceler-ating microarchitecture simulation via rigorous statistical sampling,” SIGARCHComput. Archit. News, vol. 31, no. 2, pp. 84–97, 2003.

[58] T. F. Wenisch, R. E. Wunderlich, B. Falsafi, and J. C. Hoe, “TurboSMARTS:accurate microarchitecture simulation sampling in minutes,” in SIGMETRICS’05: Proceedings of the 2005 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems, (New York, NY, USA),pp. 408–409, ACM Press, 2005.

[59] T. F. Wenisch and R. E. Wunderlich, “SimFlex: Fast, accurate and flexible sim-ulation of computer systems,” in International Symposium on Microarchitecture(MICRO-38), November 2005.

[60] W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence,J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu, “Parallelprogramming with Polaris,” IEEE Computer, vol. 29, pp. 78–82, December1996.

[61] SUIF2, The SUIF 2 Compiler System. http://suif.stanford.edu/suif/suif2/,2005.

[62] P. Tu and D. A. Padua, “Gated SSA-based demand-driven symbolic analysisfor parallelizing compilers,” in International Conference on Supercomputing,pp. 414–423, 1995.

[63] W. Blume and R. Eigenmann, “Symbolic range propagation,” in the 9th Inter-national Parallel Processing Symposium, pp. 357–363, 1995.

101

[64] S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: a call graph execu-tion profiler,” in SIGPLAN Symposium on Compiler Construction, pp. 120–126,1982.

[65] R. Karp, “Reducibility among combinatorial problems,” in a Symposium on theComplexity of Computer Computations, (New York), pp. 85–103, Plenum Press,1972.

[66] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann,“Effective compiler support for predicated execution using the hyperblock,” inProceedings of the 25th annual international symposium on Microarchitecture,pp. 45–54, IEEE Computer Society Press, 1992.

[67] S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, and W.-M. W. Hwu,“A comparison of full and partial predicated execution support for ilp proces-sors,” in Proceedings of the 22nd annual international symposium on Computerarchitecture, pp. 138–150, ACM Press, 1995.

[68] D. I. August, W. mei W. Hwu, and S. A. Mahlke, “A framework for balancingcontrol flow and predication,” in Proceedings of the 30th annual ACM/IEEEinternational symposium on Microarchitecture, pp. 92–103, IEEE Computer So-ciety, 1997.

[69] B. Pottenger and R. Eigenmann, “Idiom recognition in the polaris parallelizingcompiler,” in Proceedings of the 9th international conference on Supercomputing,pp. 444–448, ACM Press, 1995.

[70] P. Briggs, K. D. Cooper, and L. Torczon, “Improvements to graph coloring reg-ister allocation,” ACM Transactions on Programming Languages and Systems,vol. 16, pp. 428–455, May 1994.

[71] P. Bergner, P. Dahl, D. Engebretsen, and M. T. O’Keefe, “Spill code minimiza-tion via interference region spilling,” in SIGPLAN Conference on ProgrammingLanguage Design and Implementation, pp. 287–295, 1997.

[72] J. Park and M. Schlansker, “On predicated execution,” Tech. Rep. HPL-91-58,Hewlett-Packard Software Systems Laboratory, May 1991.

[73] K. M. Hazelwood and T. M. Conte, “A lightweight algorithm for dynamic if-conversion during dynamic optimization,” in 2000 International Conference onParallel Architectures and Compilation Techniques, pp. 71–80, 2000.

APPENDIX

102

APPENDIXPERFORMANCE OF GCC OPTIMIZATIONS

1 Introduction

Although compiler optimizations yield significant improvements in many pro-

grams, the potential for performance degradation in certain program patterns is

known to compiler researchers and many users. Potential degradations are well

understood for some techniques, while they are unexpected in other cases. For ex-

ample, the difficulty of employing predicated execution [66–68] or parallel recurrence

substitutions [69] is evident. On the other hand, performance degradation as a re-

sult of alias analysis is generally unexpected. (We will discuss this case in detail in

Section 5.)

In order to quantitatively understand the performance effects of a large number

of compiler techniques, we measure the performance of SPEC CPU2000 benchmarks

under different compiler configurations. We obtain these results on two different

computer architectures, focusing on the GNU Compiler Collection (GCC). The ex-

periments are conducted to answer the following questions: (1) Is the default opti-

mization combination suggested by the compiler good enough? (2) What optimiza-

tions may not always help performance? (3) Is performance degradation specific to a

particular architecture? (4) Do these optimizations have different effects on integer

benchmarks and on floating-point benchmarks?

Section 3 shows the performance of SPEC benchmarks under different GCC op-

timization levels, which are the optimization combinations suggested by GCC. Sec-

tion 4 shows the performance of individual optimizations. Section 5 analyzes the

reasons for the major performance degradation as a result of individual optimiza-

tions. Section 6 summarizes the answers to the above questions.

103

2 Experimental Setup

We measure the performance of GCC 3.3 optimizations on two different com-

puter architectures: a Pentium IV machine and a SPARC II machine. We focus on

GCC, because it is portable across many different computer architectures, and its

open-source nature helps us to understand the performance behavior of its optimiza-

tions. To verify that our results hold beyond the GCC compiler, we conduct similar

experiments with the Forte compilers from Sun Microsystems [52]. Our conclusions

are valid for these compilers as well, although they generally outperform GCC on

the SPARC machine.

We take the measurements using SPEC CPU2000 benchmarks. To differentiate

the effect of compiler optimizations on integer (INT) and floating-point (FP) pro-

grams, we display the results of these two benchmark categories separately. Among

all the FP benchmarks, facerec, fma3d, galgel, and lucas are written in f90. Because

GCC cannot currently handle f90, we do not measure them.

To ensure reliable measurements, we run the experiments multiple times. The

average execution time represents the performance, while the minimum and the

maximum are shown through “error bars” to indicate the degree of fluctuation. This

fluctuation is relevant where the performance gains and losses of an optimization

technique are small.

3 Performance of Optimization Levels O1 through O3

GCC provides three optimization levels, O1 through O3 [51], each applying a

larger number of optimization techniques. O0 does not apply any substantial code

optimizations. From Figure A.1, we make the following observations.

1. There is consistent, significant performance improvement from O0 to O1. How-

ever, O2 and O3 do not always lead to additional gains; in some cases, perfor-

mance even degrades. (In Section 5 we will analyze the significant degradation

104

(b) FP benchmarks on a Pentium IV machine

0

200

400

600

800

1000

1200

1400

1600

1800

2000

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

Exe

cuti

on

Tim

e/S

eco

nd

s

0

200

400

600

800

1000

1200

1400

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

twol

f

vort

ex vpr

Exe

cuti

on

Tim

e/S

eco

nd

s

(a) INT benchmarks on a Pentium IV machine

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

twol

f

vort

ex vpr

Exe

cuti

on

Tim

e/S

eco

nd

s

(c) INT benchmarks on a SPARC II machine

0

5000

10000

15000

20000

25000

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

Exe

cuti

on

Tim

e/S

eco

nd

s

(d) FP benchmarks on a SPARC II machine

Fig. A.1. Execution time of SPEC CPU 2000 benchmarks underdifferent optimization levels compiled by GCC. (Four floating pointbenchmarks written in f90 are not included, since GCC does notcompile them.) Each benchmark has four bars for O0 to O3. (a) and(c) show the integer benchmarks; (b) and (d) show the floating pointbenchmarks. (a) and (b) are the results on a Pentium IV machine;(c) and (d) are the results on a SPARC II machine.

of art). For different applications, any one of the three levels O1 through O3

may be the best.

2. As expected, Table A.1 shows that O2 is better on average than O12 and, for

the integer benchmarks, O3 is better than O2. However, for the floating point

benchmarks O2 is better than or close to O3. Most of the performance is gained

from the optimizations in level O1. The performance increase from O1 to O2

is bigger than that from O2 to O3.

2Except the anomalous art, to be discussed in Section 5

105

Table A.1Average speedups of the optimization levels, relative to O0. In eachentry, the first number is the arithmetic mean, and the second one isthe geometric mean. The averages without art are put in parenthesesfor the floating point benchmarks on the Pentium IV machine.

INT FP INT FPPentium IV Pentium IV SPARC II SPARC II

O1 1.49/1.47 1.74(1.77)/1.65(1.67) 2.32/2.28 3.17/2.88O2 1.53/1.50 1.81(1.95)/1.60(1.81) 2.50/2.43 4.40/3.78O3 1.55/1.51 1.80(1.94)/1.60(1.80) 2.58/2.52 4.38/3.79

3. Floating point benchmarks benefit more from compiler optimizations than in-

teger benchmarks. Possible reasons are that floating point benchmarks tend

to have fewer control statements than integer benchmarks and are written in a

more regular way. Six of them are written in Fortran 77.

4. Optimizations achieve higher performance on the SPARC II machine than on

the Pentium IV machine. Possible reasons are the regularity of RISC versus

CISC instruction sets and the fact that SPARC II has more registers than

Pentium IV. The latter gives the compiler more freedom to allocate registers,

resulting in less register spilling on the SPARC II machine.

5. Eon, the only C++ benchmark, benefits more from optimization than all other

integer benchmarks. On the Pentium IV machine, the highest speedup of eon

is 2.65, while the highest one among other integer benchmarks is 1.73. On the

SPARC II machine, the highest speedup of eon is 3.70, while the highest one

among other integer benchmarks is 3.47.

4 Performance of Individual Optimizations

This section discusses different performance behaviors of individual optimization

techniques used in GCC. We measure the execution time with all optimizations on as

the baseline performance. Then, for each optimization x, we measure the execution

106

time with all optimizations on except x. The performance of optimization x is

represented by its Relative Improvement Percentage (RIP), defined as follows:

RIP =( execution time without x

execution time of baseline− 1

)∗ 100 (A.1)

RIP represents the percent increase of the program execution time when disabling a

given optimization technique. A larger RIP value, indicates a bigger positive impact

of the technique.

Due to the huge amount of data, we do not show all the results but the repre-

sentative ones. We make a number of observations, and discuss opportunities and

needs for better orchestration of the techniques.

1. Ideally, one expects that most optimization techniques yield performance im-

provements with no degradation. This is the case for apsi on the SPARC II

machine, shown in Figure A.2 (a). This situation indicates little or no need and

opportunity for optimization orchestration.

2. In some benchmarks, only a few optimizations make a significant performance

difference, while others have very small effects. vortex on Pentium IV Fig-

ure A.2 (b) is such an example. Here also, little opportunity exists for perfor-

mance gain through optimization orchestration.

3. It is possible that many optimizations cause performance degradation, such as in

twolf on Pentium IV (Figure A.2 (c)), or individual degradations are large, as in

sixtrack on Pentium IV (Figure A.4). In these cases, optimization orchestration

may help significantly.

4. In some programs, the relative improvement percentages of individual opti-

mizations are between −1.5 and 1.5. For example, in Figure A.3 (a) and Fig-

ure A.3 (b) the improvements are in the same order of magnitude as their

variance. While optimization orchestration may combine small individual gains

to a substantial improvement, the need for accurate performance measurement

becomes evident. Effects such as OS activities need to be considered carefully.

5. The performance improvement or degradation may depend on the computer

architectures. According to the results of twolf on SPARC II, shown in Fig-

107

-2

0

2

4

6

8

10

12

14

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

p

Rel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-4-202468

10121416

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-3

-2

-1

0

1

2

3

4

5

6

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

(b) VORTEX on a Pentium IV machine

(a) APSI on a SPARC II machine

(c) TWOLF on a Pentium IV machine

Fig. A.2. Relative improvement percentage of all O3 optimizations.

108

-1.5

-1

-0.5

0

0.5

1

1.5

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-4

-3

-2

-1

0

1

2

3

4

5

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

(b) VPR on a Pentium IV machine

(a) BZIP2 on a Pentium IV machine

(c) TWOLF on a SPARC II machine

Fig. A.3. Relative improvement percentage of all O3 optimizations.

109

-20-15-10-505

10152025

base

(-O

3)

rena

me-

regi

ster

s

inlin

e-fu

nctio

ns

alig

n-la

bels

alig

n-lo

ops

alig

n-ju

mps

alig

n-fu

nctio

ns

stric

t-al

iasi

ng

reor

der-

func

tions

reor

der-

bloc

ks

peep

hole

2

calle

r-sa

ves

sche

d-sp

ec

sche

d-in

terb

lock

sche

dule

-insn

s2

sche

dule

-insn

s

regm

ove

expe

nsiv

e-op

timiz

atio

ns

dele

te-n

ull-p

oint

er-c

heck

s

gcse

-sm

gcse

-lm

gcse

reru

n-lo

op-o

pt

reru

n-cs

e-af

ter-

loop

cse-

skip

-blo

cks

cse-

follo

w-ju

mps

stre

ngth

-red

uce

optim

ize-

sibl

ing-

calls

forc

e-m

em

cpro

p-re

gist

ers

gues

s-br

anch

-pro

babi

lity

dela

yed-

bran

ch

if-co

nver

sion

2

if-co

nver

sion

cros

sjum

ping

loop

-opt

imiz

e

thre

ad-ju

mps

mer

ge-c

onst

ants

defe

r-po

pRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

Fig. A.4. Relative improvement percentage of all O3 optimizations.sixtrack on a Pentium IV machine.

ure A.3 (c), and the results on Pentium IV, in Figure A.2 (c). The optimiza-

tions causing degradations are completely different on these two platforms. It

clearly shows that optimization orchestration needs to consider the application

as well as the execution environment.

In some experiments, few or none of the tested optimizations cause significant

speedups (e.g. Figure A.3 (a)); however, there is significant improvement from O0

to O1, as shown in Figure A.1. This is because some basic optimizations are not

controllable by compiler options. These optimizations include the expansion of built-

in functions, and basic local and global register allocation.

5 Optimizations with Major Negative Effects

This section shows the performance of three optimizations, which cause major

degradations to some of the benchmarks. These optimizations are strict aliasing,

global common subexpression elimination, and if-conversion. We briefly discuss the

reasons to the degradation.

110

(b) FP benchmarks on a Pentium IV machine(a) INT benchmarks on a Pentium IV machine

(c) INT benchmarks on a SPARC II machine (d) FP benchmarks on a SPARC II machine

-2

0

2

4

6

8

10

12

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

eRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-2

-1

0

1

2

3

4

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-70

-60

-50

-40

-30

-20

-10

0

10

20

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

eRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e-2

-1

0

1

2

3

4

5

6

7

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprR

elat

ive

Imp

rove

men

tP

erce

nta

ge

Fig. A.5. Relative improvement percentage of strict aliasing.

5.1 Strict aliasing

Strict aliasing is a simple alias analysis technique. When turned on, objects of

different types are always assumed to reside at different addresses.3 If strict aliasing

is turned off, GCC assumes the existence of aliases very conservatively [51].

Generally, one expects a throughout positive effect of strict aliasing, as it avoids

conservative assumptions. Figure A.5 confirms this view for most cases. However,

the technique also leads to significant degradation in art, shown in Figure A.5 (b).

The RIP is -64.5.

The degradation in art on Pentium IV is due to the interaction between strict

aliasing and register allocation. GCC implements a graph coloring register alloca-

3Strict-aliasing in combination with type casting may lead to incorrect programs. We have notobserved any such problems in the SPEC CPU2000 benchmarks.

111

tor [70, 71]. With strict aliasing, the live ranges of the variables used in art become

longer, leading to higher register pressure and spilling. With more conservative

aliasing, the same variables incur memory transfers at the end of their (shorter)

live ranges as well. However, in the given compiler implementation, the spill code,

generated with strict aliasing on, includes substantially more memory accesses than

these transfers, generated without strict aliasing. Thus, strict aliasing causes perfor-

mance degradation to art. (Unfortunately, this performance degradation is highly

impossible to predict at compile-time, because register allocation is an NP-complete

problem.)

From Figure A.5 (d), on SPARC II, strict aliasing does not degrade art, but

improves the performance by 10.7%. We attribute this improvement to less spilling

due to the larger number of registers on SPARC II than on Pentium IV.

5.2 Global common subexpression elimination

Global common subexpression elimination (GCSE) employs partial redundancy

elimination (PRE), global constant propagation, and copy propagation [51]. GCSE

removes redundant computation and, therefore, generally improves performance. In

rare cases it increases register pressure by keeping the expression values longer. PRE

may also create additional move instructions, as it attempts to place the results of

the same expression computed in different basic blocks into the same register.

We have also found that GCSE can degrade the performance, as it interacts with

other optimizations. In applu (Figure A.6 (b)), we observed a significant perfor-

mance degradation in the subroutine of JACLD. Detailed analysis showed that this

problem happens when GCSE is used together with the flag force-mem. This flag

forces memory operands to be copied into registers before arithmetic operations,

which generally improves code by making all memory references potential common

subexpressions. However, in applu, this pass evidently interferes with the GCSE

algorithm. Comparing the assembly code with and without force-mem, we found

the former recognized fewer common subexpressions.

112



-20

-15

-10

-5

0

5

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

Rel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-4

-3

-2

-1

0

1

2

3

4

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-4

-2

0

2

4

6

8

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

eRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-5

-4

-3

-2

-1

0

1

2

3

4

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

Fig. A.6. Relative improvement percentage of global common subex-pression elimination.

5.3 If-conversion

If-conversion attempts to transform conditional jumps into branch-less equiva-

lents. It makes use of conditional moves, min, max, set flags and abs instructions,

and applies laws of standard arithmetic [51]. If the computer architecture supports

predication, if-conversion may be used to enable predicated instructions [72]. By

removing conditional jumps, if-conversion not only reduces the number of branches,

but also enlarges basic blocks, thus helps scheduling. The potential overhead of such

transformations and the opportunities for dynamic optimization are well-known [73].

In our measurements, we found many cases where if-conversion degrades the perfor-

mance.

113



-1

0

1

2

3

4

5

6

7

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

eRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e

-2

-1

0

1

2

3

4

5

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

eRel

ativ

eIm

pro

vem

ent

Per

cen

tag

e-4

-3

-2

-1

0

1

2

3

4

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprR

elat

ive

Imp

rove

men

tP

erce

nta

ge

-4

-3

-2

-1

0

1

2

3

4

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vprR

elat

ive

Imp

rove

men

tP

erce

nta

ge

Fig. A.7. Relative improvement percentage of if-conversion.

In vortex, there is a frequently called function named ChkGetChunk, which is

called billions of times in the course of the program. After if-conversion, the number

of basic blocks is reduced, but the number of instructions in the converted if-then-else

construct is still the same. However, the number of registers used in this function

is increased, because the coalesced basic block uses more physical registers to avoid

spilling. Thus, this function has to save the status of more registers. The additional

register saving causes substantial overhead, as the function is called many times.

6 Summary on GCC Optimization Performance

We make the following observations from the above discussion:

• The default optimization combinations do not guarantee the best performance.

There is still room for performance tuning.

114

• Optimizations may exhibit unexpected performance behavior. Even generally-

beneficial techniques may degrade performance. Degradations are often com-

plex side-effects of the interaction with other optimizations. They are near-

impossible to predict analytically.

• On different architectures, the optimizations may behave differently. This

means that compiler optimizations orchestrated on one architecture may not

suit the others.

• A larger number of optimization techniques cause performance degradations in

integer benchmarks. Integer benchmarks often contain irregular code with many

control statements, which tends to reduce the effectiveness of optimizations. On

the other hand, larger degradations (of fewer techniques) occur in floating point

benchmarks. This is consistent with the generally larger effect of optimization

techniques in these programs.

VITA

115

VITA

Zhelong Pan was born in 1976 at Suzhou, China. He received his B.E. degree

in Electrical Engineering from Tsinghua University in July, 1998. He was awarded

“Tsinghua Honorable Excellent Graduate” the same year. Three years later, in July,

2001, he received his M.E. degree in Electrical Engineering from Tsinghua Univer-

sity, awarded the “Excellent Master Thesis”. Then, he moved to West Lafayette at

Indiana to pursue his Ph.D. degree at the School of Electrical and Computer Engi-

neering at Purdue University. He successfully defended his Ph.D. research in April,

2006 and received his Ph.D. degree in May, 2006.

During the period of his master’s program, Zhelong Pan participated in develop-

ing software for power system analysis and simulation, which had been applied to

more than ten power companies in China at that time. In his master’s thesis, he

developed a distributed genetic algorithm to solve the reactive power optimization

problem. As a Ph.D. student at Purdue, he worked on optimizing compilers and

parallel computing. He developed a tiling algorithm for locality enhancement, which

works in concert with the existing parallelization techniques in the Polaris compiler.

He implemented a user-level socket virtualization technique, which enables MPI pro-

gram execution on virtual machines in iShare, an open Internet sharing system. In

his Ph.D. dissertation, he developed fast and effective algorithms and compiler tools

for automatic performance tuning via selecting the best compiler optimization com-

bination.

Documents

engineering.purdue.edu · iii ACKNOWLEDGMENTS Iwouldliketoﬁrstthankmyadvisor,RudiEigenmann,forhissupportandadvice duringmymanydayshereatPurdue. Ithasbeenatruepleasuretoworkwithhim