Upload
mohamed-amine
View
214
Download
1
Embed Size (px)
Citation preview
A Practical Method for Quickly Improving Performance and Reducing Answer TimeThrough the Selection of Hot Loops Based on the Input Data
Lamia Atma Djoudi and Mohamed Amine Achab
Contact: [email protected]
Independent researchers
Abstract— A quick resolution of the performance problemdepends on having precise methodology/tool for exploring therelationship between performance and computation analysis.Also, it is necessary to take into account parameters that have animpact on the methodology process. Input data is an importantand a main parameter which can involve the performance gainand the answer time.
In this paper, we propose a strategy which allows us to quicklyimprove performance by (1) focusing on the hot part of code and(2) taking into account the input data. Our methodology providesa precise static/dynamic analysis for the selected part of code. Itguides user to apply the best transformation for the hot part ofcode. Whatever the number and the size of input data, the sizeof the application, our methodology provides in a short time thebest transformation for the selected part of code.
I. INTRODUCTION
Before starting to optimize an application, we must identify
the main factors limiting the performance of its application.
For that, two types of code analysis techniques can be used:
static and dynamic analysis. Most of the work to date has
been based either on static analysis or on measurement. Static
analysis is usually faster than dynamic analysis but less pre-
cise. Therefore it is often desirable to retain information from
static analysis for checking properties, and dynamic analysis
which evaluates properties against an event originating from
a concrete program execution.
Collecting static information is easier than having dynamic in-
formation. Obtaining dynamic information on the behavior of a
program is relatively complex. In one hand the application size
is increased and on the other hand the number, the complexity
and the interactions of the transformations (optimizations) to
apply are important. Moreover the validity of this information
depends on the input parameters of the program.
Applying several transformations to a part of code (or all
application) by taking into account the input data implies
launching the execution several times. The number of the
executions depends mainly on the number of the input data
and the number of the transformations to be applied.
Also, another point to be discussed about the application
transformation to improve performanceis, how we can be sure
that the applied transformation is the best one to improve
performance? Unfortunately this is verified after the execution
of the application.
As we know that loop optimization is a critical part in the code
optimization. A routinely stated rule being 90/10, i.e. 90%
of the execution time is spent on 10% of the code. Another
question can be asked: it is neccessayr to optimize all loops
or how we can focus just on the critique loops.
For all these constraints, we can imagine and deduce the
execution time, the answer time to improve performance for
a large application with sevral large input data. This why, we
believe that we need a strategy which allows us quickly to
obtain dynamic analysis by focusing on the select part of code
(source, assembly or binary). So, if a developer is focusing on
the implementation or the tuning of a key computation, it will
be far more efficient and less cumbersome to run just the key
computation, isolated from the rest of the program. Also, we
need a methodology to determine which transformation will
be applied for each loop and each input data.
In this paper, we propose an approach to improve per-
formance by taking into account the input data. With our
approach, we can also apply the best transformation; that
means we reduce the number of transformations applied for
each part of code. This is can be done by using a precise
static/dynamic analysis.
By selecting the best transformation and taking into account
the input data, we will also reduce the answer time.
With our approach, user (expert or not in code analysis
and optimization) can easily use and test several input data,
different hardware and compilers.
A. Key Issues
Our approach addresses the following issues:
Application: For every application (c, Fortran, assembly),
we need to improve performance by taking into account the
input data. With our approach, we will prove that several large
input data for a large application, will not be a problem or an
obstacle to quickly improve performance.
Transformations: One of our goals is to apply the best
transformation for each part of code based on the input data.
This is can be done by selecting the hot part of code, then
taking a precise analysis to apply the best transformation.
Answer time: Current scientific applications take a lot of
time to be executed. So, applying different transformations on
different architectures and compilers will need huge time. Our
objective is to propose a new methodology to reduce answer
time to satisfy user.
B. Motivation
Taking into account the issues presented before, we present
a large application that has several input data with an important
execution time. We use Blast, Basic Local Alignment Search
Tool (BLAST10) is a set of algorithms used to find similar
sequences between several DNA chains or protein databases.
Table I presents the size, the number of files and loops of this
2012 IEEE 14th International Conference on High Performance Computing and Communications
978-0-7695-4749-7/12 $26.00 © 2012 IEEE
DOI
1150
2012 IEEE 14th International Conference on High Performance Computing and Communications
978-0-7695-4749-7/12 $26.00 © 2012 IEEE
DOI 10.1109/HPCC.2012.298
1150
application.
Using the 7 input data of Blast application, Table II presents
TABLE I
BLAST APPLICATION
Application size Number of Files Number of source loops
45 MO 748 10895
TABLE II
BLAST: TIME EXECUTION FOR DIFFERENT INPUT DATA
Input data Execution Time(sec) Pin (sec) MAKS-MAQAO (sec)
I1 150,07 37153,59 268
I2 4,15 997,15 5,23
I3 24,99 6106,19 25,68
I4 205,08 52142,63 345,11
I5 53,32 13906,69 75,26
I6 211,31 53088,14 288,18
I7 953,4 130756,32 1614,14
the execution time for the original code and the execution
time by using Pin [19] and MAKS-MAQAO [1] tools. The
choice of the usage of these two tools is based on a strategy
of selecting tools to apply selective instrumentation [17]. With
the combination of these tools, selective methodology helps us
to quickly have precise static/dynamic analysis for a selected
part of code. More information about MAKS-MAQAO, Pin
and their combination is detailed in section II-A.3.
In Table II, we remark that the execution time using
tools analysis is large than the excution time of the original
code. And as we need these tools to analyze and improve
perfromance, we start by asking the following questions:
• Is it necessary to analyze all the 10895 source loops?
• How many transformations will be applied by taking into
account the 7 input data?
• How many times we will launch this large application?
• What is the overall time to run all the application with
all input data and all the transformations?
• How to choose a transformation? And it must be applied
to any loop?
C. Proposition
Answering the previous questions can be done by using our
methodology described in this paper where we propose to:
• focus on the hot parts of code;
• select the best transformation and guiding user;
• Reduce the answer time.
These main goals will be detailed in this paper to quickly find
the best transformation for each loop and taking into account
the input data.
In the rest of this paper: Section 2 describes our method-
ology where we present a background, a description of our
methodology and an evaluation study. Experimental results are
presented in Section 3. Section 4 presents related work. We
conclude in Section 5.
II. METHODOLOGY
The focus of our effort has been to propose a methodology
that is (1) easy to use, (2) helps user to quickly improve the
performance of his application, and (3) takes into account the
input data. Rather, than on inventing new performance mea-
sures or new ways to collect measurements, developing new
tools, we believe that good usage of existing tools/performance
measures can answer our request.
In this section, we discuss our methodology by presenting
its usefulness and benefits. The process of our approach is also
presented in this section. But at first, we present an overview
of the tools/performance measures used to achieve our goals
fixed in this paper.
A. Background
1) Analysis: Before starting the optimization of an applica-
tion, it is necessary to first identify the main factors limiting its
performance. For that, two types of code analysis techniques
can be used: static and dynamic analysis. Most of the work
has been based either on static analysis or on measurement.
Existing static and dynamic analysis methods often have their
advantages and limitations.
We beleive that is better to have results analysis on source,
assembly and binary level. The main goal in the analysis phase
is to do an accurate analysis, be it static or dynamic, high
level or low level. We believe that combining the benefits of
each technique will be superior to any one single technique,
excluding the other one, with the added difficulty of making
the right choice between the techniques.
2) Analysis level: By analyzing the performance results, we
should be able (1) to determine the sources of performance
degradation and (2) to improve performance. To do this, we
must have all information on source, assembly and binary
codes. The idea is not to analyze each code separately but
proposing how we can benefit from all information extracted
from the three codes to have precise analysis.
For static analysis, we propose to extract information from
source and assembly codes(post-compilation).
For dynamic analysis, we propose to work on assembly and
binary level. The idea is to have a collaborative relationship
with the compiler. It may lack some expressiveness (compared
to compilation passes applied to an abstract representation),
but it does handle complex analysis and transformations, and
allows a direct and precise modeling of the target platform.
This post-compilation approach has several advantages:
• At assembly language almost all compiler performed
optimizations become visible that is not the case for
higher-level representations.
• The exploitation of the post-compiler optimization op-
portunities is not intended as a compiler replacement:
it is guaranteed that no other code transformation will
break undo or break the optimization. Also, being based
after the compilation phase allows a precise diagnostic of
compiler optimization successes and/or failures.
• Assembly language is still at a higher enough level to
make development possible and optimization achievable.
This is not the case for binary executable. In contrast to
binary executable code, program areas such as function
and basic blocks are still identifiable which not the case
for binary code.
11511151
• The code can be compiled directly and can actually be
implemented using any compilers/assembler, unlike an
intermediate representation.
3) Performance Analysis Tools: Obtain accurate analyzes
static / dynamic at any level (source, assembler and binary) to
detect sources of performance degradations, several tools have
been developed.
For our approach, we need tool(s) that answer ourneeds.
Mainly, we require tool(s) that works on source/assembly and
binary level to be sure to identify the source of degradation
performance. Also, the tool(s) must generate static/dynamic
information on three levels: source, assembly and binary.
After a careful study of existing tools to acheive our goals,
we chose the following tools:
a) MAKS-MAQAO [2], [1]: stands for Multi-
Architecture Knowledge-based System-Modular Quality
Analyzer and Optimizer. It is the implementation of our new
optimization approach. It addresses performance problem
in all its diversity : static analysis, support for hardware
counters, dynamic instrumentation and profiling, hybrid
intelligent system in a knowledge based system to process
the results, source transformations and automatic low level
optimization for fine loop tuning. MAKS-MAQAO is a tool
to analyze and optimize assembly and source application
based on compiler optimization and user criteria to exploit
the hadware resources.
A distinctive advantage is that this system strongly focuses on
versatility, i.e., users can specify their own analysis and enrich
the performance intelligent system. These capabilities enable
a better control of an optimization process and enhance the
productivity of programmers in the process of code tuning.
b) Pin [19]: It instruments binary codes in a way that
when specific instructions are executed, they are caught and
user defined instrumentation routines are executed. While
being very useful Pin is more oriented toward prospective
architecture simulation than code performance analysis.
c) M2Pin [17]: Our study and test of MAKS-MAQAO
and Pin, allowed us to focus on their disadvantages:
MAKS-MAQAO: in the face of the importance of its dynamic
analysis, and the usage of this tool by an expert or not, has
an important limit which is the memory tracing.
Pin: in addition to its execution time and the influence on the
results as prouved by the Pin community [21], it is a dedicated
tool for experts. And also, it lacks the concept of selectivity
because it instrument all application.
To take into account the advantages of MAKS-MAQAO and
Pin tools (and surpassing of their disadvantges), a combination
of their adavantages is proposed [17] in previous work.
Their combination enriches the knowledge base of MAKS-
MAQAO with the dynamic information of Pin. To surpass
the inconvenient of Pin (execution time), we have proposed
a selectivity approach[17]. MAKS-MAQAO selects a part of
code which will be instrumented by Pin.
B. Description
Figure 1 presents an overview of our approach. In order
to accelerate the answer time, improve performance by taking
Fig. 1. Apporach overview
into account the input data; we propose to apply the following
steps of our approach:
1) The analyzer: The MAKS-MAQAO analyzer [1], [3],
[4] provides precise analysis (static and dynamic) results. It
is a key feature of MAKS-MAQAO is its ability to value
profile the code at various granularity. Additionally to timing,
instrumentation also performs value profiling. Value profiling
is often the missing link between the observed behavior on the
hardware and the nature of the application. This feature yields
to numerous optimization opportunities. Time profiling allows
us to give a precise weight to all executed loops, therefore
underscoring hotspot. Value profiling monitors the iteration
count. Correlating this information provides the relevant met-
ric: i.e. which hot loops are short. This is a clear illustration of
the interest of centralized approach for performance analysis.
These analyzer results can be visualized by user or presented
in a profile guide easy to understand. The hot loop(s) is (are)
presented as main keys in this profile guide.
2) The transformer: This module is based on the informa-
tion provided by the two modules of MAKS-MAQAO: (1)
MAQAOAdvisor [5], [1] that can provide precise information
about applied optimization. (2) The expert system [7], [6], [1],
which generates recommendation that guide user to apply the
best transformation to improve performance.
At this step, when the expert system generates recommenda-
11521152
tions and if the user is satisfied, the transformer applies the
optimization proposed in the recommendation.
The principal novelty of this transformer module [8] is
its integrated support for a set of source transformations,
directives and compiler pragmas. It gives user over how they
will be applied. It enables complex optimizations to be applied
to achieve performance that was previously only achievable
through careful hand optimization.
3) The instrumenter: After selecting the hot loop by taking
into account the input data, the best transformation to be
applied will be guided by the execution summary. Once
the automatic transformer will applied this transformation, a
selective instrumentation [17] will be launched.
This selective instrumentation is based on the combination
M2pin (MAKS-MAQAO and Pin). The main axes for this
approach are: (1) selecting a part of code (source, assembly
or binary) to be analyzed by a user, (2) selecting the dynamic
analysis technique. For the selected technique, a selection of
the corresponding functionality in the selected tool is launched.
4) The gain calculator: The last step in our methodology
is the calculation of the gain. Then, all information is saved in
the knowledge base of MAKS-MAQAO to be used in future
experiences or visualized by user.
C. Evaluation study
In this section, we present an evaluation study to confirm
the advantages of our approach.
Let p: the number of input data.
Let l: the number of loops.
let m: the number of optimization for each loop
For p input data, the total execution time for an application is
the sum of execution time by running the application for each
input data:
Total(Texec) =∑p
i=1 T imeappli
We suppose that we apply the same optimization for all
loops. In this case, the total execution time for the application
is the sum of execution time by running the original code
the application for each input data, plus the sum of execution
time by running the application for each input data and each
transformation:
Total(Texec) =
p∑
i=1
T imeappli+
p∑
i=1
l∑
j=1
m∑
k=1
Timeappli (1)
Using our approach, we demonstrate that for each step of our
process we have a gain.
1) Using hot loops: By selecting the hot loops, we suppose
that we have h hot loops(h inf l). The total execution time for
the application is:
Total(Texec) =
p∑
i=1
T imeappli+
p∑
i=1
h∑
j=1
m∑
k=1
Timeappli (2)
2) Best transformation for each hot loop: By using our
methodology, we have one transformation (m = 1) to be
applied to every hot loop and every input data. The total
execution time for the application is:
Total(Texec) =
p∑
i=1
T imeappli +
p∑
i=1
h∑
j=1
Timeappli (3)
3) Selective instrumentation: By applying the selective
instrumentation, the execution time is smaller than the previous
times.
Total(Texec) =
p∑
i=1
Timeappli +
p∑
i=1
h∑
i=1
T imehotloops (4)
4) Evaluation for ine hot loop: By applying our method-
ology for one hot loop, we have:a) Performance: α is the fraction between the execution
time of the hot loop and the total execution time for all the
application. α =Timehotloop
Timeappli
After applying the transformation recommended by the expert
system of MAKS-MAQAO, we have:
α =Timehotloop(optimization)
TimeappliWhere α inf α
For one input data, the total time needed by our approach is:
Totaltime = Timeappli + Timehotloop(optimization)
Totaltime = (1 + α) ∗ T imeappliKnowing that α inf 1, our time execution by running the
original code the application for each input data, plus the
sum of execution time by running the hot loops for each
input data and each transformation is impressive. It is not no
larger than the execution time of the original application. That
means, with a bit time we can guide user to apply the best
transformation for the hot loop by taking into account the input
data.b) Answer time: Applying our approach for one
hot loop, we conclude the answer time as follow:
β =Timeinstrumentation(hotloop)
Timeinstrumentaion(appli)
After applying the transformation recommended by the expert
system of MAKS-MAQAO, we have:
β =Timeinstrumentation−hotloop(optimization)
Timeinstrumentation−applicationWhere β inf β
For one input data, the total time needed by our approach is:
Answertime = Timeinstrumentation−appli +Timeinstrumentation−hotloop(optimization)
Answertime = (1 + β) ∗ Timeinstrumentation−appli
III. EXPERIMENTAL RESULTS
In this section, we evaluate our proposed appraoch. We con-
sider two scientific applications: Blast and NR. Experiments
were run on two machines:
• BULL Itanium 2 Novascale system, 1.6GHz, 3MB of L3.
Codes were compiled using Intel ICC/IFORT 10.1.
• For X86, we use 4 sockets quadri-core 2,93 GHz, 48 GB
Mem 1 x 146 GB. Codes were compiled using ICC 10.1.
A. Blast applicationAs described in section I-B, this large application has
several input data and 10895 loops. Applying our approach,
we can summarize the optimization steps as follow:
11531153
Evaluation: Following the study evaluation described in
section II-C, Table III presents the values by applying the four
formulas described in section II-C.
TABLE III
BLAST: ANSWER TIME USING EVALUATION STUDY FORMULAS
F1 F2 F3 F4
Time (sec) 338744343000 3312487 55331 5418
TABLE IV
BLAST: SPEEDUP IN ANSWER TIME, GAIN IN PERFORMANCE BY APPLYING
OUR APPROACH FOR HOT LOOPS
I data Files H. loops %Exec. time Gain perf Speedup Ans. time
I1 ungapped 500 61,95 0,85 223, 33
I2 seqport 376 10,70 2,65 60,65
I3 gapalign 2875 25,23 2,76 7,52
I4 ungapped 500 60,47 1,18 235,20
I5 ungapped 500 49,06 1,60 221,41
I6 ungapped 500 44,86 1,77 231,1
I7 ungapped 500 57,78 1,50 56,67
Note that:
• For the four formulas:
– F1: is the total execution time for the application. It
is the sum of execution time by running the original
code the application for each input data, plus the sum
of execution time by running the application for each
input data and each transformation.
– F2: is the total execution time for the application. It
is the sum of execution time by running the original
code the application for each input data, plus the sum
of execution time by running the hot loops for each
input data and each transformation.
– F3: is the total execution time for the application. It
is the sum of execution time by running the original
code the application for each input data, plus the
sum of execution time by running the hot loops for
each input data and applying the best transformation
proposed by our methodology.
– F4: is the total execution time by applying the se-
lective instrumentation for each hot and by applying
the best transformation.
• We have 7 input data and 7 hot loops.
• We take 10 transformations (it can be more) for each loop
from the 10895 loops.
In Table III, we remark that we need a large time when
executing the application for the 7 input data and several
transformations for each loop. This is why we focus on the
hot loop. Comparing to execution time, our need time (F4) is
important. But comparing this time to another time (F1, F2
and F3) we can be sure that our methodology is the best.
Based on F4 formula, Table IV presents the source line for
the hot loop for each input data. The file contains the hot
loop. Third column represents the percentage of the execution
time for the hoot loop. Most hot loops take an important time
to be executed. Applying our approach, the two last columns
summarize the gain in performance and speedup in answer
time. This is done by focusing on the hot loop. Starting
by finding the hot loop for each input data, generating the
static/dynamic analysis, then applying the best transformation.
Applying our approach, we can summarize the optimization
steps as follow:
TABLE V
NR APPLICATIONS, FILES, LOOPS AND SIZE FOR SMALL AND LARGE
INPUT DATA:
Application GaussJordan Jacobi Mprove Toeplz Tridag
Files 8 5 9 4 2
loops 9 8 9 4 2
small I data 16 16 16 16 8192
Large I data 1024 1024 1024 1024 8388608
TABLE VI
NR: : SPEEDUP IN ANSWER TIME, GAIN IN PERFORMANCE BY APPLYING
OUR APPROACH FOR HOT LOOPS
I data Loops % execution time gain performance Speedup A. time
16 Loop7 34,28 1,65 2,13
32 Loop7 35,84 4,84 2,54
64 Loop7 36,20 8,02 3,81
128 Loop7 34,97 12,51 8,10
200 Loop8 33,46 15,16 11,54
256 Loop2 33,94 17,56 13,34
500 Loop2 54,43 25,48 14,11
512 Loop2 53,89 24,38 14,20
999 Loop2 32,64 1,57 12,71
1024 Loop2 32,38 1,99 12,91
1) Benefits focusing on Hot loops: As described in Table
IV, the hot loops depend on the input data. For each input
data, there is a hot loop in a hot file. That confirms we must
take into account the input data to improve performance.
The percentage of execution time for each hot loop confirms
the benefits of selecting the hot loop from the 10895 loops.
2) Gain: As described in our approach, our main goals is
improving performance and reducing answer time. Of course
that for Blast application, we have a small gain, but the
speedup in answer time is very important.
Despite these small gains in performance, we presented this
application to show the importance of execution time by
applying transformation for multiple data. In this example,
we gain in answer time because we focus on the hot loop by
applying a single transformation for each input data.
B. NR applications
we present experiments results of applications of Solution
of Linear Algebra Equations from the Numerical Recipes
applications (NR). These applications have ten input data.
Table V summarizes the different applications of NR by
presenting the number of files/loops in each input data. Also,
for each application, we present the size of the small and the
large input data.
Following the same process as Blast application, we choose
Guass jordan application to present the advantages of our
approach. Table VI presents the hot loops, the percentage of
the execution time of the hot loops and the gain in performance
and answer time. We remark:
• The hot loop is not the same for all input data
11541154
Fig. 2. NR hot loops: execution time on x86
Fig. 3. NR performance Gain
• The percentage of the execution time of the hot loop for
each input data encourages us to focus on analyzing and
optimizing just the hot loop.
• For most hot loops and input data, we have an important
gain in performance and also an important speedup in
answer time.
For all NR applications, we use the average and the maximal
values of input data to present the gain in answer time and
performance.
Figure 2 presents the time execution of each hot loops in
each application (run on X86). Most of these hot loops take
more than 30 % of the time execution of the application. This
is why we have applied our approach by incrementing just the
hot loops. On Itanium 2 we have the same presentation.
Figure 3 presents the gain in performance using Itanium 2
and x86 architectures.
Figure 4 presents the speedup in answer time for all NR
applications. It summarizes the speedup of the whole of each
application. We remark, all speedup are important whatever
the architecture and the input data.
For the evaluation study (described in section II-C), Figures
5 and 6 present the execution time of the four formulas.
For each formula, we to the execution time for the original
application, the following values:
• For F1: we add the total execution time by applying
several transformations for each loop and each input data.
• For F2: For this formula, we execute the hot loops for
several transformation and input data. Then we add it to
the execution time for the original application.
• For F3: For this formula, we apply the best transformation
Fig. 4. NR Speedup in answer time
Fig. 5. Evaluation study for NR applications on X86
for each input data. Then we add it to the execution time
for the original application.
• For F4: we execute just the hot loop with the best
transformation for each input data. Then we add it to
the execution time for the original application.
It is clearly visible that for both architectures, our approach
is best. For X86 (Figure 5), F4 (our approach takes a very
short answer time comparing to the others. For example for
Tridag application, we have 53 seconds when the execution
of the original code takes 23 seconds. Running and apply
instrumentation the Tridag for ten input data and two loops,
we need 182321 seconds. In this figure, F1, F2 and F3 must be
divided by 75, 25 and 10 respectively to compare the results
with our approach.
The same remarks for Itanium architecture (Figure 6). F1,
F2 and F3 must be divided by 100, 25 and 10 respectively to
compare the results with our approach.
IV. RELATED WORK
In this section we briefly discuss related work in the
investigation of the impact of input data on performance, the
performance tools and the selectivity mode.
A. Input data and performance
Most of searchers use a limited number of the input data to
validate their researches which mainly in code optimization.
The main reason to be limited with a small number of input
data is launching the application several times which need a
huge execution time. For this, several researchers propose new
ideas to how investigate the several input data [9], [10], [11],
[12], [13]. Mainly they focus on the impact of data sets on
compiler optimization parameterization and selection.
11551155
Fig. 6. Evaluation study for NR applications on Itanium 2
Zhong et al. [14] present two techniques to predict how
program locality is affected across data sets. Several studies
[9], [10], [11], [12], [13] underscore the fact that a significant
number of iterations (tens or hundreds) are required to find
the best combination of compiler optimizations.
The number of input data is not an obstacle to our approach
because we prove that we need small time than the execution
time of total execution time to select the best optimization.
Chen et all [15] evaluate the iterative optimization effective-
ness across a large number of data sets. They propose the
possibility to learn the best possible compiler optimizations
across distinct data sets. Unfortunately, their method is applied
to the whole of the program. This needs also an important time
to choose the best transformation.
With our approach, we investigated a fine-grain optimization
because we focus on loops. Also our time execution is smaller
because we select and focus on hot loops to be analyzed and
tested with the new optimization to select the best one.
B. Tools for Code Analysis and Optimization
Most of the performance analysis tools/toolkits can be dis-
patched among two main classes: static and dynamic analysis.
Hardware monitors are extremely helpful for performance
tuning, they are the backbone of analysis tools like VTune[22] andCprof [18]. Their usage is so widespread that an API
gets standardized to describe their access [23]. Nevertheless,
hardware counters are limited to the dynamic description of an
application and this picture needs to be correlated with other
metrics. DPCL [24] based on Dyninst [25]. It helps developers
to support dynamic instrumentation of parallel jobs. Even if
dynamic instrumentation is very appealing, DPCL does not
include any notion of code inspection.
ATOM [26] and Pin [19] instrument assembly/binary codes
in a way that when specific instructions are executed, they are
caught and user defined instrumentation routines are executed.
While being very useful Atom and Pin are more oriented
toward prospective architecture simulation than code perfor-
mance analysis. EEL [16] belongs to the same categories of
tools. This C++ library allows to edit a binary and add code
fragment on edges of dissambled application CFG. Therefore
it can be used as a foundation for an analysis tool but
does not provide performance analysis by itself. TAU [20]
Performance System is a portable profiling and tracing toolkit
for performance analysis of parallel programs. TAU combine
different tools but no information interchange between them.
Also his major inconvenient, the source code instrumentation.
HPCview [27] and Finesse [28] address the analysis
problem from static and dynamic sides. HPCview tackles the
same problem as MAKS-MAQAO: the complex interaction
between source code, assembly, performance and hardware
monitors. HPCview presents a well designed GUI based
on web browser, displaying simultaneous views of source,
assembly code and dynamic information. This interface is
connected to a database storing for each statement of the
assembly code a summary of its dynamic HPCview also lacks
value profiling, yet simple to implement optimizations.
Vista [29], is an interesting cross-over between compiler
and performance tool. Addressing the issue of compiler op-
timization phases ordering, this complete framework allows
an interactive, step by step compilation. Plugged with its own
compiler, Vista allows to interactively test and configure com-
pilation phases for code fragment. While being conceptually
close to MAKS-MAQAO, Vista remains more a compiler
project than a performance analyzer.
Shark [30] offers a comprehensive interface for performance
problem. As MAKS-MAQAO it is located at the assembly
level for its analyzes, displays source code as well as profiling
information. Shark lacks instrumentation and value profiling.
DSPInst[33] is a binary instrumentation tool. This tool
allows user to select the part of code to be instrumented
(function, loop, ...). Several results are generated (data cache
misses, memory tracing, ...). Despite its advantages, the only
major disadvantage is that it is focused on one architecture
Blackfin.
C. Selectivity
Shend et all [31] propose selected mode on source code.
With their approach, it is possible to select function, instruction
to be instrumented. Moreover, the source instrumentation can
interfere with compiler optimizations. Also it is not useful
when using libraries.
Hernandez at all. [32] apply the selective instrumentation in
the compiler openuh. Based on static estimation, they propose
to instrument procedures. The major inconvenient, their tool
cannot select the hot function/loop/instruction.
There is many researchs on automatically selecting for
the best compiler optimization.Works [87, 115, 124] are
based on iteratively enabling certain optimization, running the
compiled program and, based on its performance, deciding
on a new optimization setting. Compilers apply a complete
fixed pipeline of optimizations from the source code to the
binary[8]. Cohen et al.[21], Cavazos et al.[67], use hardware
counters to generate heuristics to predict good optimizations.
Our work, concentrates upon post-compiler, hence we are
sure that the compiler does not undo optimizations. For our
approach, we prove that static analysis is an important step to
propose a good optimization. Hardware counters are used to
complete the MAKS-MAQAO process. They are implemented
in MAQAO. MAQAOAdvisor guides users with hardware
counters. We propose to add an extra phase of the process of
iterative compilation in order to reduce the space of iterative
11561156
compilation. For one execution of source code, MAQAOAd-
visor guides user to generate a small number of versions for
each hot loop. This number is limited by the maximum value
of unrolling factor, the code size and the performance.
V. CONCLUSION
An automatic analysis and a quick resolution of the perfor-
mance problem depends on having precise methodology/tool
for exploring the relationship between performance and com-
putation analysis.
In this paper we have proposed an approach which allows
improving performance, reducing answer time by applying the
best transformation for each hot loop and taking into account
the input data. Our technique is faster than existing methods
because it selects and evaluate just the hot part of code.
Using our methodology, we can easily understand the com-
piler optimization applied to an application, the source code
transformation and the usage of the hardware source. It is
possible also to build a summary that defines an abstract rep-
resentation of the application, or a selected part of the program,
in order to capture the parameters that affect performance. The
results are then presented in an elaborate format which can be
easily understood and interpreted by a user that is not an expert
in code optimization. Also, the user can be guided to apply
the best transformation to improve performance.
In the future, we plan to use a set of input data (hundreds or
thousands of input data). Focusing on fine-grain optimization
and using a set of input data, will help us to investigate in
iterative optimization, which has not been evaluated up to now.
We plan also to extend our approach in two important ways:
First, we plan to propose an infrastructure to cooperate with
parallel information. For this way, it is possible to combine
other tools in our infrastructure. We also, plan to study tools
overhead. With our approach, it will be easier to study this
overhead and propose how we can reduce it.
REFERENCES
[1] L. Djoudi. MAKS-MAQAO: An Intelligent Integrated PerformanceAnalysis and Optimization Framework PhD 2009
[2] L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva,MAQAO: Modular Assembler Quality Analyzer and Optimizer for Ita-nium 2 Workshop on EPIC architectures and compiler technology, 2005.
[3] L.Djoudi, D.Barthou, O.Tomaz, A.Charif-Rubial, J.-T. Acquaviva,W.Jalby The Design and Architecture of MAQAOPROFILE: an In-strumentation MAQAO Module Workshop on EPIC Architectures andCompiler Technology, San Jose, Mar. 11-14, 21 pages(2007)
[4] Lamia Djoudi and William Jalby SA-IDMA: An Accurate and EffectiveMethodology of Combining Static and Dynamic Analysis Conferenceon Genie Electrique(CGE), polytechnique military school in Algiers, Apr.13-14, 7 pages(2009).
[5] Lamia Djoudi, Jose Noudohouenou and William Jalby The design andthe architecture of MAQAOAdvisor: A Live Tuning Guide InternationalConference on High Performance Computing (HiPC), India, Dec. 17-20,14 pages(2008).
[6] Lamia Djoudi and William Jalby KBS-MAQAO: A Knowledge-BasedSystem For MAQAO Tool High Performance Computing and Communi-cations (HPCC), Seoul, Jun. 25-27, 17 pages(2009).
[7] Lamia Djoudi and Mohamed Amine Achab The Design and Architectureof an Expert System for MAQAO Tool The 2010 World Congress inComputer Science, Computer Engineering, and Applied Computing, LasVegas, Jul. 12-15, 9 pages(2010)
[8] Submitted
[9] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves, D. Subramanian, L.Torczon, and T. Waterman. ACME: adaptive compilation made efficient.In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages,Compilers, and Tools for Embedded Systems (LCTES), pages 6977, July2005.
[10] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P.OBoyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machinelearning to focus iterative optimization. International Symposium on CodeGeneration and Optimization (CGO), pages 295305, March 2006.
[11] M. Stephenson, M. Martin, and U. OReilly. Meta optimization:Improving compiler heuristics with machine learning. Conference onProgramming Language Design and Implementation (PLDI), pages 7790,June 2003.
[12] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones.Fast searches for effective optimization phase sequences. Conferenceon Programming Language Design and Implementation (PLDI), pages171182, June 2004.
[13] B. Franke, M. OBoyle, J. Thomson, and G. Fursin. Probabilistic source-level optimisation of embedded programs. Conference on Languages,Compilers, and Tools for Embedded Systems (LCTES), pages 7886, July2005.
[14] Y. Zhong, X. Shen, and C. Ding. Program locality analysis usingreuse distance. Transactions on Programming Languages and Systems(TOPLAS), 31(6):139, Aug. 2009.
[15] Y. Chen and L. Eeckhout and G. Fursin and L. Peng, O. Temam and C.Wu Evaluating iterative optimization across 1000 data sets Conferenceon Programming Language Design and Implementation (PLDI), 2010
[16] J. R. Larus and E. Schnaar. EEL: Machine-Independent ExecutableEditing appeared in the PLDI 1995
[17] Submitted[18] http://sourceforge.net/projects/cprof[19] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi.
Pinpointing Representative Portions of Large Intel Itanium Programs withDynamic Instrumentation Micro 37, Portland, 2004
[20] Kurt Windisch, Bernd Mohr, and Al Malony. A brief technical overviewof the tau tools.
[21] Gang-Ryung Uh, Robert Cohn, Bharadwaj Yadavalli, Ramesh Peri, andRavi Ayyagari. Analyzing Dynamic Binary Instrumentation OverheadWorkshop on Binary Instrumentation and Application (2007).
[22] Intel Corporation. VTune Performance Analyzerhttp://www.intel.com/software/products/vtune
[23] Jack Dongarra, Kevin S. London, Shirley Moore, Philip Mucci, DanielTerpstra, Haihang You, Min Zhou. Experiences and Lessons Learnedwith a Portable Interface to Hardware Performance Counters. IPDPS03
[24] Luiz De Rose, Ted Hoover Jr. and Jeffrey K. Hollingsworth, TheDynamic Probe Class Library: An Infrastructure for Developing Instru-mentation for Performance Tools, IPDPS 2001: 66
[25] B. R. Buck and J. K. Hollingsworth, An API for runtime code patchingJournal of High Performance Computing Application, 317-329, 1994.
[26] Amitabh Srivastava and Alan Eustace. ATOM - A System for BuildingCustomized Program Analysis Tools. PLDI 1994: 196-205
[27] J. Mellor-Crummey, R. Fowler and G. Marin. HPCView: A tool for top-down analysis of node performance. Computer Science Institute SecondAnnual Symposium, Santa Fe, NM, October 2001.
[28] N. Mukherjee, G.D. Riley and J.R. Gurd. FINESSE: A PrototypeFeedback-guided Performance Enhancement System. Parallel and Dis-tributed Processing (PDP) 2000, Rhodes, Greece, January 2000
[29] W. Zhao and B. Cai and D. Whalley and M. Bailey and R. van Engelenand X. Yuan and J. Hiser and J. Davidson and K. Gallivan and D. Jones,Vista: a system for interactive code improvement, In Proceedings ofthe joint conference on Languages, compilers and tools for embeddedsystems, pages 155–164. ACM Press, 2002.
[30] Optimizing Your Application with Shark 4http://developer.apple.com/tools/shark optimize.html
[31] S. Shende, Allen D. Malony, A. Morris Optimization of Instrumentationin Parallel Performance Evaluation Tools PARA’06 Proceedings of the8th international conference on Applied parallel computing: state of theart in scientific computing
[32] O. Hernandez, H. Jin, B. Chapman. Compiler Support for EfficientInstrumentation PARA’07 Proceedings of the 8th international conferenceon Applied parallel computing: state of the art in scientific computing
[33] E. Sun, D. Kaeli Binary Instrumentation Tool for the Blackfin ProcessorWBIA ’09 Proceedings of the Workshop on Binary Instrumentation andApplications
11571157