8
A Practical Method for Quickly Improving Performance and Reducing Answer Time Through the Selection of Hot Loops Based on the Input Data Lamia Atma Djoudi and Mohamed Amine Achab Contact: [email protected] Independent researchers Abstract— A quick resolution of the performance problem depends on having precise methodology/tool for exploring the relationship between performance and computation analysis. Also, it is necessary to take into account parameters that have an impact on the methodology process. Input data is an important and a main parameter which can involve the performance gain and the answer time. In this paper, we propose a strategy which allows us to quickly improve performance by (1) focusing on the hot part of code and (2) taking into account the input data. Our methodology provides a precise static/dynamic analysis for the selected part of code. It guides user to apply the best transformation for the hot part of code. Whatever the number and the size of input data, the size of the application, our methodology provides in a short time the best transformation for the selected part of code. I. I NTRODUCTION Before starting to optimize an application, we must identify the main factors limiting the performance of its application. For that, two types of code analysis techniques can be used: static and dynamic analysis. Most of the work to date has been based either on static analysis or on measurement. Static analysis is usually faster than dynamic analysis but less pre- cise. Therefore it is often desirable to retain information from static analysis for checking properties, and dynamic analysis which evaluates properties against an event originating from a concrete program execution. Collecting static information is easier than having dynamic in- formation. Obtaining dynamic information on the behavior of a program is relatively complex. In one hand the application size is increased and on the other hand the number, the complexity and the interactions of the transformations (optimizations) to apply are important. Moreover the validity of this information depends on the input parameters of the program. Applying several transformations to a part of code (or all application) by taking into account the input data implies launching the execution several times. The number of the executions depends mainly on the number of the input data and the number of the transformations to be applied. Also, another point to be discussed about the application transformation to improve performanceis, how we can be sure that the applied transformation is the best one to improve performance? Unfortunately this is verified after the execution of the application. As we know that loop optimization is a critical part in the code optimization. A routinely stated rule being 90/10, i.e. 90% of the execution time is spent on 10% of the code. Another question can be asked: it is neccessayr to optimize all loops or how we can focus just on the critique loops. For all these constraints, we can imagine and deduce the execution time, the answer time to improve performance for a large application with sevral large input data. This why, we believe that we need a strategy which allows us quickly to obtain dynamic analysis by focusing on the select part of code (source, assembly or binary). So, if a developer is focusing on the implementation or the tuning of a key computation, it will be far more efficient and less cumbersome to run just the key computation, isolated from the rest of the program. Also, we need a methodology to determine which transformation will be applied for each loop and each input data. In this paper, we propose an approach to improve per- formance by taking into account the input data. With our approach, we can also apply the best transformation; that means we reduce the number of transformations applied for each part of code. This is can be done by using a precise static/dynamic analysis. By selecting the best transformation and taking into account the input data, we will also reduce the answer time. With our approach, user (expert or not in code analysis and optimization) can easily use and test several input data, different hardware and compilers. A. Key Issues Our approach addresses the following issues: Application: For every application (c, Fortran, assembly), we need to improve performance by taking into account the input data. With our approach, we will prove that several large input data for a large application, will not be a problem or an obstacle to quickly improve performance. Transformations: One of our goals is to apply the best transformation for each part of code based on the input data. This is can be done by selecting the hot part of code, then taking a precise analysis to apply the best transformation. Answer time: Current scientific applications take a lot of time to be executed. So, applying different transformations on different architectures and compilers will need huge time. Our objective is to propose a new methodology to reduce answer time to satisfy user. B. Motivation Taking into account the issues presented before, we present a large application that has several input data with an important execution time. We use Blast, Basic Local Alignment Search Tool (BLAST10) is a set of algorithms used to find similar sequences between several DNA chains or protein databases. Table I presents the size, the number of files and loops of this 2012 IEEE 14th International Conference on High Performance Computing and Communications 978-0-7695-4749-7/12 $26.00 © 2012 IEEE DOI 1150 2012 IEEE 14th International Conference on High Performance Computing and Communications 978-0-7695-4749-7/12 $26.00 © 2012 IEEE DOI 10.1109/HPCC.2012.298 1150

[IEEE 2012 IEEE 14th Int'l Conf. on High Performance Computing and Communication (HPCC) & 2012 IEEE 9th Int'l Conf. on Embedded Software and Systems (ICESS) - Liverpool, United Kingdom

Embed Size (px)

Citation preview

A Practical Method for Quickly Improving Performance and Reducing Answer TimeThrough the Selection of Hot Loops Based on the Input Data

Lamia Atma Djoudi and Mohamed Amine Achab

Contact: [email protected]

Independent researchers

Abstract— A quick resolution of the performance problemdepends on having precise methodology/tool for exploring therelationship between performance and computation analysis.Also, it is necessary to take into account parameters that have animpact on the methodology process. Input data is an importantand a main parameter which can involve the performance gainand the answer time.

In this paper, we propose a strategy which allows us to quicklyimprove performance by (1) focusing on the hot part of code and(2) taking into account the input data. Our methodology providesa precise static/dynamic analysis for the selected part of code. Itguides user to apply the best transformation for the hot part ofcode. Whatever the number and the size of input data, the sizeof the application, our methodology provides in a short time thebest transformation for the selected part of code.

I. INTRODUCTION

Before starting to optimize an application, we must identify

the main factors limiting the performance of its application.

For that, two types of code analysis techniques can be used:

static and dynamic analysis. Most of the work to date has

been based either on static analysis or on measurement. Static

analysis is usually faster than dynamic analysis but less pre-

cise. Therefore it is often desirable to retain information from

static analysis for checking properties, and dynamic analysis

which evaluates properties against an event originating from

a concrete program execution.

Collecting static information is easier than having dynamic in-

formation. Obtaining dynamic information on the behavior of a

program is relatively complex. In one hand the application size

is increased and on the other hand the number, the complexity

and the interactions of the transformations (optimizations) to

apply are important. Moreover the validity of this information

depends on the input parameters of the program.

Applying several transformations to a part of code (or all

application) by taking into account the input data implies

launching the execution several times. The number of the

executions depends mainly on the number of the input data

and the number of the transformations to be applied.

Also, another point to be discussed about the application

transformation to improve performanceis, how we can be sure

that the applied transformation is the best one to improve

performance? Unfortunately this is verified after the execution

of the application.

As we know that loop optimization is a critical part in the code

optimization. A routinely stated rule being 90/10, i.e. 90%

of the execution time is spent on 10% of the code. Another

question can be asked: it is neccessayr to optimize all loops

or how we can focus just on the critique loops.

For all these constraints, we can imagine and deduce the

execution time, the answer time to improve performance for

a large application with sevral large input data. This why, we

believe that we need a strategy which allows us quickly to

obtain dynamic analysis by focusing on the select part of code

(source, assembly or binary). So, if a developer is focusing on

the implementation or the tuning of a key computation, it will

be far more efficient and less cumbersome to run just the key

computation, isolated from the rest of the program. Also, we

need a methodology to determine which transformation will

be applied for each loop and each input data.

In this paper, we propose an approach to improve per-

formance by taking into account the input data. With our

approach, we can also apply the best transformation; that

means we reduce the number of transformations applied for

each part of code. This is can be done by using a precise

static/dynamic analysis.

By selecting the best transformation and taking into account

the input data, we will also reduce the answer time.

With our approach, user (expert or not in code analysis

and optimization) can easily use and test several input data,

different hardware and compilers.

A. Key Issues

Our approach addresses the following issues:

Application: For every application (c, Fortran, assembly),

we need to improve performance by taking into account the

input data. With our approach, we will prove that several large

input data for a large application, will not be a problem or an

obstacle to quickly improve performance.

Transformations: One of our goals is to apply the best

transformation for each part of code based on the input data.

This is can be done by selecting the hot part of code, then

taking a precise analysis to apply the best transformation.

Answer time: Current scientific applications take a lot of

time to be executed. So, applying different transformations on

different architectures and compilers will need huge time. Our

objective is to propose a new methodology to reduce answer

time to satisfy user.

B. Motivation

Taking into account the issues presented before, we present

a large application that has several input data with an important

execution time. We use Blast, Basic Local Alignment Search

Tool (BLAST10) is a set of algorithms used to find similar

sequences between several DNA chains or protein databases.

Table I presents the size, the number of files and loops of this

2012 IEEE 14th International Conference on High Performance Computing and Communications

978-0-7695-4749-7/12 $26.00 © 2012 IEEE

DOI

1150

2012 IEEE 14th International Conference on High Performance Computing and Communications

978-0-7695-4749-7/12 $26.00 © 2012 IEEE

DOI 10.1109/HPCC.2012.298

1150

application.

Using the 7 input data of Blast application, Table II presents

TABLE I

BLAST APPLICATION

Application size Number of Files Number of source loops

45 MO 748 10895

TABLE II

BLAST: TIME EXECUTION FOR DIFFERENT INPUT DATA

Input data Execution Time(sec) Pin (sec) MAKS-MAQAO (sec)

I1 150,07 37153,59 268

I2 4,15 997,15 5,23

I3 24,99 6106,19 25,68

I4 205,08 52142,63 345,11

I5 53,32 13906,69 75,26

I6 211,31 53088,14 288,18

I7 953,4 130756,32 1614,14

the execution time for the original code and the execution

time by using Pin [19] and MAKS-MAQAO [1] tools. The

choice of the usage of these two tools is based on a strategy

of selecting tools to apply selective instrumentation [17]. With

the combination of these tools, selective methodology helps us

to quickly have precise static/dynamic analysis for a selected

part of code. More information about MAKS-MAQAO, Pin

and their combination is detailed in section II-A.3.

In Table II, we remark that the execution time using

tools analysis is large than the excution time of the original

code. And as we need these tools to analyze and improve

perfromance, we start by asking the following questions:

• Is it necessary to analyze all the 10895 source loops?

• How many transformations will be applied by taking into

account the 7 input data?

• How many times we will launch this large application?

• What is the overall time to run all the application with

all input data and all the transformations?

• How to choose a transformation? And it must be applied

to any loop?

C. Proposition

Answering the previous questions can be done by using our

methodology described in this paper where we propose to:

• focus on the hot parts of code;

• select the best transformation and guiding user;

• Reduce the answer time.

These main goals will be detailed in this paper to quickly find

the best transformation for each loop and taking into account

the input data.

In the rest of this paper: Section 2 describes our method-

ology where we present a background, a description of our

methodology and an evaluation study. Experimental results are

presented in Section 3. Section 4 presents related work. We

conclude in Section 5.

II. METHODOLOGY

The focus of our effort has been to propose a methodology

that is (1) easy to use, (2) helps user to quickly improve the

performance of his application, and (3) takes into account the

input data. Rather, than on inventing new performance mea-

sures or new ways to collect measurements, developing new

tools, we believe that good usage of existing tools/performance

measures can answer our request.

In this section, we discuss our methodology by presenting

its usefulness and benefits. The process of our approach is also

presented in this section. But at first, we present an overview

of the tools/performance measures used to achieve our goals

fixed in this paper.

A. Background

1) Analysis: Before starting the optimization of an applica-

tion, it is necessary to first identify the main factors limiting its

performance. For that, two types of code analysis techniques

can be used: static and dynamic analysis. Most of the work

has been based either on static analysis or on measurement.

Existing static and dynamic analysis methods often have their

advantages and limitations.

We beleive that is better to have results analysis on source,

assembly and binary level. The main goal in the analysis phase

is to do an accurate analysis, be it static or dynamic, high

level or low level. We believe that combining the benefits of

each technique will be superior to any one single technique,

excluding the other one, with the added difficulty of making

the right choice between the techniques.

2) Analysis level: By analyzing the performance results, we

should be able (1) to determine the sources of performance

degradation and (2) to improve performance. To do this, we

must have all information on source, assembly and binary

codes. The idea is not to analyze each code separately but

proposing how we can benefit from all information extracted

from the three codes to have precise analysis.

For static analysis, we propose to extract information from

source and assembly codes(post-compilation).

For dynamic analysis, we propose to work on assembly and

binary level. The idea is to have a collaborative relationship

with the compiler. It may lack some expressiveness (compared

to compilation passes applied to an abstract representation),

but it does handle complex analysis and transformations, and

allows a direct and precise modeling of the target platform.

This post-compilation approach has several advantages:

• At assembly language almost all compiler performed

optimizations become visible that is not the case for

higher-level representations.

• The exploitation of the post-compiler optimization op-

portunities is not intended as a compiler replacement:

it is guaranteed that no other code transformation will

break undo or break the optimization. Also, being based

after the compilation phase allows a precise diagnostic of

compiler optimization successes and/or failures.

• Assembly language is still at a higher enough level to

make development possible and optimization achievable.

This is not the case for binary executable. In contrast to

binary executable code, program areas such as function

and basic blocks are still identifiable which not the case

for binary code.

11511151

• The code can be compiled directly and can actually be

implemented using any compilers/assembler, unlike an

intermediate representation.

3) Performance Analysis Tools: Obtain accurate analyzes

static / dynamic at any level (source, assembler and binary) to

detect sources of performance degradations, several tools have

been developed.

For our approach, we need tool(s) that answer ourneeds.

Mainly, we require tool(s) that works on source/assembly and

binary level to be sure to identify the source of degradation

performance. Also, the tool(s) must generate static/dynamic

information on three levels: source, assembly and binary.

After a careful study of existing tools to acheive our goals,

we chose the following tools:

a) MAKS-MAQAO [2], [1]: stands for Multi-

Architecture Knowledge-based System-Modular Quality

Analyzer and Optimizer. It is the implementation of our new

optimization approach. It addresses performance problem

in all its diversity : static analysis, support for hardware

counters, dynamic instrumentation and profiling, hybrid

intelligent system in a knowledge based system to process

the results, source transformations and automatic low level

optimization for fine loop tuning. MAKS-MAQAO is a tool

to analyze and optimize assembly and source application

based on compiler optimization and user criteria to exploit

the hadware resources.

A distinctive advantage is that this system strongly focuses on

versatility, i.e., users can specify their own analysis and enrich

the performance intelligent system. These capabilities enable

a better control of an optimization process and enhance the

productivity of programmers in the process of code tuning.

b) Pin [19]: It instruments binary codes in a way that

when specific instructions are executed, they are caught and

user defined instrumentation routines are executed. While

being very useful Pin is more oriented toward prospective

architecture simulation than code performance analysis.

c) M2Pin [17]: Our study and test of MAKS-MAQAO

and Pin, allowed us to focus on their disadvantages:

MAKS-MAQAO: in the face of the importance of its dynamic

analysis, and the usage of this tool by an expert or not, has

an important limit which is the memory tracing.

Pin: in addition to its execution time and the influence on the

results as prouved by the Pin community [21], it is a dedicated

tool for experts. And also, it lacks the concept of selectivity

because it instrument all application.

To take into account the advantages of MAKS-MAQAO and

Pin tools (and surpassing of their disadvantges), a combination

of their adavantages is proposed [17] in previous work.

Their combination enriches the knowledge base of MAKS-

MAQAO with the dynamic information of Pin. To surpass

the inconvenient of Pin (execution time), we have proposed

a selectivity approach[17]. MAKS-MAQAO selects a part of

code which will be instrumented by Pin.

B. Description

Figure 1 presents an overview of our approach. In order

to accelerate the answer time, improve performance by taking

Fig. 1. Apporach overview

into account the input data; we propose to apply the following

steps of our approach:

1) The analyzer: The MAKS-MAQAO analyzer [1], [3],

[4] provides precise analysis (static and dynamic) results. It

is a key feature of MAKS-MAQAO is its ability to value

profile the code at various granularity. Additionally to timing,

instrumentation also performs value profiling. Value profiling

is often the missing link between the observed behavior on the

hardware and the nature of the application. This feature yields

to numerous optimization opportunities. Time profiling allows

us to give a precise weight to all executed loops, therefore

underscoring hotspot. Value profiling monitors the iteration

count. Correlating this information provides the relevant met-

ric: i.e. which hot loops are short. This is a clear illustration of

the interest of centralized approach for performance analysis.

These analyzer results can be visualized by user or presented

in a profile guide easy to understand. The hot loop(s) is (are)

presented as main keys in this profile guide.

2) The transformer: This module is based on the informa-

tion provided by the two modules of MAKS-MAQAO: (1)

MAQAOAdvisor [5], [1] that can provide precise information

about applied optimization. (2) The expert system [7], [6], [1],

which generates recommendation that guide user to apply the

best transformation to improve performance.

At this step, when the expert system generates recommenda-

11521152

tions and if the user is satisfied, the transformer applies the

optimization proposed in the recommendation.

The principal novelty of this transformer module [8] is

its integrated support for a set of source transformations,

directives and compiler pragmas. It gives user over how they

will be applied. It enables complex optimizations to be applied

to achieve performance that was previously only achievable

through careful hand optimization.

3) The instrumenter: After selecting the hot loop by taking

into account the input data, the best transformation to be

applied will be guided by the execution summary. Once

the automatic transformer will applied this transformation, a

selective instrumentation [17] will be launched.

This selective instrumentation is based on the combination

M2pin (MAKS-MAQAO and Pin). The main axes for this

approach are: (1) selecting a part of code (source, assembly

or binary) to be analyzed by a user, (2) selecting the dynamic

analysis technique. For the selected technique, a selection of

the corresponding functionality in the selected tool is launched.

4) The gain calculator: The last step in our methodology

is the calculation of the gain. Then, all information is saved in

the knowledge base of MAKS-MAQAO to be used in future

experiences or visualized by user.

C. Evaluation study

In this section, we present an evaluation study to confirm

the advantages of our approach.

Let p: the number of input data.

Let l: the number of loops.

let m: the number of optimization for each loop

For p input data, the total execution time for an application is

the sum of execution time by running the application for each

input data:

Total(Texec) =∑p

i=1 T imeappli

We suppose that we apply the same optimization for all

loops. In this case, the total execution time for the application

is the sum of execution time by running the original code

the application for each input data, plus the sum of execution

time by running the application for each input data and each

transformation:

Total(Texec) =

p∑

i=1

T imeappli+

p∑

i=1

l∑

j=1

m∑

k=1

Timeappli (1)

Using our approach, we demonstrate that for each step of our

process we have a gain.

1) Using hot loops: By selecting the hot loops, we suppose

that we have h hot loops(h inf l). The total execution time for

the application is:

Total(Texec) =

p∑

i=1

T imeappli+

p∑

i=1

h∑

j=1

m∑

k=1

Timeappli (2)

2) Best transformation for each hot loop: By using our

methodology, we have one transformation (m = 1) to be

applied to every hot loop and every input data. The total

execution time for the application is:

Total(Texec) =

p∑

i=1

T imeappli +

p∑

i=1

h∑

j=1

Timeappli (3)

3) Selective instrumentation: By applying the selective

instrumentation, the execution time is smaller than the previous

times.

Total(Texec) =

p∑

i=1

Timeappli +

p∑

i=1

h∑

i=1

T imehotloops (4)

4) Evaluation for ine hot loop: By applying our method-

ology for one hot loop, we have:a) Performance: α is the fraction between the execution

time of the hot loop and the total execution time for all the

application. α =Timehotloop

Timeappli

After applying the transformation recommended by the expert

system of MAKS-MAQAO, we have:

α =Timehotloop(optimization)

TimeappliWhere α inf α

For one input data, the total time needed by our approach is:

Totaltime = Timeappli + Timehotloop(optimization)

Totaltime = (1 + α) ∗ T imeappliKnowing that α inf 1, our time execution by running the

original code the application for each input data, plus the

sum of execution time by running the hot loops for each

input data and each transformation is impressive. It is not no

larger than the execution time of the original application. That

means, with a bit time we can guide user to apply the best

transformation for the hot loop by taking into account the input

data.b) Answer time: Applying our approach for one

hot loop, we conclude the answer time as follow:

β =Timeinstrumentation(hotloop)

Timeinstrumentaion(appli)

After applying the transformation recommended by the expert

system of MAKS-MAQAO, we have:

β =Timeinstrumentation−hotloop(optimization)

Timeinstrumentation−applicationWhere β inf β

For one input data, the total time needed by our approach is:

Answertime = Timeinstrumentation−appli +Timeinstrumentation−hotloop(optimization)

Answertime = (1 + β) ∗ Timeinstrumentation−appli

III. EXPERIMENTAL RESULTS

In this section, we evaluate our proposed appraoch. We con-

sider two scientific applications: Blast and NR. Experiments

were run on two machines:

• BULL Itanium 2 Novascale system, 1.6GHz, 3MB of L3.

Codes were compiled using Intel ICC/IFORT 10.1.

• For X86, we use 4 sockets quadri-core 2,93 GHz, 48 GB

Mem 1 x 146 GB. Codes were compiled using ICC 10.1.

A. Blast applicationAs described in section I-B, this large application has

several input data and 10895 loops. Applying our approach,

we can summarize the optimization steps as follow:

11531153

Evaluation: Following the study evaluation described in

section II-C, Table III presents the values by applying the four

formulas described in section II-C.

TABLE III

BLAST: ANSWER TIME USING EVALUATION STUDY FORMULAS

F1 F2 F3 F4

Time (sec) 338744343000 3312487 55331 5418

TABLE IV

BLAST: SPEEDUP IN ANSWER TIME, GAIN IN PERFORMANCE BY APPLYING

OUR APPROACH FOR HOT LOOPS

I data Files H. loops %Exec. time Gain perf Speedup Ans. time

I1 ungapped 500 61,95 0,85 223, 33

I2 seqport 376 10,70 2,65 60,65

I3 gapalign 2875 25,23 2,76 7,52

I4 ungapped 500 60,47 1,18 235,20

I5 ungapped 500 49,06 1,60 221,41

I6 ungapped 500 44,86 1,77 231,1

I7 ungapped 500 57,78 1,50 56,67

Note that:

• For the four formulas:

– F1: is the total execution time for the application. It

is the sum of execution time by running the original

code the application for each input data, plus the sum

of execution time by running the application for each

input data and each transformation.

– F2: is the total execution time for the application. It

is the sum of execution time by running the original

code the application for each input data, plus the sum

of execution time by running the hot loops for each

input data and each transformation.

– F3: is the total execution time for the application. It

is the sum of execution time by running the original

code the application for each input data, plus the

sum of execution time by running the hot loops for

each input data and applying the best transformation

proposed by our methodology.

– F4: is the total execution time by applying the se-

lective instrumentation for each hot and by applying

the best transformation.

• We have 7 input data and 7 hot loops.

• We take 10 transformations (it can be more) for each loop

from the 10895 loops.

In Table III, we remark that we need a large time when

executing the application for the 7 input data and several

transformations for each loop. This is why we focus on the

hot loop. Comparing to execution time, our need time (F4) is

important. But comparing this time to another time (F1, F2

and F3) we can be sure that our methodology is the best.

Based on F4 formula, Table IV presents the source line for

the hot loop for each input data. The file contains the hot

loop. Third column represents the percentage of the execution

time for the hoot loop. Most hot loops take an important time

to be executed. Applying our approach, the two last columns

summarize the gain in performance and speedup in answer

time. This is done by focusing on the hot loop. Starting

by finding the hot loop for each input data, generating the

static/dynamic analysis, then applying the best transformation.

Applying our approach, we can summarize the optimization

steps as follow:

TABLE V

NR APPLICATIONS, FILES, LOOPS AND SIZE FOR SMALL AND LARGE

INPUT DATA:

Application GaussJordan Jacobi Mprove Toeplz Tridag

Files 8 5 9 4 2

loops 9 8 9 4 2

small I data 16 16 16 16 8192

Large I data 1024 1024 1024 1024 8388608

TABLE VI

NR: : SPEEDUP IN ANSWER TIME, GAIN IN PERFORMANCE BY APPLYING

OUR APPROACH FOR HOT LOOPS

I data Loops % execution time gain performance Speedup A. time

16 Loop7 34,28 1,65 2,13

32 Loop7 35,84 4,84 2,54

64 Loop7 36,20 8,02 3,81

128 Loop7 34,97 12,51 8,10

200 Loop8 33,46 15,16 11,54

256 Loop2 33,94 17,56 13,34

500 Loop2 54,43 25,48 14,11

512 Loop2 53,89 24,38 14,20

999 Loop2 32,64 1,57 12,71

1024 Loop2 32,38 1,99 12,91

1) Benefits focusing on Hot loops: As described in Table

IV, the hot loops depend on the input data. For each input

data, there is a hot loop in a hot file. That confirms we must

take into account the input data to improve performance.

The percentage of execution time for each hot loop confirms

the benefits of selecting the hot loop from the 10895 loops.

2) Gain: As described in our approach, our main goals is

improving performance and reducing answer time. Of course

that for Blast application, we have a small gain, but the

speedup in answer time is very important.

Despite these small gains in performance, we presented this

application to show the importance of execution time by

applying transformation for multiple data. In this example,

we gain in answer time because we focus on the hot loop by

applying a single transformation for each input data.

B. NR applications

we present experiments results of applications of Solution

of Linear Algebra Equations from the Numerical Recipes

applications (NR). These applications have ten input data.

Table V summarizes the different applications of NR by

presenting the number of files/loops in each input data. Also,

for each application, we present the size of the small and the

large input data.

Following the same process as Blast application, we choose

Guass jordan application to present the advantages of our

approach. Table VI presents the hot loops, the percentage of

the execution time of the hot loops and the gain in performance

and answer time. We remark:

• The hot loop is not the same for all input data

11541154

Fig. 2. NR hot loops: execution time on x86

Fig. 3. NR performance Gain

• The percentage of the execution time of the hot loop for

each input data encourages us to focus on analyzing and

optimizing just the hot loop.

• For most hot loops and input data, we have an important

gain in performance and also an important speedup in

answer time.

For all NR applications, we use the average and the maximal

values of input data to present the gain in answer time and

performance.

Figure 2 presents the time execution of each hot loops in

each application (run on X86). Most of these hot loops take

more than 30 % of the time execution of the application. This

is why we have applied our approach by incrementing just the

hot loops. On Itanium 2 we have the same presentation.

Figure 3 presents the gain in performance using Itanium 2

and x86 architectures.

Figure 4 presents the speedup in answer time for all NR

applications. It summarizes the speedup of the whole of each

application. We remark, all speedup are important whatever

the architecture and the input data.

For the evaluation study (described in section II-C), Figures

5 and 6 present the execution time of the four formulas.

For each formula, we to the execution time for the original

application, the following values:

• For F1: we add the total execution time by applying

several transformations for each loop and each input data.

• For F2: For this formula, we execute the hot loops for

several transformation and input data. Then we add it to

the execution time for the original application.

• For F3: For this formula, we apply the best transformation

Fig. 4. NR Speedup in answer time

Fig. 5. Evaluation study for NR applications on X86

for each input data. Then we add it to the execution time

for the original application.

• For F4: we execute just the hot loop with the best

transformation for each input data. Then we add it to

the execution time for the original application.

It is clearly visible that for both architectures, our approach

is best. For X86 (Figure 5), F4 (our approach takes a very

short answer time comparing to the others. For example for

Tridag application, we have 53 seconds when the execution

of the original code takes 23 seconds. Running and apply

instrumentation the Tridag for ten input data and two loops,

we need 182321 seconds. In this figure, F1, F2 and F3 must be

divided by 75, 25 and 10 respectively to compare the results

with our approach.

The same remarks for Itanium architecture (Figure 6). F1,

F2 and F3 must be divided by 100, 25 and 10 respectively to

compare the results with our approach.

IV. RELATED WORK

In this section we briefly discuss related work in the

investigation of the impact of input data on performance, the

performance tools and the selectivity mode.

A. Input data and performance

Most of searchers use a limited number of the input data to

validate their researches which mainly in code optimization.

The main reason to be limited with a small number of input

data is launching the application several times which need a

huge execution time. For this, several researchers propose new

ideas to how investigate the several input data [9], [10], [11],

[12], [13]. Mainly they focus on the impact of data sets on

compiler optimization parameterization and selection.

11551155

Fig. 6. Evaluation study for NR applications on Itanium 2

Zhong et al. [14] present two techniques to predict how

program locality is affected across data sets. Several studies

[9], [10], [11], [12], [13] underscore the fact that a significant

number of iterations (tens or hundreds) are required to find

the best combination of compiler optimizations.

The number of input data is not an obstacle to our approach

because we prove that we need small time than the execution

time of total execution time to select the best optimization.

Chen et all [15] evaluate the iterative optimization effective-

ness across a large number of data sets. They propose the

possibility to learn the best possible compiler optimizations

across distinct data sets. Unfortunately, their method is applied

to the whole of the program. This needs also an important time

to choose the best transformation.

With our approach, we investigated a fine-grain optimization

because we focus on loops. Also our time execution is smaller

because we select and focus on hot loops to be analyzed and

tested with the new optimization to select the best one.

B. Tools for Code Analysis and Optimization

Most of the performance analysis tools/toolkits can be dis-

patched among two main classes: static and dynamic analysis.

Hardware monitors are extremely helpful for performance

tuning, they are the backbone of analysis tools like VTune[22] andCprof [18]. Their usage is so widespread that an API

gets standardized to describe their access [23]. Nevertheless,

hardware counters are limited to the dynamic description of an

application and this picture needs to be correlated with other

metrics. DPCL [24] based on Dyninst [25]. It helps developers

to support dynamic instrumentation of parallel jobs. Even if

dynamic instrumentation is very appealing, DPCL does not

include any notion of code inspection.

ATOM [26] and Pin [19] instrument assembly/binary codes

in a way that when specific instructions are executed, they are

caught and user defined instrumentation routines are executed.

While being very useful Atom and Pin are more oriented

toward prospective architecture simulation than code perfor-

mance analysis. EEL [16] belongs to the same categories of

tools. This C++ library allows to edit a binary and add code

fragment on edges of dissambled application CFG. Therefore

it can be used as a foundation for an analysis tool but

does not provide performance analysis by itself. TAU [20]

Performance System is a portable profiling and tracing toolkit

for performance analysis of parallel programs. TAU combine

different tools but no information interchange between them.

Also his major inconvenient, the source code instrumentation.

HPCview [27] and Finesse [28] address the analysis

problem from static and dynamic sides. HPCview tackles the

same problem as MAKS-MAQAO: the complex interaction

between source code, assembly, performance and hardware

monitors. HPCview presents a well designed GUI based

on web browser, displaying simultaneous views of source,

assembly code and dynamic information. This interface is

connected to a database storing for each statement of the

assembly code a summary of its dynamic HPCview also lacks

value profiling, yet simple to implement optimizations.

Vista [29], is an interesting cross-over between compiler

and performance tool. Addressing the issue of compiler op-

timization phases ordering, this complete framework allows

an interactive, step by step compilation. Plugged with its own

compiler, Vista allows to interactively test and configure com-

pilation phases for code fragment. While being conceptually

close to MAKS-MAQAO, Vista remains more a compiler

project than a performance analyzer.

Shark [30] offers a comprehensive interface for performance

problem. As MAKS-MAQAO it is located at the assembly

level for its analyzes, displays source code as well as profiling

information. Shark lacks instrumentation and value profiling.

DSPInst[33] is a binary instrumentation tool. This tool

allows user to select the part of code to be instrumented

(function, loop, ...). Several results are generated (data cache

misses, memory tracing, ...). Despite its advantages, the only

major disadvantage is that it is focused on one architecture

Blackfin.

C. Selectivity

Shend et all [31] propose selected mode on source code.

With their approach, it is possible to select function, instruction

to be instrumented. Moreover, the source instrumentation can

interfere with compiler optimizations. Also it is not useful

when using libraries.

Hernandez at all. [32] apply the selective instrumentation in

the compiler openuh. Based on static estimation, they propose

to instrument procedures. The major inconvenient, their tool

cannot select the hot function/loop/instruction.

There is many researchs on automatically selecting for

the best compiler optimization.Works [87, 115, 124] are

based on iteratively enabling certain optimization, running the

compiled program and, based on its performance, deciding

on a new optimization setting. Compilers apply a complete

fixed pipeline of optimizations from the source code to the

binary[8]. Cohen et al.[21], Cavazos et al.[67], use hardware

counters to generate heuristics to predict good optimizations.

Our work, concentrates upon post-compiler, hence we are

sure that the compiler does not undo optimizations. For our

approach, we prove that static analysis is an important step to

propose a good optimization. Hardware counters are used to

complete the MAKS-MAQAO process. They are implemented

in MAQAO. MAQAOAdvisor guides users with hardware

counters. We propose to add an extra phase of the process of

iterative compilation in order to reduce the space of iterative

11561156

compilation. For one execution of source code, MAQAOAd-

visor guides user to generate a small number of versions for

each hot loop. This number is limited by the maximum value

of unrolling factor, the code size and the performance.

V. CONCLUSION

An automatic analysis and a quick resolution of the perfor-

mance problem depends on having precise methodology/tool

for exploring the relationship between performance and com-

putation analysis.

In this paper we have proposed an approach which allows

improving performance, reducing answer time by applying the

best transformation for each hot loop and taking into account

the input data. Our technique is faster than existing methods

because it selects and evaluate just the hot part of code.

Using our methodology, we can easily understand the com-

piler optimization applied to an application, the source code

transformation and the usage of the hardware source. It is

possible also to build a summary that defines an abstract rep-

resentation of the application, or a selected part of the program,

in order to capture the parameters that affect performance. The

results are then presented in an elaborate format which can be

easily understood and interpreted by a user that is not an expert

in code optimization. Also, the user can be guided to apply

the best transformation to improve performance.

In the future, we plan to use a set of input data (hundreds or

thousands of input data). Focusing on fine-grain optimization

and using a set of input data, will help us to investigate in

iterative optimization, which has not been evaluated up to now.

We plan also to extend our approach in two important ways:

First, we plan to propose an infrastructure to cooperate with

parallel information. For this way, it is possible to combine

other tools in our infrastructure. We also, plan to study tools

overhead. With our approach, it will be easier to study this

overhead and propose how we can reduce it.

REFERENCES

[1] L. Djoudi. MAKS-MAQAO: An Intelligent Integrated PerformanceAnalysis and Optimization Framework PhD 2009

[2] L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva,MAQAO: Modular Assembler Quality Analyzer and Optimizer for Ita-nium 2 Workshop on EPIC architectures and compiler technology, 2005.

[3] L.Djoudi, D.Barthou, O.Tomaz, A.Charif-Rubial, J.-T. Acquaviva,W.Jalby The Design and Architecture of MAQAOPROFILE: an In-strumentation MAQAO Module Workshop on EPIC Architectures andCompiler Technology, San Jose, Mar. 11-14, 21 pages(2007)

[4] Lamia Djoudi and William Jalby SA-IDMA: An Accurate and EffectiveMethodology of Combining Static and Dynamic Analysis Conferenceon Genie Electrique(CGE), polytechnique military school in Algiers, Apr.13-14, 7 pages(2009).

[5] Lamia Djoudi, Jose Noudohouenou and William Jalby The design andthe architecture of MAQAOAdvisor: A Live Tuning Guide InternationalConference on High Performance Computing (HiPC), India, Dec. 17-20,14 pages(2008).

[6] Lamia Djoudi and William Jalby KBS-MAQAO: A Knowledge-BasedSystem For MAQAO Tool High Performance Computing and Communi-cations (HPCC), Seoul, Jun. 25-27, 17 pages(2009).

[7] Lamia Djoudi and Mohamed Amine Achab The Design and Architectureof an Expert System for MAQAO Tool The 2010 World Congress inComputer Science, Computer Engineering, and Applied Computing, LasVegas, Jul. 12-15, 9 pages(2010)

[8] Submitted

[9] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves, D. Subramanian, L.Torczon, and T. Waterman. ACME: adaptive compilation made efficient.In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages,Compilers, and Tools for Embedded Systems (LCTES), pages 6977, July2005.

[10] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P.OBoyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machinelearning to focus iterative optimization. International Symposium on CodeGeneration and Optimization (CGO), pages 295305, March 2006.

[11] M. Stephenson, M. Martin, and U. OReilly. Meta optimization:Improving compiler heuristics with machine learning. Conference onProgramming Language Design and Implementation (PLDI), pages 7790,June 2003.

[12] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones.Fast searches for effective optimization phase sequences. Conferenceon Programming Language Design and Implementation (PLDI), pages171182, June 2004.

[13] B. Franke, M. OBoyle, J. Thomson, and G. Fursin. Probabilistic source-level optimisation of embedded programs. Conference on Languages,Compilers, and Tools for Embedded Systems (LCTES), pages 7886, July2005.

[14] Y. Zhong, X. Shen, and C. Ding. Program locality analysis usingreuse distance. Transactions on Programming Languages and Systems(TOPLAS), 31(6):139, Aug. 2009.

[15] Y. Chen and L. Eeckhout and G. Fursin and L. Peng, O. Temam and C.Wu Evaluating iterative optimization across 1000 data sets Conferenceon Programming Language Design and Implementation (PLDI), 2010

[16] J. R. Larus and E. Schnaar. EEL: Machine-Independent ExecutableEditing appeared in the PLDI 1995

[17] Submitted[18] http://sourceforge.net/projects/cprof[19] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi.

Pinpointing Representative Portions of Large Intel Itanium Programs withDynamic Instrumentation Micro 37, Portland, 2004

[20] Kurt Windisch, Bernd Mohr, and Al Malony. A brief technical overviewof the tau tools.

[21] Gang-Ryung Uh, Robert Cohn, Bharadwaj Yadavalli, Ramesh Peri, andRavi Ayyagari. Analyzing Dynamic Binary Instrumentation OverheadWorkshop on Binary Instrumentation and Application (2007).

[22] Intel Corporation. VTune Performance Analyzerhttp://www.intel.com/software/products/vtune

[23] Jack Dongarra, Kevin S. London, Shirley Moore, Philip Mucci, DanielTerpstra, Haihang You, Min Zhou. Experiences and Lessons Learnedwith a Portable Interface to Hardware Performance Counters. IPDPS03

[24] Luiz De Rose, Ted Hoover Jr. and Jeffrey K. Hollingsworth, TheDynamic Probe Class Library: An Infrastructure for Developing Instru-mentation for Performance Tools, IPDPS 2001: 66

[25] B. R. Buck and J. K. Hollingsworth, An API for runtime code patchingJournal of High Performance Computing Application, 317-329, 1994.

[26] Amitabh Srivastava and Alan Eustace. ATOM - A System for BuildingCustomized Program Analysis Tools. PLDI 1994: 196-205

[27] J. Mellor-Crummey, R. Fowler and G. Marin. HPCView: A tool for top-down analysis of node performance. Computer Science Institute SecondAnnual Symposium, Santa Fe, NM, October 2001.

[28] N. Mukherjee, G.D. Riley and J.R. Gurd. FINESSE: A PrototypeFeedback-guided Performance Enhancement System. Parallel and Dis-tributed Processing (PDP) 2000, Rhodes, Greece, January 2000

[29] W. Zhao and B. Cai and D. Whalley and M. Bailey and R. van Engelenand X. Yuan and J. Hiser and J. Davidson and K. Gallivan and D. Jones,Vista: a system for interactive code improvement, In Proceedings ofthe joint conference on Languages, compilers and tools for embeddedsystems, pages 155–164. ACM Press, 2002.

[30] Optimizing Your Application with Shark 4http://developer.apple.com/tools/shark optimize.html

[31] S. Shende, Allen D. Malony, A. Morris Optimization of Instrumentationin Parallel Performance Evaluation Tools PARA’06 Proceedings of the8th international conference on Applied parallel computing: state of theart in scientific computing

[32] O. Hernandez, H. Jin, B. Chapman. Compiler Support for EfficientInstrumentation PARA’07 Proceedings of the 8th international conferenceon Applied parallel computing: state of the art in scientific computing

[33] E. Sun, D. Kaeli Binary Instrumentation Tool for the Blackfin ProcessorWBIA ’09 Proceedings of the Workshop on Binary Instrumentation andApplications

11571157