7
Selective Methodology Based on User Criteria to Explore the Relationship Between Performance and Computation Analysis Lamia Atma Djoudi and Mohamed Amine Achab Contact: [email protected] Independent Researchers POSTER PAPER Abstract— An automatic analysis and a quick resolution of the performance problem depends on having precise methodology to develop tool for exploring the relationship between performance and computation analysis. In this paper, we propose a strategy which allows the choice of analysis type, code level and tool used, by focusing on the hot part of code. Taking into account user criteria our system generates precise static/dynamic analysis for the selected part of code. It requires smaller computation times. It can be applied systematically without user intervention. Index Terms— tools, static analysis, dynamic analysis, perfor- mance, instrumentation, selectivity. I. I NTRODUCTION B Efore starting to optimize an application, we must iden- tify the main factors limiting the performances of its application. For that, two types of code analysis techniques can be used: static and dynamic analysis. Most of the work to date has been based either on static analysis or on measurement. Static analysis is usually faster than dynamic analysis but less precise. Therefore it is often desirable to retain information from static analysis for checking properties, and dynamic anal- ysis which evaluates properties against an event originating from a concrete program execution. Collecting static information is easier than having dynamic in- formation. Obtaining dynamic information on the behavior of a program is relatively complex. In one hand the application size is increased and on the other hand the number, the complexity and the interactions of the transformations (optimizations) to apply are important. Moreover the validity of these information depends on the input parameters of the program. We propose a strategy to combine the advantages of both analysis and surpassing their disadvantages. It allows us quickly to obtain precise analysis to improve performance. As a large application may execute for hours and sometimes even days, if a developer is focusing on the implementation or the tuning of a key computation, it will be far more efficient and less cumbersome to run just the key computation, isolated from the rest of the program. We propose a strategy which allows us quickly to obtain dynamic analysis by focusing on the select part of code (source, assembly or binary). Once the part of code is selected to be analyzed, the main question, which dynamic methodology is the best? Several dynamic techniques can be used, among which we have simu- lation and instrumentation can obtain accurately a broad range of performance metrics for a program. But their precision and flexibility come at a price. For this reason, we propose a synthesis of the most techniques used and select the one that has more advantages. In the next section, we present different technique and we discuss how we select the best one. If our approach is based on the selectivity of the part of code to be analyzed, the selectivity of the best dynamic techniques, its realization can have two ways: (1) Developing a new system, and (2) combining existing tools. The first solution is much expensive and requires a lot of work and time develop- ment. The second one requires a good conception and precise selection of existing tools. As, several tools were developed. And also each tool has his advantages and disadvantages; we believe that a study of the most use is necessary. This allows us to select the features that correspond to a selective and quick analysis. The most used tools are presented in the next section, then we discuss which tools we select for our approach. Also, the type of the user who can be an expert or not in the performance area has encouraged us to introduce his/her criteria in our system. With our strategy, we satisfy user criteria and answer his/her needs in a short time. In this paper, we present a new approach called selectivity mode. Their main axes are: (1) selecting a part of code (source, assembly or binary) to be analyzed by a user, (2) selecting the dynamic analysis technique. For the selected technique, we select the corresponding functionality in the selected tool. In the rest of this paper: Section 2 details the principles. A motivating example and a discussion to select tools/dynamic methodology are presented in Section 3. Section 4 details our approach. Experimental results are presented in Section 5. Section 6 presents related work. We conclude in Section 7. II. PRINCIPLES Before describing our approach, we discuss some issues. For each issue, we give an overview on how our approach will handle this issue. A. Problems and Requirements We consider performance studies for code optimization starting when the developer writes the program and ending at the end of execution of the application: Application level: By determining the causes of perfor- mance degradation, the user is guided in restructuring the application to provide more opportunities to the compiler and/or hardware to improve performance. At this level, we must take into account that the user is either an expert or not. A sound methodology is needed to guide the user to improve the source code or to use the best compiler options or pragmas to achieve good performance. 978-1-4673-2362-8/12/$31.00 ©2012 IEEE 660

[IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Selective Methodology Based on User Criteria to Explore the Relationship BetweenPerformance and Computation Analysis

Lamia Atma Djoudi and Mohamed Amine AchabContact: [email protected]

Independent ResearchersPOSTER PAPER

Abstract— An automatic analysis and a quick resolution of theperformance problem depends on having precise methodology todevelop tool for exploring the relationship between performanceand computation analysis. In this paper, we propose a strategywhich allows the choice of analysis type, code level and tool used,by focusing on the hot part of code. Taking into account usercriteria our system generates precise static/dynamic analysis forthe selected part of code. It requires smaller computation times.It can be applied systematically without user intervention.

Index Terms— tools, static analysis, dynamic analysis, perfor-mance, instrumentation, selectivity.

I. INTRODUCTION

BEfore starting to optimize an application, we must iden-tify the main factors limiting the performances of its

application. For that, two types of code analysis techniques canbe used: static and dynamic analysis. Most of the work to datehas been based either on static analysis or on measurement.Static analysis is usually faster than dynamic analysis but lessprecise. Therefore it is often desirable to retain informationfrom static analysis for checking properties, and dynamic anal-ysis which evaluates properties against an event originatingfrom a concrete program execution.Collecting static information is easier than having dynamic in-formation. Obtaining dynamic information on the behavior of aprogram is relatively complex. In one hand the application sizeis increased and on the other hand the number, the complexityand the interactions of the transformations (optimizations) toapply are important. Moreover the validity of these informationdepends on the input parameters of the program.

We propose a strategy to combine the advantages of bothanalysis and surpassing their disadvantages. It allows usquickly to obtain precise analysis to improve performance.

As a large application may execute for hours and sometimeseven days, if a developer is focusing on the implementation orthe tuning of a key computation, it will be far more efficientand less cumbersome to run just the key computation, isolatedfrom the rest of the program. We propose a strategy whichallows us quickly to obtain dynamic analysis by focusing onthe select part of code (source, assembly or binary).

Once the part of code is selected to be analyzed, the mainquestion, which dynamic methodology is the best? Severaldynamic techniques can be used, among which we have simu-lation and instrumentation can obtain accurately a broad rangeof performance metrics for a program. But their precisionand flexibility come at a price. For this reason, we propose asynthesis of the most techniques used and select the one that

has more advantages. In the next section, we present differenttechnique and we discuss how we select the best one.

If our approach is based on the selectivity of the part of codeto be analyzed, the selectivity of the best dynamic techniques,its realization can have two ways: (1) Developing a newsystem, and (2) combining existing tools. The first solution ismuch expensive and requires a lot of work and time develop-ment. The second one requires a good conception and preciseselection of existing tools. As, several tools were developed.And also each tool has his advantages and disadvantages; webelieve that a study of the most use is necessary. This allows usto select the features that correspond to a selective and quickanalysis. The most used tools are presented in the next section,then we discuss which tools we select for our approach.

Also, the type of the user who can be an expert or not inthe performance area has encouraged us to introduce his/hercriteria in our system. With our strategy, we satisfy user criteriaand answer his/her needs in a short time.

In this paper, we present a new approach called selectivitymode. Their main axes are: (1) selecting a part of code (source,assembly or binary) to be analyzed by a user, (2) selecting thedynamic analysis technique. For the selected technique, weselect the corresponding functionality in the selected tool.

In the rest of this paper: Section 2 details the principles. Amotivating example and a discussion to select tools/dynamicmethodology are presented in Section 3. Section 4 detailsour approach. Experimental results are presented in Section5. Section 6 presents related work. We conclude in Section 7.

II. PRINCIPLES

Before describing our approach, we discuss some issues.For each issue, we give an overview on how our approachwill handle this issue.

A. Problems and Requirements

We consider performance studies for code optimizationstarting when the developer writes the program and endingat the end of execution of the application:

Application level: By determining the causes of perfor-mance degradation, the user is guided in restructuring theapplication to provide more opportunities to the compilerand/or hardware to improve performance. At this level, wemust take into account that the user is either an expert or not.A sound methodology is needed to guide the user to improvethe source code or to use the best compiler options or pragmasto achieve good performance.

978-1-4673-2362-8/12/$31.00 ©2012 IEEE 660

Page 2: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Compiler level: A compiler/pre-processor applies optimiza-tions by performing a long sequence of transformations. Forthat, many chains of optimizations will succeed several times.As the compiler is based on heuristics, it may not apply theoptimal transformation. So, it is very difficult to get compilersto produce code that will make optimal use of the machine.

Hardware level : To better exploit the architectural re-sources, all mechanisms and their interactions during programexecution must be understood. Hardware is expensive todesign and implement. The understanding of the causes ofperformance degradation by applying the best transformationsis difficult and requires a precise knowledge of the hardware.

Taking into account these three levels, we conclude that theperformance study is a very difficult task. It should be left tothe experts having knowledge of the architecture, the compilerand program transformation.

For this issue, we beleive that an approach based on thethree levels and the user level (expert or not) is much needed.Also the answer time, the precision of finding the source ofperformance degradations will be an important factors in theconception of this approach.

B. Analysis Methods

Before starting the optimization of an application, it isnecessary to first identify the main factors limiting its per-formance. For that, two types of code analysis techniquescan be used: static and dynamic analysis. Most of the workhas been based either on static analysis or on measurement.Existing static and dynamic analysis methods often have theiradvantages and limitations.

a) Static Analysis: It examines program code and rea-sons over all possible behaviors that might arise at run time.It is usually an abstracted model of program state that losessome information, but which is more compact and easier tomanipulate than a higher-fidelity model would be.

As a result, the analysis output is usually faster thandynamic analysis, but may be less precise (more approximate,more conservative) than the best results that are in the grammarof the analysis. For this issue, we need tool(s) that present tothe user a static analysis that must achieve results at a muchlower cost and with a better accuracy.

b) Dynamic Analysis: It operates by executing a programand observing the executions. It is precise because no ap-proximation or abstraction needs to be done. The analysis canexamine the actual, exact run-time behavior of the program.The disadvantage of dynamic analysis is that its results maynot generalize to future executions. There is no guarantee thatthe test suite over which the program was run is characteristicof all possible program executions.Obtaining dynamic information on the behavior of a programis relatively complex. For example, performance analysisgenerally begins with finding out functions that form a largepercentage of total execution time.Different methodologies can be applied to measure perfor-mance of applications, depending on the metrics to be mea-sured. We briefly describe some methods:

b-1) Instrumentation: inserts instructions to collect infor-mation. Several instrumentation methods exist: source mod-ification, compiler injected instrumentation, binary rewritingto get an instrumented version of an executable, and binarytranslation at runtime. Instrumentation adds code to incrementcounters at entry/exit function, reading hardware performancecounters, or even simulate hardware to get synthetic eventcounts. The instrumentation runtime can dramatically increasethe execution time such that time measurements becomeuseless. It may also result in a huge code. MAKS-MAQAO[2],[1] and EEL[3] are tools based on the instrumentation.

b-2) Sampling: consists in taking measuring points duringshort time intervals. It does not provide an exact result in thestrict sense because the validity of the results depends on thechoice of the measures and their duration. However, if properlyused, the results are perfectly usable in most cases. Instead ofconstantly measuring the activity of a program, sampling oftentakes a measurement for a short time interval. The choice ofthe relationship between the frequency and duration measuresis the validity of results. CProf [4] uses this technique.

b-3) Simulation: can understand the behavior of an archi-tecture at fine grain. Unfortunately, they are very expensiveand very difficult to develop because the architecture is verycomplex. More is the architecture model is precise, more thesimulation is slow. In one hand the application size is increasedand on the other hand the number, the complexity and theinteractions of the optimizations to apply are important. P.in[6] uses this technique

There are many simulators, emulators, instrumentors. Un-fortunately, for most users (Computer architects, Softwaredevelopers and Compiler writers), it is not always clear whentheir programs need tuning. Our approach is not just developanother analyzer, instrumentor or simulator, it is an infras-tructure for accurately measuring and effectively analyzing theperformance of an application. For this issue, we need to selectthe best tool(s)/method(s) to have a quick and precise reults.

For static and dynamic analysis, is better to have resultsanalysis on source, assembly and binary level. The main goalin the analysis phase is to do an accurate analysis, be it static ordynamic, high level or low level. We believe that combiningthe benefits of each technique will be superior to any onesingle technique, excluding the other one, with the addeddifficulty of making the right choice between the techniques.

C. Analysis level

By analyzing the performance results, we should be able(1) to determine the sources of performance degradation and(2) to improve performance. To do this, we must have allinformation on source, assembly and binary codes. The ideais not to analyze each code separately but proposing how wecan benefit from all information extracted from the three codesto have precise analysis.

For static analysis, we propose to extract information fromsource and assembly codes(post-compilation).For dynamic analysis, we propose to work on assembly andbinary level. The idea is to have a collaborative relationshipwith the compiler. It may lack some expressiveness (compared

661

Page 3: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

to compilation passes applied to an abstract representation),but it does handle complex analysis and transformations, andallows a direct and precise modeling of the target platform.This post-compilation approach has several advantages:

• At assembly language almost all compiler performedoptimizations become visible that is not the case forhigher-level representations.

• The exploitation of the post-compiler optimization op-portunities is not intended as a compiler replacement:it is guaranteed that no other code transformation willbreak undo or break the optimization. Also, being basedafter the compilation phase allows a precise diagnostic ofcompiler optimization successes and/or failures.

• Assembly language is still at a higher enough level tomake development possible and optimization achievable.This is not the case for binary executable. In contrast tobinary executable code, program areas such as functionand basic blocks are still identifiable which not the casefor binary code.

• The code can be compiled directly and can actually beimplemented using any compilers/assembler, unlike anintermediate representation.

D. Performance Analysis Tools

Obtain accurate analyzes static / dynamic at any level(source, assembler and binary) to detect sources of perfor-mance degradations, several tools have been developed.The performance analysis tools/toolkits can be dispatchedamong two main classes. The first one is focused on theexploitation of hardware performance counters while the sec-ond relies on code instrumentation or even transformation.The regular use of performance instrumentation and analy-sis tools to tune real application is surprisingly uncommon.Traditionally, hardware counters, profiling information, staticanalysis and even expert knowledge are exploited individuallyor at best interconnected through ad hoc tools. In this paperwe choose many tools in order to take into account theiradvantages and surpassing their inconvenient to detect thedegradations performance source and guide user to apply thebest transformation to improve performance. The choice ofthese tools is based on the following criteria:

• Level work: Is the tool work on assembly level? it cancombine source/assembly/binary information?

• Level application: Where we can have precise informa-tion: on function/loop/instruction level? It is easy to havethis information?

• Level analysis: For our approach, we need static/dynamicanalysis, which tool can provide this?

• Execution time to generate dynamic information: everytool uses a dynamic methodology, which one will bebenefit? How about the execution time?

In this section, we outline the tasks of main tools used tomeasure and evaluate performance and answer our goals1.

1other tools are described in the Related work section

MAKS-MAQAO [2], [1]: stands for Multi-ArchitectureKnowledge-based System-Modular Quality Analyzer and Op-timizer. It is the implementation of our new optimizationapproach. It addresses performance problem in all its diversity: static analysis, support for hardware counters, dynamicinstrumentation and profiling, hybrid intelligent system ina knowledge based system to process the results, sourcetransformations and automatic low level optimization for fineloop tuning. MAKS-MAQAO is a tool to analyze and optimizeassembly and source application based on compiler optimiza-tion and user criteria to exploit the hadware resources.A distinctive advantage is that this system strongly focuses onversatility, i.e., users can specify their own analysis and enrichthe performance intelligent system. These capabilities enablea better control of an optimization process and enhance theproductivity of programmers in the process of code tuning.

Pin [6]: instruments binary codes in a way that whenspecific instructions are executed, they are caught and user de-fined instrumentation routines are executed. While being veryuseful Pin is more oriented toward prospective architecturesimulation than code performance analysis.

TAU [7]: is a portable profiling and tracing toolkit forperformance analysis of parallel programs. It provides inter-operability with a range of instrumentation libraries, post-processing and tracing tools

Gprof [5]: provides subprogram profiling and exact countand exact count number of the number of time every sub-program is called (using call graph). It can isolate the mostimportant function

HPCToolkit [8]: is an open-source suite of multi-platformtools for profile-based performance analysis of applications.The main tool is hpcview, which correlates program structureinformation, multiple sample-based performance profiles, andprogram source code to produce a performance database.

Table I summarizes these tools based on different issuespresented in this section.

TABLE ITOOLS, ANALYSIS METHODS AND LEVELS

Tool Analysis method Analysis levelTAU static-Dynamic SourceHPCToolkit static-Dynamic SourceGPROF Dynamic BinaryPin Dynamic BinaryMAKS-MAQAO static-Dynamic Source-Assembly

In the next section, we present the limitations of each toolto justify our selection of tools used in our approach.

III. MOTIVATION AND DECISION

In this section, we present a motivating example to demon-strate the limits of each tool presented in the previous section.Then we select tools to achieve goals fixed in this paper.

A. Motivation

Our Motivation is based on two scientific applications: smalland a large applications.

• Euler: is a mathematical formula in complex analysis withone file and 853 Bytes for size.

662

Page 4: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

• Blast: Basic Local Alignment Search Tool (BLAST10) isa set of algorithms used to find similar sequences betweenseveral DNA chains or protein databases. This program,composed of 748 files (45 MB).

Using the tools described before, we have generated thedynamic information for Euler and Blast by using the instru-mentation on function level. Table II summarizes the executiontime of the original code and dynamic analysis for each tooland each application (Euler and Blast respectively).

TABLE IIEXECUTION TIME AND PERCENTAGE OF INTRUMENTATION TIME FOR

EULER AND BLAST APPLICATIONS

Orig TAU M-MAQAO PIN GPROF HPCToolkitT(s) 303.89 332.74 332.55 22342.43 331.12 334.74% 1,09 1,09 73,54 1 1,1T(s) 229.80 402.73 359.32 57888.47 284.60 230.64% 1,75 1,56 251,9 1,23 1,03

B. Discussion

Based on Table II and the functionality of each tool, wehave the following remarks:

1- Execution time: we remark that Pin has a large executiontime for small and large application. That not means is a badtool but the dynamic information generated needs more time.The major inconvenient, it instruments all application.2- Analysis:2-1 Static analysis: we select MAKS-MAQAO because ithas an important information on source and assembly level.Also the link between source and assembly level helps us tounderstand easily the static information.2-2 Dynamic information: Based on our objectives, we sum-marize the limits for each tool.- TAU, the main inconvenient is the source instrumentation.- GPROF, also its limits is the source instrumentation and alimit number of dynamic information.- HPCToolkit, based on sampling can not be exact every time.For example, no information about memory tracing.- MAKS-MAQAO, in the face of the importance of itsdynamic analysis, and the usage of this tool by an expert ornot, has an important limit which is the memory tracing.- Pin, in addition to its execution time and the influence on theresults as prouved by the Pin community [9], it is a dedicatedtool for experts. And also, it lacks the concept of selectivitybecause it instrument all application.

C. Decision

Based on our objectives fixed in this paper, we selectMAKS-MAQAO and Pin. Their combination will enrich theknowledge base of MAKS-MAQAO with the dynamic infor-mation of Pin. To surpass the inconvenient of Pin (executiontime), we propose a selectivity approach. MAKS-MAQAO willselect a part of code which will be instrumented by Pin. Thisapproach will be detailed in the following section.

Fig. 1. Selectivity process by combining MAKS-MAQAO and PIN

IV. APPROACH

Taking into account user criteria mainly the part of codeto be analyzed and the dynamic analysis level(Application,function, Loop, basic block or instruction), our system gen-erates precise static/dynamic analysis for the selected part ofcode. It requires smaller computation times. It can be appliedsystematically without user intervention.In this section, we present more details about our approach.

A. Process

Figure 1 illustrates the way our system is organized.Thereare two main parts: MAKS-MAQAO and PIN. Our processstarts by MAKS-MAQAO, then PIN is launched based onMAKS-MAQAO output.

1) MAKS-MAQAO Tasks: We have the following tasks:1- From a source code, the compiler generates an assembly

file. This file is then analyzed by MAQAO in the MAKS-MAQAO infrastructure, which might already trigger some op-timizations. In this step, MAKS-MAQAO can analyze severalassembly versions for the same source code.

2- The second task is launched by the export module. It isan interface between MAKS-MAQAO and external tools. It iscomposed on the Extractor and the Translator. The extractorextracts the corresponding information needed by the externaltool. Each of the extracted information is then translated intothe external tool using the Translator. In previous work, thismodule is used to extract DDG information [10].In this paper, the information extracted is the assembly files.The export module extracts a part of the code from theassembly files. Then, it provides information based on thequery which can be a user need or a MAKS-MAQAO recom-mendation. It summarizes all information about the differenttransformation of this selected part (basic block, loop, function

663

Page 5: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Fig. 2. Selective-Pintool overview

or application). The part of code to be analyzed is generatedin the assembly file called selective-file.

3- The translator submit the assembly file containing thepart of code to be analyzed and the selective-file.

2) PIN Tasks: Once the selected part is submitted to pin,the assembly and the binary code are also submitted to pin.

In the Pin infrastructure, we have developed a pintool(selective-pintool). It has as input: the binary code of theapplication (bin-appli), the assembly file (asm-file), and thepart of code to be instrumented (select-asm) which is selectedfrom the assembly file. Figure 2 presents the main selective-pintool fonctionalities which are described bellow:

• The parser: selects from the bin-appli, the binary file (bin-file) corresponding to the asm-file.

• The corresponding function: uses the bin-file, the asm-file and the select-asm to generate two outputs: binaryand assembly files. The two ouputs called func-bin andfunc-asm, have only the function including the part ofcode to be instrumented.

• Using the select-asm, the binary code, the func-bin andthe func-asm, the selectivity function generate an inter-mediaite representation. This intermediaite representationcorresponds to the binary code with the selected part ofcode and the results will be generated by Pin.

• In the last step, Pin executes the instrumented binaryto generate results for the selected part of code. Theseresults depend on the pintool selected by user.

3) Results Managing: Once Pin generates the results basedon user criteria: selected part and results type (caches misses,memory tracing, ...), our process will retrieve these reults tomanage them in the good way: This why, we use the MAKS-MAQAO infrastructure:

TABLE IIICOMPARING EXECUTION TIME FOR JACOBI: ORIGINAL CODE,INSTRUMENTED APPLICATION BY MAKS-MAQAO AND PIN

Original MAKS-MAQAO PINInput AV MAX AV MAX AV MAXExec T 40,85 213,4 52,61 299,3 26986,43 98899,23% 1,28 1,40 660,62 463,44

TABLE IVJACOBI HOT LOOP: INSTRUMENTED TIME GAIN COMPARING TO

INSTRUMENTED TIME OF THE APPLICATION

MAKS-MAQAO PINInput AV MAX AV MAXExec T 44,75 241,87 2619,6525 9713,46gain 1,17 1,23 10,30 10,18

• Pin results will be used by the import module of MAKS-MAQAO. In this module we have a reflector whichinterpret the PIN results back to MAKS-MAQAO.

• A translator is much needed in this step to direct and sendthe result to (1) the knowledge base of MAKS-MAQAOor (2) to be visualized by the user.

V. EXPERIMENTAL RESULTSIn this section, we evaluate our proposed appraoch. We con-

sider two scientific applications: NR and NPB. For these ap-plications, we use MAKS-MAQAO to generate static/dynamicanalysis for hot loops. We use PIN to generate memory cacheresults. Experiments were run on two machines:

• BULL Itanium 2 Novascale system, 1.6GHz, 3MB of L3.Codes were compiled using Intel ICC/IFORT 10.1.

• For X86, we use 4 sockets quadri-core 2,93 GHz, 48 GBMem 1 x 146 GB. Codes were compiled using ICC 10.1.

A. NR applications

we present experiments results of applications of Solutionof Linear Algebra Equations from the Numerical Recipesapplications (NR). These applications have ten input data. Forour experiments we use the average and the maximal values.

1) Jacobi application: The first example presents the re-sults of our approach is Jacobi, which is an application usedto solve the Helmholtz equation on a regular mesh, usingan iterative Jacobi method with over-relaxation. It is run onItanium 2. In the Table IV, we remark that PIN execution timeis more very important comparing to the execution time of theoriginal code and MAKS-MAQAO instrumentation.

To demonstrate the advantage of our approach, we haveselected the hot loop in Jacobi application. For a large inputdata, this hot loop can take until 64,17 % from de executiontime of the application.. Then we have instrumented it withMAKS-MAQAO and Pin. With PIN we prove the efficiencyof our approach. If the gain in the instrumentation of MAKS-MAQAO is not interesting, we don’t forget that the selectedprocess is done by MAKS-MAQAO which is an importantstep in the selectivity process.

Figure 3 summarizes the execution time of the instrumenta-tion (application and hot loop) using different inputs data. We

664

Page 6: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

Fig. 3. Jacobi: MAKS-MAQAO and PIN instrumentaion gain by selectingthe hot loop

Fig. 4. NR hot loops: execution time on x86

remark that the execution time of the selected loop is moreless than the instrumentation time with MAKS-MAQAO andPin for the whole application.

2) All NR applications: NR applications are run on itanium2 and X86 using the average and the max value of input data.

Figure 4 presents the execution time of each hot loops ineach application (runned on X86). most of these hot loops takemore than 30 % of the execution time of the application. Thisis why we have applied our appraoch by instrumenting justthe hot loops. On Itanium 2 we have the same presentation.

Figure 5 summarizes the speedup of selective instrumen-taion and the instrumentaion of the whole of each application.We remark, all speedup are important whatever the arhitectureand the input data.

B. NPB applications

The NAS Parallel Benchmarks (NPB) are a small setof programs designed to help evaluate the performance ofparallel supercomputers. The benchmarks are derived fromcomputational fluid dynamics (CFD) applications and consistof five kernels and three pseudo-applications in the original”pencil-and-paper” specification.

These applications are runned on X86. Following the sameprocess as NR applications, we have in Figure 6, the executiontime of each hot loop in each application and with the threeinput data of NBP applications. Figure 7 presents the speedupwhich is very important using our approach.

VI. RELATED WORKMost of the performance analysis tools/toolkits can be dis-

patched among two main classes: static and dynamic analysis.

Fig. 5. NR speedup using Itanium 2 and X86 using three input data

Fig. 6. NPB hot loops: execution time on x86

Fig. 7. NPB speedup using Itanium 2 and X86 for average max input data

Hardware monitors are extremely helpful for performancetuning, they are the backbone of analysis tools like VTune[11] andCprof [4]. Their usage is so widespread that an APIgets standardized to describe their access [12]. Nevertheless,hardware counters are limited to the dynamic description of anapplication and this picture needs to be correlated with othermetrics. DPCL [13] based on Dyninst [14]. It helps developersto support dynamic instrumentation of parallel jobs. Even ifdynamic instrumentation is very appealing, DPCL does notinclude any notion of code inspection.ATOM [15] and Pin [6] instrument assembly/binary codes

in a way that when specific instructions are executed, they arecaught and user defined instrumentation routines are executed.While being very useful Atom and Pin are more orientedtoward prospective architecture simulation than code perfor-mance analysis. EEL [3] belongs to the same categories oftools. This C++ library allows to edit a binary and add codefragment on edges of dissambled application CFG. Therefore

665

Page 7: [IEEE 2012 International Conference on High Performance Computing & Simulation (HPCS) - Madrid, Spain (2012.07.2-2012.07.6)] 2012 International Conference on High Performance Computing

it can be used as a foundation for an analysis tool butdoes not provide performance analysis by itself. TAU [7]Performance System is a portable profiling and tracing toolkitfor performance analysis of parallel programs. TAU combinedifferent tools but no information interchange between them.Also his major inconvenient, the source code instrumentation.HPCview [16] and Finesse [17] address the analysis

problem from static and dynamic sides. HPCview tackles thesame problem as MAKS-MAQAO: the complex interactionbetween source code, assembly, performance and hardwaremonitors. HPCview presents a well designed GUI basedon web browser, displaying simultaneous views of source,assembly code and dynamic information. This interface isconnected to a database storing for each statement of theassembly code a summary of its dynamic HPCview also lacksvalue profiling, yet simple to implement optimizations.

Vista [18], is an interesting cross-over between compilerand performance tool. Addressing the issue of compiler op-timization phases ordering, this complete framework allowsan interactive, step by step compilation. Plugged with its owncompiler, Vista allows to interactively test and configure com-pilation phases for code fragment. While being conceptuallyclose to MAKS-MAQAO, Vista remains more a compilerproject than a performance analyzer.

Shark [19] offers a comprehensive interface for performanceproblem. As MAKS-MAQAO it is located at the assemblylevel for its analyzes, displays source code as well as profilinginformation. Shark lacks instrumentation and value profiling.

DSPInst[22] is a binary instrumentation tool. This toolallows user to select the part of code to be instrumented(function, loop, ...). Several results are generated (datacache misses, memory tracing, ...). Despite its advantages,the only major disadvantage is that it is focused on onearchitecture Blackfin. Shend et all [20] propose selected modeon source code. With their approach, it is possible to selectfunction, instruction to be instrumented. Moreover, the sourceinstrumentation can interfere with compiler optimizations.Also it is not useful when using libraries.Hernandez at all. [21] apply the selective instrumentation inthe compiler openuh. Based on static estimation, they proposeto instrument procedures. The major inconvenient, their toolcannot select the hot function/loop/instruction.

VII. CONCLUSION

In this paper we have proposed an approach which allowsinvestigating the effectiveness of selective instrumentation bycombining the advantages of performance tools, static and dy-namic analysis (source, assembly and binary code), and takinginto account the user criteria. Our technique is faster thanexisting simulation and instrumentation technique because itselects and evaluate just the hot part of code.

The stem of our work is the diagnostic that in scientificcomputing a consequent fraction of the execution time is thetime spent in the hot part of code. We come out with a novelmethod for quickly (1) selecting the hot part of code, (2)generating static analysis to decide which dynamic analysis

we will apply. The static/dynamic analysis are presented in agood and comprehensible view to user. Also thy can be savedin the knowledge base to be used for future experiments.We have implemented our selectivity approach in MAKS-MAQAO and PIN. Currently these two tools are combinedand provide results based on user criteria.

In the future, we plan to extend our approach in twoimportant ways: First, we plan to propose an infrastructure tocooperate with parallel information. For this way, it is possibleto combine other tools in our infrastructure. We also, plan tostudy tools overhead. With our approach, it will be easier tostudy this overhead and propose how we can reduce it.

REFERENCES

[1] L. Djoudi. MAKS-MAQAO: An Intelligent Integrated PerformanceAnalysis and Optimization Framework PhD 2009

[2] L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva,MAQAO: Modular Assembler Quality Analyzer and Optimizer for Ita-nium 2 Workshop on EPIC architectures and compiler technology, 2005.

[3] J. R. Larus and E. Schnaar. EEL: Machine-Independent ExecutableEditing appeared in the PLDI 1995

[4] http://sourceforge.net/projects/cprof[5] http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html[6] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi.

Pinpointing Representative Portions of Large Intel Itanium Programs withDynamic Instrumentation Micro 37, Portland, 2004

[7] Kurt Windisch, Bernd Mohr, and Al Malony. A brief technical overviewof the tau tools.

[8] L Adhianto N Tallent, J M-Crummey, M Fagan, and M Krentel. Hpc-toolkit : performance tools for scientific computing. Journal of Physics :Conference Series, 2008.

[9] Gang-Ryung Uh, Robert Cohn, Bharadwaj Yadavalli, Ramesh Peri, andRavi Ayyagari. Analyzing Dynamic Binary Instrumentation OverheadWorkshop on Binary Instrumentation and Application (2007).

[10] L. Djoudi and L. Kloul. Assembly code analysis using process algebra.5th LNCS European Performance Engineering Workshop - EPEW, 2008.

[11] Intel Corporation. VTune Performance Analyzerhttp://www.intel.com/software/products/vtune

[12] Jack Dongarra, Kevin S. London, Shirley Moore, Philip Mucci, DanielTerpstra, Haihang You, Min Zhou. Experiences and Lessons Learnedwith a Portable Interface to Hardware Performance Counters. IPDPS03

[13] Luiz De Rose, Ted Hoover Jr. and Jeffrey K. Hollingsworth, TheDynamic Probe Class Library: An Infrastructure for Developing Instru-mentation for Performance Tools, IPDPS 2001: 66

[14] B. R. Buck and J. K. Hollingsworth, An API for runtime code patchingJournal of High Performance Computing Application, 317-329, 1994.

[15] Amitabh Srivastava and Alan Eustace. ATOM - A System for BuildingCustomized Program Analysis Tools. PLDI 1994: 196-205

[16] J. Mellor-Crummey, R. Fowler and G. Marin. HPCView: A tool for top-down analysis of node performance. Computer Science Institute SecondAnnual Symposium, Santa Fe, NM, October 2001.

[17] N. Mukherjee, G.D. Riley and J.R. Gurd. FINESSE: A PrototypeFeedback-guided Performance Enhancement System. Parallel and Dis-tributed Processing (PDP) 2000, Rhodes, Greece, January 2000

[18] W. Zhao and B. Cai and D. Whalley and M. Bailey and R. van Engelenand X. Yuan and J. Hiser and J. Davidson and K. Gallivan and D. Jones,Vista: a system for interactive code improvement, In Proceedings ofthe joint conference on Languages, compilers and tools for embeddedsystems, pages 155–164. ACM Press, 2002.

[19] Optimizing Your Application with Shark 4http://developer.apple.com/tools/shark optimize.html

[20] S. Shende, Allen D. Malony, A. Morris Optimization of Instrumentationin Parallel Performance Evaluation Tools PARA’06 Proceedings of the8th international conference on Applied parallel computing: state of theart in scientific computing

[21] O. Hernandez, H. Jin, B. Chapman. Compiler Support for EfficientInstrumentation PARA’07 Proceedings of the 8th international conferenceon Applied parallel computing: state of the art in scientific computing

[22] E. Sun, D. Kaeli Binary Instrumentation Tool for the Blackfin ProcessorWBIA ’09 Proceedings of the Workshop on Binary Instrumentation andApplications

666