[IEEE 2012 19th International Conference on High Performance Computing (HiPC) - Pune, India (2012.12.18-2012.12.22)] 2012 19th International Conference on High Performance Computing

1

Tool for Performance Tuning and RegressionAnalyses of HPC Systems and Applications

Saumil Merchant, Giri PrabhakarHigh Performance Computing, IBM India,

<smerchant, [email protected]>

Abstract—Increasing sophistication in High Performance Com-puting (HPC) system architectures, software, and user environ-ments has substantially increased its complexity. For example,tuning an application on a given platform to maximize perfor-mance requires playing with multitude of different optimizationflags and environment variables. This is typically a highly repeti-tive and an ad hoc process of trying out different combinations ofvariable settings and manually comparing the results to find theoptimal. Similar is the case for performance regression analysesof systems and applications, where one is interested in detectingperformance regressions in the software version under test andanalyzing the causes to fix issues. In both of these scenarios, theprocess involves creating a patchwork of scripts to deploy jobs,extract meaningful data from raw outputs, arrange this datain some reportable format to be able to analyze, and performtweaks to enable subsequent iterations. When repeated over time,the ad-hoc process results in users re-writing similar set of scriptsagain and again for different applications, or architectures, oreven new software builds on the same architecture resulting insignificant wastage of productive man-hours. This paper presentsJACE (Job Auto-creator and Executor), a tool that enablesautomation of creation and execution of complex functional andperformance regression tests. JACE aims to address many painpoints in performance engineering such as tedious scripting,parameter tuning, careful book-keeping, frequent debugging,assessing reliability of results, and comparative evaluation. Thepaper introduces the tool and describes its architecture andworkflow. It also presents a sample walk-through of performanceregression analyses using JACE.

Index Terms—Performance Regression, Performance Analysis,Application Tuning, Application Optimization, High PerformanceComputing.

I. INTRODUCTION

Performance test, analyses, and benchmarking of HPC sys-tems is a highly repetitive process involving multiple iterationswith different sets of input parameters to find the right com-bination that can achieve optimal results. For each applicationor benchmark, performance analyses cycle involves creatinga patchwork of scripts to deploy jobs, extract meaningfuldata from raw outputs, arrange this data in some reportableformat to be able to analyze, and perform tweaks to enablesubsequent iterations. Over time, this ad-hoc process results inanalysts re-writing similar sets of scripts again and again fordifferent benchmarks, or architectures, or even new softwarebuilds on the same architecture resulting in significant wastageof productive man-hours. It takes significant extra effort toorganize and document daily results and scripts in this ad-hocprocess to enable re-use of results, scripts, and configurations.Answering questions such as how do my current results on

Figure 1. Performance regression analysis process

this version of the system/software compare with results from3Q last year, or which configuration settings had I used togenerate results last month for benchmark X, is an extremelytedious process of going through myriad of directory treesto find the right data and/or scripts. Even after this step, itis very tedious to do a point-to-point comparison of resultsfor evaluation. For a meaningful performance comparison, itis necessary to compare not just the output data, but alsothe ‘input’, viz., benchmark configurations (See Figure 1).This information is often buried under layers of scripts. Alsosince the process was ad-hoc to begin with, it is highly likelythat the input configurations are not standardized across thejob executions under comparison to enable a simple point-to-point comparative analysis. Thus, significant manual effort isrequired to analyze and compare results from different sets ofbenchmark job executions.

This paper presents JACE (Job Auto-Creator and Executor),a tool that enables automation of creation and execution ofcomplex functional and performance regression tests1 [1].JACE interfaces with third party job deployment and analysistools and brings in configuration management to organize andautomate job deployment and analysis. It aims to addressmany pain points in performance engineering such as tediousscripting, parameter tuning, careful book-keeping, frequentdebugging, assessing reliability of results, comparative eval-uation. Simply put, JACE is a job configuration management

1JACE was previously called EPMT and hence some of the variablesmentioned in this paper still use the older notation of EPMT_varname

978-1-4673-2371-0/12/$31.00 ©2012 IEEE

2

tool with following features:• Assists and automates generation of complex perfor-

mance analyses tests by allowing users to specify con-figuration parameters.

• Records complete history of jobs executed over time andenables inheriting parameters from previously executedjobs for new jobs. New jobs can then be quickly executedwith minimal tweaks in configuration parameters as theneed be.

• Automates generation of complex parameter sweepswithout requiring any additional scripting from the user.

• Enables generation of complex loops and conditionalcontrol of experiments with minimal scripting.

• Compare groups of complex jobs together, point-to-point,and produce detailed spreadsheet and web reports withtables and charts.

II. RELATED WORK

Not much has been reported in literature on tools forautomating creation, deployment, and management of complexfunctional and performance regression tests for HPC applica-tions and benchmarks. A quick search on the web revealsmany tools that do functional and performance testing ofsmall codes and typically transactional computing. A list andsome brief explanation of these tools can be found in [2].For HPC application performance analyses there are manytools within IBM that test at various depths, and are orga-nized through HPC Tools [3]. However, these offer a deeperlevel of performance analysis involving hardware performancecounters, code profiling etc., and serves a complimentaryfunctionality to JACE which enables creation, deployment,and analyses of performance regression tests. Similar toolsare also available from other vendors such as Intel and alsovarious tools from open-source community [4], [5]. PlatformHPC (recently acquired by IBM) has a well-developed suiteof tools, IBM Platform HPC [6]. There are some conceptualsimilarities between JACE and IBM Platform HPC tools suchas form based job configuration and input, but there are alsodistinct differences. One distinct difference is that JACE auto-mates workflow generation of complex parameter-sweep typeexperiments, whereas this appears to be manual in Platformcomputing toolsets. Platform HPC suite is geared more to-wards job deployment and cluster management, whereas JACEfocuses on application and system tuning and optimization.Both of these tools provide configuration management forjobs, although Platform application center is a more matureproduct in this arena as compared to JACE. Performance Re-gression Manager (PRM) is another performance monitoringtool developed internally in IBM by Faraj et. al.[9] It used forperformance regression monitoring for HPC systems. JACEcomplements this tool by providing front-end configurationmanagement for the tool. This is discussed further in the nextsection.

III. JACE: CONCEPT, ARCHITECTURE, AND DESIGN

JACE is a job configuration management tool that interfaceswith third party job deployment and analyses engines (JADE)

Figure 2. JACE interface and functionality overview

such as PRM (Performance regression Manager, previouslycalled Performance Monitoring Tool), an internal IBM tool,to provide an end-to-end solution for performance regressionanalyses and management (see Figure 2) [9]. PRM is aperformance monitoring tool developed by Faraj et al. JACEautomates creation of scripts that drive PRM’s interface for jobdeployment and execution. Although, PRM does provide morefunctionality for job output analyses and comparison, currentlyJACE only uses PRM’s job execution and web reporting capa-bilities. For performance regressions analyses and spreadsheetreporting, JACE currently uses custom built third party tools.These will be discussed in later sections. JACE accepts inputsfrom the user via two types of application forms or stencils,a configuration stencil and an environment parameter stencil.Using these it generates various execution and analyses scriptswhich drive JADE interface for deployment, execution, andanalyses.

JACE bridges an important gap in performance and regres-sion analyses by addressing many pain points which exist inthe current ad hoc methodology. It streamlines and automatesjob creation, deployment, and evaluation and provides an over-all job configuration management. It addresses the followingpain points:

1) Tedious Scripting: JACE automates generation of scriptsfor job execution and deployment. With JACE, users cancreate complex functional and performance tests withminimal to zero scripting. Using job stencils, users cancustomize and specify job execution and analyses relatedinformation such as number of tasks, tasks per node, joblaunch node, list of compute nodes, various environmentparameters, application binary name and location, inputfiles for the application, job comparison thresholds etc.Once these are specified, JACE functions as a turn-keyappliance that can create relevant scripts, deploy jobs,collect and log outputs, and generate reports.

2) Application/Parameter Tuning: Users can automaticallycreate complex parameter sweeps by merely specify-ing the parameter ranges in the job stencils. JACEautomatically loops over the ranges and executes allthe tests. As an example, let us consider evaluatingimpact of MP_SHARED_MEMORY environment variable,which has the following two options ’yes’ and ’no’,on a particular application under test. Option ’yes’

3

enables the application to use available shared memorywithin a node for communicating between processesexecuting on different cores within the node. Option’no’ forces all inter-process communications to gothrough the network software stack. One can merelyspecify the two options in the job stencil form asa range MP_SHARED_MEMORY=”no yes”. JACE willautomatically create a loop over the two options andcreate scripts to execute both the tests, once us-ing MP_SHARED_MEMORY=no and second time usingMP_SHARED_MEMORY=yes under the same job. In fact,if now we specify a range on another variable, sayTASKS_PER_NODE=”2 4 8”, JACE will create nestedloops across these two variables and execute all the sixcombinations of the two variables viz. 2, no; 2,

yes; 4, no; 4, yes; 8, no; 8; yes. Thus, onecan do scalability tests for an application by specifyinga range on the number of tasks MP_TASKS. JACE willexecute all combinations, and store and log all theresults.

3) Book-keeping of jobs and results: JACE maintainsdatabases of all the created and launched jobs. Eachevent such as stencil creation, job deployment, andcomparisons performed are logged in these databases. Atthe end of each job execution it prompts the user to inputinformation on the executed job. This is recorded andstored as notes for the particular job launch event. Thus,users can view the notes for the launched jobs at a laterdate to get unique information about the job. This is inaddition to the job stencil forms and output/comparisondata which are duly stored in appropriate directoriesproviding careful book-keeping.

4) Frequent Debugging & Reliability of Results: Oncethe target application is setup, scripts for job creation,deployment, and comparisons are automated. Thus, zeroto minimal scripting is required for all subsequent jobdeployments, eliminating scripting errors and savingall the debugging effort. Features such as stencil syn-tax checking, failure point analyses, and system healthchecks are currently at various stages of implementationinto JACE. These will significantly enhance the reliabil-ity of results.

5) Comparative Analyses and Reports: A regression anal-yses tool is incomplete without the ability to compareand evaluate results from two or more runs side-by-side.JACE enables creation of scripts to drive the interfaceof third party custom built scripts for job comparisonand analyses. These enable point-to-point comparison oftwo jobs and produces reports. As discussed in section1 (refer Figure 1), for a meaningful performance com-parison, it is important to compare not just the outputdata, but input as well, viz. benchmark configurations.Since JACE standardizes the input parameter formatsvia stencils, it enables easy comparison between twojob stencils along with result comparisons for root causeanalysis.

6) Potential Loss of Performance Insights: Haphazardmaintenance of performance records results in valuable

Figure 3. JACE high-level user workflow

Figure 4. JACE job launch workflow

comparison job information hidden across many layersof directories and scripts. This results in significantdifficulty in tracing, retrieving and comparing historical(and cross-configuration) performance “trends”. Alsosince lot of automatable work such as scripting is donemanually, significant productive man hours are wastedwhich could have been better utilized analyzing the per-formance data and generating insights. Since JACE au-tomates job deployment and performance comparisons,and systematically stores all the performance records, allproductive hours can be better invested in analyses andgeneration of performance insights.

A. JACE Design and Workflow

Figure 3 illustrates the high-level user workflow for JACEstarting from integrating a new application (or a benchmark)to generating job execution and analyses reports. At a high-level it is a five step process involving use of five differentcomponents of JACE.

Not all five steps are required to be followed for everysingle job execution by the user. The first two steps ofintegrating an application and creating scripts only need tobe performed once at the time of adding a new applicationor benchmark in JACE. JACE organizes all launched jobsand their reports under Application stencils. An applicationstencil is a base stencil from which all jobs launched within itinherit configuration and environment parameters. To launcha new job for an application, the user needs to specify theapplication stencil he wishes to use and give a name or titleto identify the job. The Job Launcher component automaticallycreates new forms for the job by inheriting all the parametersfrom Application Stencil specified. User can then edit theseforms to customize the job parameters as required and launchthe job. The launch script copies over required files to thecompute node specified in the configuration form and runsthe job, organizes the output, and copies it back to thelaunch node. This is illustrated in Figure 4. New jobs canalso inherit parameters from previously launched jobs insteadof the Application Stencil. Thus, users can quickly launch

4

Figure 5. Performance job launch framework

new jobs with minor tweaks. This is very useful in manyscenarios. For example, to compare effect of shared memoryon job performance, one can launch two jobs which are almostidentical (second job inherits from the first) except for thevalue of MP_SHARED_MEMORY environment parameter whichis set to ’no’ and ’yes’ respectively for the two jobs. Thetwo jobs results can then be compared point-to-point usingJACE and the results analyzed. Currently JACE uses PRM forexecuting performance jobs and uses custom build comparisonengine for launching comparison jobs. The performance andcomparison job launch processes are illustrated in Figures 5and 6.

Thus, an Application Stencil groups all similar jobs togetherin a single bin. An application can have multiple ApplicationStencils under it. New Application Stencils can be added usingthe script builder tool. The launch and execution scripts foran application are created when a new Application Stencil iscreated. Hence all jobs under a stencil have a fixed subset ofconfiguration parameters which cannot be changed at the joblaunch level such as application specific command line andrun options. To change these fixed subset of parameters a newApplication Stencil must be created. All other parameters suchas all environment variables, number of tasks, tasks per node,compute nodes, launch node etc. are configurable at the joblaunch level. The jobs within a stencil thus can vary from eachother based on the values of these configurable parameters.

Job stencil forms are format sensitive and inadvertentlydeleting specific form markers may alter their behavior. Hencea tool, JACE Stencil Assistant, has been developed whichassists users with editing the configuration parameters withinthe stencil forms. Although one can directly edit these formsusing text editors such “vi or emacs”, it is highly recommendedfor novice users to use the JACE Stencil Assistant to makeany changes to the job forms. The Stencil Assistant is fullyintegrated into the JACE framework and is part of the Stencilbuilder component in JACE.

B. JACE Architecture

Table I lists all the components of JACE and gives a shortdescription of their function. Due to page limit we will notgo into detailed description of each component. Section 3.1described the function and use of some of the components suchas the Application Builder, Job Launcher, Stencil Builder andScript Builder. We will briefly explain the function of someother components of JACE in this section.

Figure 6. Comparison job launch framework

Table IJACE COMPONENTS AND THEIR FUNCTIONALITY

ComponentName

Component Description

ApplicationBuilder

Framework and tools for integrating new applications intoJACE. Manages creation of application specific scripts forprocessing.

Script Builder Tools for generating an automated environment for thirdparty analyses and deployment engines (JADE). Buildsdriver, launch, run, and processing scripts.

Stencil Builder Manages stencil creator, editor and hierarchy. Provides amenu driven interface for stencil editing. Manages StencilDatabase.

Job Launcher Manages setup and launch of jobs. It currently supportslaunching performance and comparison jobs. Uses job formsand JADE interface scripts to launch jobs using JADE.

ReportGenerator

Generates job specific reports from outputs of JADE.

User Interface Currently supports two user interfaces, viz. command lineand menu-driven interface.

EventsDatabase

Logs for application builds, script builds, and job launchevents.

Stencils/FormsDatabase

Database of existing job forms and stencils. New forms canbe created by inheriting existing stencils.

ReportsDatabase

Database of job execution and comparison reports.

1) Script Builder : This is the core component of JACE.Its base function was briefly described in Section 3.1 above.This component uses the job stencils as input and generatesscripts which drive JADE interfaces. It generates four typesof scripts, viz. driver, launch, run, and processing scripts. Oneof its core functions which enables it to automate generationof complex functional and performance tests is its ability togenerate different types of loops. These are discussed next.It should be noted that all these types of loops can co-existtogether in a single job.

• Loop over an export parameter: This feature enablesusers to create complex environment parameter sweeps.For example, varying the eager limit over a range:

MP_EAGER_LIMIT=“16 32 64 128 256”

5

JACE automatically creates required execution loops toexecute all the combinations of the tests. This featurewas discussed in details in Section 3 above.

• Runtime command line loops: This feature enablesusers to create loops over run time parameters such asvary the vary the tasks-per-node, number of nodes, orloop over different benchmarks e.g. benchmarks withinIMB such as Alltoall, Bcast, Allreduce etc. Wheretasks_per_node = “1 2 4 8..” and EPMT_bm =

“allreduce allgather ...”

• Nested loops: Nested loops are created whenone specifies ranges for multiple parameters. Forexample, specifying ranges on EPMT_NUM_NODES andEPMT_TASKS_PER_NODE variables will create nestedloops as shown below:

# specify the following ranges on variables

in configuration stencil

EPMT_NUM_NODES=”2 4 8”

EPMT_TASKS_PER_NODE=”1 2 4”

Script Builder will automatically create nested loops inthe execute script as shown below :

[...]

for epmt_num_nodes in $EPMT_NUM_NODES; do

for epmt_tasks_per_node in $EPMT_TASKS_PER_NODE;

do

run benchmark

done

done

[...]

The example above shows how easily one can executescaling tests on applications/benchmarks using JACEwithout need for any scripting.

• Conditional loops: Conditional loops are loops thatinherit values dependent on other loop indices. Hereis an example that shows how to test and compareindividual collective algorithms.

# specify algorithms to test for each collectivein configuration formAlltoall="I1:Alltoall:P2P:P2PI0:Pairwise:P2P:P2P I0:M2MComposite:P2P:P2P"Allreduce="I1:Allreduce:P2P:P2PI0:Binomial:P2P:P2P“

EPMT_BM_GROUP=”Alltoall Allreduce”EPMT_symbolic_bm="‘eval echo\\\\$\\$EPMT_BM_GROUP‘"

Script Builder will automatically create conditionalloops in the execute script as shown below. Thus,depending on the value of variable $EPMT_BM_GROUPi.e. Alltoall or Allreduce, the respective algorithms willbe selected for execution.

[...]for EPMT_bm_group in $EPMT_BM_GROUP; dofor EPMT_symbolic_bm in ‘eval echo\\$\$EPMT_bm_group‘; dorun benchmarkdonedone

[...]

2) User Interface : Currently JACE supports two user inter-faces. First is the command line interface for advanced userswho may prefer using all the functionality of JACE directlyfrom the command prompt using commands such as thosediscussed in the above sections. The second is the menu driveninterface. This interface presents users with menus which theycan browse through and execute all of JACE’s functionalitysuch as adding new applications, creation and deployment ofperformance and comparison jobs, report generation, and othermanagement tasks.

3) Reports Generator : The Report Generator componentgenerates reports for outputs produced by the Job Launchercomponent. Currently it produces reports in three outputformats, viz. text, spreadsheet, and web. The text report isnatively generated by the component with the help of userprovided post-processing scripts. For spreadsheet and web for-mats it uses third party tools for report generation. Spreadsheetformat uses custom perl scripts using functions from Spread-sheet::WriteExcel and Spreadsheet::ParseExcel perl packagesto generate the report [8]. Web report uses PRM’s webreporting capability to generate reports. Section 4 shows somesample screenshots of the spreadsheet report.

4) JACE Databases : JACE has two types of databases.First type stores content and these include the ReportsDatabase and Stencil/Forms Database. The Reports databasestores all the reports generated by the Report Generatorcomponent. The Stencil/Forms database stores all the stencilsof launched jobs. Currently, both of these databases areorganized under a hierarchical, browsable directory structure.A more mature and searchable database of reports is underconstruction. The other type of databases in JACE are eventlogs. The events which JACE logs are job launch events (per-formance and comparison), report generation events, stencilcreation/deletion events.

IV. SAMPLE PERFORMANCE REGRESSION CASE STUDY

In this section we present a sample case study of evaluatingthree PAMI [10] Broadcast collective algorithms against thedefault baseline algorithm for Broadcast using the Intel MPIBenchmark Suite [IMB] [7]. IMB was the first benchmarkintegrated in JACE and as such comes as the pre-configureddefault application in JACE. The test setup is as follows.We execute two separate jobs, viz. Baseline and Contender.Baseline job is set to execute the default PAMI Broadcastalgorithm whereas the Contender job executes the three dif-ferent PAMI broadcast algorithms. Since the objective of thecase study is to demonstrate the tool, we will not name thesethree PAMI algorithms. It should be noted that not mucheffort was spent at optimizing any of these job runs, since asmentioned before the objective was to demonstrate the tool.So the numbers reported may not be the best possible numbersfor the collective algorithms. For the same reason the systemconfiguration is also not revealed here. It should also be notedthat the images in Figures 7 and 8 have been edited to removeany references to the specific algorithms. Following parametersneed to be set in the configuration stencils for the baseline andcontender configurations:

6

Bcast="baseline” # in case of baseline jobBcast=”pami_alg1 pami_alg2 pami_alg3” # in

case of contender job

Following are some common settings for baseline and con-tender jobs stencils.

BM_GROUP="Bcast”

- - - - - -

EPMT_bm_group="$BM_GROUP"EPMT_symbolic_bm="‘eval echo \\\\$\\

$EPMT_bm_group‘"

The following lines selects specific PAMI algorithm in caseof the contender job and the default for the baseline job.

export MP_S_PAMI_FOR=$EPMT_bm_group

if [[ $EPMT_symbolic_bm != “ baseline ]];

then

MP_S_$EPMT_bm_group=$EPMT_symbolic_bmfi

Both of these job configurations are executed and the outputdata is compared point-to-point by JACE producing compar-ison worksheets. Sample worksheets produced are shown inFigures 7 and 8. Figure 8 shows the summary worksheet whichgives a single page summary of the comparisons. Lookingat the summary worksheet we can see that all the contenderPAMI algorithms perform better than the default baselinefor almost all the message sizes. Please note that since the%dev numbers are for t_avg, negative deviation means thatthe contender did better than the baseline (PASS) and positivedeviation means that Baseline did better than the contender(FAIL).

V. CONCLUSIONS

Increasing complexities for HPC users and analysts withadvances in system architectures, software stack, and environ-ments has resulted in many ad-hoc procedures in performanceanalyses and benchmarking. JACE presented in this paperaims to streamline and automate this process, thus bridging asignificant gap in performance analyses and regression testingcycle. It is a tool for automating complex functional andperformance regression analyses of HPC applications andbenchmarks. JACE is developed and maintained by the IBMIndia HPC Cluster Performance team. Many new featuressuch as a job scheduler, enhanced reporting, and failure pointanalyses are currently being added to the tool enhancing itsusability and functionality.

REFERENCES

[1] G Prabhakar, S Merchant, “An Environment for AutomatingHPC Application Deployment”, The IBM HPC Systems Scien-tific Computing User Group, 2012.

[2] Load Testing: http://en.wikipedia.org/wiki/Load_testing[3] HPC Toolkit: http://researcher.ibm.com/view_project.php?id=2754[4] L. Adhianto et. al., “HPCToolkit: Tools for performance anal-

ysis of optimized parallel programs. Concurrency and Compu-tation: Practice and Experience”, 22(6):685–701, 2010

[5] Philip Mucci, “Open Source Performance Tools forHPC/Linux”, The 3rd High Performance Computing inthe Arctic Conference, Tromso, Norway, December 2007.

[6] Platform HPC: http://www.platform.com/cluster-computing/platform-hpc

Figure 7. Sample output worksheet

Figure 8. Sample Summary Worksheet

[7] Saini, S. et. al. “Performance evaluation of supercomputersusing HPCC and IMB benchmarks“, 20th International Paralleland Distributed Processing Symposium, 2006 (IPDPS 2006),25-29 April 2006.

[8] J. Mcnamara, Perl Spreadsheet::WriteExcel and spread-sheet::ParseExcel packages: www.cpan.org

[9] D. A. Faraj, S. Bodda, K. Davis. "PRM: Performance Regres-sion Manager For Large Scale Systems." To be submitted tothe 27th IEEE International Parallel & Distributed ProcessingSymposium, Cambridge, Boston, Massachusetts, USA.

[10] PAMI, https://wiki.alcf.anl.gov/parts/index.php/PAMI

Documents

[IEEE 2012 19th International Conference on High Performance Computing (HiPC) - Pune, India (2012.12.18-2012.12.22)] 2012 19th International Conference on High Performance Computing