154
1 Parallel Performance Analysis Parallel Performance Analysis with Open|SpeedShop with Open|SpeedShop Half Day Tutorial @ SC 2008 Half Day Tutorial @ SC 2008 Austin, TX Austin, TX

1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Embed Size (px)

Citation preview

Page 1: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

11

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShop with Open|SpeedShop

Half Day Tutorial @ SC 2008Half Day Tutorial @ SC 2008Austin, TXAustin, TX

Page 2: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 22

What is Open|SpeedShop?

Comprehensive Open Source Comprehensive Open Source Performance Analysis FrameworkPerformance Analysis Framework Combining Profiling and TracingCombining Profiling and Tracing Common workflow for all experimentsCommon workflow for all experiments Flexible instrumentationFlexible instrumentation Extensibility through pluginsExtensibility through plugins

PartnersPartners DOE/NNSA Tri-Labs (LLNL, LANL, SNLs)DOE/NNSA Tri-Labs (LLNL, LANL, SNLs) Krell InstituteKrell Institute Universities of Wisconsin and MarylandUniversities of Wisconsin and Maryland

Page 3: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 33

HighlightsOpen Source Performance Analysis Tool FrameworkOpen Source Performance Analysis Tool Framework

Most common performance analysis steps Most common performance analysis steps all in one toolall in one tool ExtensibleExtensible by using plugins for data collection and representation by using plugins for data collection and representation

Several Instrumentation OptionsSeveral Instrumentation Options All work on All work on unmodifiedunmodified applicationapplication binariesbinaries Offline Offline andand online data collection online data collection // attach attach to running applicationsto running applications

Flexible and Easy to useFlexible and Easy to use User access through User access through GUIGUI, , CommandCommand LineLine, and , and PythonPython ScriptingScripting

Large Range of Platforms Large Range of Platforms LinuxLinux ClustersClusters with x86, IA-64, Opteron, and EM64T CPUs with x86, IA-64, Opteron, and EM64T CPUs Easier Easier portabilityportability with offline data collection mechanism with offline data collection mechanism

AvailabilityAvailability Used at Used at allall threethree ASCASC labslabs with lab-size applications with lab-size applications Full source available on sourceforge.netFull source available on sourceforge.net

Page 4: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 44

O|SS Target Audience

Programmers/code teamsProgrammers/code teams Use Open|SpeedShop out of the boxUse Open|SpeedShop out of the box Powerful performance analysisPowerful performance analysis Ability to integrate O|SS into projectsAbility to integrate O|SS into projects

Tool developersTool developers Single, comprehensive infrastructureSingle, comprehensive infrastructure Easy deployment of new toolsEasy deployment of new tools

Project/product specific customizationsProject/product specific customizations Predefined/custom experimentsPredefined/custom experiments

Page 5: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 55

Tutorial Goals

Introduce Open|SpeedShopIntroduce Open|SpeedShop Basic concepts & terminologyBasic concepts & terminology Running first examplesRunning first examples

Provide Overview of FeaturesProvide Overview of Features Sampling & Tracing in O|SSSampling & Tracing in O|SS Performance comparisonsPerformance comparisons Parallel performance analysisParallel performance analysis

Overview of advanced techniquesOverview of advanced techniques Interactive performance analysisInteractive performance analysis Scripting & Python integrationScripting & Python integration

Page 6: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 66

“Rules”

Let’s keep this interactiveLet’s keep this interactive Feel free to ask as we go alongFeel free to ask as we go along Online demos as we go alongOnline demos as we go along

Feel free to play alongFeel free to play along Live CDs with O|SS installed (for PCs)Live CDs with O|SS installed (for PCs) Ask us if you get stuckAsk us if you get stuck

Feedback on O|SSFeedback on O|SS What is good/missing in the tool?What is good/missing in the tool? What should be done differently?What should be done differently? Please report bugs/incompatibilitiesPlease report bugs/incompatibilities

Page 7: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 77

Presenters

Martin Schulz, LLNLMartin Schulz, LLNL

Jim Galarowicz, KrellJim Galarowicz, Krell

Don Maghrak, KrellDon Maghrak, Krell

David Montoya, LANLDavid Montoya, LANL

Scott Cranford, SandiaScott Cranford, Sandia

Larger Team:Larger Team: William Hachfeld, KrellWilliam Hachfeld, Krell Samuel Gutierrez, LANLSamuel Gutierrez, LANL Joseph Kenny, Sandia NLsJoseph Kenny, Sandia NLs Chris Chambreau, LLNLChris Chambreau, LLNL

Page 8: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 88

Outline Introduction & OverviewIntroduction & Overview Running a First ExperimentRunning a First Experiment O|SS SamplingO|SS Sampling Simple Comparisons Simple Comparisons Break (30 minutes)Break (30 minutes) I/O Tracing ExperimentsI/O Tracing Experiments Parallel Performance AnalysisParallel Performance Analysis Installation Requirements and ProcessInstallation Requirements and Process Advanced CapabilitiesAdvanced Capabilities

Page 9: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

99

Section 1 Section 1 Overview & TerminologyOverview & Terminology

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 10: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1010

Results

Experiment Workflow

Ru

nApplication

“Experiment”

Results can be displayed using several “Views”

Process Management

Panel

Consists of one or more data “Collectors”

Stored in SQL database

Page 11: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1111

High-level Architecture

GUI pyO|SSCLI

AMD and Intel based clusters/SSI using Linux

CLI

Open SourceSoftware

Code Instrumentation

Exp

erimen

ts

Page 12: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1212

Performance Experiments

Concept of an ExperimentConcept of an Experiment What to measure and analyze?What to measure and analyze? Experiment chosen by user Experiment chosen by user Any experiment can be applied to any codeAny experiment can be applied to any code

Consists of Collectors and ViewsConsists of Collectors and Views Collectors define specific data sourcesCollectors define specific data sources

Hardware countersHardware counters

Tracing of certain routinesTracing of certain routines Views specify data aggregation and presentationViews specify data aggregation and presentation Multiple collectors per experiment possibleMultiple collectors per experiment possible

Page 13: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1313

Experiment Types in O|SS

Sampling ExperimentsSampling Experiments Periodically interrupt run and record locationPeriodically interrupt run and record location Report statistical distribution of these locationsReport statistical distribution of these locations Typically provides good overviewTypically provides good overview Overhead mostly low and uniformOverhead mostly low and uniform

Tracing Experiments Tracing Experiments Gather and store individual application events, Gather and store individual application events,

e.g., function invocations (MPI, I/O, …)e.g., function invocations (MPI, I/O, …) Provides detailed, low-level informationProvides detailed, low-level information Higher overhead, potentially burstyHigher overhead, potentially bursty

Page 14: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1414

Sampling Experiments

PC Sampling (pcsamp)PC Sampling (pcsamp) Record PC in user defined time intervalsRecord PC in user defined time intervals Low overhead overview of time distributionLow overhead overview of time distribution

User Time (usertime)User Time (usertime) PC Sampling + Call stacks for each samplePC Sampling + Call stacks for each sample Provides inclusive & exclusive timing dataProvides inclusive & exclusive timing data

Hardware Counters (hwc, hwctime)Hardware Counters (hwc, hwctime) Sample HWC overflow eventsSample HWC overflow events Access to data like cache and TLB missesAccess to data like cache and TLB misses

* Updated

Page 15: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1515

Tracing Experiments

I/O Tracing (io, iot)I/O Tracing (io, iot) Record invocation of all POSIX I/O eventsRecord invocation of all POSIX I/O events Provides aggregate and individual timingsProvides aggregate and individual timings

MPI Tracing (mpi, mpit, mpiotf)MPI Tracing (mpi, mpit, mpiotf) Record invocation of all MPI routinesRecord invocation of all MPI routines Provides aggregate and individual timingsProvides aggregate and individual timings

Floating Point Exception Tracing (fpe)Floating Point Exception Tracing (fpe) Triggered by any FPE caused by the codeTriggered by any FPE caused by the code Helps pinpoint numerical problem areasHelps pinpoint numerical problem areas

* Updated

Page 16: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1616

Parallel Experiments

O|SS supports MPI and threaded codesO|SS supports MPI and threaded codes Tested with a variety of MPI implementationTested with a variety of MPI implementation Thread support based on POSIX threadsThread support based on POSIX threads OpenMPI only supported through POSIX threadsOpenMPI only supported through POSIX threads

Any experiment can be parallelAny experiment can be parallel Automatically applied to all tasks/threadsAutomatically applied to all tasks/threads Default views aggregate across all tasks/threadsDefault views aggregate across all tasks/threads Data from individual tasks/threads availableData from individual tasks/threads available

Specific parallel experiments (e.g., MPI)Specific parallel experiments (e.g., MPI)

Page 17: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1717

High-level Architecture

GUI pyO|SSCLI

AMD and Intel based clusters/SSI using Linux

CLI

Open SourceSoftware

Code Instrumentation

Page 18: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1818

Code Instrumentation in O|SS

Dynamic Instrumentation through DPCLDynamic Instrumentation through DPCL Data delivered directly to tool onlineData delivered directly to tool online Ability to attach to running applicationAbility to attach to running application

Page 19: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 1919

Initial Solution: DPCL

MPI Application

O|SS

DPCL

CommunicationBottleneck

OS limitations

CommunicationBottleneck

OS limitations

Page 20: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2020

Code Instrumentation in O|SS

Dynamic Instrumentation through DPCLDynamic Instrumentation through DPCL Data delivered directly to tool onlineData delivered directly to tool online Ability to attach to running applicationAbility to attach to running application

Offline/External Data CollectionOffline/External Data Collection Instrument application at startupInstrument application at startup Write data to raw files and convert to O|SSWrite data to raw files and convert to O|SS

Page 21: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2121

Offline Data Collection

MPI Application

O|SS

post-mortem

OfflineMPI Application

O|SS

DPCL

Page 22: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2222

Code Instrumentation in O|SS

Dynamic Instrumentation through DPCLDynamic Instrumentation through DPCL Data delivered directly to tool onlineData delivered directly to tool online Ability to attach to running applicationAbility to attach to running application

Offline/External Data CollectionOffline/External Data Collection Instrument application at startupInstrument application at startup Write data to raw files and convert to O|SSWrite data to raw files and convert to O|SS

Scalable Data Collection with MRNetScalable Data Collection with MRNet Similar to DPCL, but scalable transport layerSimilar to DPCL, but scalable transport layer Ability for interactive online analysisAbility for interactive online analysis

Page 23: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2323

Hierarchical Online Collection

MPI Application

O|SS

post-mortem

OfflineMPI Application

O|SS

DPCLMPI Application

O|SS

MRNet

Page 24: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2424

Code Instrumentation in O|SS

Dynamic Instrumentation through DPCLDynamic Instrumentation through DPCL Data delivered directly to tool onlineData delivered directly to tool online Ability to attach to running applicationAbility to attach to running application

Offline/External Data CollectionOffline/External Data Collection Instrument application at startupInstrument application at startup Write data to raw files and convert to O|SSWrite data to raw files and convert to O|SS

Scalable Data Collection with MRNetScalable Data Collection with MRNet Similar to DPCL, but scalable transport layerSimilar to DPCL, but scalable transport layer Ability for interactive online analysisAbility for interactive online analysis

Page 25: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2525

High-level Architecture

GUI pyO|SSCLI

AMD and Intel based clusters/SSI using Linux

CLI

Open SourceSoftware

Code Instrumentation

Page 26: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2626

Three InterfacesExperiment Commands expAttach expCreate expDetach expGo expView

List Commands listExp listHosts listStatus

Session Commands setBreak openGui

import openss

my_filename=oss.FileList("myprog.a.out")my_exptype=oss.ExpTypeList("pcsamp")my_id=oss.expCreate(my_filename,my_exptype)

oss.expGo()

My_metric_list = oss.MetricList("exclusive")my_viewtype = oss.ViewTypeList("pcsamp“)result = oss.expView(my_id,my_viewtype,my_metric_list)

Page 27: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2727

Summary

Open|SpeedShop provides comprehensive Open|SpeedShop provides comprehensive performance analysis optionsperformance analysis options

Important terminologyImportant terminology Experiments: types of performance analysisExperiments: types of performance analysis Collectors: data sourcesCollectors: data sources Views: data presentation and aggregationViews: data presentation and aggregation

Sampling vs. TracingSampling vs. Tracing Sampling: overview data at low overheadSampling: overview data at low overhead Tracing: details, but at higher costTracing: details, but at higher cost

Page 28: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

2828

Section 2 Section 2 Running you First ExperimentRunning you First Experiment

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 29: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 2929

Running your First Experiment

What do we mean by an experiment?What do we mean by an experiment?

Running a very basic experiment Running a very basic experiment What does the command syntax look like?What does the command syntax look like? What are the outputs from the experiment?What are the outputs from the experiment?

Viewing and Interpreting gathered Viewing and Interpreting gathered measurementsmeasurements

Introduce additional command syntaxIntroduce additional command syntax

Page 30: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3030

What is an Experiment?

The concept of an experimentThe concept of an experiment Identify the application/executable to profileIdentify the application/executable to profile Identify the type of performance data that is to be gatheredIdentify the type of performance data that is to be gathered Together they form what we call an experimentTogether they form what we call an experiment

Application/executableApplication/executable Doesn’t need to be recompiled but needs –g type option to Doesn’t need to be recompiled but needs –g type option to

associate gathered data with functions and/or statements.associate gathered data with functions and/or statements.

Type of performance data (metric)Type of performance data (metric) Sampling based (program counter, call stack, # of events)Sampling based (program counter, call stack, # of events) Tracing based (wrap functions and record information)Tracing based (wrap functions and record information)

Page 31: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3131

Basic experiment syntax

openss –offline –f “executable” pcsampopenss –offline –f “executable” pcsamp openss is the command to invoke Open|SpeedShopopenss is the command to invoke Open|SpeedShop -offline indicates the user interface to use -offline indicates the user interface to use (immediate command)(immediate command)

There are a number of user interface optionsThere are a number of user interface options

-f is the option for specifying the executable name-f is the option for specifying the executable nameThe “executable” can be a sequential or parallel commandThe “executable” can be a sequential or parallel command

pcsamp indicates what type of performance data (metric) pcsamp indicates what type of performance data (metric) you will gatheryou will gather

Here pcsamp indicates that we will periodically take a sample of the Here pcsamp indicates that we will periodically take a sample of the address that the program counter is pointing to.address that the program counter is pointing to.

We will associate that address with a function and/or source line.We will associate that address with a function and/or source line.

There are several existing performance metric choices There are several existing performance metric choices

Page 32: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3232

What are the outputs?

Outputs from : Outputs from : openss –offline –f “executable” pcsampopenss –offline –f “executable” pcsamp Normal program output while executable is runningNormal program output while executable is running The sorted list of performance informationThe sorted list of performance information

A list of the top time taking functionsA list of the top time taking functions

The corresponding sample derived time for each functionThe corresponding sample derived time for each function

A performance information database fileA performance information database fileThe database file contains all the information needed to view the The database file contains all the information needed to view the data at anytime in the future without the executable(s).data at anytime in the future without the executable(s).

Symbol table information from executable(s) and system Symbol table information from executable(s) and system librarieslibraries

Performance data openss gatheredPerformance data openss gathered

Time stamps for when dso(s) were loaded and unloadedTime stamps for when dso(s) were loaded and unloaded

Page 33: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3333

Example Run with Output * Updated

openss –offline –f “orterun -np 128 sweep3d.mpi” pcsamp

Page 34: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3434

Example Run with Output * Updated

openss –offline –f “orterun -np 128 sweep3d.mpi” pcsamp

Page 35: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3535

Using the Database file

Database file is one of the outputs from running: Database file is one of the outputs from running: openss –offline –f “executable” pcsamp openss –offline –f “executable” pcsamp Use this file to view the dataUse this file to view the data How to open the database file with openssHow to open the database file with openss

openss –f <database file name>openss –f <database file name>

openss (then use menus or wizard to open)openss (then use menus or wizard to open)

openss –cliopenss –cli

exprestore –f <database file name>exprestore –f <database file name>

In this example, we show both: In this example, we show both: openss –f X.0.openss (GUI)openss –f X.0.openss (GUI)

openss –cli –f X.0.openss (CLI)openss –cli –f X.0.openss (CLI)

X.0.openss is the file name openss creates by defaultX.0.openss is the file name openss creates by default

* Updated

Page 36: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3636

Output from Example Run

Loading the database file: openss –cli –f X.0.openss

* NEW

Page 37: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3737

openss –f X.0.openss: openss –f X.0.openss: Control your job, focus stats panel, create process subsets Control your job, focus stats panel, create process subsets

Process Management Panel * NEW

Page 38: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3838

GUI view of gathered data * Updated

Page 39: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 3939

Associate Source & Data * Updated

Page 40: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4040

Additonal experiment syntax

openss –offline –f “executable” pcsampopenss –offline –f “executable” pcsamp -offline indicates the user interface is immediate command mode.-offline indicates the user interface is immediate command mode. Uses offline (LD_PRELOAD) collection mechanism.Uses offline (LD_PRELOAD) collection mechanism.

openss –cli –f “executable” pcsampopenss –cli –f “executable” pcsamp -cli indicates the user interface is interactive command line-cli indicates the user interface is interactive command line.. Uses online (dynamic instrumentation) collection mechanism.Uses online (dynamic instrumentation) collection mechanism.

openss –f “executable” pcsampopenss –f “executable” pcsamp No option indicates the user interface is graphical user.No option indicates the user interface is graphical user. Uses online (dynamic instrumentation) collection mechanism.Uses online (dynamic instrumentation) collection mechanism.

openss –batch < input.commands.fileopenss –batch < input.commands.file Executes from file of cli commandsExecutes from file of cli commands

Page 41: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4141

Demo of Basic Experiment

Run Program Counter Sampling Run Program Counter Sampling (pcsamp) on a sequential application.(pcsamp) on a sequential application.

Page 42: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

4242

Section 3 Section 3 Sampling ExperimentsSampling Experiments

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 43: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4343

Sampling Experiments

PC SamplingPC Sampling Approximates CPU Time For Line and FunctionApproximates CPU Time For Line and Function No Call StacksNo Call Stacks

User TimeUser Time Inclusive vs. Exclusive TimeInclusive vs. Exclusive Time Includes Call stacksIncludes Call stacks

HW CountersHW Counters Samples Hardware Counter OverflowsSamples Hardware Counter Overflows

Page 44: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4444

Sampling - Considerations

Sampling: Statistical Subset of All EventsSampling: Statistical Subset of All Events Low OverheadLow Overhead Low PerturbationLow Perturbation Good to Get Overview / Find HotspotsGood to Get Overview / Find Hotspots

Page 45: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4545

Example 1

Offline usertime Experiment – smg2000Offline usertime Experiment – smg2000

[samuel@yra084 test]$ openss -cli[samuel@yra084 test]$ openss -cli Welcome to OpenSpeedShop 1.6Welcome to OpenSpeedShop 1.6 openss>>RunOfflineExp("mpirun -np 16 openss>>RunOfflineExp("mpirun -np 16

smg2000 -n 100 100 100","usertime")smg2000 -n 100 100 100","usertime")

ExperimentExperimentApplicationApplication

Page 46: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4646

Example 1- Views

Default ViewDefault View Values Aggregated Across All RanksValues Aggregated Across All Ranks Manually Include/Exclude Individual ProcessesManually Include/Exclude Individual Processes

Page 47: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4747

Example 1- Views Cont.

Load Balance View Load Balance View Calculates min, max, average AcrossCalculates min, max, average Across Ranks, Processes, or ThreadsRanks, Processes, or Threads

Page 48: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4848

Example 1- Views Cont.

Butterfly View Butterfly View

Callers Of hypre_SMGResidual

Callees Of hypre_SMGResidual

* Updated

Page 49: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 4949

* UpdatedExample 1 Cont.

Source Code MappingSource Code Mapping

Page 50: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5050

Example 2 – hwc

Offline hwc Experiment – gamessOffline hwc Experiment – gamess

openss -offline -f "./rungms openss -offline -f "./rungms tools/fmo/samples/PIEDA 01 1" hwctools/fmo/samples/PIEDA 01 1" hwc

-OR--OR-

openss -cliopenss -cli Welcome to OpenSpeedShop 1.6Welcome to OpenSpeedShop 1.6 openss>>RunOfflineExp("./rungms openss>>RunOfflineExp("./rungms

tools/fmo/samples/PIEDA 01 1","hwc")tools/fmo/samples/PIEDA 01 1","hwc")

Page 51: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5151

Example 2 – hwc Cont.• Offline hwc Experiment –Default ViewOffline hwc Experiment –Default View

Page 52: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

5252

Section 4 Section 4 Simple ComparisonsSimple Comparisons

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 53: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5353

Simple comparisons

Views provide basic “default” viewsViews provide basic “default” views Sorted by statements/functions/objectsSorted by statements/functions/objects Sum across all tasksSum across all tasks

How to get a more insight?How to get a more insight? Compare results from different sourcesCompare results from different sources Combine multiple metricsCombine multiple metrics

O|SS allows users to customize viewsO|SS allows users to customize views Independent of experiment or metricIndependent of experiment or metric Predefined “canned” viewsPredefined “canned” views User defined viewsUser defined views

Page 54: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5454

Compare WizardCompare entire experiments against Compare entire experiments against each othereach other Select compare experiments from Intro WizardSelect compare experiments from Intro Wizard Select experiments from dialog pageSelect experiments from dialog page Follow wizard instructionsFollow wizard instructions Creates a side by side compare panelCreates a side by side compare panel

Example shown in next slidesExample shown in next slides Create & Run executable from “orig” directoryCreate & Run executable from “orig” directory Copy source to “modified” directory & changeCopy source to “modified” directory & change Run new “modified” version & compare resultsRun new “modified” version & compare results

* NEW

Page 55: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5555

Compare WizardSelect Compare Saved Performance DataSelect Compare Saved Performance Data

* NEW

Page 56: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5656

Compare WizardChoose the experiments you want to compareChoose the experiments you want to compare

* NEW

Page 57: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5757

Compare WizardSide by side performance resultsSide by side performance results

* NEW

Page 58: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5858

Compare WizardGo to the source for the selected functionGo to the source for the selected function

* NEW

Page 59: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 5959

Compare WizardSide by Side Source for the two versionsSide by Side Source for the two versions

* NEW

Page 60: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6060

Customizing ViewsPick your own viewsPick your own views Compare entire experiments against each otherCompare entire experiments against each other Combine any data from any active experimentCombine any data from any active experiment Select metrics provided by any collectorSelect metrics provided by any collector Restrict data to arbitrary node setsRestrict data to arbitrary node sets

Configurable through menu optionConfigurable through menu option“Customize Stats Panel”“Customize Stats Panel” Activate from stats panel context menu or Activate from stats panel context menu or

select “CC” icon in toolbarselect “CC” icon in toolbar Allows definition of content for each columnAllows definition of content for each column

Page 61: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6161

Terminology: Views / Columns

ViewColumn

Page 62: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6262

Designing a New ViewDecide on what to compareDecide on what to compare Load all necessary experimentsLoad all necessary experiments Create all columns using context menuCreate all columns using context menu

Select data for each columnSelect data for each column Pick experiment/collector/metricPick experiment/collector/metric Select process and thread setSelect process and thread set

Drag and drop from process management panelDrag and drop from process management panel

Process Add/Remove dialog boxProcess Add/Remove dialog box

““Focus StatsPanel” to activate viewFocus StatsPanel” to activate view Context menu of Customize View StatsPanel Context menu of Customize View StatsPanel

Page 63: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6363

Customize Comparison Panel

Activate from context menu or select iconActivate from context menu or select icon

Page 64: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6464

Customize Comparison Panel

““Load another experiment menu item” to load exp.Load another experiment menu item” to load exp.

Load additional experiments

Page 65: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6565

Customize Comparison PanelChoose experiment to load.Choose experiment to load.

Page 66: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6666

Customize Comparison PanelAdd a compare columnAdd a compare column

Add compare column menu item

Page 67: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6767

Customize Comparison PanelChoose experiment to compare in new column (2)Choose experiment to compare in new column (2)

Available Experiments has two choices

Page 68: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6868

Customize Comparison PanelChoose “Focus StatsPanel” to create new panelChoose “Focus StatsPanel” to create new panel

Focus StatsPanel creates a comparison stats panel

Page 69: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 6969

Comparing ExperimentsView after Choosing “Focus Stats Panel”View after Choosing “Focus Stats Panel”

Data from “pcsamp”Experiment with ID 1

Data from “hwc”Experiment with ID 2

Page 70: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7070

Summary

Simple Comparison of two different Simple Comparison of two different metrics on same executablemetrics on same executable

Views can be customizedViews can be customized Enable detailed analysisEnable detailed analysis

Customized Views to contrast …Customized Views to contrast … Results from multiple experimentsResults from multiple experiments Multiple metrics from several collectorsMultiple metrics from several collectors

Page 71: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

7171

Section 5 Section 5 I/O Tracing ExperimentsI/O Tracing Experiments

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 72: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7272

I/O Tracing - Overview

What Does IO Tracing Record?What Does IO Tracing Record? I/O EventsI/O Events Intercepts All Calls to I/O FunctionsIntercepts All Calls to I/O Functions Records Current Stack Trace & Start/End TimeRecords Current Stack Trace & Start/End Time

What About IOT Tracing? What About IOT Tracing? Collects Additional InformationCollects Additional Information

Function ParametersFunction Parameters

Function Return ValueFunction Return Value

Page 73: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7373

I/O Tracing – Overview.

What I/O Events Are Recorded?What I/O Events Are Recorded? read, readv, pread, pread64read, readv, pread, pread64 write, writev, pwrite, pwrite64write, writev, pwrite, pwrite64 open, open64open, open64 closeclose pipe, dup, dup2pipe, dup, dup2 creat, creat64creat, creat64 lseek, lseek64lseek, lseek64

* Updated

Page 74: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7474

I/O Tracing - Considerations

Trace ExperimentsTrace Experiments Collect Large Amounts of DataCollect Large Amounts of Data More Overhead (Compared To Sampling)More Overhead (Compared To Sampling) More Perturbation (Compared To Sampling)More Perturbation (Compared To Sampling) Allows For Fine-grained AnalysisAllows For Fine-grained Analysis

Page 75: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7575

Example 1

Offline IOT Experiment – gamessOffline IOT Experiment – gamess

openss -offline -f "./ddikick.x gamess.01.x openss -offline -f "./ddikick.x gamess.01.x tools/efp/lysine -ddi 1 1 lanz -scr /scr/scranfo" iottools/efp/lysine -ddi 1 1 lanz -scr /scr/scranfo" iot

-OR--OR- openss -cliopenss -cli Welcome to OpenSpeedShop 1.6Welcome to OpenSpeedShop 1.6 openss>>RunOfflineExp("./ddikick.x openss>>RunOfflineExp("./ddikick.x

gamess.01.x tools/efp/lysine -ddi 1 1 lanz -scr gamess.01.x tools/efp/lysine -ddi 1 1 lanz -scr /scr/scranfo","iot")/scr/scranfo","iot")

Page 76: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7676

Example 1 Cont.

View Experiment Status – CLIView Experiment Status – CLI expstatusexpstatus

Page 77: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7777

Example 1 Cont.

View Experiment Output – CLIView Experiment Output – CLI expviewexpview

CLI View HelpCLI View Help help expviewhelp expview

View Experiment Output – GUIView Experiment Output – GUI openguiopengui

Page 78: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7878

Example 1 Cont. expview

Show Top N IO CallsShow Top N IO Calls

expview -m count iotNexpview -m count iotN

openss>>expview -m count iot6openss>>expview -m count iot6 Number of Calls Function (defining location)Number of Calls Function (defining location) 6994 __libc_write (/lib/libpthread-2.4.so)6994 __libc_write (/lib/libpthread-2.4.so) 548 llseek (/lib/libpthread-2.4.so)548 llseek (/lib/libpthread-2.4.so) 384 __libc_read (/lib/libpthread-2.4.so)384 __libc_read (/lib/libpthread-2.4.so) 8 __libc_close (/lib/libpthread-2.4.so)8 __libc_close (/lib/libpthread-2.4.so) 5 __libc_open64 (/lib/libpthread-2.4.so)5 __libc_open64 (/lib/libpthread-2.4.so) 5 __dup (/lib/libc-2.4.so)5 __dup (/lib/libc-2.4.so)

Page 79: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 7979

Example 1 Cont. expview

Show Time Spent In Particular FunctionShow Time Spent In Particular Functionexpview -f <functionName>expview -f <functionName>

openss>>expview -f __libc_readopenss>>expview -f __libc_read

Exclusive I/O Call % of Total Function (defining location)Exclusive I/O Call % of Total Function (defining location)

Time(ms)Time(ms)

11.742272 0.416965 __libc_read (/lib/libpthread.so)11.742272 0.416965 __libc_read (/lib/libpthread.so)

Page 80: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8080

Example 1 Cont. expview

Show Call Count For Particular FunctionShow Call Count For Particular Functionexpview -f <functionName> -m countexpview -f <functionName> -m count

openss>>expview -f __libc_read -m countopenss>>expview -f __libc_read -m count

Number of Calls Function (defining location)Number of Calls Function (defining location)

384 __libc_read (/lib/libpthread-2.4.so)384 __libc_read (/lib/libpthread-2.4.so)

Page 81: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8181

Example 1 Cont. expview

Show Ret. Val. & Start Time For Top N CallsShow Ret. Val. & Start Time For Top N Calls

expview -v trace -f <function name> -m retval -m start_time iotNexpview -v trace -f <function name> -m retval -m start_time iotN

openss>>expview -v trace -f __libc_read -m retval -m start_time iot5openss>>expview -v trace -f __libc_read -m retval -m start_time iot5

Function Dependent Start Time Call Stack Function (defining location)Function Dependent Start Time Call Stack Function (defining location)

Return ValueReturn Value

32720 2008/07/08 15:36:37 >>__libc_read (/lib/libpthread-2.4.so)32720 2008/07/08 15:36:37 >>__libc_read (/lib/libpthread-2.4.so)

32720 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)32720 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)

20960 2008/07/08 15:36:37 >>__libc_read (/lib/libpthread-2.4.so)20960 2008/07/08 15:36:37 >>__libc_read (/lib/libpthread-2.4.so)

8192 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)8192 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)

8192 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)8192 2008/07/08 15:36:38 >>__libc_read (/lib/libpthread-2.4.so)

Page 82: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8282

Example 1 Cont. expviewShow Top N Most Time Consuming Call TreesShow Top N Most Time Consuming Call Treesexpview -v calltrees,fullstack iotNexpview -v calltrees,fullstack iotN

oopenss>>expview -vcalltrees,fullstack -mcount,time iot1

    Number of Calls    Exclusive I/O Call  Call Stack Function (defining location)                                  Time(ms)                                                               

_start (gamess.01.x)                                            …….                                           >>>>main (gamess.01.x)                                            >>>>> @ 545 in MAIN__ (gamess.01.x: gamess.f,352)                                            >>>>>> @ 741 in brnchx_ (gamess.01.x: gamess.f,695)                                            >>>>>>> @ 989 in energx_ (gamess.01.x: gamess.f,777)                                            >>>>>>>> @ 1742 in wfn_ (gamess.01.x: gamess.f,1637) ……                                           >>>>>>>>>>>>>>>find_or_create_unit (libgfortran.so.1.0.0)                                            >>>>>>>>>>>>>>>>find_or_create_unit (libgfortran.so.1.0.0)                 845            167.248327  >>>>>>>>>>>>>>>>>__libc_write (libpthread-2.4.so)

* Updated

Page 83: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8383

Example 1 Cont. - opengui * Updated

Page 84: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8484

Example 2 – Parallel App.Let's Look At A Parallel ExampleLet's Look At A Parallel Example Offline IO Experiment - smg2000Offline IO Experiment - smg2000

[[samuel@yra131 smg2000]$ openss -cli smg2000]$ openss -cli Welcome to OpenSpeedShop 1.6Welcome to OpenSpeedShop 1.6 openss>>RunOfflineExp("mpirun -np 16 openss>>RunOfflineExp("mpirun -np 16

smg2000 -n 100 100 100","io")smg2000 -n 100 100 100","io") ...... openss>>openguiopenss>>opengui

Page 85: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8585

Example 2 – Parallel App. * Updated

Page 86: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8686

Summary

I/O CollectorsI/O Collectors Intercept All Calls to I/O FunctionsIntercept All Calls to I/O Functions Record Current Stack Trace & Start/End TimeRecord Current Stack Trace & Start/End Time Can Collect Detailed Ancillary Data (IOT)Can Collect Detailed Ancillary Data (IOT)

Trace ExperimentsTrace Experiments

Collect Large Amounts of DataCollect Large Amounts of Data

Allows For Fine-grained AnalysisAllows For Fine-grained Analysis

Page 87: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

8787

Section 6 Section 6 Parallel Performance AnalysisParallel Performance Analysis

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 88: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8888

Experiment Types

Open|SpeedShop is designed to work on Open|SpeedShop is designed to work on parallel jobsparallel jobs Focus here: parallelism using MPIFocus here: parallelism using MPI

Sequential experimentsSequential experiments Apply experiment/collectors to all nodesApply experiment/collectors to all nodes By default display aggregate resultsBy default display aggregate results Optional select individual groups of processesOptional select individual groups of processes

MPI experimentsMPI experiments Tracing of MPI callsTracing of MPI calls Can be combined with sequential collectorsCan be combined with sequential collectors

Page 89: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 8989

Configuration

Multiple MPI versions possibleMultiple MPI versions possible-Open|SpeedShop MPI and Vampirtrace-Open|SpeedShop MPI and Vampirtrace

OPENSS_MPI_LAMOPENSS_MPI_LAM set to MPI LAM installation dirset to MPI LAM installation dirOPENSS_MPI_OPENMPIOPENSS_MPI_OPENMPI set to MPI OPENMPI installation dirset to MPI OPENMPI installation dirOPENSS_MPI_MPICHOPENSS_MPI_MPICH set to MPI MPICH installation dirset to MPI MPICH installation dirOPENSS_MPI_MPICH2OPENSS_MPI_MPICH2 set to MPI MPICH2 installation dirset to MPI MPICH2 installation dirOPENSS_MPI_MPICH_DRIVEROPENSS_MPI_MPICH_DRIVER mpich/mpich2 driver name [ch-p4]mpich/mpich2 driver name [ch-p4]OPENSS_MPI_MPTOPENSS_MPI_MPT set to SGI MPI MPT installation dirset to SGI MPI MPT installation dirOPENSS_MPI_MVAPICHOPENSS_MPI_MVAPICH set to MPI MVAPICH installation dirset to MPI MVAPICH installation dir

Specify during installationSpecify during installation(using the install.sh script)(using the install.sh script)

Page 90: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9090

MPI Job Start & Attach (offline)

MPI process controlMPI process control Link in O|SS collector into MPI applicationLink in O|SS collector into MPI application

ExamplesExamples openss –offline –f “mpirun -np 4 sweep3d.mpi” pcsampopenss –offline –f “mpirun -np 4 sweep3d.mpi” pcsamp openss –offline –f “srun –N 4 –n 16 sweep3d.mpi” pcsampopenss –offline –f “srun –N 4 –n 16 sweep3d.mpi” pcsamp openss –offline –f “orterun –np 16 sweep3d.mpi” usertimeopenss –offline –f “orterun –np 16 sweep3d.mpi” usertime

Page 91: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9191

Parallel Result Analysis

Default viewsDefault views Values aggregated across all ranksValues aggregated across all ranks Manually include/exclude individual processesManually include/exclude individual processes

Rank comparisonsRank comparisons Use Customize Stats Panel ViewUse Customize Stats Panel View Create columns for process groupsCreate columns for process groups

Cluster AnalysisCluster Analysis Automatically create process groups of similar Automatically create process groups of similar

processesprocesses Available from Stats Panel context menuAvailable from Stats Panel context menu

Page 92: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9292

Viewing Results by Process

Choice of ranks

Page 93: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9393

MPI Tracing

Similar to I/O tracingSimilar to I/O tracing Record all MPI call invocationsRecord all MPI call invocations By default: record call times (mpi)By default: record call times (mpi) Optional: record all arguments (mpit)Optional: record all arguments (mpit)

Equal events will be aggregatedEqual events will be aggregated Save space in databaseSave space in database Reduce overheadReduce overhead

Future plans:Future plans: Full MPI traces in public formatFull MPI traces in public format

Page 94: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9494

Tracing Results: Default View * Updated

Page 95: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9595

Tracing Results: Event View * Updated

Page 96: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9696

Tracing Results: Creating Event View * Updated

Page 97: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9797

Tracing Results: Specialized Event View

Page 98: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9898

Results / Show: Callstacks

Page 99: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 9999

Predefined Analysis ViewsO|SS provides common analysis O|SS provides common analysis functionsfunctions Designed for quick analysis of MPI applicationsDesigned for quick analysis of MPI applications Create new views in the StatsPanelCreate new views in the StatsPanel Accessible through context menu or toolbarAccessible through context menu or toolbar

Load Balance ViewLoad Balance View Calculate min, max, average across ranks, Calculate min, max, average across ranks,

processes or threadsprocesses or threads

Comparative Analysis ViewComparative Analysis View Use “cluster analysis” algorithm to group like Use “cluster analysis” algorithm to group like

performing ranks, processes, or threads.performing ranks, processes, or threads.

Page 100: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 100100

Quick Min, Max, Average View

Select “LB” in ToolbarSelect “LB” in Toolbar

Page 101: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 101101

Comparative Analysis: Clustering Ranks

Select “CA” in Toolbar Select “CA” in Toolbar

Page 102: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 102102

Comparing Ranks (1)

Use CustomizeStatsPanel

Create columns for eachprocess set to compare

Select set of ranksfor each column

Page 103: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 103103

Comparing Ranks (2)

Rank 0 Rank 1

Page 104: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 104104

Summary

Open|SpeedShop manages MPI jobsOpen|SpeedShop manages MPI jobs Works with multiple MPI implementationsWorks with multiple MPI implementations Process control using MPIR interface (dynamic)Process control using MPIR interface (dynamic)

Parallel ExperimentsParallel Experiments Apply sequential collectors to all nodesApply sequential collectors to all nodes Specialized MPI tracing experimentsSpecialized MPI tracing experiments

ResultsResults By default aggregated across resultsBy default aggregated across results Optional: select individual processesOptional: select individual processes Compare or group ranks & specialized viewsCompare or group ranks & specialized views

Page 105: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

105105

Section 7 Section 7 System Requirements & InstallationSystem Requirements & Installation

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 106: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 106106

System Requirements

System ArchitectureSystem Architecture AMD Opteron/AthlonAMD Opteron/Athlon Intel x86, x86-64, and Itanium-2Intel x86, x86-64, and Itanium-2

Operating SystemOperating System Tested on Many Popular Linux DistributionsTested on Many Popular Linux Distributions

SLESSLES

RHELRHEL

Fedora CoreFedora Core

Etc.Etc.

Page 107: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 107107

System Software Prerequisites

libelf (devel) libelf (devel) System-specific version highly recommendedSystem-specific version highly recommended

Python (devel)Python (devel)

Qt3 (devel) Qt3 (devel)

Typical Linux Development EnvironmentTypical Linux Development Environment GNU AutotoolsGNU Autotools GNU libtoolGNU libtool Etc.Etc.

Page 108: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 108108

Getting The Source

Sourceforge Project HomeSourceforge Project Home http://sourceforge.net/projects/opensshttp://sourceforge.net/projects/openss

CVS AccessCVS Access http://sourceforge.net/cvs/?group_id=176777http://sourceforge.net/cvs/?group_id=176777

PackagesPackages Accessible From Project Home Download TabAccessible From Project Home Download Tab

Additional InformationAdditional Information http://www.openspeedshop.org/http://www.openspeedshop.org/

Page 109: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 109109

Preparing The Build Environment

Important Environment VariablesImportant Environment Variables OPENSS_PREFIXOPENSS_PREFIX OPENSS_INSTRUMENTOROPENSS_INSTRUMENTOR OPENSS_MPI_<IMPL>OPENSS_MPI_<IMPL> QTDIRQTDIR

Simple bash ExampleSimple bash Example export OPENSS_PREFIX=/home/skg/localexport OPENSS_PREFIX=/home/skg/local export OPENSS_INSTRUMENTOR=mrnetexport OPENSS_INSTRUMENTOR=mrnet export OPENSS_MPI_OPENMPI=/opt/openmpiexport OPENSS_MPI_OPENMPI=/opt/openmpi export QTDIR=/usr/lib64/qt-3.3export QTDIR=/usr/lib64/qt-3.3

* Updated

Page 110: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 110110

Building OpenSpeedShopPreparing Directory Structure For Install ScriptPreparing Directory Structure For Install Script Needed If OSS & OSS_ROOT Were Pulled From CVSNeeded If OSS & OSS_ROOT Were Pulled From CVS

Rename OpenSpeedShop to openspeedshop-1.6Rename OpenSpeedShop to openspeedshop-1.6

tar -czf openspeedshop-1.6.tar.gz openspeedshop-1.6tar -czf openspeedshop-1.6.tar.gz openspeedshop-1.6

Note: In General, openspeedshop-<currentVersion>Note: In General, openspeedshop-<currentVersion>

Move openspeedshop-1.6 to OSS_ROOT/SOURCES Move openspeedshop-1.6 to OSS_ROOT/SOURCES

Navigate To OpenSpeedShop_ROOTNavigate To OpenSpeedShop_ROOT We Now Have Two Options...We Now Have Two Options...

Use install.sh to Build Prerequisites/OSS InteractivelyUse install.sh to Build Prerequisites/OSS Interactively

Use Option 9 For Automatic Build Use Option 9 For Automatic Build

./install.sh [--with-option] [OPTION]./install.sh [--with-option] [OPTION]

Page 111: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 111111

Installation Script Overview

Page 112: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 112112

Basic Installation Structure

Page 113: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 113113

Post-Installation Setup

Important Environment Variables (All Important Environment Variables (All Instrumentors)Instrumentors) OPENSS_PREFIXOPENSS_PREFIX

Path to installation directoryPath to installation directory OPENSS_PLUGIN_PATHOPENSS_PLUGIN_PATH

Path to directory where plugins are storedPath to directory where plugins are stored OPENSS_MPI_IMPLEMENTATIONOPENSS_MPI_IMPLEMENTATION

If multiple MPI implementations, this points openss at If multiple MPI implementations, this points openss at the one you are using in your applicationthe one you are using in your application

QTDIR, LD_LIBRARY_PATH, PATHQTDIR, LD_LIBRARY_PATH, PATH QT installation directory and Linux path variablesQT installation directory and Linux path variables

* Updated

Page 114: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 114114

Post-Installation Setup Cont.

Simple bash ExampleSimple bash Example

export OPENSS_PREFIX=/home/skg/localexport OPENSS_PREFIX=/home/skg/localexport OPENSS_MPI_IMPLEMENTATION=openmpiexport OPENSS_MPI_IMPLEMENTATION=openmpiexport OPENSS_PLUGIN_PATH= export OPENSS_PLUGIN_PATH=

$OPENSS_PREFIX/lib64/openspeedshop $OPENSS_PREFIX/lib64/openspeedshopexport LD_LIBRARY_PATH=export LD_LIBRARY_PATH=

$OPENSS_PREFIX/lib64:$LD_LIBRARY_PATH $OPENSS_PREFIX/lib64:$LD_LIBRARY_PATHexport QTDIR=/usr/lib64/qt-3.3export QTDIR=/usr/lib64/qt-3.3export PATH=$OPENSS_PREFIX/bin:$PATHexport PATH=$OPENSS_PREFIX/bin:$PATH

* Updated

Page 115: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 115115

Post-Installation Setup (MRNet)Additional MRNet-Specific Environment VariablesAdditional MRNet-Specific Environment Variables

OPENSS_MRNET_TOPOLOGY_FILEOPENSS_MRNET_TOPOLOGY_FILEAdditional Info: http://www.paradyn.org/mrnet Additional Info: http://www.paradyn.org/mrnet

DYNINSTAPI_RT_LIBDYNINSTAPI_RT_LIB MRNET_RSHMRNET_RSH XPLAT_RSHCOMMANDXPLAT_RSHCOMMAND OPENSS_RAWDATA_DIR OPENSS_RAWDATA_DIR

Simple bash ExampleSimple bash Example export OPENSS_MRNET_TOPOLOGY_FILE=/home/skg/oss.topexport OPENSS_MRNET_TOPOLOGY_FILE=/home/skg/oss.top export export

DYNINSTAPI_RT_LIB=$OPENSS_PREFIX/lib64/libdyninstAPI_RT.so.1DYNINSTAPI_RT_LIB=$OPENSS_PREFIX/lib64/libdyninstAPI_RT.so.1 export MRNET_RSH=sshexport MRNET_RSH=ssh export XPLAT_RSHCOMMAND=sshexport XPLAT_RSHCOMMAND=ssh export OPENSS_RAWDATA_DIR=/panfs/scratch2/vol3/skg/rawdata export OPENSS_RAWDATA_DIR=/panfs/scratch2/vol3/skg/rawdata

Page 116: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 116116

Post-Installation Setup (Offline)Additional Offline-Specific Environment VariablesAdditional Offline-Specific Environment Variables

OPENSS_RAWDATA_DIROPENSS_RAWDATA_DIR

OPENSS_RAWDATA_DIR must be defined within the user's environment OPENSS_RAWDATA_DIR must be defined within the user's environment before OpenSpeedShop is invoked. If OPENSS_RAWDATA_DIR is not before OpenSpeedShop is invoked. If OPENSS_RAWDATA_DIR is not defined in the user's current terminal session before the invocation of defined in the user's current terminal session before the invocation of OpenSpeedShop, then all raw performance data files will be placed in OpenSpeedShop, then all raw performance data files will be placed in /tmp/$USER/offline-oss That is, OPENSS_RAWDATA_DIR will automatically /tmp/$USER/offline-oss That is, OPENSS_RAWDATA_DIR will automatically be initialized to be initialized to

/tmp/$USER/offline-oss for the duration of the OpenSpeedShop session./tmp/$USER/offline-oss for the duration of the OpenSpeedShop session.

NOTE: In general, it is best if a user selects a target directory located onNOTE: In general, it is best if a user selects a target directory located on

scratch space.scratch space.

Simple bash ExampleSimple bash Example export OPENSS_RAWDATA_DIR=/panfs/scratch2/vol3/skg/rawdata export OPENSS_RAWDATA_DIR=/panfs/scratch2/vol3/skg/rawdata

Page 117: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

117117

Section 8 Section 8 Advanced CapabilitiesAdvanced Capabilities

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 118: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 118118

Advanced Features OverviewDynamic instrumentation and online analysisDynamic instrumentation and online analysis Pros & Cons Pros & Cons OROR what to use when what to use when Interactive performance analysis wizardsInteractive performance analysis wizards

Advanced GUI featuresAdvanced GUI features

Scripting Open|SpeedShopScripting Open|SpeedShop Using the command line interfaceUsing the command line interface Integrating O|SS into PythonIntegrating O|SS into Python

Plugin concept to extend Open|SpeedShopPlugin concept to extend Open|SpeedShop

Future plugin plansFuture plugin plans

Page 119: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 119119

Interactive AnalysisDynamic InstrumentationDynamic Instrumentation Works on binariesWorks on binaries Add/Change instrumentationAdd/Change instrumentation

at runtimeat runtime

Hierarchical CommunicationHierarchical Communication Efficient broadcast of commandsEfficient broadcast of commands Online data reductionOnline data reduction

Interactive ControlInteractive Control Available through GUI and CLIAvailable through GUI and CLI Start/Stop/Adjust data collectionStart/Stop/Adjust data collection

MPI Application

O|SS

MRNet

Page 120: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 120120

Pros & ConsAdvantage: New capabilitiesAdvantage: New capabilities Attach to/detach from running jobAttach to/detach from running job Intermediate/partial resultsIntermediate/partial results

Advantage: Easier useAdvantage: Easier use Control directly from one environmentControl directly from one environment

Disadvantage: Longer wait timesDisadvantage: Longer wait times Less information available at database creation timeLess information available at database creation time Binary parsing expensiveBinary parsing expensive

Disadvantage: StabilityDisadvantage: Stability Techniques are low-level and OS specificTechniques are low-level and OS specific

Page 121: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 121121

Using Dynamic InstrumentationMRNet instrumentorMRNet instrumentor Can be installed in parallel with offline capabilitiesCan be installed in parallel with offline capabilities Site specific daemon launch instructions (site.py)Site specific daemon launch instructions (site.py)

Launch control from within O|SSLaunch control from within O|SS Select binary at experiment creationSelect binary at experiment creation

Attach option from within O|SSAttach option from within O|SS Select PID of target process at experiment creationSelect PID of target process at experiment creation

Alternative option: WizardsAlternative option: Wizards GUI offers wizards for all default experimentsGUI offers wizards for all default experiments Easy to use walk through all optionsEasy to use walk through all options

Page 122: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 122122

Wizard Panels (1)

Gather data fromnew runs

Analyze and/or compare existing data from previous

runs

O|SS CommandLine Interface

Page 123: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 123123

Wizard Panels (2)

Select type of data to begathered by Open|SpeedShop

Page 124: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 124124

Using the Wizards

Wizard guide through standard stepsWizard guide through standard steps Access to all available experimentsAccess to all available experiments Options in separate screensOptions in separate screens Attach and start of applications possibleAttach and start of applications possible

Example: pcsamp WizardExample: pcsamp Wizard Select “… where my program spends time …”Select “… where my program spends time …” Select sampling rate (100 is a good default)Select sampling rate (100 is a good default) Select program to run or attach toSelect program to run or attach to Ensure all online daemons are started (site.py)Ensure all online daemons are started (site.py) Review and select finishReview and select finish

Page 125: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 125125

Dynamic MPI Job Start & Attach

MPI process controlMPI process control Starting: depends on MPI launcherStarting: depends on MPI launcher Uses TotalView MPIR interface for MPI-1Uses TotalView MPIR interface for MPI-1

ExamplesExamples Example MPICH:Example MPICH:

Run on actual binaryRun on actual binary Example SLURM:Example SLURM:

Run on “srun <application>”Run on “srun <application>”

Attach: attach to process with MPIR tableAttach: attach to process with MPIR table Select “-v mpi” or check “Attach to MPI job”Select “-v mpi” or check “Attach to MPI job”

Page 126: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 126126

Process Management

Execution Control

Process Overview

Process Details

Page 127: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 127127

Experiment Control

Process management panelProcess management panel Access to all tasks/processes/threadsAccess to all tasks/processes/threads Information on status and locationInformation on status and location

Execution controlExecution control Start/Pause/Resume/Stop jobsStart/Pause/Resume/Stop jobs

Select process groups for viewsSelect process groups for views

Collector controlCollector control Add additional collectors to tasksAdd additional collectors to tasks Create custom experimentsCreate custom experiments

Page 128: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 128128

General GUI Features

GUI panel managementGUI panel management Peel-off and rearrange any panelPeel-off and rearrange any panel Color coded panel groups per experimentColor coded panel groups per experiment

Context sensitive menusContext sensitive menus Right click at any locationRight click at any location Access to different viewsAccess to different views Activate additional panelsActivate additional panels

Access to source location of eventsAccess to source location of events Double click on stats panelDouble click on stats panel Opens source panel with (optional) statisticsOpens source panel with (optional) statistics

Page 129: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 129129

Leaving the GUIThree different options to run without GUIThree different options to run without GUI Equal functionalityEqual functionality Can transfer state/resultsCan transfer state/results

Interactive Command Line InterfaceInteractive Command Line Interface openss -cliopenss -cli

Batch InterfaceBatch Interface openss -batch < openss_cmd_fileopenss -batch < openss_cmd_file openss –batch –f <exe> <experiment>openss –batch –f <exe> <experiment>

Python Scripting APIPython Scripting API python openss_python_script_file.pypython openss_python_script_file.py

Page 130: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 130130

CLI Language

An interactive command Line InterfaceAn interactive command Line Interface gdb/dbx like processinggdb/dbx like processing

Several interactive commandsSeveral interactive commands Create ExperimentsCreate Experiments Provide Process/Thread Control Provide Process/Thread Control View Experiment ResultsView Experiment Results

Where possible commands execute Where possible commands execute asynchronouslyasynchronously

http://www.openspeedshop.org/docs/cli_doc/ http://www.openspeedshop.org/docs/cli_doc/

Page 131: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 131131

User-Time Example

lnx-jeg.americas.sgi.com-17>openss -cliopenss>>Welcome to OpenSpeedShop 1.9openss>>expcreate -f test/executables/ fred/fred usertimeThe new focused experiment identifier is: -x 1openss>>expgoStart asynchronous execution of experiment: -x 1openss>>Experiment 1 has terminated.

Create experiments and load application

Start application

Page 132: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 132132

Showing CLI Results

openss>>expviewExcl CPU time Inclu CPU time % of Total Exclusive Function in seconds. in seconds. CPU Time (defining location)5.2571 5.2571 49.7297 f3 (fred: f3.c,2)3.3429 3.3429 31.6216 f2 (fred: f2.c,2)1.9714 1.9714 18.6486 f1 (fred: f1.c,2)0.0000 10.5714 0.0000 __libc_start_main (libc.so.6)0.0000 10.5714 0.0000 _start (fred)0.0000 10.5429 0.0000 work(fred:work.c,2)0.0000 10.5714 0.0000 main (fred: fred.c,5)

Page 133: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 133133

CLI Batch Scripting (1)

Create batch file with CLI commandsCreate batch file with CLI commands Plain text filePlain text file Example:Example:

# Create batch fileecho expcreate -f fred pcsamp >> input.scriptecho expgo >> input.scriptecho expview pcsamp10 >>input.script

# Run OpenSpeedShopopenss -batch < input.script

Page 134: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 134134

CLI Batch Scripting (2)

Open|SpeedShop Batch Example ResultsOpen|SpeedShop Batch Example Results

The new focused experiment identifier is: -x 1Start asynchronous execution of experiment: -x 1

Experiment 1 has terminated. CPU Time Function (defining location) 24.2700 f3 (mutatee: mutatee.c,24) 16.0000 f2 (mutatee: mutatee.c,15) 8.9400 f1 (mutatee: mutatee.c,6) 0.0200 work (mutatee: mutatee.c,33)

Page 135: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 135135

CLI Batch Scripting (3)

Open|SpeedShop Batch Example: directOpen|SpeedShop Batch Example: direct

#Run Open|SpeedShop as a single non-interactive commandopenss –batch –f fred pcsamp

The new focused experiment identifier is: -x 1Start asynchronous execution of experiment: -x 1

Experiment 1 has terminated. CPU Time Function (defining location) 24.2700 f3 (mutatee: mutatee.c,24) 16.0000 f2 (mutatee: mutatee.c,15) 8.9400 f1 (mutatee: mutatee.c,6) 0.0200 work (mutatee: mutatee.c,33)

Page 136: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 136136

CLI Integration in GUI

Program Output &Command Line Panel

Page 137: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 137137

Command Line Panel

Integrated with output panelIntegrated with output panel O|SS command line language O|SS command line language Optional python supportOptional python support

Access to all loaded experimentsAccess to all loaded experiments Same context as GUISame context as GUI Use experiment ID listed in panel name Use experiment ID listed in panel name

““HistoryHistory” command” command Display all commands issued by GUIDisplay all commands issued by GUI Cut & paste output for scriptingCut & paste output for scripting

Page 138: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 138138

History Command

Page 139: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 139139

CLI Command Overview

Experiment Creations– expcreate– expattach

Experiment Control– expgo– expwait– expdisable– expenable

Experiment Storage– expsave– exprestore

Result Presentation– expview– opengui

Misc. Commands– help– list– log– record– playback– history– quit

Page 140: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 140140

Python Scripting

Open|SpeedShop Python API that executes Open|SpeedShop Python API that executes “same” Interactive/Batch Open|SpeedShop “same” Interactive/Batch Open|SpeedShop commandscommands

User can intersperse “normal” Python code with User can intersperse “normal” Python code with Open|SpeedShop Python APIOpen|SpeedShop Python API

Run Open|SpeedShop experiments via the Open|Run Open|SpeedShop experiments via the Open|SpeedShop Python APISpeedShop Python API

Page 141: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 141141

Python Example (1)

Necessary steps:Necessary steps: Import O|SS Python moduleImport O|SS Python module Prepare arguments for target applicationPrepare arguments for target application Set view and experiment typeSet view and experiment type Create experimentCreate experiment

import openss

my_filename=openss.FileList("usability/phaseII/fred")my_viewtype = openss.ViewTypeList()my_viewtype += "pcsamp"exp1=openss.expCreate(my_filename,viewtype)

Page 142: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 142142

Python Example (2)

After experiment creationAfter experiment creation Start target application (asynchronous!)Start target application (asynchronous!) Wait for completionWait for completion Write resultsWrite results

openss.expGo()

openss.wait()

except openss.error: print "expGo(exp1,my_modifer) failed"

openss.dumpView()

Page 143: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 143143

Python Example Output

Two interfaces to dump dataTwo interfaces to dump data Plain text (similar to CLI) for viewingPlain text (similar to CLI) for viewing As Python objects for post-processingAs Python objects for post-processing

>python example.py/work/jeg/OpenSpeedShop/usability/phaseII/fred: successfully completed.

Excl. CPU time % of CPU Time Function (def. location) \ 4.6700 47.7994 f3 (fred: f3.c,23) 3.5100 35.9263 f2 (fred: f2.c,2) 1.5900 16.2743 f1 (fred: f1.c,2)

Page 144: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 144144

Summary

Multiple non-graphical interfacesMultiple non-graphical interfaces Interactive Command LineInteractive Command Line Batch scriptingBatch scripting Python modulePython module

Equal functionalityEqual functionality Similar commands in all interfacesSimilar commands in all interfaces

Results transferableResults transferable E.g., run in Python and view in GUIE.g., run in Python and view in GUI Possibility to switch GUI Possibility to switch GUI ↔↔ CLI CLI

Page 145: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 145145

ExtensibilityO|SS is more than a performance toolO|SS is more than a performance tool All functionality in one toolset with one interfaceAll functionality in one toolset with one interface General infrastructure to create new toolsGeneral infrastructure to create new tools

Plugins to add new functionalityPlugins to add new functionality Cover all essential steps of performance analysisCover all essential steps of performance analysis Automatically loaded at O|SS startupAutomatically loaded at O|SS startup

Three types of pluginsThree types of plugins Collectors: How to acquire performance data?Collectors: How to acquire performance data? Views: How to aggregate and present data?Views: How to aggregate and present data? Panels: How to visualize data in the GUI?Panels: How to visualize data in the GUI?

Page 146: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 146146

Plugin DataflowWizardPlugin

Execution Environment

CollectorPlugin

ViewPlugin

PanelPlugin

Instrumentor

Data Abstraction

CLI

SQLDatabase

Page 147: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 147147

Plugin Plans

Existing extensions to base experimentsExisting extensions to base experiments MPI tracing directly to OTFMPI tracing directly to OTF Extended parameter capture for MPI and I/O tracesExtended parameter capture for MPI and I/O traces

New plugins plannedNew plugins planned Code coverageCode coverage Memory profilingMemory profiling Extended Hardware Counter profilingExtended Hardware Counter profiling

Interested in creating your own plugins?Interested in creating your own plugins? Step 1: Design tool workflow and map to pluginsStep 1: Design tool workflow and map to plugins Step 2: Define data format (XDR encoding)Step 2: Define data format (XDR encoding) Examples and plugin creation guide availableExamples and plugin creation guide available

Page 148: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 148148

Summary / Advanced FeaturesTwo techniques for instrumentationTwo techniques for instrumentation Online vs. OfflineOnline vs. Offline Different strength for different target scenariosDifferent strength for different target scenarios

Flexible GUI that can be customizedFlexible GUI that can be customized

Several compatible scripting optionsSeveral compatible scripting options Command Line LanguageCommand Line Language Direct batch interfaceDirect batch interface Integration of O|SS into PythonIntegration of O|SS into Python

GUI and scripting interoperableGUI and scripting interoperable

Plugin concept to extend Open|SpeedShopPlugin concept to extend Open|SpeedShop

Page 149: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

149149

Conclusions Conclusions

Parallel Performance AnalysisParallel Performance Analysiswith Open|SpeedShopwith Open|SpeedShop

Page 150: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 150150

Open|SpeedShop

Goal: One stop performance analysisGoal: One stop performance analysis All basic experiments in one toolAll basic experiments in one tool Extensible using plugin mechanismExtensible using plugin mechanism Supports parallel programsSupports parallel programs

Multiple interfaceMultiple interface Flexible GUIFlexible GUI Multiple scripting and batch optionsMultiple scripting and batch options

Rich featuresRich features Comparing resultsComparing results Tracebacks/Callstacks in most experimentsTracebacks/Callstacks in most experiments Full I/O tracing supportFull I/O tracing support

Page 151: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 151151

Documentation

Open|SpeedShop User Guide DocumentationOpen|SpeedShop User Guide Documentation http://www.openspeedshop.org/docs/users_guide/ /opt/OSS/share/doc/packages/OpenSpeedShop/users_guide/opt/OSS/share/doc/packages/OpenSpeedShop/users_guide

Where /opt/OSS is the installation directoryWhere /opt/OSS is the installation directory

Python scripting API DocumentationPython scripting API Documentation http://www.openspeedshop.org/docs/pyscripting_doc/ /opt/OSS/share/doc/packages/OpenSpeedShop/pyscripting_doc/opt/OSS/share/doc/packages/OpenSpeedShop/pyscripting_doc

Where /opt/OSS is the installation directoryWhere /opt/OSS is the installation directory

Command Line Interface DocumentationCommand Line Interface Documentation http://www.openspeedshop.org/docs/cli_doc/ http://www.openspeedshop.org/docs/cli_doc/ /opt/OSS/share/doc/packages/OpenSpeedShop/cli_doc/opt/OSS/share/doc/packages/OpenSpeedShop/cli_doc

Where /opt/OSS is the installation directoryWhere /opt/OSS is the installation directory

Page 152: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 152152

Current Status

Open|SpeedShop 1.9 availableOpen|SpeedShop 1.9 available Packages and source from sourceforge.netPackages and source from sourceforge.net Tested on a variety of platformsTested on a variety of platforms

Keep in mind, though:Keep in mind, though: Open|SpeedShop is still under developmentOpen|SpeedShop is still under development Low-level OS specific componentsLow-level OS specific components

Keep us informedKeep us informed We are happy to help with problemsWe are happy to help with problems Interested in getting feedbackInterested in getting feedback

* Updated

Page 153: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 153153

Building a Community

More then yet another research toolMore then yet another research tool Large scale environmentLarge scale environment Flexible frameworkFlexible framework Cooperative atmosphereCooperative atmosphere

Wanted:Wanted: New Users and Open source tool developersNew Users and Open source tool developers Tools/Software companiesTools/Software companies

Goal: self-sustaining projectGoal: self-sustaining project

* Updated

Page 154: 1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX

Tutorial @ SC2008, Austin, TX Tutorial @ SC2008, Austin, TX Slide Slide 154154

Availability and Contact

Open|SpeedShop website:Open|SpeedShop website: http://www.openspeedshop.org/

Download options:Download options: Package with Install ScriptPackage with Install Script Source for tool and base librariesSource for tool and base libraries

FeedbackFeedback Bug tracking available from websiteBug tracking available from website Contact information on websiteContact information on website Feel free to contact presenters directlyFeel free to contact presenters directly