142
The TAU Performance System Sameer Shende , Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance SystemSameer Shende, Allen D. Malony, Robert Bell

University of Oregon

Page 2: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20032

Outline

Motivation Part I: What is TAU? Performance Analysis and Visualization with TAU

Pprof Paraprof Paravis Performance Database

Part II: Using TAU – a tutorial Part III: A Case Study – Uintah [U. Utah] (if time permits)

Conclusion

Page 3: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20033

Research Motivation

Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

• Instrumentation• Measurement• Analysis• Visualization

PerformanceTechnology

• Experimentmanagement

• Performancedatabase

PerformanceTechnology

Page 4: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20034

TAU Performance System

Tuning and Analysis Utilities (11+ year project effort) Performance system framework for scalable parallel and

distributed high-performance computing Targets a general complex system computation model

nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction

Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Open software approach with technology integration

University of Oregon , Forschungszentrum Jülich, LANL

Page 5: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20035

TAU Performance Systems Goals

Multi-level performance instrumentation Multi-language automatic source instrumentation

Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

Support for performance mapping Support for object-oriented and generic programming Integration in complex software systems and applications

Page 6: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20036

Definitions – Profiling

Profiling Recording of summary information during execution

inclusive, exclusive time, # calls, hardware statistics, … Reflects performance behavior of program entities

functions, loops, basic blocks user-defined “semantic” entities

Very good for low-cost performance assessment Helps to expose performance bottlenecks and hotspots Implemented through

sampling: periodic OS interrupts or hardware counter traps instrumentation: direct insertion of measurement code

Page 7: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20037

Definitions – Tracing

Tracing Recording of information about significant points (events)

during program execution entering/exiting code region (function, loop, block, …) thread/process interactions (e.g., send/receive message)

Save information in event record timestamp CPU identifier, thread identifier Event type and event-specific information

Event trace is a time-sequenced stream of event records Can be used to reconstruct dynamic program behavior Typically requires code instrumentation

Page 8: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20038

Event Tracing: Instrumentation, Monitor, Trace

1 master

2 slave

3 ...

void slave { trace(ENTER, 2); ... recv(A, tag, buf); trace(RECV, A); ... trace(EXIT, 2);}

void master { trace(ENTER, 1); ... trace(SEND, B); send(B, tag, buf); ... trace(EXIT, 1);}

MONITOR 58 A ENTER 1

60 B ENTER 2

62 A SEND B

64 A EXIT 1

68 B RECV A

...

69 B EXIT 2

...

CPU A:

CPU B:

Event definition

timestamp

Page 9: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 20039

Event Tracing: “Timeline” Visualization

1 master

2 slave

3 ...

58 A ENTER 1

60 B ENTER 2

62 A SEND B

64 A EXIT 1

68 B RECV A

...

69 B EXIT 2

...

mainmasterslave

58 60 62 64 66 68 70

B

A

Page 10: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200310

General Complex System Computation Model

Node: physically distinct shared memory machine Message passing node interconnection network

Context: distinct virtual memory space within node Thread: execution threads (user/system) in context

memory memory

Node Node Node

VMspace

Context

SMP

Threads

node memory

Interconnection Network Inter-node messagecommunication

*

*

physicalview

modelview

Page 11: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200311

TAU Performance System Architecture

EPILOG

Paraver

Page 12: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200312

Strategies for Empirical Performance Evaluation

Empirical performance evaluation as a series of performance experiments Experiment trials describing instrumentation and

measurement requirements Where/When/How axes of empirical performance space

where are performance measurements made in program routines, loops, statements…

when is performance instrumentation donecompile-time, while pre-processing, runtime…

how are performance measurement/instrumentation chosenprofiling with hw counters, tracing, callpath profiling…

Page 13: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200313

TAU Instrumentation Approach

Support for standard program events Routines Classes and templates Statement-level blocks

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups Instrumentation optimization

Page 14: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200314

TAU Instrumentation

Flexible instrumentation mechanisms at multiple levels Source code

manual automatic

C, C++, F77/90/95 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari), POMP spec)

Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked

Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)

Page 15: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200315

Multi-Level Instrumentation

Targets common measurement interface TAU API

Multiple instrumentation interfaces Simultaneously active

Information sharing between interfaces Utilizes instrumentation knowledge between levels

Selective instrumentation Available at each level Cross-level selection

Targets a common performance model Presents a unified view of execution

Consistent performance events

Page 16: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200316

Program Database Toolkit (PDT)

Program code analysis framework develop source-based tools

High-level interface to source code information Integrated toolkit for source code parsing, database

creation, and database query Commercial grade front-end parsers Portable IL analyzer, database format, and access API Open software approach for tool development

Multiple source languages Implement automatic performance instrumentation tools

tau_instrumentor

Page 17: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200317

Program Database Toolkit (PDT)

Application/ Library

C / C++parser

Fortran parserF77/90/95

C / C++IL analyzer

FortranIL analyzer

ProgramDatabase

Files

IL IL

DUCTAPE

PDBhtml

SILOON

CHASM

TAU_instr

Programdocumentation

Applicationcomponent glue

C++ / F90/95interoperability

Automatic sourceinstrumentation

Page 18: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200318

PDT 3.0 Functionality

C++ statement-level information implementation for, while loops, declarations, initialization, assignment… PDB records defined for most constructs

DUCTAPE Processes PDB 1.x, 2.x, 3.x uniformly

PDT applications XMLgen

PDB to XML converter Used for CHASM and CCA tools

PDBstmt Statement callgraph display tool

Page 19: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200319

PDT 3.0 Functionality (continued)

Cleanscape Flint parser fully integrated for F90/95 Flint parser is very robust Produces PDB records for TAU instrumentation (stage 1)

Linux (x86, IA-64, Opteron, Power4), HP Tru64, IBM AIX, Cray X1,T3E, Solaris, SGI, Apple, Windows

Full PDB 2.0 specification (stage 2) [Q1 ‘04] Statement level support (stage 3) [Q3 ‘04]

PDT 3.0 released at SC2003 URL:

http://www.cs.uoregon.edu/research/paracomp/pdtoolkit

Page 20: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200320

TAU Performance Measurement TAU supports profiling and tracing measurement Robust timing and hardware performance support using

PAPI Support for online performance monitoring

Profile and trace performance data export to file system Selective exporting

Extension of TAU measurement for multiple counters Creation of user-defined TAU counters Access to system-level metrics

Support for callpath measurement Integration with system-level performance data

Linux MAGNET/MUSE (Wu Feng, LANL)

Page 21: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200321

TAU Measurement with Multiple Counters

Extend event measurement to capture multiple metrics Begin/end (interval) events User-defined (atomic) events Multiple performance data sources can be queried

Associate counter function list to event Defined statically or dynamically Different counter sources

Timers and hardware counters User-defined counters (application specified) System-level counters

Monotonically increasing required for begin/end events Extend user-defined counters to system-level counter

Page 22: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200322

TAU Measurement

Performance information Performance events High-resolution timer library (real-time / virtual clocks) General software counter library (user-defined events) Hardware performance counters

PAPI (Performance API) (UTK, Ptools Consortium) consistent, portable API

Organization Node, context, thread levels Profile groups for collective events (runtime selective) Performance data mapping between software levels

Page 23: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200323

TAU Measurement Options

Parallel profiling Function-level, block-level, statement-level Supports user-defined events TAU parallel profile data stored during execution Hardware counts values Support for multiple counters Support for callgraph and callpath profiling

Tracing All profile-level events Inter-process communication events Trace merging and format conversion

Page 24: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200324

Grouping Performance Data in TAU

Profile Groups A group of related routines forms a profile group Statically defined

TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, …

Dynamically defined group name based on string, such as “adlib” or “particles” runtime lookup in a map to get unique group identifier uses tau_instrumentor to instrument

Ability to change group names at runtime Group-based instrumentation and measurement control

Page 25: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200325

TAU Analysis

Parallel profile analysis Pprof

parallel profiler with text-based display ParaProf

Graphical, scalable, parallel profile analysis and display

Trace analysis and visualization Trace merging and clock adjustment (if necessary) Trace format conversion (ALOG, SDDF, VTF, Paraver) Trace visualization using Vampir (Pallas/Intel)

Page 26: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200326

Pprof Output (NAS Parallel Benchmark – LU)

Intel QuadPIII Xeon

F90 + MPICH

Profile - Node - Context - Thread

Events - code - MPI

Page 27: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200327

Terminology – Example

For routine “int main( )”: Exclusive time

100-20-50-20=10 secs Inclusive time

100 secs Calls

1 call Subrs (no. of child

routines called) 3

Inclusive time/call 100secs

int main( )

{ /* takes 100 secs */

f1(); /* takes 20 secs */

f2(); /* takes 50 secs */

f1(); /* takes 20 secs */

/* other work */

}

/*

Time can be replaced by counts */

Page 28: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200328

ParaProf (NAS Parallel Benchmark – LU)

node,context, thread Global profiles Routine profile across all nodes

Event legend

Individual profile

Page 29: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200329

TAU + Vampir (NAS Parallel Benchmark – LU)

Timeline display Callgraph display

Parallelism display

Communications display

Page 30: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200330

PETSc ex19 (Tracing)

Commonly seen communicaton behavior

Page 31: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200331

TAU’s EVH1 Execution Trace in Vampir

MPI_Alltoall is an execution bottleneck

Page 32: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200332

TAU Performance System Status

Computing platforms (selected) IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 /

X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows

Programming languages C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python

Thread libraries pthreads, SGI sproc, Java,Windows, OpenMP

Compilers (selected) Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,

Microsoft, SGI, Cray, IBM (xlc, xlf), Compaq, NEC, Intel

Page 33: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200333

Performance Analysis and Visualization

Analysis of parallel profile and trace measurement Parallel profile analysis

ParaProf ParaVis Profile generation from trace data

Performance database framework (PerfDBF) Parallel trace analysis

Translation to VTF 3.0 and EPILOG Integration with VNG (Technical University of Dresden)

Online parallel analysis and visualization

Page 34: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200334

ParaProf Framework Architecture

Portable, extensible, and scalable tool for profile analysis Try to offer “best of breed” capabilities to analysts Build as profile analysis framework for extensibility

Page 35: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200335

Profile Manager Window

Structured AMR toolkit (SAMRAI++), LLNL

Page 36: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200336

Full Profile Window (Exclusive Time)

512

proc

esse

s

Page 37: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200337

Node / Context / Thread Profile Window

Page 38: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200338

Derived Metrics

Page 39: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200339

Full Profile Window (Metric-specific)

512

proc

esse

s

Page 40: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200340

ParaProf Enhancements

Readers completely separated from the GUI Access to performance profile database

Profile translators

mpiP, papiprof, dynaprof Callgraph display

prof/gprof style with hyperlinks Integration of 3D performance plotting library Scalable profile analysis

Statistical histograms, cluster analysis, … Generalized programmable analysis engine Cross-experiment analysis

Page 41: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200341

ParaVis

PerformanceVisualizer

PerformanceAnalyzer

PerformanceData Reader

Scalable parallel profile analysis Scalable performance displays

3D graphics Analysis across profile samples

Allow for runtime use Animated / interactive visualization Initially develop with SCIRun

Computational environment Performance graphics toolkit

Portable plotting library OpenGL

Page 42: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200342

Performance Visualization in SCIRun

SCIRun program

EVH1, IBM

EVH1, Linux IA-32

Page 43: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200343

“Terrain” Visualization (Full profile)

F

Uintah

Page 44: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200344

“Scatterplot” Visualization Each point

coordinatedeterminedby threevalues:MPI_Reduce

MPI_Recv

MPI_Waitsome

Min/Maxvalue range

Effective forclusteranalysis

Uintah

Page 45: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200345

Empirical-Based Performance Optimization

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

observabilityrequirements ?

ProcessExperiment

Schemas

ExperimentTrials

Experimentmanagement

Page 46: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200346

TAU Performance Database FrameworkPerformance

analysis programs

Performance analysisand query toolkit

profile data only XML representation project / experiment / trial

PerfDMLtranslators

. . .

ORDB

PostgreSQL

PerfDB

Performancedata description

Raw performance data

Other tools

Page 47: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200347

PerfDBF Components

Performance Data Meta Language (PerfDML) Common performance data representation Performance meta-data description PerfDML translators to common data representation

Performance DataBase (PerfDB) Standard database technology (SQL) Free, robust database software (PostgresSQL, MySQL) Commonly available APIs

Performance DataBase Toolkit (PerfDBT) Commonly used modules for query and analysis PerfDB API to facilitate analysis tool development

Page 48: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200348

PerfDBF Browser

Page 49: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200349

PerfDBF Cross-Trial Analysis

Page 50: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200350

TAU Performance System Status Computing platforms

IBM SP, SGI Origin 2K/3K, ASCI Red, Apple, Cray X1, SV1, T3E, HP/Compaq SC, HP Superdome, Sun, Windows, Linux (IA-32, Opteron, IA-64, Alpha…), NEC, Hitachi, …

Programming languages C, C++, Fortran 77/90, HPF, Java, Python

Thread libraries pthreads, Java,Windows, Tulip, SMARTS, OpenMP

Compilers Cray, KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM,

HP-Compaq, NEC, Hitachi, HP, Absoft, NAGWare, Intel… Version 2.13 available from:

http://www.cs.uoregon.edu/research/paracomp/tau

Page 51: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200351

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 52: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200352

TAU Performance System Architecture

EPILOG

Paraver

Page 53: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200353

Using TAU Install TAU

% configure ; make clean install Instrument application

TAU Profiling API Typically modify application makefile

include TAU’s stub makefile, modify variables Set environment variables

directory where profiles/traces are to be stored Execute application

% mpirun –np <procs> a.out; Analyze performance data

paraprof, vampir, pprof, paraver …

Page 54: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200354

Description of Optional Packages

PAPI – Measures hardware performance data e.g., floating point instructions, L1 data cache misses etc.

DyninstAPI – Helps instrument an application binary at runtime or rewrites the binary

EPILOG – Trace library. Epilog traces can be analyzed by EXPERT [FZJ], an automated bottleneck detection tool.

Opari – Tool that instruments OpenMP programs Vampir – Commercial trace visualization tool [Pallas] Paraver – Trace visualization tool [CEPBA]

Page 55: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200355

TAU Measurement System Configuration configure [OPTIONS]

{-c++=<CC>, -cc=<cc>} Specify C++ and C compilers {-pthread, -sproc} Use pthread or SGI sproc threads -openmp Use OpenMP threads -jdk=<dir> Specify Java instrumentation (JDK) -opari=<dir> Specify location of Opari OpenMP tool -papi=<dir> Specify location of PAPI -pdt=<dir> Specify location of PDT -dyninst=<dir> Specify location of DynInst Package -mpi[inc/lib]=<dir> Specify MPI library instrumentation -python[inc/lib]=<dir> Specify Python instrumentation -epilog=<dir> Specify location of EPILOG

Page 56: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200356

TAU Measurement System Configuration

configure [OPTIONS] -TRACE Generate binary TAU traces -PROFILE (default) Generate profiles (summary) -PROFILECALLPATH Generate call path profiles -PROFILESTATS Generate std. dev. statistics -MULTIPLECOUNTERS Use hardware counters + time -CPUTIME Use usertime+system time -PAPIWALLCLOCK Use PAPI’s wallclock time -PAPIVIRTUAL Use PAPI’s process virtual

time -SGITIMERS Use fast IRIX timers -LINUXTIMERS Use fast x86 Linux timers

Page 57: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200357

TAU Measurement Configuration – Examples

./configure -c++=xlC_r –pthread Use TAU with xlC_r and pthread library under AIX Enable TAU profiling (default)

./configure -TRACE –PROFILE Enable both TAU profiling and tracing

./configure -c++=xlC_r -cc=xlc_r-papi=/usr/local/packages/papi -pdt=/usr/local/pdtoolkit-3.0 –arch=ibm64-mpiinc=/usr/lpp/ppe.poe/include-mpilib=/usr/lpp/ppe.poe/lib -MULTIPLECOUNTERS Use IBM’s xlC_r and xlc_r compilers with PAPI, PDT,

MPI packages and multiple counters for measurements Typically configure multiple measurement libraries

Page 58: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200358

TAU Manual Instrumentation API for C/C++ Initialization and runtime configuration

TAU_PROFILE_INIT(argc, argv);TAU_PROFILE_SET_NODE(myNode);TAU_PROFILE_SET_CONTEXT(myContext);TAU_PROFILE_EXIT(message);TAU_REGISTER_THREAD();

Function and class methods TAU_PROFILE(name, type, group);

Template TAU_TYPE_STRING(variable, type);

TAU_PROFILE(name, type, group);CT(variable);

User-defined timing TAU_PROFILE_TIMER(timer, name, type, group);

TAU_PROFILE_START(timer);TAU_PROFILE_STOP(timer);

Page 59: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200359

TAU Measurement API (continued) User-defined events

TAU_REGISTER_EVENT(variable, event_name);TAU_EVENT(variable, value);TAU_PROFILE_STMT(statement);

Mapping TAU_MAPPING(statement, key);

TAU_MAPPING_OBJECT(funcIdVar);TAU_MAPPING_LINK(funcIdVar, key);

TAU_MAPPING_PROFILE (funcIdVar);TAU_MAPPING_PROFILE_TIMER(timer, funcIdVar);TAU_MAPPING_PROFILE_START(timer);TAU_MAPPING_PROFILE_STOP(timer);

Reporting TAU_REPORT_STATISTICS();

TAU_REPORT_THREAD_STATISTICS();

Page 60: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200360

Manual Instrumentation – C++ Example

#include <TAU.h>

int main(int argc, char **argv)

{

TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);

TAU_PROFILE_INIT(argc, argv);

TAU_PROFILE_SET_NODE(0); /* for sequential programs */

foo();

return 0;

}

int foo(void)

{

TAU_PROFILE(“int foo(void)”, “ ”, TAU_DEFAULT); // measures entire foo()

TAU_PROFILE_TIMER(t, “foo(): for loop”, “[23:45 file.cpp]”, TAU_USER);

TAU_PROFILE_START(t);

for(int i = 0; i < N ; i++){

work(i);

}

TAU_PROFILE_STOP(t);

// other statements in foo …

}

Page 61: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200361

Manual Instrumentation – C Example

#include <TAU.h>

int main(int argc, char **argv)

{

TAU_PROFILE_TIMER(tmain, “int main(int, char **)”, “ ”, TAU_DEFAULT);

TAU_PROFILE_INIT(argc, argv);

TAU_PROFILE_SET_NODE(0); /* for sequential programs */

TAU_PROFILE_START(tmain);

foo();

TAU_PROFILE_STOP(tmain);

return 0;

}

int foo(void)

{

TAU_PROFILE_TIMER(t, “foo()”, “ ”, TAU_USER);

TAU_PROFILE_START(t);

for(int i = 0; i < N ; i++){

work(i);

}

TAU_PROFILE_STOP(t);

}

Page 62: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200362

Manual Instrumentation – F90 Examplecc34567 Cubes program – comment line

PROGRAM SUM_OF_CUBES

integer profiler(2)

save profiler

INTEGER :: H, T, U

call TAU_PROFILE_INIT()

call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES')

call TAU_PROFILE_START(profiler)

call TAU_PROFILE_SET_NODE(0)

! This program prints all 3-digit numbers that

! equal the sum of the cubes of their digits.

DO H = 1, 9

DO T = 0, 9

DO U = 0, 9

IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN

PRINT "(3I1)", H, T, U

ENDIF

END DO

END DO

END DO

call TAU_PROFILE_STOP(profiler)

END PROGRAM SUM_OF_CUBES

Page 63: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200363

Instrumenting Multithreaded Applications

#include <TAU.h>

void * threaded_function(void *data)

{

TAU_REGISTER_THREAD(); // Before any other TAU calls

TAU_PROFILE(“void * threaded_function”, “ ”, TAU_DEFAULT);

work();

}

int main(int argc, char **argv)

{

TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);

TAU_PROFILE_INIT(argc, argv);

TAU_PROFILE_SET_NODE(0); /* for sequential programs */

pthread_attr_t attr;

pthread_t tid;

pthread_attr_init(&attr);

pthread_create(&tid, NULL, threaded_function, NULL);

return 0;

}

Page 64: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200364

Compiling

% configure [options]

% make clean install

Creates <arch>/lib/Makefile.tau<options> stub Makefile

and <arch>/lib/libTau<options>.a [.so] libraries which defines a single configuration of TAU

Page 65: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200365

Compiling: TAU Makefiles Include TAU Stub Makefile (<arch>/lib) in the user’s Makefile. Variables:

TAU_CXX Specify the C++ compiler used by TAU TAU_CC, TAU_F90 Specify the C, F90 compilers TAU_DEFS Defines used by TAU. Add to CFLAGS TAU_LDFLAGS Linker options. Add to LDFLAGS TAU_INCLUDE Header files include path. Add to CFLAGS TAU_LIBS Statically linked TAU library. Add to LIBS TAU_SHLIBS Dynamically linked TAU library TAU_MPI_LIBS TAU’s MPI wrapper library for C/C++ TAU_MPI_FLIBS TAU’s MPI wrapper library for F90 TAU_FORTRANLIBS Must be linked in with C++ linker for F90 TAU_CXXLIBS Must be linked in with F90 linker TAU_DISABLE TAU’s dummy F90 stub library

Note: Not including TAU_DEFS in CFLAGS disables instrumentation in C/C++ programs (TAU_DISABLE for f90).

Page 66: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200366

Including TAU Makefile - Example

include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc

CXX = $(TAU_CXX)

CC = $(TAU_CC)

CFLAGS = $(TAU_DEFS) $(TAU_INCLUDE)

LIBS = $(TAU_LIBS)

OBJS = ...

TARGET= a.out

TARGET: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $< -o $@

Page 67: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200367

Setup: Running Applications

% set path=($path <taudir>/<arch>/bin)

% setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\:<taudir>/<arch>/lib

For PAPI (1 counter):

% setenv PAPI_EVENT PAPI_FP_INS

For PAPI (multiplecounters):

% setenv COUNTER1 PAPI_FP_INS (PAPI’s Floating point ins)

% setenv COUNTER2 PAPI_L1_DCM (PAPI’s L1 Data cache misses)

% setenv COUNTER3 P_VIRTUAL_TIME (PAPI’s virtual time)

% setenv COUNTER4 LINUX_TIMERS (Wallclock time)

% mpirun –np <n> <application>

% llsubmit job.sh

% paraprof (for performance analysis)

Page 68: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200368

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 69: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200369

Using Program Database Toolkit (PDT)

Step I: Configure PDT:

% configure –arch=IRIX64 –CC

% make clean; make install

Builds <pdtdir>/<arch>/bin/cxxparse, cparse, f90parse and f95parse

Builds <pdtdir>/<arch>/lib/libpdb.a. See <pdtdir>/README file.

Step II: Configure TAU with PDT for auto-instrumentation of source code:

% configure –arch=sgi64 –c++=CC –cc=cc –pdt=/usr/contrib/TAU/pdtoolkit-3.0

% make clean; make install

Builds <taudir>/<arch>/bin/tau_instrumentor,

<taudir>/<arch>/lib/Makefile.tau<options> and libTau<options>.a

See <taudir>/INSTALL file.

Page 70: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200370

TAU Makefile for PDT

include /usr/tau/include/Makefile

CXX = $(TAU_CXX)

CC = $(TAU_CC)

PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse

TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor

CFLAGS = $(TAU_DEFS) $(TAU_INCLUDE)

LIBS = $(TAU_LIBS)

OBJS = ...

TARGET= a.out

TARGET: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(PDTPARSE) $<

$(TAUINSTR) $*.pdb $< -o $*.inst.cpp –f select.dat

$(CC) $(CFLAGS) -c $*.inst.cpp -o $@

Page 71: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200371

Instrumentation Control

Selection of which performance events to observe Could depend on scope, type, level of interest Could depend on instrumentation overhead

How is selection supported in instrumentation system? No choice Include / exclude routine and file lists (TAU) Environment variables Static vs. dynamic

Problem: Controlling instrumentation of small routines High relative measurement overhead Significant intrusion and possible perturbation

Page 72: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200372

Using PDT: tau_instrumentor% tau_instrumentor

Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]

For selective instrumentation, use –f option

% tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat

% cat selective.dat

# Selective instrumentation: Specify an exclude/include list of routines/files.

BEGIN_EXCLUDE_LIST

void quicksort(int *, int, int)

void sort_5elements(int *)

void interchange(int *, int *)

END_EXCLUDE_LIST

BEGIN_FILE_INCLUDE_LIST

Main.cpp

Foo?.c

*.C

END_FILE_INCLUDE_LIST

# Instruments routines in Main.cpp, Foo?.c and *.C files only

# Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST

Page 73: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200373

tau_reduce: Rule-Based Overhead Analysis

Analyze the performance data to determine events with high (relative) overhead performance measurements

Create a select list for excluding those events Rule grammar (used in tau_reduce tool)

[GroupName:] Field Operator Number GroupName indicates rule applies to events in group Field is a event metric attribute (from profile statistics)

numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call

Operator is one of >, <, or = Number is any number Compound rules possible using & between simple rules

Page 74: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200374

Example Rules

#Exclude all events that are members of TAU_USER #and use less than 1000 microsecondsTAU_USER:usec < 1000

#Exclude all events that have less than 100 #microseconds and are called only onceusec < 1000 & numcalls = 1

#Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5usecs/call < 1000percent < 5

Scientific notation can be used usec>1000 & numcalls>400000 & usecs/call<30 & percent>25

Page 75: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200375

TAU_REDUCE

Reads profile files and rules Creates selective instrumentation file

Specifies which routines should be excluded from instrumentation

tau_reduce

rules

profile

Selectiveinstrumentation file

Page 76: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200376

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 77: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200377

Using MPI Wrapper Interposition Library

Step I: Configure TAU with MPI:

% configure –mpiinc=/usr/include –mpilib=/usr/lib64 –arch=sgi64 –c++=CC –cc=cc –pdt=/usr/contrib/TAU/pdtoolkit-3.0

% make clean; make install

Builds <taudir>/<arch>/lib/libTauMpi<options>,

<taudir>/<arch>/lib/Makefile.tau<options> and libTau<options>.a

Page 78: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200378

TAU’s MPI Wrapper Interposition Library

Uses standard MPI Profiling Interface Provides name shifted interface

MPI_Send = PMPI_Send Weak bindings

Interpose TAU’s MPI wrapper library between MPI and TAU -lmpi replaced by –lTauMpi –lpmpi –lmpi

Page 79: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200379

MPI Library Instrumentation (MPI_Send)

int MPI_Send(…) /* TAU redefines MPI_Send */...{

int returnVal, typesize;TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE);TAU_PROFILE_START(tautimer);if (dest != MPI_PROC_NULL) {

PMPI_Type_size(datatype, &typesize);TAU_TRACE_SENDMSG(tag, dest, typesize*count);

} /* Wrapper calls PMPI_Send */ returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm);

TAU_PROFILE_STOP(tautimer); return returnVal;

}

Page 80: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200380

Including TAU’s stub Makefile

include /usr/tau/sgi64/lib/Makefile.tau-mpi

CXX = $(TAU_CXX)

CC = $(TAU_CC)

CFLAGS = $(TAU_DEFS) $(TAU_INCLUDE) $(TAU_MPI_INCLUDE)

LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS)

LD_FLAGS = $(TAU_LDFLAGS)

OBJS = ...

TARGET= a.out

TARGET: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $< -o $@

Page 81: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200381

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 82: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200382

Using Opari with TAU

Step I: Configure KOJAK/opari [Download from http://www.fz-juelich.de/zam/kojak/]

% cd kojak-0.99; cp mf/Makefile.defs.sgi Makefile.defs; edit Makefile

% make

Builds opari

Step II: Configure TAU with Opari (used here with MPI and PDT)

% configure –opari=/usr/contrib/TAU/kojak-0.99/opari -mpiinc=/usr/include –mpilib=/usr/lib64 –arch=sgi64 –c++=CC –cc=cc –pdt=/usr/contrib/TAU/pdtoolkit-3.0

% make clean; make install

Page 83: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200383

Instrumentation of OpenMP Constructs

OOpenMP PPragma AAnd RRegion IInstrumentor Source-to-Source translator to insert POMP calls

around OpenMP constructs and API functions Done: Supports

Fortran77 and Fortran90, OpenMP 2.0 C and C++, OpenMP 1.0 POMP Extensions EPILOG and TAU POMP implementations Preserves source code information (#line line file)

Work in Progress:Investigating standardization through OpenMP Forum

Page 84: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200384

OpenMP API Instrumentation

Transform omp_#_lock() pomp_#_lock() omp_#_nest_lock() pomp_#_nest_lock()

[ # = init | destroy | set | unset | test ]

POMP version Calls omp version internally Can do extra stuff before and after call

Page 85: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200385

Example: !$OMP PARALLEL DO Instrumentation

!$OMP PARALLEL DO clauses...

do loop

!$OMP END PARALLEL DO

!$OMP PARALLEL other-clauses...

!$OMP DO schedule-clauses, ordered-clauses, lastprivate-clausesdo loop

!$OMP END DO

!$OMP END PARALLEL DO

NOWAIT

!$OMP BARRIER

call pomp_parallel_fork(d)

call pomp_parallel_begin(d)

call pomp_parallel_end(d)

call pomp_parallel_join(d)

call pomp_do_enter(d)

call pomp_do_exit(d)

call pomp_barrier_enter(d)

call pomp_barrier_exit(d)

Page 86: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200386

Opari Instrumentation: Example

OpenMP directive instrumentation

pomp_for_enter(&omp_rd_2);

#line 252 "stommel.c"

#pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5) nowait

for( i=i1;i<=i2;i++) {

for(j=j1;j<=j2;j++){

new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]

+ a4*psi[i][j-1] - a5*the_for[i][j];

diff=diff+fabs(new_psi[i][j]-psi[i][j]);

}

}

pomp_barrier_enter(&omp_rd_2);

#pragma omp barrier

pomp_barrier_exit(&omp_rd_2);

pomp_for_exit(&omp_rd_2);

#line 261 "stommel.c"

Page 87: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200387

OPARI: Basic Usage (f90)

Reset OPARI state information rm -f opari.rc

Call OPARI for each input source file opari file1.f90...opari fileN.f90

Generate OPARI runtime table, compile it with ANSI C opari -table opari.tab.ccc -c opari.tab.c

Compile modified files *.mod.f90 using OpenMP Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL

Page 88: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200388

OPARI: Makefile Template (C/C++)

OMPCC = ... # insert C OpenMP compiler hereOMPCXX = ... # insert C++ OpenMP compiler here

.c.o:opari $<$(OMPCC) $(CFLAGS) -c $*.mod.c

.cc.o:opari $<$(OMPCXX) $(CXXFLAGS) -c $*.mod.cc

opari.init:rm -rf opari.rc

opari.tab.o:opari -table opari.tab.c$(CC) -c opari.tab.c

myprog: opari.init myfile*.o ... opari.tab.o$(OMPCC) -o myprog myfile*.o opari.tab.o -lpomp

myfile1.o: myfile1.c myheader.hmyfile2.o: ...

Page 89: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200389

OPARI: Makefile Template (Fortran)

OMPF77 = ... # insert f77 OpenMP compiler hereOMPF90 = ... # insert f90 OpenMP compiler here

.f.o:opari $<$(OMPF77) $(CFLAGS) -c $*.mod.F

.f90.o:opari $<$(OMPF90) $(CXXFLAGS) -c $*.mod.F90

opari.init:rm -rf opari.rc

opari.tab.o:opari -table opari.tab.c$(CC) -c opari.tab.c

myprog: opari.init myfile*.o ... opari.tab.o$(OMPF90) -o myprog myfile*.o opari.tab.o -lpomp

myfile1.o: myfile1.f90myfile2.o: ...

Page 90: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200390

Tracing Hybrid Executions – TAU and Vampir

Page 91: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200391

Profiling Hybrid Executions

Page 92: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200392

OpenMP + MPI Ocean Modeling (HW Profile)

% configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/lib

FPinstructions

IntegratedOpenMP +MPI events

Page 93: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200393

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 94: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200394

Dynamic Instrumentation

TAU uses DyninstAPI for runtime code patching tau_run (mutator) loads measurement library Instruments mutatee MPI issues:

one mutator per executable image [TAU, DynaProf] one mutator for several executables [Paradyn, DPCL]

Page 95: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200395

Using DyninstAPI with TAU

Step I: Install DyninstAPI[Download from http://www.dyninst.org]

% cd dyninstAPI-4.0.2/core; make

Set DyninstAPI environment variables (including LD_LIBRARY_PATH)

Step II: Configure TAU with Dyninst

% configure –dyninst=/usr/local/dyninstAPI-4.0.2

% make clean; make install

Builds <taudir>/<arch>/bin/tau_run

% tau_run [<-o outfile>] [-Xrun<libname>] [-f <select_inst_file>] [-v] <infile>

% tau_run –o a.inst.out a.out

Rewrites a.out

% tau_run klargest

Instruments klargest with TAU calls and executes it

% tau_run -XrunTAUsh-papi a.out

Loads libTAUsh-papi.so instead of libTAU.so for measurements

NOTE: All compilers and platforms are not yet supported (work in progress)

Page 96: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200396

SIMPLE Hydrodynamics Benchmark

Page 97: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200397

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 98: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200398

Multi-Threading Performance Measurement

General issues Thread identity and per-thread data storage Performance measurement support and synchronization Fine-grained parallelism

different forms and levels of threading greater need for efficient instrumentation

TAU general threading and measurement model Common thread layer and measurement support Interface to system specific libraries (reg, id, sync)

Target different thread systems with core functionality Pthreads, Windows, Java, SMARTS, Tulip, OpenMP

Page 99: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 200399

Virtual Machine Performance Instrumentation

Integrate performance system with VM Captures robust performance data (e.g., thread events) Maintain features of environment

portability, concurrency, extensibility, interoperation Allow use in optimization methods

JVM Profiling Interface (JVMPI) Generation of JVM events and hooks into JVM Profiler agent (TAU) loaded as shared object

registers events of interest and address of callback routine Access to information on dynamically loaded classes No need to modify Java source, bytecode, or JVM

Page 100: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003100

Using TAU with Java Applications

Step I: Sun JDK 1.2+ [download from www.javasoft.com]

Step II: Configure TAU with JDK (v 1.2 or better)

% configure –jdk=/usr/java2 –TRACE -PROFILE

% make clean; make install

Builds <taudir>/<arch>/lib/libTAU.so

For Java (without instrumentation):

% java application

With instrumentation:

% java -XrunTAU application

% java -XrunTAU:exclude=sun/io,java application

Excludes sun/io/* and java/* classes

Page 101: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003101

TAU Profiling of Java Application (SciVis)

Profile for eachJava thread Captures events

for different Javapackages

24 threads of execution!

globalroutineprofile

Page 102: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003102

TAU Tracing of Java Application (SciVis)

Performance groupsTimeline display

Parallelism view

Page 103: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003103

Vampir Dynamic Call Tree View (SciVis)

Per thread call tree

Annotated performance

Expandedcall tree

Page 104: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003104

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 105: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003105

Using TAU with Python Applications

Step I: Configure TAU with Python

% configure –pythoninc=/usr/include/python2.2/include

% make clean; make install

Builds <taudir>/<arch>/lib/<bindings>/pytau.py and tau.py packages

for manual and automatic instrumentation respectively

% setenv PYTHONPATH $PYTHONPATH\:<taudir>/<arch>/lib/[<dir>]

Page 106: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003106

Python Manual Instrumentation Example

#!/usr/bin/env/python

import pytau

From time import sleep

x = pytau.profileTimer(``Timer A’’)

pytau.start(x)

print “ Sleeping for 5 seconds ”

sleep(5)

pytau.stop(x)

Running:% setenv PYTHONPATH <tau>/<arch>/lib

% ./application.py

Page 107: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003107

Python Automatic Instrumentation Example

#!/usr/bin/env/python

import tau

from time import sleep

def f2():

print “ In f2: Sleeping for 2 seconds ”

sleep(2)

def f1():

print “ In f1: Sleeping for 3 seconds ”

sleep(3)

def OurMain():

f1()

tau.run(‘OurMain()’)

Running:% setenv PYTHONPATH <tau>/<arch>/lib

% ./auto.py

Instruments OurMain, f1, f2, print…

Page 108: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003108

Using TAU – A tutorial

Configuration Instrumentation

Manual PDT- Source rewriting for C,C++, F77/90/95 MPI – Wrapper interposition library OpenMP – Directive rewriting Binary Instrumentation

DyninstAPI – Runtime/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 109: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003109

TAU Measurement

Performance information High-resolution timer library (real-time / virtual clocks) General software counter library (user-defined events) Hardware performance counters

PAPI (Performance API) (UTK, Ptools Consortium) consistent, portable API one event or multiple counters

Organization Node, context, thread levels Profile groups for collective events (runtime selective) Performance data mapping between software levels

Page 110: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003110

TAU Measurement (continued)

Parallel profiling Function-level, block-level, statement-level Supports user-defined events TAU parallel profile database Call path profiles Hardware counts values (in replace of time)

Tracing All profile-level events Interprocess communication events

User-configurable measurement library (user controlled)

Page 111: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003111

Setup: Running Applications

% setenv PROFILEDIR /home/data/experiments/profile/01

% setenv TRACEDIR /home/data/experiments/trace/01(optional)

% set path=($path <taudir>/<arch>/bin)

% setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\:<taudir>/<arch>/lib

For PAPI (1 counter):

% setenv PAPI_EVENT PAPI_FP_INS

For PAPI (multiplecounters):

% setenv COUNTER1 PAPI_FP_INS (PAPI’s Floating point ins)

% setenv COUNTER2 PAPI_L1_DCM (PAPI’s L1 Data cache misses)

% setenv COUNTER3 P_VIRTUAL_TIME (PAPI’s virtual time)

% setenv COUNTER4 LINUX_TIMERS (Wallclock time)

% mpirun –np <n> <application>

% llsubmit job.sh

For Callpath Profiling:

% setenv TAU_CALLPATH_DEPTH 10

Page 112: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003112

Performance Mapping

Associate performance with “significant” entities (events)

Source code points are important Functions, regions, control flow events, user events

Execution process and thread entities are important Some entities are more abstract, harder to measure Consider callgraph (callpath) profiling

Measure time (metric) along an edge (path) of callgraph Incident edge gives parent / child view Edge sequence (path) gives parent / descendant view

Problem: Callpath profiling when callgraph is unknown Determine callgraph dynamically at runtime Map performance measurement to dynamic call path state

Page 113: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003113

A

B C

D

HG

FE

I

Callgraph (Callpath) Profiling Measure time (metric) along an edge (path) of callgraph

Incident edge gives parent / child view Edge sequence (path) gives parent / descendant view

1-level callpath Immediate descendant AB, EI, DH CH ?

k-level callpath k call descendant 2-level: AD, CI 2-level: AI ? 3-level: AH

Page 114: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003114

k-Level Callpath Implementation in TAU

TAU maintains a performance event (routine) callstack Profiled routine (child) looks in callstack for parent

Previous profiled performance event is the parent A callpath profile structure created first time parent calls TAU records parent in a callgraph map for child String representing k-level callpath used as its key

“a( )=>b( )=>c()” : name for time spent in “c” when called by “b” when “b” is called by “a”

Map returns pointer to callpath profile structure k-level callpath is profiled using this profiling data Set environment variable TAU_CALLPATH_DEPTH to depth

Build upon TAU’s performance mapping technology Measurement is independent of instrumentation Use –PROFILECALLPATH to configure TAU

Page 115: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003115

k-Level Callpath Implementation in TAU

Page 116: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003116

Using TAU with Vampir

Configure TAU with -TRACE option% configure –TRACE –SGITIMERS …

Execute application% mpirun –np 4 a.out

This generates TAU traces and event descriptors Merge all traces using tau_merge

% tau_merge *.trc app.trc Convert traces to Vampir Trace format using tau_convert

% tau_convert –pv app.trc tau.edf app.pvNote: Use –vampir instead of –pv for multi-threaded traces

Load generated trace file in Vampir% vampir app.pv

Page 117: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003117

Case Study: Utah ASCI/ASAP Level 1 Center

C-SAFE was established at U. Utah to build a problem-solving environment (PSE) for the numerical simulation of accidental fires and explosions Fundamental chemistry and engineering physics models Coupled with non-linear solvers, optimization,

computational steering, visualization, and experimental data verification

Very large-scale simulations Computer science problems:

Coupling of multiple simulation codes Software engineering across diverse expert teams Achieving high performance on large-scale systems

Page 118: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003118

Example C-SAFE Simulation Problems

∑Heptane fire simulation

Material stress simulation

Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics

Page 119: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003119

Uintah High-Level Component View

Page 120: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003120

Uintah Computational Framework

Execution model based on software (macro) dataflow Exposes parallelism and hides data transport latency Computations expressed a directed acyclic graphs of tasks

consumes input and produces output (input to future task) input/outputs specified for each patch in a structured grid

Abstraction of global single-assignment memory DataWarehouse Directory mapping names to values (array structured) Write value once then communicate to awaiting tasks

Task graph gets mapped to processing resources Communications schedule approximates global optimal

Page 121: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003121

Uintah Task Graph (Material Point Method)

Diagram of named tasks (ovals) and data (edges)

Imminent computation Dataflow-constrained

MPM Newtonian material point

motion time step Solid: values defined at

material point (particle) Dashed: values defined at

vertex (grid) Prime (‘): values updated

during time step

Page 122: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003122

Uintah PSE

UCF automatically sets up: Domain decomposition Inter-processor communication with aggregation/reduction Parallel I/O Checkpoint and restart Performance measurement and analysis (stay tuned)

Software engineering Coding standards CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day) Correctness regression testing with bugzilla bug tracking Nightly build (parallel compiles) 170,000 lines of code (Fortran and C++ tasks supported)

Page 123: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003123

Performance Technology Integration

Uintah present challenges to performance integration Software diversity and structure

UCF middleware, simulation code modules component-based hierarchy

Portability objectives cross-language and cross-platform multi-parallelism: thread, message passing, mixed

Scalability objectives High-level programming and execution abstractions

Requires flexible and robust performance technology Requires support for performance mapping

Page 124: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003124

Task execution time dominates (what task?)

MPI communication overheads (where?)

Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in MPI library

Task execution time distribution

Need to map performance data!

Page 125: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003125

Semantics-Based Performance Mapping

Associate performance measurements with high-level semantic abstractions

Need mapping support in the performance measurement system to assign data correctly

Page 126: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003126

Semantic Entities/Attributes/Associations (SEAA)

New dynamic mapping scheme Entities defined at any level of abstraction Attribute entity with semantic information Entity-to-entity associations

Two association types (implemented in TAU API) Embedded – extends data structure of associated object to

store performance measurement entity External – creates an external look-up table using address

of object as the key to locate performance measurement entity

Page 127: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003127

Uintah Task Performance Mapping Uintah partitions individual particles across processing

elements (processes or threads) Simulation tasks in task graph work on particles

Tasks have domain-specific character in the computation “interpolate particles to grid” in Material Point Method

Task instances generated for each partitioned particle set Execution scheduled with respect to task dependencies

How to attributed execution time among different tasks Assign semantic name (task type) to a task instance

SerialMPM::interpolateParticleToGrid Map TAU timer object to (abstract) task (semantic entity) Look up timer object using task type (semantic attribute)

Further partition along different domain-specific axes

Page 128: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003128

Using External Associations

Two level mappings: Level 1: <task name, timer> Level 2: <task name, patch, timer>

Embedded association vs External association

Data (object) Performance Data

...

Hash Table

Page 129: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003129

Task Performance Mapping Instrumentation

void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw,

DataWarehouseP & dw ) {...TAU_MAPPING_CREATE(

task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0);...TAU_MAPPING_OBJECT(tautimer)TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());

// EXTERNAL ASSOCIATION...TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)TAU_MAPPING_PROFILE_START(doitprofiler,0);task->doit(pc);TAU_MAPPING_PROFILE_STOP(0);...

}

Page 130: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003130

Task Performance Mapping (Profile)

Performance mapping for different tasks

Mapped task performance across processes

Page 131: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003131

Task Performance Mapping (Trace)

Work packet computation events colored by task type

Distinct phases of computation can be identifed based on task

Page 132: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003132

Task Performance Mapping (Trace - Zoom)

Startup communicationimbalance

Page 133: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003133

Task Performance Mapping (Trace - Parallelism)

Communication/ load imbalance

Page 134: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003134

Comparing Uintah Traces for Scalability Analysis

8 processes

8 processes

32 processes32 processes

32 processes

Page 135: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003135

Scaling Performance Optimizations

Last year: initial “correct” scheduler

Reduce communication by 10 x

Reduce task graph overhead by 20 x

ASCI NirvanaSGI Origin 2000Los AlamosNational Laboratory

Page 136: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003136

Scalability to 2000 Processors (Fall 2001)

ASCI NirvanaSGI Origin 2000Los AlamosNational Laboratory

Page 137: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003137

Concluding Remarks

Complex software and parallel computing systems pose challenging performance analysis problems that require robust methodologies and tools

To build more sophisticated performance tools, existing proven performance technology must be utilized

Performance tools must be integrated with software and systems models and technology Performance engineered software Function consistently and coherently in software and

system environments PAPI and TAU performance systems offer robust

performance technology that can be broadly integrated

Page 138: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003138

Support Acknowledgements

Department of Energy (DOE) Office of Science contracts University of Utah DOE ASCI Level 1

sub-contract DOE ASCI Level 3 (LANL, LLNL)

NSF National Young Investigator (NYI) award

Research Centre Juelich John von Neumann Institute for

Computing Dr. Bernd Mohr

Los Alamos National Laboratory

Page 139: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003139

Setup for Training Session

SSH: Go to Edit->Properties-> Tunnelling and check Tunnel X11 connections

Login to daley.nrlmry.navy.mil Try % xclock &

Page 140: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003140

Training Session

% cp –r /usr/contrib/TAU/tau-2.13 <yourdir>% set path=(/usr/java2/bin $path)% set path=($path

/usr/contrib/TAU/tau-2.13/sgi64/bin)% setenv PAL_ROOT /usr/contrib/TAU/vampir-4.0% set path=($path $PAL_ROOT/bin)% cd <yourdir>/tau-2.13% configure –arch=sgi64

–mpiinc=/usr/include –mpilib=/usr/lib64% make clean install; cd examples/NPB2.3% gmake clean; gmake; cd bin;% mpirun –np 4 sp.W.4; paraprof% set path=($path /usr/contrib/TAU/pdtoolkit-

3.0/sgi64/bin)

Page 141: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003141

Using TAU+PAPI+PDT+MPI for F90

This will run your job in the queue:

~campbell/com/runsmpi <exe> <numprocs> % cd <yourdir>/ % ls /usr/contrib/TAU/tau-2.13/sgi64/lib/Makefile.* For the F90+MPI example, you want: Makefile.tau-

papiwallclock-multiplecounters-papivirtual-mpi-papi-pdt To see data in textual format:

% pprof | more

To see data in paraprof:

% paraprof

Page 142: The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon

The TAU Performance System NRL Monterey, CA Dec. 2003142

Using TAU with VAMPIR In your .cshrc file:

setenv PAL_ROOT /usr/contrib/TAU/vampir-4.0set path=($path $PAL_ROOT/bin)

Select /usr/contrib/TAU/tau-2.13/sgi64/Makefile.tau-mpi-pdt-trace as your stub makefile

Run your application (e.g., examples/mpi_pdt/f90).TAU produces *.edf *.trc files

Merge and convert the trace files: % tau_merge tautrace*.trc app.trc % tau_convert -pv app.trc tau.edf app.pv % vampir app.pv See Global displays -> Timeline, Zoom in/out of

segments.