40
Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin, Madison, 10:45am – 11:30 am Tuesday, 14 th April, 2010 Sameer Shende, Allen D. Malony, Alan Morris Performance Research Laboratory University of Oregon, Eugene, OR {sameer, malony, amorris}@cs.uoregon.edu http://tau.uoregon.edu

Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

Embed Size (px)

Citation preview

Page 1: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

Simplifying the Usage of Performance Evaluation Tools:

Experiences with TAU and DyninstAPIParadyn/Condor Week 2010,

Rm 221, Fluno Center, U. of Wisconsin, Madison, 10:45am – 11:30 am Tuesday, 14th April, 2010

Sameer Shende, Allen D. Malony, Alan MorrisPerformance Research LaboratoryUniversity of Oregon, Eugene, OR

{sameer, malony, amorris}@cs.uoregon.edu

http://tau.uoregon.edu

Page 2: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 2

Acknowledgements: University of Oregon

Dr. Allen D. Malony, Professor, CIS Dept, and Director, NeuroInformatics Center

Alan Morris, Senior software engineer Dr. Chee Wai Lee, Research faculty Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Dr. Robert Yelle, Research faculty Suzanne Millstein, Ph.D. student

And Matt Legendre and Dan McNulty, University of

Wisconsin at Madison

Page 3: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 3

Motivation

We have made great advances in instrumentation, measurement and analysis techniques

Tools are rich in features and have a complex tool dependency

Tools are getting more complex to use and to install

We need to simplify the usage of our performance evaluation tools!

Page 4: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 4

TAU Performance System® Integrated toolkit for performance problem

solving Instrumentation, measurement, analysis,

visualization Portable performance profiling and tracing

facility Performance data management and data

mining Based on direct performance measurement

approach Open source Available on all HPC platforms

Partners LLNL, ANL, ORNL, LANL, PNNL, LBL Research Centre Jülich, TU Dresden

TAU Architecture

Page 5: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 5

TAU Parallel Performance System Goals

Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers

Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms

Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based

Support for performance mapping Integration of leading performance technology Scalable (very large) parallel performance analysis

Page 6: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 6

TAU Performance System ComponentsTAU Architecture Program Analysis

Parallel Profile Analysis

PD

TP

erfD

MF

Par

aPro

f

Performance Data Mining

Performance Monitoring

TA

Uov

erM

RN

et (

ToM

)

PerfExplorer

Page 7: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 7

TAU Performance System Architecture

Page 8: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 8

TAU Performance System Architecture

Page 9: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 9

Parallel Profile Visualization: ParaProf

Page 10: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 10

Scalable Visualization: ParaProf (128k cores)

Page 11: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 11

Scatter Plot: ParaProf (128k cores)

Page 12: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 12

ParaProf: Communication Matrix Display

Page 13: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 13

Comparing Effects of Multi-Core Processors

AORSA2Dmagnetized plasma simulation Automatic loop level instrumentation Blue is single node Red is dual core Cray XT3 (4K cores)

Page 14: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 14

ParaProf: Mflops Sorted by Exclusive Time

low mflops?

Page 15: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 15

Performance Regression Testing

Page 16: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 16

Usage Scenarios: Evaluate Scalability

Page 17: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 17

Scaling NAMD with CUDA (Jumpshot with TAU)

Data transfer

Page 18: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 18

Measuring Performance of PGI Accelerated Code

Page 19: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 19

TAU and Eclipse

Provide an interface for configuring TAU’s automatic instrumentation within Eclipse’s build system

Manage runtime configuration settings and environment variables for execution of TAU instrumented programs

C/C++/FortranProject in Eclipse

Add or modifyan Eclipse build

configuration w/ TAU

Temporary copyof instrumented code

Compilation/linkingwith TAU libraries

TAU instrumentedlibraries

Programexecution

Performancedata

Programoutput

Page 20: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 20

TAU and Eclipse

PerfDMF

Page 21: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 21

Choosing PAPI Counters with TAU in Eclipse

Page 22: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 22

TAU Performance System Architecture

Page 23: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 23

TAU Instrumentation Approach Support for standard program events

Routines, classes and templates Statement-level blocks Begin/End events (Interval events)

Support for user-defined events Begin/End events specified by user Atomic events (e.g., size of memory allocated/freed) Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization

Eliminate instrumentation in lightweight routines

Page 24: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 24

TAU Instrumentation Mechanisms Source code

Manual (TAU API, TAU component API) Automatic (robust)

C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec)

Object code Compiler-based instrumentation (-optCompInst) Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked (tau_wrap)

Executable code Binary re-writing and dynamic instrumentation (DyninstAPI, U. Wisconsin,

U. Maryland) Virtual machine instrumentation (e.g., Java using JVMPI) Interpreter based instrumentation (Python) Kernel based instrumentation (KTAU)

Page 25: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 25

Program Database Toolkit (PDT)

Application/ Library

C / C++parser

Fortran parserF77/90/95

C / C++IL analyzer

FortranIL analyzer

ProgramDatabase

Files

IL IL

DUCTAPE

PDBhtml

SILOON

CHASM

TAU_instr

Programdocumentation

Applicationcomponent glue

C++ / F90/95interoperability

Automatic sourceinstrumentation

Page 26: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 26

Automatic Source-Level Instrumentation in TAU

tau_instrumentor

Parsedprogram

Instrumentationspecification file

Instrumentedsource

TAU sourceanalyzer

Applicationsource

TAU v2.19.1+: If source based instrumentation fails, compiler-based instrumentation is used automatically

Page 27: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 2727

Using TAU with Source Code Instrumentation

TAU supports several measurement options (profiling, tracing, profiling with hardware counters, etc.)

Each measurement configuration of TAU corresponds to a unique stub makefile that is generated when you configure it

To instrument source code using PDT Choose an appropriate TAU stub makefile in <arch>/lib:% export TAU_MAKEFILE=/usr/local/packages/tau/x86_64/lib/Makefile.tau-mpi-pdt% export TAU_OPTIONS=‘-optVerbose …’ (see tau_compiler.sh -help)

And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers:% mpif90 foo.f90 changes to % tau_f90.sh foo.f90

Execute application and analyze performance data:% pprof (for text based profile display)% paraprof (for GUI)

Page 28: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 28

TAU Measurement Configuration – Examples

% cd /usr/local/packages/tau/x86_64/lib; ls Makefile.*Makefile.tau-pdtMakefile.tau-mpi-pdtMakefile.tau-papi-mpi-pdtMakefile.tau-pthread-pdtMakefile.tau-pthread-mpi-pdtMakefile.tau-openmp-opari-pdtMakefile.tau-openmp-opari-mpi-pdtMakefile.tau-papi-openmp-opari-mpi-pdt… For an MPI+F90 application, you may want to start with:Makefile.tau-mpi-pdt

Supports MPI instrumentation & PDT for automatic source instrumentation % setenv TAU_MAKEFILE

/usr/local/packages/tau/x86_64/lib/Makefile.tau-mpi-pdt % tau_f90.sh application.f90; mpirun –np 256 ./a.out

Page 29: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 29

Compile-Time Environment Variables Optional parameters for TAU_OPTIONS: [tau_compiler.sh –help]

-optVerbose Turn on verbose debugging messages

-optCompInst Use compiler based instrumentation

-optNoCompInst Do not revert to compiler instrumentation if source instrumentation fails.

-optDetectMemoryLeaks Turn on debugging memory allocations/de-allocations to track leaks

-optKeepFiles Does not remove intermediate .pdb and .inst.* files

-optPreProcess Preprocess Fortran sources before instrumentation

-optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor

-optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS)

-optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)

-optPdtF95Opts="" Add options for Fortran parser in PDT (f95parse/gfparse)

-optPdtF95Reset="" Reset options for Fortran parser in PDT (f95parse/gfparse)

-optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)

-optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)

...

Page 30: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 30

Runtime Environment Variables in TAU

Environment Variable Default Description

TAU_TRACE 0 Setting to 1 turns on tracing

TAU_CALLPATH 0 Setting to 1 turns on callpath profiling

TAU_TRACK_HEAP or TAU_TRACK_HEADROOM

0 Setting to 1 turns on tracking heap memory/headroom at routine entry & exit using context events (e.g., Heap at Entry: main=>foo=>bar)

TAU_CALLPATH_DEPTH 2 Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and context events have just parent information (e.g., Heap Entry: foo)

TAU_SYNCHRONIZE_CLOCKS 1 Synchronize clocks across nodes to correct timestamps in traces

TAU_COMM_MATRIX 0 Setting to 1 generates communication matrix display using context events

TAU_THROTTLE 1 Setting to 0 turns off throttling. Enabled by default to remove instrumentation in lightweight routines that are called frequently

TAU_THROTTLE_NUMCALLS 100000 Specifies the number of calls before testing for throttling

TAU_THROTTLE_PERCALL 10 Specifies value in microseconds. Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call

TAU_COMPENSATE 0 Setting to 1 enables runtime compensation of instrumentation overhead

TAU_PROFILE_FORMAT Profile Setting to “merged” generates a single file. “snapshot” generates xml format

TAU_METRICS TIME Setting to a comma separted list generates other metrics. (e.g., TIME:linuxtimers:PAPI_FP_OPS:PAPI_NATIVE_<event>)

Page 31: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 31

Simplifying Instrumentation using DyninstAPI TAU uses DyninstAPI to create a binary re-writer (tau_run) TAU’s measurement library (DSO) is loaded by tau_run Both runtime instrumentation and binary re-writing are supported Selection of files and routines based on exclude/include lists Simplifies tool usage greatly! Available on POINT LiveDVD [http://tau.uoregon.edu/point.iso] Usage:

% tau_run a.out –o a.inst.out % mpirun –np 4 a.inst.out % paraprof

Page 32: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 32

Issues

Re-writing static executables limited to gcc, limited platforms in beta

Currently, we support dynamic executables (v6.1) We are working on supporting both static and dynamic

executables We hope to support more platforms, compilers and

runtime systems in the future Rewriting shared libraries used by the application LD_PRELOAD’able wrapper libraries can be created

using tau_wrap requires interface information in header file

Page 33: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 33

Binary Rewriting in TAU using DyninstAPI

Page 34: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 34

Wish List for tau_run

Support for more platforms Apple Mac OS X, Windows, IBM BG/P, AIX, …

Support for more compilers Support for rewriting shared objects Support for static binary rewriting with validation for

compilers other than gcc XLC, PathScale, Cray CCE, Intel, PGI,…

Page 35: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 35

Other Tools…

Other TAU tools that use technologies from the ParaDyn/DyninstAPI group TAU over MRNet (ToM) for runtime Stackwalker API for accessing callstack

Page 36: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 36

StackWalkerAPI in TAU

Requirements overview: Minimal information required (PC is enough) Threaded support necessary Low overhead (for high sample rates) Stack unwinding from a signal handler

Malloc could be interrupted Need to walk through signal handler frame

Page 37: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 37

Issues encountered with StackWalkerAPI

StackWalkerAPI: Isn’t thread safe (and locking to use it can cause

significant overhead) Uses malloc/new (and so do dependent libraries such as

libdwarf) C++ (we would prefer C) Issues walking certain kinds of stack frames

Matt Legendre was able to help us out a lot though!

Alternatives: TAU is currently using stack walking constructs from

HPCToolkit

Page 38: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 38

Online Monitoring using TAU over MRNet (ToM)

Back-End (BE) TAU adapter offloads performance data

Filters reduction distributed analysis upstream / downstream

Front-End (FE) unpacks, interprets, stores

Paths reverse data reduction path multicast control path

Push-Pull model source pushes, sink pulls

Page 39: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 39

Conclusions

TAU and DyninstAPI represents mature technology for performance instrumentation, measurement and analysis

Using DyninstAPI’s binary re-writing capabilities, we have produced a tool that simplifies code instrumentation

We hope to collaborate on other projects and include support for an enhanced stack walker API

Questions?

Page 40: Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

http://tau.uoregon.edu 40

Support Acknowledgements Department of Energy (DOE)

Office of Science MICS, Argonne National Lab

ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, LLNL

Department of Defense (DoD) NSF SDCI Partners: Research Centre Juelich LBL, ORNL, ANL, LANL, PNNL, LLNL TU Dresden ParaTools, Inc.