Paradyn Evaluation Report

Paradyn Evaluation Report

Adam Leko,UPC Group

HCS Research LaboratoryUniversity of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

2

Basic Information

Name: Paradyn Developer: University of Wisconsin-Madison Current versions:

Paradyn: 4.1.1 DynInst: 4.1.1 KernInst: 2.0.1

Website:http://www.paradyn.org/index.html

Contact: Matthew Legendre

3

What is Paradyn? A performance analysis tool (PAT) for sequential and

parallel programs Uses dynamic binary instrumentation to record program

metrics (may use unmodified executables) Visualizations

Metric-focus grids (right, top) Rows: performance metrics Columns: resources to collect a performance from

Metrics can be reported as current value, statistics (min/max/average), or time-histograms (right, bottom)

Performance consultant Automated search to identify bottlenecks in program Uses W3 model – where, when, why

A generic project that includes tools related to performance analysis Paradyn PAT DynInst: a dynamic binary instrumentation library KernInst: a dynamic instrumentation library for instrumenting

running operating system (OS) kernels Not very useful for a PAT unless PAT needs to be applied to

an OS kernel MRNet: a high-performance communication library

supporting master-slave software architectures “Multicast/Reduction Network” Not immediately useful for the design phase of our PAT

CPU usage Bandwidth

Node 1 84% [histogram]




Example metric-focus grid

Visual representation ofexample time-histogram

Time

Bandwidth

4

General Paradyn Architecture

Image courtesy [1]

Four main components User interface

(green) Visualization (red) Performance

consultant (purple) Instrumentation

(blue) Thick circles represent

running processes, dotted circles represent threads within a single process

Will present each using “bottom up” approach

5

Part 1: Instrumentation

6

Instrumentation Overview Paradyn terminology: points, primitives, and predicates

Points – places where instrumentation code can be placed Supported points: procedure entry, procedure exit, individual call statements

Primitives – simple operations that change the value of a counter or timer Predicates – boolean expressions that guard execution of primitives Using predicates and primitives, points in a program may be instrumented Predicates and primitives are controlled via PCL and MDL (discussed later)

Paradyn uses dynamic instrumentation to record performance data at points Paradyn attaches its performance daemons to a running process or starts a new

process using an unmodified binary Instrumentation workflow:

User or performance consultant requests a metric-focus from Paradyn Data manager in Paradyn uses Remote Procedure Call (RPC) to communicate with remote

processes asking them to start instrumentation for a specific metric focus RPC allows heterogeneity in runtime environment

Metric manager receives instrumentation request and turns that into an abstract, machine-independent request

Instrumentation manager inserts code into executable corresponding to machine-independent abstraction Executable is stopped Code is inserted Executable resumes running

Instrumented data is periodically sampled by the metric manager and sent back to the data manager

7

Binary Instrumentation Binary instrumentation accomplished by inserting

base trampolines for each instrumentation point Base trampolines handle storing current state of

program so instrumentations do not affect execution In some architectures, only registers that are used

are saved (if can be inferred from machine calling convention)

Mini trampolines are the machine-specific realizations of predicates and primitives

One base trampoline may handle many mini-trampolines, but a base trampoline is needed for every instrumentation point

Basic flow of trampoline shown in right, top Mini trampoline assembly code for SPARC machine

shown in right, bottom Binary instrumentation difficult!

Have to deal with Compiler optimizations Branch delay slots Different sizes of instructions for x86 (may increase

the number of instructions that have to be relocated)

Creating and inserting mini trampolines somewhere in program (at end?) Limited-range jumps may complicate this

Luckily, DynInst library available separately for use in other applications

Paradyn’s instrumentation cost <= 80 clock cycles per base trampoline [2]

Trampoline flow (courtesy [2])

Mini trampoline (courtesy [2])

8

PCL & MDL Paradyn provides a TCL-like language to

configure and add metrics without recompiling or modifying Paradyn Stored in paradyn.rc, user may use their own

version (.paradynrc) PCL – Paradyn Control Language

Controls available daemons (MPI, sequential, etc) Can add processes automatically at startup (which

programs to record performance data for) Can customize Paradyn options (colors and other

“tunable constants”) Can add visualizations (described later) Can add metrics via MDL

MDL – Metric Description Language Sublanguage of PCL Describe metrics

Types provided: counters and timers Can specify constraints for each metric that limit

how they can be used/what they can be used with

May be exclusive or inclusive (include a point’s calls to other procedures or just include a point’s cost by itself, excluding time spent in other procedures called from this point)

Language not Turing complete: no looping construct provided

Example counter shown rightCounter MDL code (courtesy [2])

9

Paradyn Overhead Instrumentation very low for most

test programs for 5 metrics on all functions Communication metrics

Number of messages sent Number of point-to-point messages Number of collective messages I/O bytes

CPU metric CPU utilization

Instrumenting CAMEL’s main routine had 800% overhead Instrumenting a function also

instruments its call sites main routine had many small

function calls Performance consultant (discussed

later) added a large amount of overhead to most programs during searches

Paradyn overhead

1%

1%

4%

3%

0%

5%

2%

3%

0%

0%

0%

0% 1% 2% 3% 4% 5%

CAMEL

NAS LU (8p, W)

PP: Big message

PP: Diffuse procedure*

PP: Hot procedure

PP: Intensive server

PP: Ping pong

PP: Random barrier

PP: Small messages*

PP: System time

PP: Wrong way*

Be

nc

hm

ark

Overhead (instrumented/uninstrumented)

10

Part 2: Performance Consultant

11

Performance Consultant (PC) Overview PC performs an automated search on the

program Identifies bottlenecks in programs Uses W3 search (described next slide)

Search is guided, based on program’s call graph [3] Iterative method that tests hypothesis against

sections of code Starts with main and examines subroutine

calls “Drills down” and examines subroutines based

on frequency of they are called Call graph search method was successfully

applied to several large programs containing thousands of lines of code

However, method can miss functions called by more than one parent function whose individual parent functions do not appear as “problem functions”

Call graph automatically generated from executable’s symbol table

Example PC run shown at top right, corresponding call graph for application shown at bottom right

Example PC run

Call graph used in PC run

12

W3 Model: Why, Where, When Paradyn’s goal: “… to assist the user

in locating performance problems in a program; a performance problem is a part of the program that contributes a significant amount of time to its execution”

W3 model attempts to answer: Why is the program performing

poorly? Where is the program performing

poorly? When is the application performing

poorly? Performance consultant shows why

and where axes graphically to the user (see right) Yellow line: why axis refinement Purple line: where axis refinement

W3 refinements (blue=true, pink=false)

13

W3 Model: Why, Where, When (2) Why axis

Paradyn applies hypotheses to code ExcessiveSyncWaitingTime? CPUBound? ExcessiveIOBlockingTime?

TooManySmallIOOps?

Each hypothesis is represented by a tunable predicate E.g., CPUBound := CPUTime > 20%

After a hypothesis is determined to be false, no more searching is done for that type of bottleneck

Where axis Once a hypothesis is tested to be true (why refinement),

An automated search is started to determine where the problem lies Each subroutine is examined to see if the hypothesis is also true

(where refinement) The program’s call graph is used to guide search of subroutines

Where axis is iteratively searched until the deepest node of the call graph is reached that the hypothesis tests true for

When axis Indirectly supported through the use of “phases” Phases are defined by the user Phases represent specific time intervals in a program’s execution When axis refinement relies on the user’s interaction

While axis refinements are made, performance consultant automatically requests instrumentation Frequency of instrumentation and a limit on number of concurrent

instrumentations can be set by the user

W3 refinements (blue=true, pink=false)

14

Bottleneck Identification Test Suite Testing metric: what did Performance Consultant tell us? Programs correctness not affected by instrumentation CAMEL: PASSED

Identified program as CPU-bound However, Performance Consultant added much overhead and resulted in

a misdetection on the where axis LU: TOSS-UP

Identified as excessive sync time bottleneck Not further resolved to too many small messages, only was able to track

down to the ssor.f source code file Big messages: PASSED

Identified excessive sync time @ Grecv_message function Diffuse procedure: FAILED

Identified excessive sync time at MPI_Barrier, but did not localize to bottleneck procedure

Missed picking up on diffuse CPU-bound behavior

15

Bottleneck Identification Test Suite (2) Hot procedure: PASSED

Correctly identified CPU-bound bottleneck procedure Due to excessive instrumentation, Performance Consultant overhead slightly

misdiagnosed where location Attributed to all nodes except one when all nodes exhibit the problem

Intensive server: TOSS-UP Identified excessive sync waiting time on Grecv_message from main However, due to lack of trace view, it would be difficult/impossible to see all

threads waiting on the master thread Ping-pong: PASSED

Identified excessive sync waiting time on Grecv_message Random barrier: TOSS-UP

Identified excessive sync waiting time on barrier in main No trace view means it would be nearly impossible to see randomness of which

node was (inconsistently) taking more time

16

Bottleneck Identification Test Suite (3) Small messages: TOSS-UP

Identified excessive sync waiting time on Gsendmessage in main Did not localize to a particular node, though

System time: FAILED Performance Consultant failed to instrument code

Possibly due to OS being too busy with user code to handle dynamic binary instrumentation

Wrong order: TOSS-UP Identified excessive sync waiting time on messages on main Would best be seen by a trace, but classification here was different than

other communication-based bottlenecks

17

Part 3: Visualizations

18

Visualizations Overview Paradyn supports several types of built-in

visualizations (visis) for metrics Bar charts Histograms (right, top) Table (text representation, can show

current/max/min values for each metric) “Terrain” – 3D histogram (see right, bottom)

Axes are time, metric, location Visualizations may handle multiple metrics at

once Visualizations are implemented as separate

processes Callback functions are used to provide

continuous data to visualization programs Users may add custom visualizations

Paradyn provides a simple library and RPC interface

Configured to show up in interface via PCL files (paradyn.rc, .paradynrc) Terrain visualization

Histogram visualization

19

Visualizations Overview (2) When a user creates a visualization,

Paradyn automatically instruments running program accordingly

Visualization continues until user closes it After closing, Paradyn automatically removes

instrumented code Histograms are stored using a fixed-size

data structure Metric values sorted into “buckets” When buckets fill, data is reorganized and

number of buckets doubles (though keeping structure of a fixed size)

As execution time increases, sampling rate decreases logarithmically to keep data sizes small

Terrain visualization

Histogram visualization

20

Part 4: User Interface

21

User Interface Overview Current

interface uses Tcl/Tk for graphics (right)

Multiple windows for everything Makes for a

cluttered interface

Tcl/Tk provides a useable but crude-looking interface

Example Paradyn session

22

Paradyn Bugs Can’t detect end of MPI program run (Paradyn will crash

unless you start over from scratch) Program crashes almost every time shortly after MPI program

completes

Buggy startup code (starting a new process twice gives errors; program must be restarted)

Doesn’t work with code compiled with profiling information (gcc –g), see error dialog to right “Can’t read .shstrtabsection”

Pausing execution and adding a visualization crashes Paradyn (program continues execution while Paradyn thinks it is still paused)

Often leaves zombie children processes, even on error-free runs

Paradyn left unkillable processes hanging around after crashes on etas killall -9 could not get rid of them

23

Paradyn Complaints Slow startup (~5 seconds for each MPI node on etas) Performance consultant takes a while to identify bottlenecks

Although, search is entirely automated However, only seems to pick up on code that exhibits obvious bottlenecks

Cluttered and confusing interface Why is there separate windows for the callgraph and where axes?

Many bugs, although most are handled by displaying a nice dialog box However, some bugs necessitate a Paradyn restart

Function list on “where axis” dialog box contains a huge number of functions for MPI programs (~100+, includes MPI functions in list which makes it hard to single out your application’s functions)

Phase function difficult to use Should be easier to define phases, or base phases on subroutine entry/exit

points No “stop process” button!

24

Paradyn General Comments W3 search hypotheses and threshold functions seem overly simplistic (-)

Doesn’t seem to work well on code that alternates quickly between communication and computation Small amount of hypotheses, perhaps due to large cost of evaluating each one? Cutoff values for hypotheses seems arbitrary

Are tunable, but is a fixed cutoff appropriate? Performance consultant was not able to detect/classify a sleep(1) statement inserted for a single

MPI process Should have labeled the receiving node as ExcessiveSyncWaitingTime, but did not label the process

at all Quick changes between computation and communication may have fooled it, perhaps adjusting

thresholds would have helped; How would you know which thresholds to change?

How useful is the information provided by the W3 search? Seems to only be able to pick out obvious things Says what is the problem, but does it offer insight on how to fix it?

Overhead introduced by dynamic instrumentation seems very tiny (+++) < 1% for 16 metrics being collected on a 16-node MPI application However, overhead can increase dramatically for functions that call other (lightweight) functions

many times over

25

Paradyn General Comments (2) Platform support (--)

Paradyn: No support for 64-bit applications or Cray platforms! DynInst: No support for 64-bit Opteron or Cray platforms! (Support for Itanium is provided

though) Dependence on DynInst combined with difficulty in porting DynInst to new platforms a

potential problem Adding and removing instrumentations is fast and works well (+++)

DynInst seems to be much more stable than Paradyn, minus the parsing bugs for executables compiled with gcc -g

Adding instrumentations to code usually takes one second or less Helps reduce the measure stage of the “measure-modify” approach

However, time needed to start programs significantly increased, especially with many processes (-) However, extra delays incurred during instrumentation affect the ability to gather traces of

program execution Is dynamic instrumentation necessary?

Things are greatly simplified when dynamic binary-level instrumentation is not implemented

Is it worth the added cost and complexity? Fairly complex piece of software, takes a while to learn how to use effectively, even with

tutorials (-) This, along with its complicated installation procedure, may discourage its use Though documentation is pretty good

PCL and MDL allow configuration and addition of user-defined metrics (+++)

26

Feasibility for UPC & SHMEM In order to add support for UPC & SHMEM:

Need to create Paradyn daemons for UPC and SHMEM codes This may be very difficult, since Paradyn daemons need to handle instrumentation

For UPC, how should communication be handled? Instrument runtime libraries?

Which runtimes should be supported? Is it feasible to support all runtimes of interest?

What about proprietary UPC languages and runtimes? This could be an insurmountable problem

Paradyn has been around for a long time Is there a lot of crufty code in the source code that is left alone because no one understands

it? Is the current user interface (Tcl/Tk) acceptable?

Also: Would need to port DynInst to targeted architectures

This may be problematic for architectures with no publicly available information on executable file formats/etc

Should include performance metrics as recorded by PAPI MDL should help, but Will MDL present too large of an overhead for the level of granularity needed by PAPI?

Is a lack of tracing ability acceptable? What if more details are needed?

27

Evaluation (1) Available metrics: 5/5

Many built in Number of CPUs, number of active threads, CPU and inclusive CPU time Function calls to and by Synchronization (# operations, wait time, inclusive wait time) Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and

received), point-to-point communication (# messages, bytes sent and received) I/O (# operations, wait time, inclusive wait time, total bytes)

Can add more using MDL and PCL Cost: free 5/5 Documentation quality: 4/5

Tutorial for using sequential and MPI programs with Paradyn Well-written manuals Programming guides included for DynInst, visualization library, and MDL

Extensibility: 2/5 Creating a SHMEM daemon wouldn’t take a lot of work Creating a UPC daemon will be problematic for proprietary runtimes Depends on DynInst, and porting DynInst to a new platform may take an immense amount of work

Filtering and aggregation: 3/5 Only supports rudimentary aggregation on metrics (min, max, averages)

Hardware support: 2/5 No support for Opteron, Itanium, or Cray architectures in Paradyn

DynInst supports Itanium Porting DynInst (which Paradyn depends on) would be very difficult

Heterogeneity support: 5/5 Authors claim Paradyn supports heterogeneity due to use of RPC interfaces Not directly supported by user interface for MPI programs, so cannot test

28

Evaluation (2) Installation: 2/5

Binaries are easy to find http://www.paradyn.org/html/paradyn4.1-software.html

Compiling from source extremely difficult and error-prone Relies on specific versions of libdwarf (Linux only) and Tcl/Tk (all), which complicates the installation if your

distribution or OS uses incompatible versions Installation time: approximately 2-3 hours for a shared environment

Need to create scripts that set about 6 environment variables before program will run correctly Interoperability: 1/5

Paradyn can save output in simple, documented format, but usefulness of data unknown No detailed, trace-like information can be provided as it is not collected Dynamic instrumentation interferes with tracing due to timing perturbations

Learning curve: 2/5 Difficult, complex program with many parts Took approximately 1 week to get comfortable with the program Manuals and tutorials are very helpful

Manual overhead: 5/5 No modification needed for executables

Measurement accuracy: 4/5 Dynamic instrumentation incurs very low overhead (~80 cycles for trampoline overhead) Time-histogram loses accuracy as time goes on due to fixed size

Multiple execution: 0/5 (not supported)

29

Evaluation (3) Multiple views and analyses: 5/5

Several visualization types are supported for all metrics Histograms, bar charts, 3D histograms, tables, summary tables (min/max/average)

Users can add new visualization programs as desired using Paradyn’s RPC interface and visualization library Call graphs and where axis give user a hierarchical view of their code Default visualizations support zooming and panning

Performance bottleneck identification: 2.5/5 Performance consultant can help identify “obvious” bottlenecks automatically Due to limited search space (only 4 types of bottlenecks), bottleneck identification is limited Tweaking thresholds used for search may be necessary to identify bottlenecks

Profiling/tracing support: 2/5 Uses a “hybrid” approach of sampling and tracing Detailed tracing information cannot be logged for later analysis

Cannot create a trace file of when MPI functions were called (e.g., what you’d need for Jumpshot) Paradyn daemons report values back to main Paradyn process at time intervals Main Paradyn process only has an approximation of “real” values for metrics at any given time However, values recorded by main Paradyn process can be exported (uses a simple documented format)

Response time: 5/5 Dynamic instrumentation allows arbitrarily turning on and off instrumentation without needed to restart or

recompile application Only takes a few seconds to start collecting metrics once they are requested

30

Evaluation (4) Software support: 3/5

Supported languages: threaded C code, MPI code Supported software platforms: Linux kernel version 2.4 & 2.6, AIX, Tru64, Windows 2000/XP, and IRIX

Source code correlation: 2/5 Can correlate back to the function name level (reads executable symbol tables) No line numbers or statement information available

Searching: 0/5 (not supported) System stability: 2/5

Many bugs in Linux version, but bugs seem to be limited to Paradyn GUI; DynInst seems very stable Technical support: 4/5

Helpful responses from our contact within 24 hours

31

References[1] B. P. Miller et. al. “The Paradyn parallel performance measurement tool,”

IEEE Computer, November 1995, pp. 37-46.

[2] J.K. Hollingsworth et. al. “MDL: A Language and Compiler for Dynamic Program Instrumentation,” IEEE PACT, 1997, pg. 201.

[3] H. Cain, B.P. Miller, and B.J.N. Wylie. “A Callgraph-Based Search Strategy for Automated Performance Diagnosis,” European Conference on Parallel Computing (Euro-Par), Munich, Germany, August 2000, pg. 108.

Documents

Paradyn Evaluation Report