Upload
eve-whitney
View
28
Download
0
Embed Size (px)
DESCRIPTION
Paradyn Evaluation Report. Adam Leko, UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Paradyn Developer: University of Wisconsin-Madison Current versions: - PowerPoint PPT Presentation
Citation preview
Paradyn Evaluation Report
Adam Leko,UPC Group
HCS Research LaboratoryUniversity of Florida
Color encoding key:
Blue: Information
Red: Negative note
Green: Positive note
2
Basic Information
Name: Paradyn Developer: University of Wisconsin-Madison Current versions:
Paradyn: 4.1.1 DynInst: 4.1.1 KernInst: 2.0.1
Website:http://www.paradyn.org/index.html
Contact: Matthew Legendre
3
What is Paradyn? A performance analysis tool (PAT) for sequential and
parallel programs Uses dynamic binary instrumentation to record program
metrics (may use unmodified executables) Visualizations
Metric-focus grids (right, top) Rows: performance metrics Columns: resources to collect a performance from
Metrics can be reported as current value, statistics (min/max/average), or time-histograms (right, bottom)
Performance consultant Automated search to identify bottlenecks in program Uses W3 model – where, when, why
A generic project that includes tools related to performance analysis Paradyn PAT DynInst: a dynamic binary instrumentation library KernInst: a dynamic instrumentation library for instrumenting
running operating system (OS) kernels Not very useful for a PAT unless PAT needs to be applied to
an OS kernel MRNet: a high-performance communication library
supporting master-slave software architectures “Multicast/Reduction Network” Not immediately useful for the design phase of our PAT
CPU usage Bandwidth
Node 1 84% [histogram]
Node 2 95% [histogram]
Node 3 33% [histogram]
Node 4 65% [histogram]
Example metric-focus grid
Visual representation ofexample time-histogram
Time
Bandwidth
4
General Paradyn Architecture
Image courtesy [1]
Four main components User interface
(green) Visualization (red) Performance
consultant (purple) Instrumentation
(blue) Thick circles represent
running processes, dotted circles represent threads within a single process
Will present each using “bottom up” approach
5
Part 1: Instrumentation
6
Instrumentation Overview Paradyn terminology: points, primitives, and predicates
Points – places where instrumentation code can be placed Supported points: procedure entry, procedure exit, individual call statements
Primitives – simple operations that change the value of a counter or timer Predicates – boolean expressions that guard execution of primitives Using predicates and primitives, points in a program may be instrumented Predicates and primitives are controlled via PCL and MDL (discussed later)
Paradyn uses dynamic instrumentation to record performance data at points Paradyn attaches its performance daemons to a running process or starts a new
process using an unmodified binary Instrumentation workflow:
User or performance consultant requests a metric-focus from Paradyn Data manager in Paradyn uses Remote Procedure Call (RPC) to communicate with remote
processes asking them to start instrumentation for a specific metric focus RPC allows heterogeneity in runtime environment
Metric manager receives instrumentation request and turns that into an abstract, machine-independent request
Instrumentation manager inserts code into executable corresponding to machine-independent abstraction Executable is stopped Code is inserted Executable resumes running
Instrumented data is periodically sampled by the metric manager and sent back to the data manager
7
Binary Instrumentation Binary instrumentation accomplished by inserting
base trampolines for each instrumentation point Base trampolines handle storing current state of
program so instrumentations do not affect execution In some architectures, only registers that are used
are saved (if can be inferred from machine calling convention)
Mini trampolines are the machine-specific realizations of predicates and primitives
One base trampoline may handle many mini-trampolines, but a base trampoline is needed for every instrumentation point
Basic flow of trampoline shown in right, top Mini trampoline assembly code for SPARC machine
shown in right, bottom Binary instrumentation difficult!
Have to deal with Compiler optimizations Branch delay slots Different sizes of instructions for x86 (may increase
the number of instructions that have to be relocated)
Creating and inserting mini trampolines somewhere in program (at end?) Limited-range jumps may complicate this
Luckily, DynInst library available separately for use in other applications
Paradyn’s instrumentation cost <= 80 clock cycles per base trampoline [2]
Trampoline flow (courtesy [2])
Mini trampoline (courtesy [2])
8
PCL & MDL Paradyn provides a TCL-like language to
configure and add metrics without recompiling or modifying Paradyn Stored in paradyn.rc, user may use their own
version (.paradynrc) PCL – Paradyn Control Language
Controls available daemons (MPI, sequential, etc) Can add processes automatically at startup (which
programs to record performance data for) Can customize Paradyn options (colors and other
“tunable constants”) Can add visualizations (described later) Can add metrics via MDL
MDL – Metric Description Language Sublanguage of PCL Describe metrics
Types provided: counters and timers Can specify constraints for each metric that limit
how they can be used/what they can be used with
May be exclusive or inclusive (include a point’s calls to other procedures or just include a point’s cost by itself, excluding time spent in other procedures called from this point)
Language not Turing complete: no looping construct provided
Example counter shown rightCounter MDL code (courtesy [2])
9
Paradyn Overhead Instrumentation very low for most
test programs for 5 metrics on all functions Communication metrics
Number of messages sent Number of point-to-point messages Number of collective messages I/O bytes
CPU metric CPU utilization
Instrumenting CAMEL’s main routine had 800% overhead Instrumenting a function also
instruments its call sites main routine had many small
function calls Performance consultant (discussed
later) added a large amount of overhead to most programs during searches
Paradyn overhead
1%
1%
4%
3%
0%
5%
2%
3%
0%
0%
0%
0% 1% 2% 3% 4% 5%
CAMEL
NAS LU (8p, W)
PP: Big message
PP: Diffuse procedure*
PP: Hot procedure
PP: Intensive server
PP: Ping pong
PP: Random barrier
PP: Small messages*
PP: System time
PP: Wrong way*
Be
nc
hm
ark
Overhead (instrumented/uninstrumented)
10
Part 2: Performance Consultant
11
Performance Consultant (PC) Overview PC performs an automated search on the
program Identifies bottlenecks in programs Uses W3 search (described next slide)
Search is guided, based on program’s call graph [3] Iterative method that tests hypothesis against
sections of code Starts with main and examines subroutine
calls “Drills down” and examines subroutines based
on frequency of they are called Call graph search method was successfully
applied to several large programs containing thousands of lines of code
However, method can miss functions called by more than one parent function whose individual parent functions do not appear as “problem functions”
Call graph automatically generated from executable’s symbol table
Example PC run shown at top right, corresponding call graph for application shown at bottom right
Example PC run
Call graph used in PC run
12
W3 Model: Why, Where, When Paradyn’s goal: “… to assist the user
in locating performance problems in a program; a performance problem is a part of the program that contributes a significant amount of time to its execution”
W3 model attempts to answer: Why is the program performing
poorly? Where is the program performing
poorly? When is the application performing
poorly? Performance consultant shows why
and where axes graphically to the user (see right) Yellow line: why axis refinement Purple line: where axis refinement
W3 refinements (blue=true, pink=false)
13
W3 Model: Why, Where, When (2) Why axis
Paradyn applies hypotheses to code ExcessiveSyncWaitingTime? CPUBound? ExcessiveIOBlockingTime?
TooManySmallIOOps?
Each hypothesis is represented by a tunable predicate E.g., CPUBound := CPUTime > 20%
After a hypothesis is determined to be false, no more searching is done for that type of bottleneck
Where axis Once a hypothesis is tested to be true (why refinement),
An automated search is started to determine where the problem lies Each subroutine is examined to see if the hypothesis is also true
(where refinement) The program’s call graph is used to guide search of subroutines
Where axis is iteratively searched until the deepest node of the call graph is reached that the hypothesis tests true for
When axis Indirectly supported through the use of “phases” Phases are defined by the user Phases represent specific time intervals in a program’s execution When axis refinement relies on the user’s interaction
While axis refinements are made, performance consultant automatically requests instrumentation Frequency of instrumentation and a limit on number of concurrent
instrumentations can be set by the user
W3 refinements (blue=true, pink=false)
14
Bottleneck Identification Test Suite Testing metric: what did Performance Consultant tell us? Programs correctness not affected by instrumentation CAMEL: PASSED
Identified program as CPU-bound However, Performance Consultant added much overhead and resulted in
a misdetection on the where axis LU: TOSS-UP
Identified as excessive sync time bottleneck Not further resolved to too many small messages, only was able to track
down to the ssor.f source code file Big messages: PASSED
Identified excessive sync time @ Grecv_message function Diffuse procedure: FAILED
Identified excessive sync time at MPI_Barrier, but did not localize to bottleneck procedure
Missed picking up on diffuse CPU-bound behavior
15
Bottleneck Identification Test Suite (2) Hot procedure: PASSED
Correctly identified CPU-bound bottleneck procedure Due to excessive instrumentation, Performance Consultant overhead slightly
misdiagnosed where location Attributed to all nodes except one when all nodes exhibit the problem
Intensive server: TOSS-UP Identified excessive sync waiting time on Grecv_message from main However, due to lack of trace view, it would be difficult/impossible to see all
threads waiting on the master thread Ping-pong: PASSED
Identified excessive sync waiting time on Grecv_message Random barrier: TOSS-UP
Identified excessive sync waiting time on barrier in main No trace view means it would be nearly impossible to see randomness of which
node was (inconsistently) taking more time
16
Bottleneck Identification Test Suite (3) Small messages: TOSS-UP
Identified excessive sync waiting time on Gsendmessage in main Did not localize to a particular node, though
System time: FAILED Performance Consultant failed to instrument code
Possibly due to OS being too busy with user code to handle dynamic binary instrumentation
Wrong order: TOSS-UP Identified excessive sync waiting time on messages on main Would best be seen by a trace, but classification here was different than
other communication-based bottlenecks
17
Part 3: Visualizations
18
Visualizations Overview Paradyn supports several types of built-in
visualizations (visis) for metrics Bar charts Histograms (right, top) Table (text representation, can show
current/max/min values for each metric) “Terrain” – 3D histogram (see right, bottom)
Axes are time, metric, location Visualizations may handle multiple metrics at
once Visualizations are implemented as separate
processes Callback functions are used to provide
continuous data to visualization programs Users may add custom visualizations
Paradyn provides a simple library and RPC interface
Configured to show up in interface via PCL files (paradyn.rc, .paradynrc) Terrain visualization
Histogram visualization
19
Visualizations Overview (2) When a user creates a visualization,
Paradyn automatically instruments running program accordingly
Visualization continues until user closes it After closing, Paradyn automatically removes
instrumented code Histograms are stored using a fixed-size
data structure Metric values sorted into “buckets” When buckets fill, data is reorganized and
number of buckets doubles (though keeping structure of a fixed size)
As execution time increases, sampling rate decreases logarithmically to keep data sizes small
Terrain visualization
Histogram visualization
20
Part 4: User Interface
21
User Interface Overview Current
interface uses Tcl/Tk for graphics (right)
Multiple windows for everything Makes for a
cluttered interface
Tcl/Tk provides a useable but crude-looking interface
Example Paradyn session
22
Paradyn Bugs Can’t detect end of MPI program run (Paradyn will crash
unless you start over from scratch) Program crashes almost every time shortly after MPI program
completes
Buggy startup code (starting a new process twice gives errors; program must be restarted)
Doesn’t work with code compiled with profiling information (gcc –g), see error dialog to right “Can’t read .shstrtabsection”
Pausing execution and adding a visualization crashes Paradyn (program continues execution while Paradyn thinks it is still paused)
Often leaves zombie children processes, even on error-free runs
Paradyn left unkillable processes hanging around after crashes on etas killall -9 could not get rid of them
23
Paradyn Complaints Slow startup (~5 seconds for each MPI node on etas) Performance consultant takes a while to identify bottlenecks
Although, search is entirely automated However, only seems to pick up on code that exhibits obvious bottlenecks
Cluttered and confusing interface Why is there separate windows for the callgraph and where axes?
Many bugs, although most are handled by displaying a nice dialog box However, some bugs necessitate a Paradyn restart
Function list on “where axis” dialog box contains a huge number of functions for MPI programs (~100+, includes MPI functions in list which makes it hard to single out your application’s functions)
Phase function difficult to use Should be easier to define phases, or base phases on subroutine entry/exit
points No “stop process” button!
24
Paradyn General Comments W3 search hypotheses and threshold functions seem overly simplistic (-)
Doesn’t seem to work well on code that alternates quickly between communication and computation Small amount of hypotheses, perhaps due to large cost of evaluating each one? Cutoff values for hypotheses seems arbitrary
Are tunable, but is a fixed cutoff appropriate? Performance consultant was not able to detect/classify a sleep(1) statement inserted for a single
MPI process Should have labeled the receiving node as ExcessiveSyncWaitingTime, but did not label the process
at all Quick changes between computation and communication may have fooled it, perhaps adjusting
thresholds would have helped; How would you know which thresholds to change?
How useful is the information provided by the W3 search? Seems to only be able to pick out obvious things Says what is the problem, but does it offer insight on how to fix it?
Overhead introduced by dynamic instrumentation seems very tiny (+++) < 1% for 16 metrics being collected on a 16-node MPI application However, overhead can increase dramatically for functions that call other (lightweight) functions
many times over
25
Paradyn General Comments (2) Platform support (--)
Paradyn: No support for 64-bit applications or Cray platforms! DynInst: No support for 64-bit Opteron or Cray platforms! (Support for Itanium is provided
though) Dependence on DynInst combined with difficulty in porting DynInst to new platforms a
potential problem Adding and removing instrumentations is fast and works well (+++)
DynInst seems to be much more stable than Paradyn, minus the parsing bugs for executables compiled with gcc -g
Adding instrumentations to code usually takes one second or less Helps reduce the measure stage of the “measure-modify” approach
However, time needed to start programs significantly increased, especially with many processes (-) However, extra delays incurred during instrumentation affect the ability to gather traces of
program execution Is dynamic instrumentation necessary?
Things are greatly simplified when dynamic binary-level instrumentation is not implemented
Is it worth the added cost and complexity? Fairly complex piece of software, takes a while to learn how to use effectively, even with
tutorials (-) This, along with its complicated installation procedure, may discourage its use Though documentation is pretty good
PCL and MDL allow configuration and addition of user-defined metrics (+++)
26
Feasibility for UPC & SHMEM In order to add support for UPC & SHMEM:
Need to create Paradyn daemons for UPC and SHMEM codes This may be very difficult, since Paradyn daemons need to handle instrumentation
For UPC, how should communication be handled? Instrument runtime libraries?
Which runtimes should be supported? Is it feasible to support all runtimes of interest?
What about proprietary UPC languages and runtimes? This could be an insurmountable problem
Paradyn has been around for a long time Is there a lot of crufty code in the source code that is left alone because no one understands
it? Is the current user interface (Tcl/Tk) acceptable?
Also: Would need to port DynInst to targeted architectures
This may be problematic for architectures with no publicly available information on executable file formats/etc
Should include performance metrics as recorded by PAPI MDL should help, but Will MDL present too large of an overhead for the level of granularity needed by PAPI?
Is a lack of tracing ability acceptable? What if more details are needed?
27
Evaluation (1) Available metrics: 5/5
Many built in Number of CPUs, number of active threads, CPU and inclusive CPU time Function calls to and by Synchronization (# operations, wait time, inclusive wait time) Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and
received), point-to-point communication (# messages, bytes sent and received) I/O (# operations, wait time, inclusive wait time, total bytes)
Can add more using MDL and PCL Cost: free 5/5 Documentation quality: 4/5
Tutorial for using sequential and MPI programs with Paradyn Well-written manuals Programming guides included for DynInst, visualization library, and MDL
Extensibility: 2/5 Creating a SHMEM daemon wouldn’t take a lot of work Creating a UPC daemon will be problematic for proprietary runtimes Depends on DynInst, and porting DynInst to a new platform may take an immense amount of work
Filtering and aggregation: 3/5 Only supports rudimentary aggregation on metrics (min, max, averages)
Hardware support: 2/5 No support for Opteron, Itanium, or Cray architectures in Paradyn
DynInst supports Itanium Porting DynInst (which Paradyn depends on) would be very difficult
Heterogeneity support: 5/5 Authors claim Paradyn supports heterogeneity due to use of RPC interfaces Not directly supported by user interface for MPI programs, so cannot test
28
Evaluation (2) Installation: 2/5
Binaries are easy to find http://www.paradyn.org/html/paradyn4.1-software.html
Compiling from source extremely difficult and error-prone Relies on specific versions of libdwarf (Linux only) and Tcl/Tk (all), which complicates the installation if your
distribution or OS uses incompatible versions Installation time: approximately 2-3 hours for a shared environment
Need to create scripts that set about 6 environment variables before program will run correctly Interoperability: 1/5
Paradyn can save output in simple, documented format, but usefulness of data unknown No detailed, trace-like information can be provided as it is not collected Dynamic instrumentation interferes with tracing due to timing perturbations
Learning curve: 2/5 Difficult, complex program with many parts Took approximately 1 week to get comfortable with the program Manuals and tutorials are very helpful
Manual overhead: 5/5 No modification needed for executables
Measurement accuracy: 4/5 Dynamic instrumentation incurs very low overhead (~80 cycles for trampoline overhead) Time-histogram loses accuracy as time goes on due to fixed size
Multiple execution: 0/5 (not supported)
29
Evaluation (3) Multiple views and analyses: 5/5
Several visualization types are supported for all metrics Histograms, bar charts, 3D histograms, tables, summary tables (min/max/average)
Users can add new visualization programs as desired using Paradyn’s RPC interface and visualization library Call graphs and where axis give user a hierarchical view of their code Default visualizations support zooming and panning
Performance bottleneck identification: 2.5/5 Performance consultant can help identify “obvious” bottlenecks automatically Due to limited search space (only 4 types of bottlenecks), bottleneck identification is limited Tweaking thresholds used for search may be necessary to identify bottlenecks
Profiling/tracing support: 2/5 Uses a “hybrid” approach of sampling and tracing Detailed tracing information cannot be logged for later analysis
Cannot create a trace file of when MPI functions were called (e.g., what you’d need for Jumpshot) Paradyn daemons report values back to main Paradyn process at time intervals Main Paradyn process only has an approximation of “real” values for metrics at any given time However, values recorded by main Paradyn process can be exported (uses a simple documented format)
Response time: 5/5 Dynamic instrumentation allows arbitrarily turning on and off instrumentation without needed to restart or
recompile application Only takes a few seconds to start collecting metrics once they are requested
30
Evaluation (4) Software support: 3/5
Supported languages: threaded C code, MPI code Supported software platforms: Linux kernel version 2.4 & 2.6, AIX, Tru64, Windows 2000/XP, and IRIX
Source code correlation: 2/5 Can correlate back to the function name level (reads executable symbol tables) No line numbers or statement information available
Searching: 0/5 (not supported) System stability: 2/5
Many bugs in Linux version, but bugs seem to be limited to Paradyn GUI; DynInst seems very stable Technical support: 4/5
Helpful responses from our contact within 24 hours
31
References[1] B. P. Miller et. al. “The Paradyn parallel performance measurement tool,”
IEEE Computer, November 1995, pp. 37-46.
[2] J.K. Hollingsworth et. al. “MDL: A Language and Compiler for Dynamic Program Instrumentation,” IEEE PACT, 1997, pg. 201.
[3] H. Cain, B.P. Miller, and B.J.N. Wylie. “A Callgraph-Based Search Strategy for Automated Performance Diagnosis,” European Conference on Parallel Computing (Euro-Par), Munich, Germany, August 2000, pg. 108.