Upload
jack-stanley
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
UPC/SHMEM Performance Analysis Tool
Project
Alan D. GeorgeHung-Hsun SuBryan Golden
Adam Leko
HCS Research LaboratoryUniversity of Florida
2
Outline
Overview PAT Project status
Approach/Framework Detail
Collaborations Benchmarking and Lessons Learned Questions
3
Overview Motivations
UPC program does not yield the expected performance. Why? Due to the complexity of parallel computing, difficult to
determine without tools for performance analysis. Discouraging for users, new & old; few options for shared-
memory computing in UPC and SHMEM communities. Goals
Identify important performance “factors” in UPC computing. Research topics relating to performance analysis. Develop framework for a performance analysis tool.
As new tool or as extension/redesign of existing non-UPC tools. Design with both performance and user productivity in mind. Attract new UPC users and support improved performance.
PAT Project Status
5
Approach
Application Layer
Language Layer
Compiler Layer
Middleware Layer
Hardware Layer
Performance Analysis Tool
Measurable Factor List
Major Factor List
Minor Factor List
Tool Study
Experimental Study II
Literature Study / Intuition
Relevant Factor List
PAT
Survey of existing literature plus ideas
from past experience
Survey of existing tools with list of features as
end result
Experimental Study I
Preliminary experiments designed to test the validity
of each factor
Irrelevant Factor List
Additional experiments to understand the degree of
sensitivity, condition, etc. of each factor
Features from tool study plus analyses and factors from literature study that are
measurable
Factors shown not to be applicable to
performance
Collection of factors shown to have significant effect on
program performanceCollection of factors shown NOT to have significant effect on program
performance
Updated list including factors shown to have effect on program performance
Define layers to divide the workload. Conduct existing-tool study and
performance layers study in parallel to: Minimize development time. Maximize usefulness of PAT.
6
Framework (1)
11/29/04
All
Obtain Desirable Factor List
10/10/04 11/6/04
Hung-Hsun
Obtain Existing Tool Factor List
11/30/04 12/3/04
All
Filter Desirable Factor List
12/04/04 3/7/05
All
Design Test Plan and Start Testing
9/13/04 11/30/04
Hung-Hsun
Understand Tool Design
10/10/04 11/30/04
Hung-Hsun
Install and Experiment with Tools
10/4/04 11/29/04
Adam
Review Benchmarks
10/4/04 11/29/04
Bryan
Review HW/OS/System Metrics
10/4/04 11/29/04
Adam
Obtain Factors from Developers
11/29/04
All
Develop Factor Criteria
10/17/04 11/29/04
Bryan
L.S. on Factor Classification and Determination
11/8/04 11/29/04
All
Survey #1
2/8/05 3/7/05
All
Survey #2
10/4/04 11/29/04
Adam
L.S. on Analytical Model
10/17/04
All
L.S. on Tool Framework/Approach
Black Box Major tasks
Green Box Research areas
Orange Box Minor tasks
7
Framework (2)
12/04/04 3/7/05
All
Design Test Plan and Start Testing
2/8/05 3/7/05
All
Survey #2
3/8/05 3/11/05
All
Obtain Preliminary Important Factor List
2/8/05 3/8/04
All
Develop Usability Rating System
3/12/05 3/18/05
All
Tool and Layer Study Report
3/12/05
All
Usability Study Result
3/18/05 3/25/05
All
Prototype Design Plan
10/17/04 3/12/05
Hung-Hsun
L.S. on Presentation Methodology
10/17/04 3/12/05
Bryan
L.S. on General Usability Issues
10/17/04 3/25/05
Hung-Hsun
L.S. on Profiling/Tracing Methods
10/17/04 3/12/05
Adam
L.S. on Program/Compiler Optimization Techniques
10/17/04 3/25/05
Hung-Hsun
L.S. on Tool Design
10/17/04 3/12/05
Hung-Hsun
L.S. on Tool Evaluation Strategy
10/17/04 3/12/05
Bryan
Categorization of Platforms/Systems
All
Prototype Design
8
Profiling/Tracing Methods (1) Event based
Hierarchical Layered Model1. Monitoring
HW / SW / Hybrid.
2. Event trace access Objective: translate bits and bytes of
monitoring data (seq. of event records) abstraction of an even trace (1 event).
Hierarchical trace file format: event trace / trace segment / event record / record field.
Event trace Description Language (EDL).
9
Profiling/Tracing Methods (2) Hierarchical Layered Model (Cont.)
3. Filtering Filtering and clustering of events. Distributed Program Monitor.
4. Tool support Functional modules independent of semantics of object system and analysis
(graphical display, statistical functions).5. Tool
Validation of event traces, statistical evaluation, visualization, load balancing.6. Application support
User interface. Value needed for performance statistics (95%)
Frequency Distance in time Duration Count Value stored
Flow of control and distributed data association with parallel programs.
Static analysis techniques based on dataflow analysis.
10
Tool Evaluation Strategy Cost Functionality
Languages and libraries supported. Ways to present data. Usefulness of representation.
Source level information. Performance
Overhead introduced. Portability
Multiple platform support. Support for hybrid environments.
Robustness Coping with error (display
appropriate message). Crash rate.
Scalability Ability to handle large number of
processes and long-running programs.
Trace file formats. Mechanism for turning tracing on and
off. Filtering and aggregation. Zooming.
Tracing / profiling Real-time vs. postmortem
Usability Ease of using the tool. Documentation quality. Data collection method
(automatic/manual instrumentation). Graphical representation quality. User interface type.
Versatility Provide performance data analysis in
different ways. Provide different views. Interoperability with other trace format
and tools.
11
Tool Design and Installation Current tool list
Paradyn + DynInst, TAU, Vampirtrace, PAPI, SvPablo, Dimemas, Kojak, Perfometer, CrayPat, MPE Logging and Jumpshot, Prism, AIMS, MPICL and Paragraph, DEEP/MPI
Understanding tool design Review documents provided. Contact developer and ask
Specific questions. List of possible measurements from tool. Design methodology. Factor determination strategy. Willingness of developer to cooperate.
Install and experiment with tools Obtain and install tool on Kappa (Xeon cluster) / Lambda (Opteron cluster) / Marvel (AlphaServer). Experiment with tools on
C/MPI program. UPC program if possible.
Determine Measurements. Useful features. Ease of extension. Ease of installation.
12
Tool: Desired Information Name Developer Programming model support
(Functionality) Platform support (Portability) Version Website Contact Cost Features and modules Data collection method
Profiling / tracing Automatic / manual (Usability)
Documentation helpfulness (Usability)
Compatibility with other tools (Versatility)
Installation (Usability) Process Length Ease of installation
Execution Process Error handling capability
(Robustness) Representation
Easy to understand? (Usability)
Provide helpful information? (Functionality)
Have multiple views? (Versatility)
Provide source level information? (Functionality)
User interface: command line vs. GUI (Usability)
Tool overhead (Performance) Scalability estimation
Comments
13
Tool: TAU (1) Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon Programming model support (Functionality): MPI, OpenMP, SHMEM (work in progress), C, C++,
Fortran, POOMA, ACLMPL, A++/P++, PAWS, ACLVIS, MC++, Conejo, pC++, HPC++, Blitz++, Java, Python.
Platform support (Portability): SGI Origin 2000, SGI Power Challenge, Intel Teraflop Machine (ASCII Red), Sun Workstations, Linux PC, HP 9000 Workstations, DEC Alpha Workstations, Cray T3E.
Version: TAU 2.13.7, pdtoolkit 3.2.1 (program database toolkit, used by TAU to instrument programs).
Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/ Contact: Sameer Shende Cost: Free Features and modules: Barrier breakpoint (BREEZY), call graph browser, file and source code
browser, dynamic analysis tools (RACY), static analysis tools (FANCY), dynamic execution behavior viewer (EASY), performance extrapolation (SPEEDY).
Data collection method Profiling / tracing: Profiling + tracing, function statistics, HW counters (PAPI). Automatic / manual (Usability): Automatic (DynInst + manual (insert profiling API calls).
Documentation helpfulness (Usability): Fairly good, was able to get support from the developer fairly soon (but maybe because of his interest in SHMEM).
Compatibility with other tools (Versatility): Uses PAPI. Can visualize using Vampir. Installation (Usability)
Process: Fairly straightforward after reading the user guide, but might have trouble figuring out which options are needed.
Length: Few hours with help from support. Ease of installation: Fairly easy process.
14
Tool: TAU (2) Execution
Process Manual
1. Insert API calls in source file.2. Generate instrumented c file.3. Compile instrumented c file with regular compiler.4. Execute instrumented c file.5. View with paraprof or Vampir.
Automatic: Pending. Error handling capability (Robustness): Pending. Representation: Pending.
Easy to understand? (Usability) Provide helpful information? (Functionality) Have multiple views? (Versatility) Provide source level information? (Functionality)
User interface: command line vs. GUI (Usability): command line and GUI. Tool overhead (Performance): Design document states the overhead is ~5% (need to
verify). Scalability estimation: Pending.
Comments Might be possible to instrument UPC program with intermediate C file (work in progress). Contact has shown interest in our project, has asked if we are interested in adding support
for GPSHMEM (implement wrappers). Might be worth doing to get a deeper understanding of the system.
New factors are added base solely on user request.
Source C file
pdb file
Instrumented C file
Executable
execute
cparse
tau_instrumentor
native c compiler
GUI
papaprof
Profiling / tracing file N
...Profiling /
tracing file 1
15
Tool: Paradyn (1) Name: Paradyn Developer: University of Wisconsin-Madison Programming model support (Functionality): UNIX IPC, Solaris 2 thread and
synchronization primitives, CM-5 CMMD, CM Fortran CM-RTS, PVM, MPI. Platform support (Portability): Linux PC (2.4 & 2.6), Windows PC (2000 & XP), AIX
5.1, TMC CM-5, SPARCstation (Solaris 8 & 9), HP PA-RISC (HP/UX). Version: Paradyn 4.1.1, DynInst 4.1.1, KernInst 2.0.1, MRNet 1.0 Website: http://www.paradyn.org/index.html Contact: Matthew Legendre Cost: Free Feature and modules: Dynamic Instrumentation (during execution), performance
consultant (W3 Search Model), visualization manager, data manager, user interface manager, metric-focus grids (performance metrics/individual program components), time histograms (record behavior of a metric as it varies with time).
Data collection method Profiling / tracing: Tracing. Automatic / manual (Usability): Automatic.
Documentation helpfulness (Usability): Pending. Compatibility with other tools (Versatility): Pending. Installation (Usability)
Process: Need to install Tcl/Tk first. Length: Pending. Ease of installation: Pending.
16
Tool: Paradyn (2) Execution: Pending. Comments
W3 search model seems very useful in identifying bottleneck. Uses a configuration language (Paradyn Configuration Language PCL). Provides 6 primitives: set counter, add to counter, subtract from counter, set
timer, start timer, stop timer.
17
Presentation Methodology (1) Multiple views
Observation levels. Perspectives. Alternatives views.
Semantic context User provides context. Program control, data abstractions, programming model, system
mapping, runtime execution, machine architecture, computation behaviors.
Addresses specific problems users encounter while following general principles and guidelines for good visualization design.
User interaction Allow user to find the best visualization scenario.
Modular, dataflow environment. Data spreadsheet model. Graphical object programming system.
18
Presentation Methodology (2) Visualization techniques (improve scalability)
Adaptive graphical representation Max detail. Prevent complexity from interfering with perception. Small data size (discrete, detail) large data size (continuous, complexity-
reducing). Reduction and filtering
Traditional: Sum, max, min, etc.. Graphical: Through graphical representation.
Spatial arrangement Produce scalable visualizations by arranging graphical elements so as dataset
scales, the display size and complexity increase at slower rate. Shape construction: defines properties of 3D structure by characteristics of
performance data. Generalized scrolling
Use of variety of methods to present a continuous, localized view of a larger mass of information.
Spatial (zoom in on local regions) and temporal (different animation width) scrolling.
19
Performance Tool Usability: Issues Problem
Difficult problem for tool developer. Inherently unstable execution environment. Monitoring behavior may disturb original behavior. Short lifetime of parallel computers.
Users Tools to difficult to use. Too complex. Unsuitable for size and structure of real-world applications. Skeptical about value of tools. Tools may be misapplied.
20
Performance Tool Usability: Discussion (1) What the user seeks (conceptual framework)
Identification: Is there a problem? What are the symptoms? How serious are they?
Localization: At what point in execution is performance degrading? What is causing the problem?
Repair: What must be changed to fix the problem?
Verification: Did the “fix” improve performance? If not, try something else.
Validation: Is there still a performance problem? If so, go back to identification.
21
Performance Tool Usability: Discussion (2) How the user seeks it (task structure)
Problem stabilization This task accomplishes identification, verification, and validation. Timing information gathered for multiple runs. Timing data compared for consistency. Correlation to anticipated time.
Search space reduction Accomplishes localization. User makes educated guess about cause of behavior. Based on intuition and previous experience. User tests hypothesis with hand-coded instrumentation.
Selective modification Accomplishes sub goal of repair. Manual inspection of one potential problem location at a time. Modification of potential problem areas, then verification. Once satisfied, move to the next location.
22
Performance Tool Usability: Implications and Key Insights Implications for performance tools and solutions
Many tools begin by presenting windows with detailed info on a performance metric.
Users prefer a broader perspective on application behavior. Tools provide multiple views of program behavior.
Good in general. Need support for comparing different metrics
Ex: If CPU utilization drops in same place L1 cache miss rate rises. PAPI provides this functionality.
To be useful, essential to provide localization relative to source code.
Current tools are lacking in the area of selective modification. Group common blocks together. Suggest improvements. User doesn’t want info that can’t be used to fix the code.
Key insights User acceptance will not increase until tools are usable in real world
environment. Identify successful user strategies for real applications. Devise ways to apply strategies to tools in an intuitive manner. Use this functionality in the development of new tools.
23
Hardware Counters Hardware counter software support
Performance Counter Library (PCL) Primarily for processors used in clusters.
Pentium on Linux 2.x.x AMD Athlon Etc.
Requires kernel patch for Linux. Provides common interface to access hardware counters. Counting may be done in either system or user mode.
Performance Application Programming Interface (PAPI) New version 3.0.6 released 10-20-2004. De facto tool for accessing performance monitoring counters. Support for many architectures.
24
Survey #1 (1) Survey #1
Use the list from tools/brainstorming + some general questions. Attempt to gather more factors. Current list of factors
Application layer Variable access pattern / memory placement schema
Remote access (count / time) Local access count (count / time)
System size (optimal size, scalability) Time (load balancing, etc.)
Total Computation Communication (transfer / synchronization / rate) I/O Function (interval / count / etc.)
Language layer Construct performance
Compiler layer Compiler-added overhead
25
Survey #1 (2) Current list of factors (Cont.)
Middleware layer Synchronization time (HW support, etc.) Runtime system overhead OS overhead Thread management I/O Communication (latency / throughput)
Hardware layer CPU utilization FLOPS Branch prediction Cache (size / miss rate / penalty) Memory (size / latency / throughput) Network (latency / throughput / overhead / congestion / topology) I/O (latency / throughput / rate)
Sample questions Are you currently using any performance analysis tools? If so, which ones? Why? What features do you think are most important in deciding which tool to use? What kind of information is most important for you when determining performance
bottlenecks? Which platforms do you target most? Which compilers? From past experience, what coding practices typically lead to most of the performance
bottlenecks (for example: bad data affinity to node)?
26
Survey #2 and Benchmark Review Survey #2
User rating of preliminary factors. Help determine importance of factors from user’s perspective. Good rating system needs to be established.
Benchmark review Attempt to extrapolate performance factors from existing benchmarks. Currently reviewing GWU benchmarks
N-Queens Performs well on clusters and shared-memory machines.
Sobel edge detection Currently performs poorly on clusters with Berkeley compiler.
Berkeley compiler more sensitive to compiler “tricks”. Common optimizations
“Prefetch” read-only data all threads access frequently. Cast pointer-to-shared with affinity to current thread to private shared.
These optimizations fall under the category of language-layer performance factors. GWU benchmarks are too simplistic, need more complex benchmarks to identify
performance factors. Future work: review NAS benchmarks
FFT, integer sort, etc.. NAS benchmarks should result in more performance factors being obtained.
27
Performance Models (1) Purpose: to identify and characterize impacts of possible performance
factors by incorporating their effects into a performance models. Existing models
“Classic” models LOGP and variations, PRAM, BSP. Models of this type are loose abstractions of real machines (limited fidelity). Possible to tune these types of models to existing network/CPU hardware, but it is
yet to be seen if connection between algorithm or program can be made to them. PRAM is better suited to model generic algorithms, but model is missing lots of information
to transfer to real-world performance. Other models
P3T automatic Fortran “parallelizer” Driven by performance model with coarse-grained characteristics. Captures program performance using profiler, maps out parallel version based on prediction
from model. Successfully applied to simple scientific computation kernels.
Resource OCCupancy model (ROCC) Predict overhead of measurement/instrumentation systems. Model was calibrated to Paradyn with success.
28
Performance Models (2) UPC-specific models
Zhang Zhang (MTU) has created an initial performance model for evaluating UPC code (PhD thesis, on-going work).
Preliminary works uses a simplistic remote reference/local reference count and a cost for each (in terms of memory bandwidth) with ~5% to ~30% error.
Future work Ideal case: leverage Zhang Zhang’s model. Possible: create a performance model for the specific purpose of
identifying performance factors (more research needed). Coarse-grained simulation also another possible strategy to pursue. Research other models used to predict program behavior under different
types of execution environments. Parallel/MPI performance prediction models. Shared-memory performance prediction models.
29
Factor Information Name Relevancy
Cite literary finding. Provide test programs and/or cases.
Measurement How? Existing tools measuring it.
Experiment How? Sensitivity.
Attributes HW/SW? Architecture specific? Language specific? Compiler specific? Layer. Relation with other factors.
User rating Tracing / profiling Comment, other useful information Presentation strategy
30
Other Topics
Algorithm analysis Compiler optimizations Factor classification and determination Performance analysis strategy Tool design: production / theoretical Tool framework/approach: theoretical and
experimental
Collaborations
32
Collaborations (1) Cray
Contact: David Greene Status: No response yet.
GWU Contact: Graduate students Status: Willing to help, no involvement yet.
HP Contact: Brian Wibecan (UPC), Bruce Trull (SHMEM) Status:
Brian: Asked several questions. Shown interest in continue their previous work with Vampir. Currently busy, will respond soon.
Bruce: Asked about native SHMEM for Marvel. Will respond when he finds something. Doesn’t seem likely SHMEM exists for Tru64.
IBM Contact: Raymond Mak, Kit Barton Status: No response yet.
Intrepid Contact: Gary Funck / Nenad Vukicevic Status: Helpful in providing
Possible factors. Documentation/hints on GCC-UPC compiler.
33
Collaborations (2) MTU
Contact: Zhang Zhang, Phil Merkey Status:
Received Zhang’s dissertation (performance model and possible factors). Haven’t heard back from Phil yet.
PModel Contact: Ricky Kendall Status: In contact with, involvement pending.
Sandia Contact: Zhaofang Wen, Rob Brightwell Status: Shown interest in collaboration, detail needs to be determine.
Sun Contact: Yuan Lin Status: Will contact us when compiler becomes publicly available.
UC Berkeley Contact: Dan Bonachea Status: In constant contact.
Benchmarking and Lessons Learned
35
UPC Benchmarking: Overview Goals
Build programming skills in UPC/SHMEM. Become familiar with all aspects of languages on various platforms and compilers.
Help identifying desirable metrics through performance optimization. Perform performance analysis from user’s perspective.
Testbed Xeon cluster (Kappa)
Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset.
SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus.
MPI: MPICH 1.2.5. RedHat 9.0 with gcc compiler V 3.3.2, Berkeley UPC runtime system 2.0.
Opteron cluster (Lambda) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S
server motherboard. InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, 24-port
switch from Voltaire. MPI: MPI package provided by Voltaire. SuSE 9.0 with gcc compiler V 3.3.3, Berkeley UPC runtime system 2.0.
ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor
connections. Tru64 5.1B Unix, HP UPC V2.3-test compiler.
36
UPC Benchmarking: Depth First Search Description
Tree represented with array. Search start with root node, if match found, stop else try matching all children. Complexity O(N) for sequential, O(log N) for parallel. Depending on the match process, could be either computational intensive or
communication intensive. Metrics used
Total time Function time
Initialization Search
Conclusion Dynamic programming impossible?
UPC version is different than sequential implementation. Less intuitive to program in UPC.
Xeon cluster with SCI Computation to communication ratio important. for() {if (MYTHREAD = …) …} vs. upc_forall()
1 – 2 nodes: comparable, sometimes faster sometimes slower. 4 nodes : faster more parallel speedup.
Exact same code does not work on Opteron cluster with InfiniBand
0 1 2 3 4 5 6 ...
0
1 2
3 4 5 6
37
UPC Benchmarking: Concurrent Wave Equation Description
A vibrating string is decomposed into points. In the parallel version, each processor responsible for updating amplitude of N/P points. Each iteration, each processor exchanges boundary points with nearest neighbors. Coarse-grained communication. Algorithm complexity of O(N).
Versions Sequential C
Baseline version. Hand optimized version.
Removed redundant calculations. Used global variables. Used temp variables to store intermediate results.
UPC Based off of modified sequential version. 32/64-bit version: Xeon / Opteron. Each processor gets a continuous range of N/P points. Communication only on local endpoints.
38
Program Analysis: ResultsUPC Concurrent Wave Equation Results
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0.5 1 1.5 2 2.5 3
Number of points (1E6)
Execu
tio
n t
ime (
sec)
Xeon-sequential Xeon-sequential mod Xeon-upc-1 Xeon-upc-2 Xeon-upc-4
Opteron-sequential Opteron-sequential mod Opteron upc-1 Opteron upc-2 Opteron upc-4
39
UPC Benchmarking: Concurrent Wave Equation Conclusion
Sequential C Modified version was 30% faster than baseline for Xeon, but
only 17% faster for Opteron. Opteron and Xeon sequential unmodified code have nearly
identical execution times. UPC
Near linear speedup. Fairly straightforward to port from sequential code. upc_forall loop ran faster with ‘array+j’ as affinity expression
than with ‘&(array[j])’.
40
UPC Benchmarking: Bench 9(Mod 2N Inverse) Description
Given: list A of size listsize (64-bit integers), size ranges from 0 to 2 j – 1. Compute
list B, where Bi=Ai “right justified”. list C, such that (Bi * Ci) % 2j = 1 (iterative algorithm). Computation is embarrassingly parallel.
Check section (gather) First node checks all values to verify (Bi * Ci) % 2j = 1. Benchmark text – “Output selected values from A, B, and C”. Very communication intensive!
Benchmark parameters listsize = 5 x 106
j = 48 Optimizations: UPC version used all combinations of the following
Casting shared pointers with affinity to local thread to private pointers (cast). Explicitly stating work distribution via for loops instead of using upc_forall (for). Copying B & C arrays to main thread before check (get).
MPI, UPC, and SHMEM versions used same algorithm MPI + SHMEM equivalent to get UPC optimization.
41
Programming Practices: Bench 9Effect of Optimizations – AlphaServerBench 9 Optimizations - AlphaServer
0
1
2
3
4
5
6
forall forall cast for for cast get forall get forall cast get for get for cast
Optimization
Tim
e (
se
co
nd
s)
sequential, cc
sequential, upc
upc, 1 thread
upc, 4 threads
42
Programming Practices: Bench 9Effect of Optimizations – Opteron Cluster Bench 9 Optimizations - Opteron Cluster
0
2
4
6
8
10
12
forall forall cast for for cast get forall get forallcast
get for get for cast
Optimization
Tim
e (
se
co
nd
s)
sequential, gcc
sequential, upcc
upc, 1 thread
upc, 4 threads
43
Programming Practices: Bench 9Performance – AlphaServer
Bench 9 - AlphaServer
0
0.5
1
1.5
2
2.5
3
1 2 3 4Number threads
Tim
e (
seco
nd
s)
upc (get for cast)gpshmemmpi
44
Programming Practices: Bench 9Performance – Opteron Cluster
Bench 9 - Opteron Cluster
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number threads
Tim
e (
seco
nd
s)
gpshmem
mpi
bupc-vapi (get for cast, BS=100)
bupc-vapi (get for cast, BS=MAX)
45
UPC Benchmarking: Bench 9Conclusions
Overall performance AlphaServer: Relatively good, UPC, SHMEM, and MPI comparable performance. Opteron
Bad performance, larger lists sizes result in worse performance. MPI gave best results (more mature compiler), although code was much harder to write. GPSHMEM over MPI does not give good performance vs. plain MPI, although much much better to write with one-sided
communication patterns. MPI could use asynchronous calls to hide latency on check, although communication time dominates by a large factor.
Performance of code can be closely tied to “tricks” used on compilers AlphaServer: About 33% performance difference depending on which optimizations used. Opteron: With Berkeley compiler, sometimes over 2x performance lost if no optimizations are used.
UPC compiler overhead Compiling sequential code through UPC compiler instead of cc on marvel results in 45% performance hit. Overhead in Berkeley compiler significant if no casting is used. for vs. upc_forall in Berkeley compiler.
Implementation of upc_forall translates this code: upc_forall (i = 0; i < NUM; i++; &A[i]) { A[i] = rand(); } }
To for (i = 0; i < NUM; i++) { if (upc_threadof(&A[i]) == MYTHREAD) { A[i] = rand(); } }
Addition of extra instructions slow execution speed (branch prediction and context switches, etc). Reverse on AlphaServer: upc_forall better than for.
46
UPC Benchmarking: Convolution Description
Compute convolution of two sequences. “Classic” definition of convolution (not image processing definition). Embarrassingly parallel (gives an idea of language overhead). O(n2) algorithm complexity.
Benchmark parameters Sequence sizes: 100,000 elements. Data types: 32-bit integer, double-precision floating point.
Optimizations: UPC version used all combinations of the following Casting shared pointers with affinity to local thread to private pointers (cast). Explicitly stating work distribution via for loops instead of using upc_forall (for).
Incorporates casting shared pointers with affinity to local thread to private pointers (previously cast).
Copying X & H arrays to main thread before check (get). MPI, UPC, and SHMEM versions used same algorithm
k
knHkXnC ][*][][
47
Programming Practices: ConvolutionEffect of Optimizations – AlphaServer
Integer Convolution Optimizations - AlphaServer
0
2
4
6
8
10
12
14
naïve forall naïve for get forall get for
Optimization
Tim
e (
se
co
nd
s)
sequential
upc, 1 thread
upc, 4 threads
48
Programming Practices: ConvolutionEffect of Optimizations – Opteron Cluster
Integer Convolution Optimizations - Opteron Cluster
0
10
20
30
40
50
60
70
80
naïve forall naïve for get forall get for
Optimization
Tim
e (
se
co
nd
s)
sequential
upc, 1 thread
upc, 4 threads
49
Programming Practices: ConvolutionPerformance – AlphaServerInteger Convolution - AlphaServer
0
10
20
30
40
50
60
1 2 3 4Number threads
Tim
e (
seco
nd
s)
upc (get for)
gpshmem
mpi
50
Programming Practices: ConvolutionPerformance – Opteron ClusterInteger Convolution - Opteron Cluster
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number threads
Tim
e (s
eco
nd
s)
bupc-vapi (get for)
gpshmem
mpi
51
Programming Practices: ConvolutionPerformance – AlphaServerDouble Precision Floating Point Convolution - AlphaServer
0
20
40
60
80
100
120
1 2 3 4Number threads
Tim
e (
seco
nd
s)
upc (get for)
gpshmem
mpi
52
Programming Practices: ConvolutionPerformance – Opteron ClusterDouble Precision Floating Point Convolution - Opteron Cluster
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number threads
Tim
e (
se
co
nd
s)
bupc-vapi (get for)
gpshmem
mpi
53
UPC Benchmarking: ConvolutionConclusions
Optimizations Similar results to NSA Bench 9
Naïve version (treat shared A and B variables as though they were local) has abysmal performance on both Marvel and Lambda.
Less compiler overhead Almost no difference between sequential versions compiled with gcc/cc and sequential versions compiled
with UPC compilers. Benchmark less memory intensive: UPC compilers compiling variables as volatile? Performance of double-precision floating point actually improved by HP UPC compiler on AlphaServer.
Same flags (-O2) passed to both upc and cc compilers, perhaps HP UPC compiler does a better job scheduling the floating-point units?
Overall language overhead MPI was most work to code, but had least amount of overhead. SHMEM slightly easier because of one-sided communication functions. UPC easiest (conversion of sequential code very easy), but has potentially worse performance.
Overall language performance overhead On AlphaServer:
MPI had most runtime overhead in most cases. GPSHMEM performed surprisingly well.
Favorable to UPC and MPI with integer convolution. Slightly better than MPI for double-precision floating point convolution.
On Opteron: Runtime overhead MPI (least) < SHMEM < Berkeley UPC.
54
UPC Benchmarking: Conclusion
Benchmarking effort accomplished what we intended. Gain practical experience in UPC/SHMEM/MPI. Have a better understanding of programming styles between
different languages. Found factors/metrics that are useful for performance analysis. Found difference in compiler efficiencies. Created new types of UPC benchmarks that can be used later to
validate factors.
Benchmarking process near completion Plan to develop a few more programs. Already accomplished most of what we set out to do.
55
Questions
Target platforms / architectures in mind?
Others that should be contacted.