UPC/SHMEM Performance Analysis Tool Project Alan D. George Hung-Hsun Su Bryan Golden Adam Leko HCS Research Laboratory University of Florida

UPC/SHMEM Performance Analysis Tool

Project

Alan D. GeorgeHung-Hsun SuBryan Golden

Adam Leko

HCS Research LaboratoryUniversity of Florida

2

Outline

Overview PAT Project status

Approach/Framework Detail

Collaborations Benchmarking and Lessons Learned Questions

3

Overview Motivations

UPC program does not yield the expected performance. Why? Due to the complexity of parallel computing, difficult to

determine without tools for performance analysis. Discouraging for users, new & old; few options for shared-

memory computing in UPC and SHMEM communities. Goals

Identify important performance “factors” in UPC computing. Research topics relating to performance analysis. Develop framework for a performance analysis tool.

As new tool or as extension/redesign of existing non-UPC tools. Design with both performance and user productivity in mind. Attract new UPC users and support improved performance.

PAT Project Status

5

Approach

Application Layer

Language Layer

Compiler Layer

Middleware Layer

Hardware Layer

Performance Analysis Tool

Measurable Factor List

Major Factor List

Minor Factor List

Tool Study

Experimental Study II

Literature Study / Intuition

Relevant Factor List

PAT

Survey of existing literature plus ideas

from past experience

Survey of existing tools with list of features as

end result

Experimental Study I

Preliminary experiments designed to test the validity

of each factor

Irrelevant Factor List

Additional experiments to understand the degree of

sensitivity, condition, etc. of each factor

Features from tool study plus analyses and factors from literature study that are

measurable

Factors shown not to be applicable to

performance

Collection of factors shown to have significant effect on

program performanceCollection of factors shown NOT to have significant effect on program

performance

Updated list including factors shown to have effect on program performance

Define layers to divide the workload. Conduct existing-tool study and

performance layers study in parallel to: Minimize development time. Maximize usefulness of PAT.

6

Framework (1)

11/29/04

All

Obtain Desirable Factor List

10/10/04 11/6/04

Hung-Hsun

Obtain Existing Tool Factor List

11/30/04 12/3/04

All

Filter Desirable Factor List

12/04/04 3/7/05

All

Design Test Plan and Start Testing

9/13/04 11/30/04

Hung-Hsun

Understand Tool Design

10/10/04 11/30/04

Hung-Hsun

Install and Experiment with Tools

10/4/04 11/29/04

Adam

Review Benchmarks

10/4/04 11/29/04

Bryan

Review HW/OS/System Metrics

10/4/04 11/29/04

Adam

Obtain Factors from Developers

11/29/04

All

Develop Factor Criteria

10/17/04 11/29/04

Bryan

L.S. on Factor Classification and Determination

11/8/04 11/29/04

All

Survey #1

2/8/05 3/7/05

All

Survey #2

10/4/04 11/29/04

Adam

L.S. on Analytical Model

10/17/04

All

L.S. on Tool Framework/Approach

Black Box Major tasks

Green Box Research areas

Orange Box Minor tasks

7

Framework (2)

12/04/04 3/7/05

All

Design Test Plan and Start Testing

2/8/05 3/7/05

All

Survey #2

3/8/05 3/11/05

All

Obtain Preliminary Important Factor List

2/8/05 3/8/04

All

Develop Usability Rating System

3/12/05 3/18/05

All

Tool and Layer Study Report

3/12/05

All

Usability Study Result

3/18/05 3/25/05

All

Prototype Design Plan

10/17/04 3/12/05

Hung-Hsun

L.S. on Presentation Methodology

10/17/04 3/12/05

Bryan

L.S. on General Usability Issues

10/17/04 3/25/05

Hung-Hsun

L.S. on Profiling/Tracing Methods

10/17/04 3/12/05

Adam

L.S. on Program/Compiler Optimization Techniques

10/17/04 3/25/05

Hung-Hsun

L.S. on Tool Design

10/17/04 3/12/05

Hung-Hsun

L.S. on Tool Evaluation Strategy

10/17/04 3/12/05

Bryan

Categorization of Platforms/Systems

All

Prototype Design

8

Profiling/Tracing Methods (1) Event based

Hierarchical Layered Model1. Monitoring

HW / SW / Hybrid.

2. Event trace access Objective: translate bits and bytes of

monitoring data (seq. of event records) abstraction of an even trace (1 event).

Hierarchical trace file format: event trace / trace segment / event record / record field.

Event trace Description Language (EDL).

9

Profiling/Tracing Methods (2) Hierarchical Layered Model (Cont.)

3. Filtering Filtering and clustering of events. Distributed Program Monitor.

4. Tool support Functional modules independent of semantics of object system and analysis

(graphical display, statistical functions).5. Tool

Validation of event traces, statistical evaluation, visualization, load balancing.6. Application support

User interface. Value needed for performance statistics (95%)

Frequency Distance in time Duration Count Value stored

Flow of control and distributed data association with parallel programs.

Static analysis techniques based on dataflow analysis.

10

Tool Evaluation Strategy Cost Functionality

Languages and libraries supported. Ways to present data. Usefulness of representation.

Source level information. Performance

Overhead introduced. Portability

Multiple platform support. Support for hybrid environments.

Robustness Coping with error (display

appropriate message). Crash rate.

Scalability Ability to handle large number of

processes and long-running programs.

Trace file formats. Mechanism for turning tracing on and

off. Filtering and aggregation. Zooming.

Tracing / profiling Real-time vs. postmortem

Usability Ease of using the tool. Documentation quality. Data collection method

(automatic/manual instrumentation). Graphical representation quality. User interface type.

Versatility Provide performance data analysis in

different ways. Provide different views. Interoperability with other trace format

and tools.

11

Tool Design and Installation Current tool list

Paradyn + DynInst, TAU, Vampirtrace, PAPI, SvPablo, Dimemas, Kojak, Perfometer, CrayPat, MPE Logging and Jumpshot, Prism, AIMS, MPICL and Paragraph, DEEP/MPI

Understanding tool design Review documents provided. Contact developer and ask

Specific questions. List of possible measurements from tool. Design methodology. Factor determination strategy. Willingness of developer to cooperate.

Install and experiment with tools Obtain and install tool on Kappa (Xeon cluster) / Lambda (Opteron cluster) / Marvel (AlphaServer). Experiment with tools on

C/MPI program. UPC program if possible.

Determine Measurements. Useful features. Ease of extension. Ease of installation.

12

Tool: Desired Information Name Developer Programming model support

(Functionality) Platform support (Portability) Version Website Contact Cost Features and modules Data collection method

Profiling / tracing Automatic / manual (Usability)

Documentation helpfulness (Usability)

Compatibility with other tools (Versatility)

Installation (Usability) Process Length Ease of installation

Execution Process Error handling capability

(Robustness) Representation

Easy to understand? (Usability)

Provide helpful information? (Functionality)

Have multiple views? (Versatility)

Provide source level information? (Functionality)

User interface: command line vs. GUI (Usability)

Tool overhead (Performance) Scalability estimation

Comments

13

Tool: TAU (1) Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon Programming model support (Functionality): MPI, OpenMP, SHMEM (work in progress), C, C++,

Fortran, POOMA, ACLMPL, A++/P++, PAWS, ACLVIS, MC++, Conejo, pC++, HPC++, Blitz++, Java, Python.

Platform support (Portability): SGI Origin 2000, SGI Power Challenge, Intel Teraflop Machine (ASCII Red), Sun Workstations, Linux PC, HP 9000 Workstations, DEC Alpha Workstations, Cray T3E.

Version: TAU 2.13.7, pdtoolkit 3.2.1 (program database toolkit, used by TAU to instrument programs).

Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/ Contact: Sameer Shende Cost: Free Features and modules: Barrier breakpoint (BREEZY), call graph browser, file and source code

browser, dynamic analysis tools (RACY), static analysis tools (FANCY), dynamic execution behavior viewer (EASY), performance extrapolation (SPEEDY).

Data collection method Profiling / tracing: Profiling + tracing, function statistics, HW counters (PAPI). Automatic / manual (Usability): Automatic (DynInst + manual (insert profiling API calls).

Documentation helpfulness (Usability): Fairly good, was able to get support from the developer fairly soon (but maybe because of his interest in SHMEM).

Compatibility with other tools (Versatility): Uses PAPI. Can visualize using Vampir. Installation (Usability)

Process: Fairly straightforward after reading the user guide, but might have trouble figuring out which options are needed.

Length: Few hours with help from support. Ease of installation: Fairly easy process.

14

Tool: TAU (2) Execution

Process Manual

1. Insert API calls in source file.2. Generate instrumented c file.3. Compile instrumented c file with regular compiler.4. Execute instrumented c file.5. View with paraprof or Vampir.

Automatic: Pending. Error handling capability (Robustness): Pending. Representation: Pending.

Easy to understand? (Usability) Provide helpful information? (Functionality) Have multiple views? (Versatility) Provide source level information? (Functionality)

User interface: command line vs. GUI (Usability): command line and GUI. Tool overhead (Performance): Design document states the overhead is ~5% (need to

verify). Scalability estimation: Pending.

Comments Might be possible to instrument UPC program with intermediate C file (work in progress). Contact has shown interest in our project, has asked if we are interested in adding support

for GPSHMEM (implement wrappers). Might be worth doing to get a deeper understanding of the system.

New factors are added base solely on user request.

Source C file

pdb file

Instrumented C file

Executable

execute

cparse

tau_instrumentor

native c compiler

GUI

papaprof

Profiling / tracing file N

...Profiling /

tracing file 1

15

Tool: Paradyn (1) Name: Paradyn Developer: University of Wisconsin-Madison Programming model support (Functionality): UNIX IPC, Solaris 2 thread and

synchronization primitives, CM-5 CMMD, CM Fortran CM-RTS, PVM, MPI. Platform support (Portability): Linux PC (2.4 & 2.6), Windows PC (2000 & XP), AIX

5.1, TMC CM-5, SPARCstation (Solaris 8 & 9), HP PA-RISC (HP/UX). Version: Paradyn 4.1.1, DynInst 4.1.1, KernInst 2.0.1, MRNet 1.0 Website: http://www.paradyn.org/index.html Contact: Matthew Legendre Cost: Free Feature and modules: Dynamic Instrumentation (during execution), performance

consultant (W3 Search Model), visualization manager, data manager, user interface manager, metric-focus grids (performance metrics/individual program components), time histograms (record behavior of a metric as it varies with time).

Data collection method Profiling / tracing: Tracing. Automatic / manual (Usability): Automatic.

Documentation helpfulness (Usability): Pending. Compatibility with other tools (Versatility): Pending. Installation (Usability)

Process: Need to install Tcl/Tk first. Length: Pending. Ease of installation: Pending.

16

Tool: Paradyn (2) Execution: Pending. Comments

W3 search model seems very useful in identifying bottleneck. Uses a configuration language (Paradyn Configuration Language PCL). Provides 6 primitives: set counter, add to counter, subtract from counter, set

timer, start timer, stop timer.

17

Presentation Methodology (1) Multiple views

Observation levels. Perspectives. Alternatives views.

Semantic context User provides context. Program control, data abstractions, programming model, system

mapping, runtime execution, machine architecture, computation behaviors.

Addresses specific problems users encounter while following general principles and guidelines for good visualization design.

User interaction Allow user to find the best visualization scenario.

Modular, dataflow environment. Data spreadsheet model. Graphical object programming system.

18

Presentation Methodology (2) Visualization techniques (improve scalability)

Adaptive graphical representation Max detail. Prevent complexity from interfering with perception. Small data size (discrete, detail) large data size (continuous, complexity-

reducing). Reduction and filtering

Traditional: Sum, max, min, etc.. Graphical: Through graphical representation.

Spatial arrangement Produce scalable visualizations by arranging graphical elements so as dataset

scales, the display size and complexity increase at slower rate. Shape construction: defines properties of 3D structure by characteristics of

performance data. Generalized scrolling

Use of variety of methods to present a continuous, localized view of a larger mass of information.

Spatial (zoom in on local regions) and temporal (different animation width) scrolling.

19

Performance Tool Usability: Issues Problem

Difficult problem for tool developer. Inherently unstable execution environment. Monitoring behavior may disturb original behavior. Short lifetime of parallel computers.

Users Tools to difficult to use. Too complex. Unsuitable for size and structure of real-world applications. Skeptical about value of tools. Tools may be misapplied.

20

Performance Tool Usability: Discussion (1) What the user seeks (conceptual framework)

Identification: Is there a problem? What are the symptoms? How serious are they?

Localization: At what point in execution is performance degrading? What is causing the problem?

Repair: What must be changed to fix the problem?

Verification: Did the “fix” improve performance? If not, try something else.

Validation: Is there still a performance problem? If so, go back to identification.

21

Performance Tool Usability: Discussion (2) How the user seeks it (task structure)

Problem stabilization This task accomplishes identification, verification, and validation. Timing information gathered for multiple runs. Timing data compared for consistency. Correlation to anticipated time.

Search space reduction Accomplishes localization. User makes educated guess about cause of behavior. Based on intuition and previous experience. User tests hypothesis with hand-coded instrumentation.

Selective modification Accomplishes sub goal of repair. Manual inspection of one potential problem location at a time. Modification of potential problem areas, then verification. Once satisfied, move to the next location.

22

Performance Tool Usability: Implications and Key Insights Implications for performance tools and solutions

Many tools begin by presenting windows with detailed info on a performance metric.

Users prefer a broader perspective on application behavior. Tools provide multiple views of program behavior.

Good in general. Need support for comparing different metrics

Ex: If CPU utilization drops in same place L1 cache miss rate rises. PAPI provides this functionality.

To be useful, essential to provide localization relative to source code.

Current tools are lacking in the area of selective modification. Group common blocks together. Suggest improvements. User doesn’t want info that can’t be used to fix the code.

Key insights User acceptance will not increase until tools are usable in real world

environment. Identify successful user strategies for real applications. Devise ways to apply strategies to tools in an intuitive manner. Use this functionality in the development of new tools.

23

Hardware Counters Hardware counter software support

Performance Counter Library (PCL) Primarily for processors used in clusters.

Pentium on Linux 2.x.x AMD Athlon Etc.

Requires kernel patch for Linux. Provides common interface to access hardware counters. Counting may be done in either system or user mode.

Performance Application Programming Interface (PAPI) New version 3.0.6 released 10-20-2004. De facto tool for accessing performance monitoring counters. Support for many architectures.

24

Survey #1 (1) Survey #1

Use the list from tools/brainstorming + some general questions. Attempt to gather more factors. Current list of factors

Application layer Variable access pattern / memory placement schema

Remote access (count / time) Local access count (count / time)

System size (optimal size, scalability) Time (load balancing, etc.)

Total Computation Communication (transfer / synchronization / rate) I/O Function (interval / count / etc.)

Language layer Construct performance

Compiler layer Compiler-added overhead

25

Survey #1 (2) Current list of factors (Cont.)

Middleware layer Synchronization time (HW support, etc.) Runtime system overhead OS overhead Thread management I/O Communication (latency / throughput)

Hardware layer CPU utilization FLOPS Branch prediction Cache (size / miss rate / penalty) Memory (size / latency / throughput) Network (latency / throughput / overhead / congestion / topology) I/O (latency / throughput / rate)

Sample questions Are you currently using any performance analysis tools? If so, which ones? Why? What features do you think are most important in deciding which tool to use? What kind of information is most important for you when determining performance

bottlenecks? Which platforms do you target most? Which compilers? From past experience, what coding practices typically lead to most of the performance

bottlenecks (for example: bad data affinity to node)?

26

Survey #2 and Benchmark Review Survey #2

User rating of preliminary factors. Help determine importance of factors from user’s perspective. Good rating system needs to be established.

Benchmark review Attempt to extrapolate performance factors from existing benchmarks. Currently reviewing GWU benchmarks

N-Queens Performs well on clusters and shared-memory machines.

Sobel edge detection Currently performs poorly on clusters with Berkeley compiler.

Berkeley compiler more sensitive to compiler “tricks”. Common optimizations

“Prefetch” read-only data all threads access frequently. Cast pointer-to-shared with affinity to current thread to private shared.

These optimizations fall under the category of language-layer performance factors. GWU benchmarks are too simplistic, need more complex benchmarks to identify

performance factors. Future work: review NAS benchmarks

FFT, integer sort, etc.. NAS benchmarks should result in more performance factors being obtained.

27

Performance Models (1) Purpose: to identify and characterize impacts of possible performance

factors by incorporating their effects into a performance models. Existing models

“Classic” models LOGP and variations, PRAM, BSP. Models of this type are loose abstractions of real machines (limited fidelity). Possible to tune these types of models to existing network/CPU hardware, but it is

yet to be seen if connection between algorithm or program can be made to them. PRAM is better suited to model generic algorithms, but model is missing lots of information

to transfer to real-world performance. Other models

P3T automatic Fortran “parallelizer” Driven by performance model with coarse-grained characteristics. Captures program performance using profiler, maps out parallel version based on prediction

from model. Successfully applied to simple scientific computation kernels.

Resource OCCupancy model (ROCC) Predict overhead of measurement/instrumentation systems. Model was calibrated to Paradyn with success.

28

Performance Models (2) UPC-specific models

Zhang Zhang (MTU) has created an initial performance model for evaluating UPC code (PhD thesis, on-going work).

Preliminary works uses a simplistic remote reference/local reference count and a cost for each (in terms of memory bandwidth) with ~5% to ~30% error.

Future work Ideal case: leverage Zhang Zhang’s model. Possible: create a performance model for the specific purpose of

identifying performance factors (more research needed). Coarse-grained simulation also another possible strategy to pursue. Research other models used to predict program behavior under different

types of execution environments. Parallel/MPI performance prediction models. Shared-memory performance prediction models.

29

Factor Information Name Relevancy

Cite literary finding. Provide test programs and/or cases.

Measurement How? Existing tools measuring it.

Experiment How? Sensitivity.

Attributes HW/SW? Architecture specific? Language specific? Compiler specific? Layer. Relation with other factors.

User rating Tracing / profiling Comment, other useful information Presentation strategy

30

Other Topics

Algorithm analysis Compiler optimizations Factor classification and determination Performance analysis strategy Tool design: production / theoretical Tool framework/approach: theoretical and

experimental

Collaborations

32

Collaborations (1) Cray

Contact: David Greene Status: No response yet.

GWU Contact: Graduate students Status: Willing to help, no involvement yet.

HP Contact: Brian Wibecan (UPC), Bruce Trull (SHMEM) Status:

Brian: Asked several questions. Shown interest in continue their previous work with Vampir. Currently busy, will respond soon.

Bruce: Asked about native SHMEM for Marvel. Will respond when he finds something. Doesn’t seem likely SHMEM exists for Tru64.

IBM Contact: Raymond Mak, Kit Barton Status: No response yet.

Intrepid Contact: Gary Funck / Nenad Vukicevic Status: Helpful in providing

Possible factors. Documentation/hints on GCC-UPC compiler.

33

Collaborations (2) MTU

Contact: Zhang Zhang, Phil Merkey Status:

Received Zhang’s dissertation (performance model and possible factors). Haven’t heard back from Phil yet.

PModel Contact: Ricky Kendall Status: In contact with, involvement pending.

Sandia Contact: Zhaofang Wen, Rob Brightwell Status: Shown interest in collaboration, detail needs to be determine.

Sun Contact: Yuan Lin Status: Will contact us when compiler becomes publicly available.

UC Berkeley Contact: Dan Bonachea Status: In constant contact.

Benchmarking and Lessons Learned

35

UPC Benchmarking: Overview Goals

Build programming skills in UPC/SHMEM. Become familiar with all aspects of languages on various platforms and compilers.

Help identifying desirable metrics through performance optimization. Perform performance analysis from user’s perspective.

Testbed Xeon cluster (Kappa)

Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset.

SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus.

MPI: MPICH 1.2.5. RedHat 9.0 with gcc compiler V 3.3.2, Berkeley UPC runtime system 2.0.

Opteron cluster (Lambda) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S

server motherboard. InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, 24-port

switch from Voltaire. MPI: MPI package provided by Voltaire. SuSE 9.0 with gcc compiler V 3.3.3, Berkeley UPC runtime system 2.0.

ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor

connections. Tru64 5.1B Unix, HP UPC V2.3-test compiler.

36

UPC Benchmarking: Depth First Search Description

Tree represented with array. Search start with root node, if match found, stop else try matching all children. Complexity O(N) for sequential, O(log N) for parallel. Depending on the match process, could be either computational intensive or

communication intensive. Metrics used

Total time Function time

Initialization Search

Conclusion Dynamic programming impossible?

UPC version is different than sequential implementation. Less intuitive to program in UPC.

Xeon cluster with SCI Computation to communication ratio important. for() {if (MYTHREAD = …) …} vs. upc_forall()

1 – 2 nodes: comparable, sometimes faster sometimes slower. 4 nodes : faster more parallel speedup.

Exact same code does not work on Opteron cluster with InfiniBand

0 1 2 3 4 5 6 ...

0

1 2

3 4 5 6

37

UPC Benchmarking: Concurrent Wave Equation Description

A vibrating string is decomposed into points. In the parallel version, each processor responsible for updating amplitude of N/P points. Each iteration, each processor exchanges boundary points with nearest neighbors. Coarse-grained communication. Algorithm complexity of O(N).

Versions Sequential C

Baseline version. Hand optimized version.

Removed redundant calculations. Used global variables. Used temp variables to store intermediate results.

UPC Based off of modified sequential version. 32/64-bit version: Xeon / Opteron. Each processor gets a continuous range of N/P points. Communication only on local endpoints.

38

Program Analysis: ResultsUPC Concurrent Wave Equation Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0.5 1 1.5 2 2.5 3

Number of points (1E6)

Execu

tio

n t

ime (

sec)

Xeon-sequential Xeon-sequential mod Xeon-upc-1 Xeon-upc-2 Xeon-upc-4

Opteron-sequential Opteron-sequential mod Opteron upc-1 Opteron upc-2 Opteron upc-4

39

UPC Benchmarking: Concurrent Wave Equation Conclusion

Sequential C Modified version was 30% faster than baseline for Xeon, but

only 17% faster for Opteron. Opteron and Xeon sequential unmodified code have nearly

identical execution times. UPC

Near linear speedup. Fairly straightforward to port from sequential code. upc_forall loop ran faster with ‘array+j’ as affinity expression

than with ‘&(array[j])’.

40

UPC Benchmarking: Bench 9(Mod 2N Inverse) Description

Given: list A of size listsize (64-bit integers), size ranges from 0 to 2 j – 1. Compute

list B, where Bi=Ai “right justified”. list C, such that (Bi * Ci) % 2j = 1 (iterative algorithm). Computation is embarrassingly parallel.

Check section (gather) First node checks all values to verify (Bi * Ci) % 2j = 1. Benchmark text – “Output selected values from A, B, and C”. Very communication intensive!

Benchmark parameters listsize = 5 x 106

j = 48 Optimizations: UPC version used all combinations of the following

Casting shared pointers with affinity to local thread to private pointers (cast). Explicitly stating work distribution via for loops instead of using upc_forall (for). Copying B & C arrays to main thread before check (get).

MPI, UPC, and SHMEM versions used same algorithm MPI + SHMEM equivalent to get UPC optimization.

41

Programming Practices: Bench 9Effect of Optimizations – AlphaServerBench 9 Optimizations - AlphaServer

0

1

2

3

4

5

6

forall forall cast for for cast get forall get forall cast get for get for cast

Optimization

Tim

e (

se

co

nd

s)

sequential, cc

sequential, upc

upc, 1 thread

upc, 4 threads

42

Programming Practices: Bench 9Effect of Optimizations – Opteron Cluster Bench 9 Optimizations - Opteron Cluster

0

2

4

6

8

10

12

forall forall cast for for cast get forall get forallcast

get for get for cast

Optimization

Tim

e (

se

co

nd

s)

sequential, gcc

sequential, upcc

upc, 1 thread

upc, 4 threads

43

Programming Practices: Bench 9Performance – AlphaServer

Bench 9 - AlphaServer

0

0.5

1

1.5

2

2.5

3

1 2 3 4Number threads

Tim

e (

seco

nd

s)

upc (get for cast)gpshmemmpi

44

Programming Practices: Bench 9Performance – Opteron Cluster

Bench 9 - Opteron Cluster

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number threads

Tim

e (

seco

nd

s)

gpshmem

mpi

bupc-vapi (get for cast, BS=100)

bupc-vapi (get for cast, BS=MAX)

45

UPC Benchmarking: Bench 9Conclusions

Overall performance AlphaServer: Relatively good, UPC, SHMEM, and MPI comparable performance. Opteron

Bad performance, larger lists sizes result in worse performance. MPI gave best results (more mature compiler), although code was much harder to write. GPSHMEM over MPI does not give good performance vs. plain MPI, although much much better to write with one-sided

communication patterns. MPI could use asynchronous calls to hide latency on check, although communication time dominates by a large factor.

Performance of code can be closely tied to “tricks” used on compilers AlphaServer: About 33% performance difference depending on which optimizations used. Opteron: With Berkeley compiler, sometimes over 2x performance lost if no optimizations are used.

UPC compiler overhead Compiling sequential code through UPC compiler instead of cc on marvel results in 45% performance hit. Overhead in Berkeley compiler significant if no casting is used. for vs. upc_forall in Berkeley compiler.

Implementation of upc_forall translates this code: upc_forall (i = 0; i < NUM; i++; &A[i]) { A[i] = rand(); } }

To for (i = 0; i < NUM; i++) { if (upc_threadof(&A[i]) == MYTHREAD) { A[i] = rand(); } }

Addition of extra instructions slow execution speed (branch prediction and context switches, etc). Reverse on AlphaServer: upc_forall better than for.

46

UPC Benchmarking: Convolution Description

Compute convolution of two sequences. “Classic” definition of convolution (not image processing definition). Embarrassingly parallel (gives an idea of language overhead). O(n2) algorithm complexity.

Benchmark parameters Sequence sizes: 100,000 elements. Data types: 32-bit integer, double-precision floating point.

Optimizations: UPC version used all combinations of the following Casting shared pointers with affinity to local thread to private pointers (cast). Explicitly stating work distribution via for loops instead of using upc_forall (for).

Incorporates casting shared pointers with affinity to local thread to private pointers (previously cast).

Copying X & H arrays to main thread before check (get). MPI, UPC, and SHMEM versions used same algorithm

k

knHkXnC ][*][][

47

Programming Practices: ConvolutionEffect of Optimizations – AlphaServer

Integer Convolution Optimizations - AlphaServer

0

2

4

6

8

10

12

14

naïve forall naïve for get forall get for

Optimization

Tim

e (

se

co

nd

s)

sequential

upc, 1 thread

upc, 4 threads

48

Programming Practices: ConvolutionEffect of Optimizations – Opteron Cluster

Integer Convolution Optimizations - Opteron Cluster

0

10

20

30

40

50

60

70

80

naïve forall naïve for get forall get for

Optimization

Tim

e (

se

co

nd

s)

sequential

upc, 1 thread

upc, 4 threads

49

Programming Practices: ConvolutionPerformance – AlphaServerInteger Convolution - AlphaServer

0

10

20

30

40

50

60


Tim

e (

seco

nd

s)

upc (get for)

gpshmem

mpi

50

Programming Practices: ConvolutionPerformance – Opteron ClusterInteger Convolution - Opteron Cluster

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number threads

Tim

e (s

eco

nd

s)

bupc-vapi (get for)

gpshmem

mpi

51

Programming Practices: ConvolutionPerformance – AlphaServerDouble Precision Floating Point Convolution - AlphaServer

0

20

40

60

80

100

120


Tim

e (

seco

nd

s)

upc (get for)

gpshmem

mpi

52

Programming Practices: ConvolutionPerformance – Opteron ClusterDouble Precision Floating Point Convolution - Opteron Cluster

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number threads

Tim

e (

se

co

nd

s)

bupc-vapi (get for)

gpshmem

mpi

53

UPC Benchmarking: ConvolutionConclusions

Optimizations Similar results to NSA Bench 9

Naïve version (treat shared A and B variables as though they were local) has abysmal performance on both Marvel and Lambda.

Less compiler overhead Almost no difference between sequential versions compiled with gcc/cc and sequential versions compiled

with UPC compilers. Benchmark less memory intensive: UPC compilers compiling variables as volatile? Performance of double-precision floating point actually improved by HP UPC compiler on AlphaServer.

Same flags (-O2) passed to both upc and cc compilers, perhaps HP UPC compiler does a better job scheduling the floating-point units?

Overall language overhead MPI was most work to code, but had least amount of overhead. SHMEM slightly easier because of one-sided communication functions. UPC easiest (conversion of sequential code very easy), but has potentially worse performance.

Overall language performance overhead On AlphaServer:

MPI had most runtime overhead in most cases. GPSHMEM performed surprisingly well.

Favorable to UPC and MPI with integer convolution. Slightly better than MPI for double-precision floating point convolution.

On Opteron: Runtime overhead MPI (least) < SHMEM < Berkeley UPC.

54

UPC Benchmarking: Conclusion

Benchmarking effort accomplished what we intended. Gain practical experience in UPC/SHMEM/MPI. Have a better understanding of programming styles between

different languages. Found factors/metrics that are useful for performance analysis. Found difference in compiler efficiencies. Created new types of UPC benchmarks that can be used later to

validate factors.

Benchmarking process near completion Plan to develop a few more programs. Already accomplished most of what we set out to do.

55

Questions

Target platforms / architectures in mind?

Others that should be contacted.

Documents

UPC/SHMEM Performance Analysis Tool Project Alan D. George Hung-Hsun Su Bryan Golden Adam Leko HCS Research Laboratory University of Florida