31
11 July 2005 Emergence of Performance Emergence of Performance Analysis Tools for Shared- Analysis Tools for Shared- Memory HPC Memory HPC in UPC or SHMEM in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant HCS Research Laboratory University of Florida PAT PAT

11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

Embed Size (px)

DESCRIPTION

11 July Motivations and Objectives Motivations  UPC program does not yield the expected performance. Why?  Due to complexity of parallel computing, difficult to determine without tools for performance analysis and optimization  Discouraging for users, new & old; few options for shared- memory computing in UPC and SHMEM communities Objectives  Identify important performance “factors” in UPC computing  Research topics relating to performance analysis  Develop framework for a performance analysis tool As new tool or as extension/redesign of existing non-UPC tools  Design with both performance and user productivity in mind  Attract new UPC users and support improved performance

Citation preview

Page 1: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

11 July 2005

Emergence of Performance Emergence of Performance Analysis Tools for Shared-Analysis Tools for Shared-Memory HPC Memory HPC in UPC or SHMEMin UPC or SHMEM

Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant

Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant

Mr. Hans Sherburne, Research Assistant

HCS Research LaboratoryUniversity of Florida

PATPAT

Page 2: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

211 July 2005

Outline Motivations and Objectives

Background

Approach

Studies and Findings Tools

Language

Usability

High-level Design

Future Plans

Q&A

Page 3: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

311 July 2005

Motivations and Objectives Motivations

UPC program does not yield the expected performance. Why? Due to complexity of parallel computing, difficult to determine

without tools for performance analysis and optimization Discouraging for users, new & old; few options for shared-

memory computing in UPC and SHMEM communities Objectives

Identify important performance “factors” in UPC computing Research topics relating to performance analysis Develop framework for a performance analysis tool

As new tool or as extension/redesign of existing non-UPC tools Design with both performance and user productivity in mind Attract new UPC users and support improved performance

Page 4: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

411 July 2005

Need for Performance Analysis Performance analysis of

sequential applications can be challenging

Performance analysis of explicitly communicating parallel applications is significantly more difficult Mainly due to increase in

number of processing nodes

Performance analysis of Implicitly communicating parallel applications is even more difficult Non-blocking, one-sided

communication is tricky to track and analyze accurately

Sequential programming

Processing node

Explicit parallel programming (ex: MPI)

Processing node ... Processing

node

Implicit parallel programming (ex: UPC, SHMEM)

Processing node ... Processing

node

Page 5: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

511 July 2005

Background - SHMEM

SHard MEMory library Based on SPMD model Available for C / Fortran Available for servers and clusters PSHMEM available Hybrid programming model

Traits of message passing Explicit communication, replication and synchronization Need to give remote data location (processing element ID)

Traits of shared memory Provides logically shared memory system view Non-blocking, one-sided communication

Lower latency, higher bandwidth communication Easier to program than MPI

Int xfloat y

Remotely Accessible

Memory

Private Memory

Int xfloat y

Private Memory

SAME ADDRESS

P.E. X P.E. Y

Remotely Accessible

Memory

Page 6: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

611 July 2005

Background - UPC Unified Parallel C (UPC) A distributed shared memory parallel programming language Common and familiar syntax and semantics for parallel C with simple

extensions to ANSI C Many implementations

Open source: Berkeley UPC, Michigan UPC, GCC-UPC Proprietary: HP-UPC, Cray-UPC

Easier to program than MPI, software more scalable With hand-tuning, UPC performance compares favorably with MPI

Page 7: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

711 July 2005

Background – Performance Analysis Three general performance analysis approaches

Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance

measurement Pros: easy to use, fast, can be done without running the program Cons: usually not very accurate

Simulation Pros: allow performance estimation of program with various system

architectures Cons: slow, not generally applicable for regular UPC/SHMEM users

Experimental performance measurement Strategy used by most modern PATs Uses actual event measurement to perform the analysis Pros: most accurate Cons: can be time-consuming

PAT = Performance Analysis ToolPAT = Performance Analysis Tool

Page 8: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

811 July 2005

Background - Experimental Performance Measurement Stages Instrumentation – user-

assisted or automatic insertion of instrumentation code

Measurement – actual measuring stage

Analysis – filtering, aggregation, analysis of data gathered

Presentation – display of analyzed data to user, deals directly with user

Optimization – process of finding and resolving bottlenecks

Instrumentation

Measurement

AnalysisPresentation

Optimization

Original code

Analytical modeling

Simulation

Page 9: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

911 July 2005

Approach Define layers to divide workload Conduct multiple studies

(language, tool, usability) in parallel to: Minimize development time Maximize usefulness of PAT

Application Layer

Language Layer

Compiler Layer

Middleware Layer

Hardware Layer

Existing Tools

Preliminaryfactor list & requirement

Tool study

Sensitivity study

Language study

PAT prototype

Experimental study

Usability study

Preliminary user

interface

PAT high-level design

PAT low-level design

User-friendly

interface

User-interface evaluation

Phase 1

Phase 2

Page 10: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1011 July 2005

Tools Study: Overview Purpose

Investigate analysis methods used in existing tools Examine extensibility of existing tools

Methodology Generate list of desirable characteristics for performance tools

Rate each on a scale from 1 to 5 (0 = not applicable) Use quantitative rating scheme where possible

Install and use tool on several test codes with performance problems PPerfMark (Grindstone) suite Large code set (NAS NPB LU v2.3) “Control” code that has no problems (CAMEL)

Examine tool software architecture and estimate extensibility

Page 11: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1111 July 2005

Tools Study: List of Tools Profiling tools

TAU (Univ. of Oregon) mpiP (ORNL, LLNL) HPCToolkit (Rice Univ.) SvPablo (Univ. of Illinois, Urbana-Champaign) DynaProf (Univ. of Tennessee, Knoxville)

Tracing tools Intel Cluster Tools (Intel) MPE/Jumpshot (ANL) Dimemas & Paraver (European Ctr. for Parallelism of Barcelona) MPICL/ParaGraph (Univ. of Illinois, Univ. of Tennessee, ORNL)

Page 12: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1211 July 2005

Tools Study: List of Tools (2) Other tools

KOJAK (Forschungszentrum Jülich, ICL @ UTK) Paradyn (Univ. of Wisconsin, Madison)

Also quickly reviewed CrayPat/Apprentice2 (Cray) DynTG (LLNL) AIMS (NASA) Eclipse Parallel Tools Platform (LANL) Open/Speedshop (SGI)

Page 13: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1311 July 2005

Tools Study: Lessons Learned Most effective tools present data to user at many levels

Profile and summary data that gives top-level view of program’s execution Trace data that shows exact behavior and communication patterns Hardware counters that show how well program runs on current architecture

Common problems experienced Difficult installations Poor documentation, no “quickstart” guide Too much user overhead Not enough data available from tools to troubleshoot performance problems Difficult to navigate tool to get to performance data or nonstandard user

interfaces Inability to relate data back to source code

Page 14: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1411 July 2005

Tools Study: Extensibility Study Adding SHMEM support

Use wrapper profiling libraries (like PMPI) Challenge: relate data back to source line Existing techniques that can help: HPCToolkit, mpiP

Add better support for recording performance data for one-sided memory operations

Add SHMEM-specific visualizations that accurately reflect SHMEM’s programming model

Adding UPC support Same issues with adding SHMEM support

Source-code correlation with profiling libraries One-sided operations

Additional problem: getting accurate data from UPC programs at runtime Profiling interface initiative

Page 15: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1511 July 2005

Language Study Purpose

Determine performance factors purely from language’s perspective

Gain insight into how to best incorporate performance measurement with various implementations

Method Create a complete and minimal factor list Analyze UPC and SHMEM (Quadrics) specs Analyze various UPC/SHMEM implementations

Status Correlated performance factors to individual UPC/SHMEM

constructs for UPC and SHMEM (Quadrics) specs Correlated performance factors to individual UPC/SHMEM

constructs for Berkeley UPC, Michigan Tech UPC, HP UPC (in progress), GPSHMEM

Identified implementation details important for incorporating performance measurement

Page 16: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1611 July 2005

Language Study – Factor ListFactors Basic events

Computation

Execution (useful work) Time

System overhead Compiler, runtime, OS, thread, I/O

Communication

Bulk transfer Variable name, size, count, transfer time, overhead

Small transfer Variable name, total time (transfer + overhead), count

Synchronization Notify time, wait time, count, overhead

Completion notification Wait time, count, overhead

Memory

Local access Variable name, time, count

Cache (remote data) Size, miss/hit count, variable name

Other

Scalability System size

Resource management I/O, etc.

Page 17: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1711 July 2005

Language Study - Summary SHMEM

Wrapper library approach appears ideal Push for SHMEM standardization will simplify development

Each implementation comes with a different set of functions UPC

Hybrid pre-processor/wrapper library approach appears appropriate UPC functions wrappers (difficult with GCC-UPC) Others (direct memory access, upc_forall, etc.) pre-processor

Similar implementation techniques used Noteworthy differences

Berkeley UPC Uses centralized barrier mechanism (soon to be replaced by logarithmic

dissemination barrier) or hardware-supplied barrier Michigan Tech UPC

Program + communication thread more difficult to track communication events

Uses flat broadcast and tree broadcast

Page 18: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1811 July 2005

Usability Study

Purpose Provide a discussion on factors influencing usability of

performance tools Outline how to incorporate user-centered design into PAT Discuss common problems seen in performance tools Present solutions to these problems

Methods Elicit user feedback through a Performance Tool Usability Survey

– limited success thus far http://www.hcs.ufl.edu/prj/upcgroup/survey.html

Review and develop a concise summary of literature regarding usability for parallel performance tools

Page 19: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

1911 July 2005

Usability Study - Usability Factor* Ease-of-learning

Important for attracting new users Inconsistency leads to confusion Tool should be internally and externally consistent Stick to established conventions Provide as uniform an interface as possible

Ease-of-use Amount of effort required to accomplish work with tool Provide a simple interface Make all user-required actions concrete and logical

Usefulness How directly a tool helps a user achieve their goal Make common case simple even if that makes rare case complex

Throughput How tool contributes to user productivity in general Keep in mind that inherent goal of tool is to increase user productivity

* C. M. Pancake, ‘‘Improving the Usability of Numerical Software through User-Centered Design,’’ The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice, Chapman & Hall, pp. 44-60, 1997.

Page 20: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2011 July 2005

Usability Study - User-Centered Design General Principles

Usability will be achieved only if software design process is user-driven

Understand target users Usability should be driving factor in tool design

Four-step model to incorporate user feedback (chronological) Ensure initial functionality is based on user needs

Solicit input directly from user Use an analytical user interface evaluation technique in absence of survey

data Analyze how users identify and correct performance problems Implement Incrementally

Organize interface so that most useful features are best supported Have users evaluate every aspect of tool’s interface structure and

behavior Alpha/Beta testing (agency testers needed!) Feature-by-feature refinement in response to specific user feedback

Page 21: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2111 July 2005

PAT High-Level Design – Complete Version

P UNIT

I UNIT

M UNIT

A UNIT

GUI

Command line

Input interface

Automatic sourceinstrumentation manager

Binaryinstrumentation manager

Low-level event read module

Hardware modules (ex: PAPI)

Software modules

Measurement modules Event file manager

Trace format converter

Low-level event write module

Aggregation module

High-level event moduleAnalysis/data manager

Search module

Performance data manager

Analysis modules(With filtering and

aggregation)

Standard analysis modules

Advanced analysis modules

Other analysis modules

Standard view modules

Advanced view modules

Plug-in view modules

Visualization manager

Simulator

Basic event data

Complex event data

Analysis data

STORAGE

CORE MODULEADVANCED MODULEOTHER MODULE

Search/indexing module

Page 22: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2211 July 2005

PAT High-Level Design – Simplified Version

P UNIT

I UNIT

M UNIT

A UNIT

Input interface

Automatic sourceinstrumentation manager

Measurement modules Event file managerLow-level event read/write module

High-level event moduleAnalysis/data managerAnalysis modules

Visualization manager

Basic event data

Complex event data

Analysis data

STORAGE

Page 23: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2311 July 2005

PAT High-Level Design – Design Choices Semi-automatic source-level instrumentation as default

Minimize user workload Reduce development complexity Provide easy source-code correlation

Timing events: execution, transfer, synchronization, completion notification

Only P and part of I unit is visible to user Single user-interface is sufficient

PAPI will be used Fast becoming the standard for accessing hardware counters

Tracing mode as default with profiling support Profiling has limited use

Page 24: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2411 July 2005

PAT High-Level Design – Design Choices

Post-mortem data processing and analysis On-line processing and analysis likely to disrupt program behavior

Analyses: load balancing, scalability, memory system

Visualizations: Timeline display Speedup chart Call-tree graph Communication volume graph Memory access graph Profiling table

Page 25: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2511 July 2005

Visualization Example – Timeline Display

Page 26: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2611 July 2005

Visualization Example – Call-Tree Graph

Page 27: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2711 July 2005

Vis. Example – Comm. Volume Graph

Page 28: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2811 July 2005

Visualization Example – Profiling Table

Page 29: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

2911 July 2005

Future Plans Experimental testing to identified appropriate reusable components

from existing tools Program analysis and performance bottleneck studies Prototype development

Preliminary plans for platform support (subject to change) UPC: Berkeley (open source/cluster), HP (proprietary/server) SHMEM: GPSHMEM (open source/cluster), Cray (proprietary/server)

Pipelined I, M, A, and P development Parallel development to provide support for various implementations Beta-testing volunteers needed !!

ID Task Name2005 2006 2007

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

1

2

3

4

5

6

7

Develop detailed design

Construct, test, and integrate I&M modules

Construct, test, and integrate A module

Construct, test, and integrate P module

Conduct beta tests and refine tool design and implementation

Design and implement advanced features

Preliminary research in performance optimization via prediction

Page 30: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

3011 July 2005

Q & A General feedback? Preferred platform support? Optimization process (when and how)? Fortran SHMEM/GPSHMEM users?

Page 31: 11 July 2005 Emergence of Performance Analysis Tools for Shared-Memory HPC in UPC or SHMEM Professor Alan D. George, Principal Investigator Mr. Hung-Hsun

3111 July 2005

Schedule of Detailed Discussions

2:00 – 3:15 PM Briefing on tool evaluations

3:20 – 3:50 PM Briefing on UPC/SHMEM language analysis and usability study

3:50 – 4:45 PM Briefing on PAT requirements, design, and development plan

You are invited to fill out our short survey !! http://www.hcs.ufl.edu/prj/upcgroup/survey.html