Upload
allan-garrett
View
215
Download
0
Embed Size (px)
DESCRIPTION
11 July Motivations and Objectives Motivations UPC program does not yield the expected performance. Why? Due to complexity of parallel computing, difficult to determine without tools for performance analysis and optimization Discouraging for users, new & old; few options for shared- memory computing in UPC and SHMEM communities Objectives Identify important performance “factors” in UPC computing Research topics relating to performance analysis Develop framework for a performance analysis tool As new tool or as extension/redesign of existing non-UPC tools Design with both performance and user productivity in mind Attract new UPC users and support improved performance
Citation preview
11 July 2005
Emergence of Performance Emergence of Performance Analysis Tools for Shared-Analysis Tools for Shared-Memory HPC Memory HPC in UPC or SHMEMin UPC or SHMEM
Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant
Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant
Mr. Hans Sherburne, Research Assistant
HCS Research LaboratoryUniversity of Florida
PATPAT
211 July 2005
Outline Motivations and Objectives
Background
Approach
Studies and Findings Tools
Language
Usability
High-level Design
Future Plans
Q&A
311 July 2005
Motivations and Objectives Motivations
UPC program does not yield the expected performance. Why? Due to complexity of parallel computing, difficult to determine
without tools for performance analysis and optimization Discouraging for users, new & old; few options for shared-
memory computing in UPC and SHMEM communities Objectives
Identify important performance “factors” in UPC computing Research topics relating to performance analysis Develop framework for a performance analysis tool
As new tool or as extension/redesign of existing non-UPC tools Design with both performance and user productivity in mind Attract new UPC users and support improved performance
411 July 2005
Need for Performance Analysis Performance analysis of
sequential applications can be challenging
Performance analysis of explicitly communicating parallel applications is significantly more difficult Mainly due to increase in
number of processing nodes
Performance analysis of Implicitly communicating parallel applications is even more difficult Non-blocking, one-sided
communication is tricky to track and analyze accurately
Sequential programming
Processing node
Explicit parallel programming (ex: MPI)
Processing node ... Processing
node
Implicit parallel programming (ex: UPC, SHMEM)
Processing node ... Processing
node
511 July 2005
Background - SHMEM
SHard MEMory library Based on SPMD model Available for C / Fortran Available for servers and clusters PSHMEM available Hybrid programming model
Traits of message passing Explicit communication, replication and synchronization Need to give remote data location (processing element ID)
Traits of shared memory Provides logically shared memory system view Non-blocking, one-sided communication
Lower latency, higher bandwidth communication Easier to program than MPI
Int xfloat y
Remotely Accessible
Memory
Private Memory
Int xfloat y
Private Memory
SAME ADDRESS
P.E. X P.E. Y
Remotely Accessible
Memory
611 July 2005
Background - UPC Unified Parallel C (UPC) A distributed shared memory parallel programming language Common and familiar syntax and semantics for parallel C with simple
extensions to ANSI C Many implementations
Open source: Berkeley UPC, Michigan UPC, GCC-UPC Proprietary: HP-UPC, Cray-UPC
Easier to program than MPI, software more scalable With hand-tuning, UPC performance compares favorably with MPI
711 July 2005
Background – Performance Analysis Three general performance analysis approaches
Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance
measurement Pros: easy to use, fast, can be done without running the program Cons: usually not very accurate
Simulation Pros: allow performance estimation of program with various system
architectures Cons: slow, not generally applicable for regular UPC/SHMEM users
Experimental performance measurement Strategy used by most modern PATs Uses actual event measurement to perform the analysis Pros: most accurate Cons: can be time-consuming
PAT = Performance Analysis ToolPAT = Performance Analysis Tool
811 July 2005
Background - Experimental Performance Measurement Stages Instrumentation – user-
assisted or automatic insertion of instrumentation code
Measurement – actual measuring stage
Analysis – filtering, aggregation, analysis of data gathered
Presentation – display of analyzed data to user, deals directly with user
Optimization – process of finding and resolving bottlenecks
Instrumentation
Measurement
AnalysisPresentation
Optimization
Original code
Analytical modeling
Simulation
911 July 2005
Approach Define layers to divide workload Conduct multiple studies
(language, tool, usability) in parallel to: Minimize development time Maximize usefulness of PAT
Application Layer
Language Layer
Compiler Layer
Middleware Layer
Hardware Layer
Existing Tools
Preliminaryfactor list & requirement
Tool study
Sensitivity study
Language study
PAT prototype
Experimental study
Usability study
Preliminary user
interface
PAT high-level design
PAT low-level design
User-friendly
interface
User-interface evaluation
Phase 1
Phase 2
1011 July 2005
Tools Study: Overview Purpose
Investigate analysis methods used in existing tools Examine extensibility of existing tools
Methodology Generate list of desirable characteristics for performance tools
Rate each on a scale from 1 to 5 (0 = not applicable) Use quantitative rating scheme where possible
Install and use tool on several test codes with performance problems PPerfMark (Grindstone) suite Large code set (NAS NPB LU v2.3) “Control” code that has no problems (CAMEL)
Examine tool software architecture and estimate extensibility
1111 July 2005
Tools Study: List of Tools Profiling tools
TAU (Univ. of Oregon) mpiP (ORNL, LLNL) HPCToolkit (Rice Univ.) SvPablo (Univ. of Illinois, Urbana-Champaign) DynaProf (Univ. of Tennessee, Knoxville)
Tracing tools Intel Cluster Tools (Intel) MPE/Jumpshot (ANL) Dimemas & Paraver (European Ctr. for Parallelism of Barcelona) MPICL/ParaGraph (Univ. of Illinois, Univ. of Tennessee, ORNL)
1211 July 2005
Tools Study: List of Tools (2) Other tools
KOJAK (Forschungszentrum Jülich, ICL @ UTK) Paradyn (Univ. of Wisconsin, Madison)
Also quickly reviewed CrayPat/Apprentice2 (Cray) DynTG (LLNL) AIMS (NASA) Eclipse Parallel Tools Platform (LANL) Open/Speedshop (SGI)
1311 July 2005
Tools Study: Lessons Learned Most effective tools present data to user at many levels
Profile and summary data that gives top-level view of program’s execution Trace data that shows exact behavior and communication patterns Hardware counters that show how well program runs on current architecture
Common problems experienced Difficult installations Poor documentation, no “quickstart” guide Too much user overhead Not enough data available from tools to troubleshoot performance problems Difficult to navigate tool to get to performance data or nonstandard user
interfaces Inability to relate data back to source code
1411 July 2005
Tools Study: Extensibility Study Adding SHMEM support
Use wrapper profiling libraries (like PMPI) Challenge: relate data back to source line Existing techniques that can help: HPCToolkit, mpiP
Add better support for recording performance data for one-sided memory operations
Add SHMEM-specific visualizations that accurately reflect SHMEM’s programming model
Adding UPC support Same issues with adding SHMEM support
Source-code correlation with profiling libraries One-sided operations
Additional problem: getting accurate data from UPC programs at runtime Profiling interface initiative
1511 July 2005
Language Study Purpose
Determine performance factors purely from language’s perspective
Gain insight into how to best incorporate performance measurement with various implementations
Method Create a complete and minimal factor list Analyze UPC and SHMEM (Quadrics) specs Analyze various UPC/SHMEM implementations
Status Correlated performance factors to individual UPC/SHMEM
constructs for UPC and SHMEM (Quadrics) specs Correlated performance factors to individual UPC/SHMEM
constructs for Berkeley UPC, Michigan Tech UPC, HP UPC (in progress), GPSHMEM
Identified implementation details important for incorporating performance measurement
1611 July 2005
Language Study – Factor ListFactors Basic events
Computation
Execution (useful work) Time
System overhead Compiler, runtime, OS, thread, I/O
Communication
Bulk transfer Variable name, size, count, transfer time, overhead
Small transfer Variable name, total time (transfer + overhead), count
Synchronization Notify time, wait time, count, overhead
Completion notification Wait time, count, overhead
Memory
Local access Variable name, time, count
Cache (remote data) Size, miss/hit count, variable name
Other
Scalability System size
Resource management I/O, etc.
1711 July 2005
Language Study - Summary SHMEM
Wrapper library approach appears ideal Push for SHMEM standardization will simplify development
Each implementation comes with a different set of functions UPC
Hybrid pre-processor/wrapper library approach appears appropriate UPC functions wrappers (difficult with GCC-UPC) Others (direct memory access, upc_forall, etc.) pre-processor
Similar implementation techniques used Noteworthy differences
Berkeley UPC Uses centralized barrier mechanism (soon to be replaced by logarithmic
dissemination barrier) or hardware-supplied barrier Michigan Tech UPC
Program + communication thread more difficult to track communication events
Uses flat broadcast and tree broadcast
1811 July 2005
Usability Study
Purpose Provide a discussion on factors influencing usability of
performance tools Outline how to incorporate user-centered design into PAT Discuss common problems seen in performance tools Present solutions to these problems
Methods Elicit user feedback through a Performance Tool Usability Survey
– limited success thus far http://www.hcs.ufl.edu/prj/upcgroup/survey.html
Review and develop a concise summary of literature regarding usability for parallel performance tools
1911 July 2005
Usability Study - Usability Factor* Ease-of-learning
Important for attracting new users Inconsistency leads to confusion Tool should be internally and externally consistent Stick to established conventions Provide as uniform an interface as possible
Ease-of-use Amount of effort required to accomplish work with tool Provide a simple interface Make all user-required actions concrete and logical
Usefulness How directly a tool helps a user achieve their goal Make common case simple even if that makes rare case complex
Throughput How tool contributes to user productivity in general Keep in mind that inherent goal of tool is to increase user productivity
* C. M. Pancake, ‘‘Improving the Usability of Numerical Software through User-Centered Design,’’ The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice, Chapman & Hall, pp. 44-60, 1997.
2011 July 2005
Usability Study - User-Centered Design General Principles
Usability will be achieved only if software design process is user-driven
Understand target users Usability should be driving factor in tool design
Four-step model to incorporate user feedback (chronological) Ensure initial functionality is based on user needs
Solicit input directly from user Use an analytical user interface evaluation technique in absence of survey
data Analyze how users identify and correct performance problems Implement Incrementally
Organize interface so that most useful features are best supported Have users evaluate every aspect of tool’s interface structure and
behavior Alpha/Beta testing (agency testers needed!) Feature-by-feature refinement in response to specific user feedback
2111 July 2005
PAT High-Level Design – Complete Version
P UNIT
I UNIT
M UNIT
A UNIT
GUI
Command line
Input interface
Automatic sourceinstrumentation manager
Binaryinstrumentation manager
Low-level event read module
Hardware modules (ex: PAPI)
Software modules
Measurement modules Event file manager
Trace format converter
Low-level event write module
Aggregation module
High-level event moduleAnalysis/data manager
Search module
Performance data manager
Analysis modules(With filtering and
aggregation)
Standard analysis modules
Advanced analysis modules
Other analysis modules
Standard view modules
Advanced view modules
Plug-in view modules
Visualization manager
Simulator
Basic event data
Complex event data
Analysis data
STORAGE
CORE MODULEADVANCED MODULEOTHER MODULE
Search/indexing module
2211 July 2005
PAT High-Level Design – Simplified Version
P UNIT
I UNIT
M UNIT
A UNIT
Input interface
Automatic sourceinstrumentation manager
Measurement modules Event file managerLow-level event read/write module
High-level event moduleAnalysis/data managerAnalysis modules
Visualization manager
Basic event data
Complex event data
Analysis data
STORAGE
2311 July 2005
PAT High-Level Design – Design Choices Semi-automatic source-level instrumentation as default
Minimize user workload Reduce development complexity Provide easy source-code correlation
Timing events: execution, transfer, synchronization, completion notification
Only P and part of I unit is visible to user Single user-interface is sufficient
PAPI will be used Fast becoming the standard for accessing hardware counters
Tracing mode as default with profiling support Profiling has limited use
2411 July 2005
PAT High-Level Design – Design Choices
Post-mortem data processing and analysis On-line processing and analysis likely to disrupt program behavior
Analyses: load balancing, scalability, memory system
Visualizations: Timeline display Speedup chart Call-tree graph Communication volume graph Memory access graph Profiling table
2511 July 2005
Visualization Example – Timeline Display
2611 July 2005
Visualization Example – Call-Tree Graph
2711 July 2005
Vis. Example – Comm. Volume Graph
2811 July 2005
Visualization Example – Profiling Table
2911 July 2005
Future Plans Experimental testing to identified appropriate reusable components
from existing tools Program analysis and performance bottleneck studies Prototype development
Preliminary plans for platform support (subject to change) UPC: Berkeley (open source/cluster), HP (proprietary/server) SHMEM: GPSHMEM (open source/cluster), Cray (proprietary/server)
Pipelined I, M, A, and P development Parallel development to provide support for various implementations Beta-testing volunteers needed !!
ID Task Name2005 2006 2007
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
1
2
3
4
5
6
7
Develop detailed design
Construct, test, and integrate I&M modules
Construct, test, and integrate A module
Construct, test, and integrate P module
Conduct beta tests and refine tool design and implementation
Design and implement advanced features
Preliminary research in performance optimization via prediction
3011 July 2005
Q & A General feedback? Preferred platform support? Optimization process (when and how)? Fortran SHMEM/GPSHMEM users?
3111 July 2005
Schedule of Detailed Discussions
2:00 – 3:15 PM Briefing on tool evaluations
3:20 – 3:50 PM Briefing on UPC/SHMEM language analysis and usability study
3:50 – 4:45 PM Briefing on PAT requirements, design, and development plan
You are invited to fill out our short survey !! http://www.hcs.ufl.edu/prj/upcgroup/survey.html