A Tool for Performance Analysis of GPU-Accelerated...

A Tool for Performance Analysis ofGPU-Accelerated Applications

Keren ZhouDepartment of Computer Science

Rice UniversityHouston, Texas, USAkeren.zhou@rice.edu

John Mellor-CrummeyDepartment of Computer Science

Rice UniversityHouston, Texas, USA

johnmc@rice.edu

Abstract—Architectures for High-Performance Computing(HPC) now commonly employ accelerators such as GraphicsProcessing Units (GPUs). High-level programming abstractionsfor accelerated computing include OpenMP as well as RAJA andKokkos—programming abstractions based on C++ templates.Such programming models hide GPU architectural details andgenerate sophisticated GPU code organized as many small proce-dures. For example, a dot product kernel expressed using RAJAatop NVIDIA’s thrust templates yields 35 procedures. Existingperformance tools are ill-suited for analyzing such complexkernels because they lack a comprehensive profile view. At best,tools such as NVIDIA’s nvvp provide a profile view that showsonly limited CPU calling contexts and omits both calling contextsand loops in GPU code. To address this problem, we extendedRice University’s HPCToolkit to build a complete profile viewfor GPU-accelerated applications.

I. APPROACH

We extended HPCToolkit [1], [2] with a wait-free measure-ment subsystem to attribute costs of a GPU kernel measuredusing instruction samples back to the worker thread thatlaunched the kernel. GPU instruction samples are collectedusing NVIDIA’s CUPTI API, which spawns a backgroundCUPTI thread at program launch. When an application threadT launches a GPU kernel, it tags the kernel with a correlationID C and notifies the CUPTI thread that C belongs to T;when samples associated with C are collected, the CUPTIthread uses this information to attribute the samples back tothread T. The CUPTI thread and application threads coordinateto attribute performance information using wait-free queues.Using this strategy, HPCToolkit can monitor long-runningapplications with low memory and time overhead.

To attribute costs to GPU code, HPCToolkit recovers loopsand calling contexts in GPU machine code. HPCToolkit re-constructs GPU control flow graphs by parsing branch targetsand analyzes the graphs to identify loop nests. To understandcalling contexts in GPU code, HPCToolkit recovers staticcall graphs and transforms them into calling context trees bysplitting call edges and cloning called procedures.

To analyze performance data for each GPU calling context,HPCToolkit first attributes costs to inline functions, loop nests,and individual source lines in GPU procedures. Next, it appor-tions costs of each GPU procedure its call sites. If necessary,HPCToolkit uses the sample count of each call instruction to

Fig. 1. HPCToolkit’s hpcviewer attributing performance metrics to a GPUcontext that includes a call chain and a loop

divide a GPU procedure’s costs among its multiple call sites.To cope with recursive calls, HPCToolkit merges all GPUprocedures in the same strongly connected component groupinto a ”supernode” and apportions costs for the ”supernode.”

II. INITIAL RESULTS

We have used HPCToolkit to analyze accelerated appli-cations written using RAJA and OpenMP on the Summitsupercomputer, whose compute nodes are equipped with IBMPOWER9 processors and NVIDIA Volta GPUs. Figure 1shows HPCToolkit’s detailed analysis of calling contexts andloops in GPU code, which enables precise attribution of costsfor executions of GPU-accelerated applications and helps usersidentify hotspots and bottlenecks.

ACKNOWLEDGMENTS

This research was partially supported by LLNL SubcontractB626605 of DOE Prime Contract DE-AC52-07NA27344 andby the Exascale Computing Project (17-SC-20-SC), a collab-orative effort of two U.S. Department of Energy organizations(Office of Science and the National Nuclear Security Admin-istration).

REFERENCES

[1] L. Adhianto et al., “HPCToolkit: Tools for performance analysis ofoptimized parallel programs,” Concurrency and Computation: Practiceand Experience, vol. 22, no. 6, pp. 685–701, 2010.

[2] Rice University, “HPCToolkit performance tools project,”http://hpctoolkit.org.

A Tool for Performance Analysis of GPU-Accelerated...

Documents

HPCToolkit: Sampling-based Performance Tools for ...cscads.rice.edu/Mellor-Crummey-HPCToolkit-CScADS... · 2 Acknowledgments • Staff — Laksono Adhianto, Mike Fagan, Mark Krentel,

GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

Binary Analysis for Measurement and Attribution of Program ...johnmc/papers/hpctoolkit-pldi-2009.pdf · Binary Analysis for Measurement and Attribution of Program Performance Nathan

GPU-SM: Shared Memory Multi-GPU Programmingimpact.crhc.illinois.edu/shared/Papers/gpusm_gpgpu8.pdf · GPU-SM: Shared Memory Multi-GPU Programming Javier Cabezas Barcelona Supercomputing

GPU Acceleration for Seismic Interpretation Algorithmsdeveloper.download.nvidia.com/GTC/...GTC2012-GPU-Acceleration-Sei… · GPU Acceleration for Seismic ... GPU Acceleration for

HPCToolkit User’s Manualhpctoolkit.org/manual/HPCToolkit-users-manual.pdf · Figure 1.2: A thread-centric view of the performance of a parallel radix sort application executing

GPU-Based Hierarchical Texture Decompression short2staff.elka.pw.edu.pl/.../GPU-Based_Hierarchical_Texture_Decompression.pdf · GPU-Based Hierarchical Texture Decompression J. Stachera

GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Panini: A GPU Aware Array Class - GPU Technology

Techniques for Validating the Security Quality of Infrastructure Software John D. McGregor johnmc@cs.clemson.edu

CMPT454 GPU Managed Database · GPGPU: General Purpose GPU, using GPU for usual CPU usage. Outline 1. GPU VS CPU 2. GPU Implementation 3. Products 4. Future Holds. GPU VS CPU. GPU

GPU Architecture: Implications & Trendss08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf · GPU Architecture: Implications & Trends ... GPU architecture increasingly centers

Cygnus: GPU meets FPGA for HPC - RIKEN R-CCS · 2020. 2. 27. · FPGA-GPU DMA (FPGA ← GPU) FPGA-GPU DMA (FPGA → GPU) direction via CPU FPGA-GPU DMA GPU→FPGA 17 1.44 FPGA→GPU

Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Welcome to HPCToolKit Training

GPU Developments 2017 · GPU Developments 2017 ... and • •

Best Practices GPU-Based Video Processing | GTC 2013on-demand.gputechconf.com/...Based-Video-Processing... · GPU-Based Video Processing Optimal GPU Upload GPU Processing GPU Readback

GPU, GP-GPU, GPU computing

ScoRD: A Scoped Race Detector for GPUs€¦ · synchronous GPU programs, many emerging GPU-accelerated applications, ... modern GPU programming languages ... writing a correct GPU

Guided Inspection of UML Models John D. McGregor Korson-McGregor (formerly Software Architects) johnmc@korson-mcgregor.com