50
Introduction to Nsight Systems for Summit Holly Wilper, [email protected], March 9,2020

Summit Introduction to Nsight Systems for

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Summit Introduction to Nsight Systems for

Introduction to Nsight Systems for SummitHolly Wilper, [email protected], March 9,2020

Page 2: Summit Introduction to Nsight Systems for

2

Legacy Transition

CPU-GPU interactions & triage

Low overhead capture

GPU compute & graphics

Faster GUI + more data

GPU CUDA kernel analysis & debug

Very high freq GPU perf counters

Compare results (diff)

Incredible statistics & customizable

NVIDIA Visual ProfilerStandalone UI

nvprofCommand-line tool

Nsight SystemsStandalone GUI+CLI

Nsight ComputeStandalone GUI+CLI

Page 3: Summit Introduction to Nsight Systems for

3

Nsight Product Family

Nsight Systems - Analyze application algorithm system-wide

Nsight Compute - Debug/optimize CUDA kernel

Nsight Graphics - Debug/optimize graphics workloads

Workflow

Page 4: Summit Introduction to Nsight Systems for

4

Tuning an Orchestra of Tasks

Page 5: Summit Introduction to Nsight Systems for

5

SimulationLattice Microbes

Page 6: Summit Introduction to Nsight Systems for

6

Pro Visualization & Games

Page 7: Summit Introduction to Nsight Systems for

7

• System-wide application algorithm tuning with multi-process tree support

• Locate optimization opportunities• Visualize millions of events on a very fast GUI timeline• See gaps of unused CPU and GPU time

• Balance your workload across multiple CPUs and GPUs• CPU algorithms, utilization, and thread state• GPU streams, kernels, memory transfers, etc

• Multi-platform: • Host GUI - Linux, Windows, Mac• Target Arch - x86-64, IBM Power, ARM server, Tegra

Overview

Page 8: Summit Introduction to Nsight Systems for

8

• Compute• CUDA 10+ API & GPU workload ranges & mem transfers with correlation• OpenACC• OpenMP (next release)

• OS• Thread state and CPU utilization• MPI (OpenMPI & MPICH)

• NVTX User Annotations API••

Timeline Features on Power Arch

Page 9: Summit Introduction to Nsight Systems for

9

• Thread call-stack periodic sampling• Backtraces via frame pointers (or Dwarf unwind in next release)• Hot functions

• Command Line Interface (CLI) • No host PC required to record• Works in containers & VMs• Usable w/ access limitations• Scriptable / interactive mode• Multiple sessions• Multiple reports per launch

Other Key Features

Page 10: Summit Introduction to Nsight Systems for

Demo

Page 11: Summit Introduction to Nsight Systems for

Feature Highlights

Page 12: Summit Introduction to Nsight Systems for

12

Processes and

threads

OpenGL not on roadmap for Power

Multi-GPU

Kernel and memory transfer activities

cuDNN and cuBLAS trace

Thread/core migration

Thread state

Page 13: Summit Introduction to Nsight Systems for

13

User annotations APIs for CPU & GPU

Example: Visual Molecular Dynamics (VMD) algorithms visualized with NVTX on CPU

Page 14: Summit Introduction to Nsight Systems for

14

OS runtime trace (OSRT)(Coming to Power in future release)

Page 15: Summit Introduction to Nsight Systems for

15

Function table shows statistics from periodic call-stack backtraces

Page 16: Summit Introduction to Nsight Systems for

16 Event Table

Page 17: Summit Introduction to Nsight Systems for

17

OpenMP 5.0 trace (coming soon)v4+

Page 18: Summit Introduction to Nsight Systems for

18 MPI trace

Page 19: Summit Introduction to Nsight Systems for

19

GPU API launch to HW workload correlation

Page 20: Summit Introduction to Nsight Systems for

20

NVTX Domains - Hoisting & Hierarchies Look like APIs and ‘/’ forms hierarchy

Page 21: Summit Introduction to Nsight Systems for

21

CUDA graph launches show all related GPU ranges

Page 22: Summit Introduction to Nsight Systems for

Correlation Highlights in ruler Hidden to right

Hidden below

Hidden in sub-row

Row has highlights

CPU-GPU correlation & location assistance

Page 23: Summit Introduction to Nsight Systems for

23

cudaLaunchCooperativeKernelMultiDevice From Caffe2 Resnet50 within a container on a DGX-2

CPU launch

Multi-GPU workload

Page 24: Summit Introduction to Nsight Systems for

24

Zooming in reveals gaps where there were valleys

GPU utilization based on percentage time coverage

Page 25: Summit Introduction to Nsight Systems for

25

CUDA memory transfer color pallette show direction and pageable memory hazards

Page 26: Summit Introduction to Nsight Systems for

26

CUDA unified virtual memory (UVM) transfers

Page 27: Summit Introduction to Nsight Systems for

27

TensorRT traceFrom DeepStream

Page 28: Summit Introduction to Nsight Systems for

28

NVTX ranges projected onto the GPUFrom DeepStream

CPU launch

GPU workload

Page 29: Summit Introduction to Nsight Systems for

29

NVTX deep highlighting and correlationFrom DeepStream

CPU launches

GPU workloads

Page 30: Summit Introduction to Nsight Systems for

30

TensorFlow Resnet50 DNN nodes as NVTX ranges projected onto the GPU

Page 31: Summit Introduction to Nsight Systems for

31

CLI statistics and export(SQLite & JSON)

Page 32: Summit Introduction to Nsight Systems for

32

Stats/Export - CUDA API summary

Page 33: Summit Introduction to Nsight Systems for

33

Stats/Export - CUDA kernel summary

Page 34: Summit Introduction to Nsight Systems for

34

Stats/Export - OS Runtime API summary - in 2020.2

Page 35: Summit Introduction to Nsight Systems for

35 Stats/Export - NVTX code annotations

Note this includes TensorRT domains

Page 36: Summit Introduction to Nsight Systems for

36 Export - Thread call stack samples

Page 37: Summit Introduction to Nsight Systems for

37

Running on Summit

Page 38: Summit Introduction to Nsight Systems for

38

Load Nsys Module

Page 39: Summit Introduction to Nsight Systems for

39

Create and run script#!/bin/bash

#BSUB -P VEN201

#BSUB -W 2:00

#BSUB -nnodes 2

#BSUB -alloc_flags gpumps

#BSUB -o stdoutput.%J

#BSUB -e stderror.%J

cd /gpfs/alpine/world-shared/ven201/skottap/GROMACS_2020_NEW_VERSION/water_boxes/water-cut1.0_GMX50_bare/0048

export OMP_NUM_THREADS=7

jsrun -n 1 -a 6 -c 42 -g 6 -r 1 -l CPU-CPU -d plane:6 -b packed:7 --smpiargs="-disable_gpu_hooks" nsys profile -o /gpfs/alpine/world-shared/ven201/skottap/GROMACS_2020_NEW_VERSION/new_%q{OMPI_COMM_WORLD_RANK} -f true --stats=true /gpfs/alpine/world-shared/ven201/skottap/GROMACS_2020_NEW_VERSION/gromacs-2020/build/bin/gmx_mpi mdrun -ntomp 7 -pme gpu -npme 1 -noconfout -nb gpu -pin off -nsteps 10000

Page 40: Summit Introduction to Nsight Systems for

40

Copy File back for Viewing

Page 41: Summit Introduction to Nsight Systems for

41

Open File for Visualization

Page 42: Summit Introduction to Nsight Systems for

42

Example problems

Page 43: Summit Introduction to Nsight Systems for

43

GPU idle and low utilization level of detail

Page 44: Summit Introduction to Nsight Systems for

44

Fusion opportunitiesCPU launch cost + small GPU work size ≈ GPU sparse idle

This can apply to DNN nodes/layers

Page 45: Summit Introduction to Nsight Systems for

45

cudaMemcpyAsync behaving synchronousDevice to host pageable memory

Mitigate with pinned memory

~150us

~1.2us

Page 46: Summit Introduction to Nsight Systems for

46

Example GPU idle caused by stream synchronization

Page 47: Summit Introduction to Nsight Systems for

47

OS Runtime API Trace

Example:Mask-RCNN

Map/unmap hiccups

Mitigate by pipelining

● Map 1 batch ahead● Unmap last batch● Swap pointers here instead

Page 48: Summit Introduction to Nsight Systems for

48

Nsight Product Family

Nsight Systems - Analyze application algorithm system-wide

Nsight Compute - Debug/optimize CUDA kernel

Nsight Graphics - Debug/optimize graphics workloads

Workflow

Page 49: Summit Introduction to Nsight Systems for

49

Download https://developer.nvidia.com/nsight-systems NOTE: Website versions newer than CUDA Toolkit

Forums https://devtalk.nvidia.com

Email [email protected]

THANK YOU!

Page 50: Summit Introduction to Nsight Systems for

Backup