Download pdf - CUDA Dynamic Parallelism - NVIDIA · 2014. 4. 18. · The first focus will be on how dynamic parallelism is implemented and what users need to know to understand what is really going

CUDA Dynamic Parallelism

A Debugger Developer's Take on the

Kernel of a Revolution

Copyright © 2011 Rogue Wave Software | All Rights Reserved

What are we talking about?

• Problem: recursion & similar o Solution: CUDA dynamic parallelism

• Problem: debugging CUDA is hard o Dynamic or not: it’s still hard

o Solution: TotalView


Your speaker

• Rogue Wave Software o Since 1989

o Tools.h++ - the proto-C++ Standard Library

o Acquired TotalView in 2009

• Larry Edelstein o Around even longer

o Salesforce.com, Lotus, CNET, Klout

o Technical sales and solutions architecture


Some workloads are so hard (for CUDA)!

• Parallel tasks that create more parallel tasks

• Parallel recursive tasks


Quicksort

• Partition the array

• Recurse


Quicksort

Source:

http://blogs.nvidia.com/blog/2012/09/12/how-

tesla-k20-speeds-up-quicksort-a-familiar-

comp-sci-code/


How do you parallelize quicksort?

• Save tasks on stack

o Code complexity - shared CPU-GPU work stack

• Run a stage at a time

• Synch after each stage o Costly: short sorts must wait for long sorts


Need a better way

• Dynamic workloads

• Move logic into kernel

• Recurse within kernel


Dynamic parallelism

• Introduced in CUDA 5.0

• Familiar syntax: __global__ void myKernel(..) {

doWork();

myOtherKernel<<<(x,y)>>>(..);

doMoreWork();

}


Not dynamic Dynamic

(plus all the code required to share a stack

between CPU and GPU)


Performance


That’s great!

but


Debugging CUDA is a challenge

• Two separate realms of processing

• Highly parallel

• Dynamic o It’s a complex graph of grids

• Call stack? Not exactly.

• Steer using logical and device coordinates


If we had a debugger that could...

• Show me the active kernels on the device

• Let me set a breakpoint in any kernel

• Help me navigate from kernel to kernel

• Tell me the relationships between kernels


TotalView with CUDA support

• CUDA and host code display the same

• Set breakpoints, see variables

• Control execution as much as possible o control by warp

• Navigate device threads o logical coordinates

o device coordinates


TotalView with dynamic support

• TotalView can debug CUDA Dynamic programs using

the CUDA 5.5 toolchain and runtime

• Dynamically launched CUDA kernels say which kernels

launched them (parent kernels)


TotalView details

• Linux, Unix, and Mac OS X

• C/C++ and Fortran



Device status display


Questions


Acknowledgements

• http://blogs.nvidia.com/blog/2012/09/12/how-

tesla-k20-speeds-up-quicksort-a-familiar-

comp-sci-code/

• https://www.hackerrank.com/challenges/quic

ksort2

http://blogs.nvidia.com/blog/2012/09/12/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/