CUDA Dynamic Parallelism
A Debugger Developer's Take on the
Kernel of a Revolution
Copyright © 2011 Rogue Wave Software | All Rights Reserved
What are we talking about?
• Problem: recursion & similar o Solution: CUDA dynamic parallelism
• Problem: debugging CUDA is hard o Dynamic or not: it’s still hard
o Solution: TotalView
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Your speaker
• Rogue Wave Software o Since 1989
o Tools.h++ - the proto-C++ Standard Library
o Acquired TotalView in 2009
• Larry Edelstein o Around even longer
o Salesforce.com, Lotus, CNET, Klout
o Technical sales and solutions architecture
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Some workloads are so hard (for CUDA)!
• Parallel tasks that create more parallel tasks
• Parallel recursive tasks
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Quicksort
• Partition the array
• Recurse
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Quicksort
Source:
http://blogs.nvidia.com/blog/2012/09/12/how-
tesla-k20-speeds-up-quicksort-a-familiar-
comp-sci-code/
Copyright © 2011 Rogue Wave Software | All Rights Reserved
How do you parallelize quicksort?
• Save tasks on stack
o Code complexity - shared CPU-GPU work stack
• Run a stage at a time
• Synch after each stage o Costly: short sorts must wait for long sorts
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Need a better way
• Dynamic workloads
• Move logic into kernel
• Recurse within kernel
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Dynamic parallelism
• Introduced in CUDA 5.0
• Familiar syntax: __global__ void myKernel(..) {
doWork();
myOtherKernel<<<(x,y)>>>(..);
doMoreWork();
}
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Not dynamic Dynamic
(plus all the code required to share a stack
between CPU and GPU)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Performance
Copyright © 2011 Rogue Wave Software | All Rights Reserved
That’s great!
but
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Debugging CUDA is a challenge
• Two separate realms of processing
• Highly parallel
• Dynamic o It’s a complex graph of grids
• Call stack? Not exactly.
• Steer using logical and device coordinates
Copyright © 2011 Rogue Wave Software | All Rights Reserved
If we had a debugger that could...
• Show me the active kernels on the device
• Let me set a breakpoint in any kernel
• Help me navigate from kernel to kernel
• Tell me the relationships between kernels
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView with CUDA support
• CUDA and host code display the same
• Set breakpoints, see variables
• Control execution as much as possible o control by warp
• Navigate device threads o logical coordinates
o device coordinates
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView with dynamic support
• TotalView can debug CUDA Dynamic programs using
the CUDA 5.5 toolchain and runtime
• Dynamically launched CUDA kernels say which kernels
launched them (parent kernels)
Copyright © 2011 Rogue Wave Software | All Rights Reserved
TotalView details
• Linux, Unix, and Mac OS X
• C/C++ and Fortran
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Device status display
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Questions
Copyright © 2011 Rogue Wave Software | All Rights Reserved
Acknowledgements
• http://blogs.nvidia.com/blog/2012/09/12/how-
tesla-k20-speeds-up-quicksort-a-familiar-
comp-sci-code/
• https://www.hackerrank.com/challenges/quic
ksort2