© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
1
ECE 498AL
Lecture 23: Kernel and Algorithm Patterns for CUDA
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
2
Objective
• Learn about algorithm patterns and principles– What are my threads? Where does my data go?
• This lecture goes over several patterns that have seen significant success with CUDA, and how they got there.– Input/Output Convolution– Bounded Input/Output Convolution– Stencil computation– Input/Input Convolution– Bounded Input/Input Convolution
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
3
Two Questions
• For every application, the start of a CUDA implementation begins with these two questions:
• What work does each thread do?
• What memory space should each piece of data go?
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
4
Assumptions
• Computation is independent (parallel) unless otherwise stated– i.e. Reductions are the only real presense of serialization
• An work unit (as presented) is reasonably small for one CUDA thread.– We won't be discussing cases where the “tasks” are just too
large to fit into a thread.
• Global memory is big enough to hold your entire dataset– There are another level of issues to address for this case
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
5
A Few Commonalities: Reductions and Memory Patterns
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
6
Reduction patterns in CUDA
• Local– One thread performs an entire, unique reduction
• Matrix Multiplication
• In-Block– Threads only within a block contribute to a reduction
• Only slightly less efficient than local reduction, esp. on new hardware
• Global– Every thread contributes to the reduction: two subtypes
• Blocked: threads within a block contribute to the same reduction, allowing some component of block reduction
– The more reduction you can do within a block, the better• Scattered: threads in a block do not contribute to the same reduction
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
7
Mapping data into CUDA's memories
• Output must finally end up in global memory– No other way to communicate results to the rest of the world– Intermediate outputs can (and should) be stored in registers
or shared memory
• Globally-shared input goes in constant memory– Run several kernels to process chunks at a time
• Input shared only by adjacent threads should be tiled into the shared memory– Matrix Multiplication tiles
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
8
Mapping data continued...
• Input not shared by adjacent threads should just be loaded from global memory– There are cases where shared memory is still useful.
• E.g. coalescing data structure loads from global memory
• Texture memory should really only be used if its specialized indexing features are useful– Just accessing it “normally” is usually not worth it– Applications needing specific features might find it helpful
• Linear interpolation (good for FP function lookup tables)• Array bounds clipping or wraparound• Subword type unpacking
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
9
Input/Output Convolution:e.g. MRI, Direct Summation CP
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
10
Generic Algorithm Description
•.•.•.
Every input element contributes to every output element
Each output element is dependent on all input elements
Input contributions are combined through some reduction operation
Assumptions:• All input contributions and output elements are independent• An interaction is reasonably small for one CUDA thread.
0
N-1 M-1
0
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
11
What could each thread be assigned?
Input elements• O(N) threads• Each contributes to M global reductions (scattered reduce)
Output elements • ~M threads• Each reads N input elements, local reduction only
Input/Output pairs• O(N*M) threads• Each thread contributes to one of M global reductions
Pros and Cons to each possibility!
•.•.•.
0
N-1 M-1
0
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
12
Thread Assignment Tradeoffs
• Input elements / global reductions– Usually ineffective, as it requires M global reductions
• Input/Output pairs– Effective when you need the extra parallelism– You can group threads into blocks based on input or output
elements• Basically a choice between blocked and scattered reductions
• Output elements / local reductions– Very effective if a reasonable amount of input can fit into the
constant memory
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
13
What memory space does the data use?
• Output has to be in global memory• If input is globally shared (threads assigned Output
elements), constant memory is best.– Again, it's likely that the whole input won't fit in constant
memory at once. – Break up your implementation into “mini-kernels”, each
reading a chunk of the input at a time from constant memory.
• Even if constant memory doesn't make sense (threads assigned Input/Output pairs), shared memory can probably help some.
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
14
Bounded Input/Output Convolutione.g. Cutoff Summation
(and Matrix Multiplication)
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
15
Generic problem description
• Input elements will contribute to a bounded range of output elements– Conversely, each output element
is affected by a limited range/set of input elements
• Usually arises from cutoff distances in spatial representations– O(# output elements) instead of
O(# output elements * # input elements)
Cutoff
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
16
Revisiting thread-assignment tradeoffs
• Input elements / “global” reductions– The reductions that an input element affects are restricted– Might be reasonable, if the the “global” reduction can
become mostly an in-block reduction.
• Input/Output pairs– Still most effective when you need the extra parallelism– Try as much as possible to keep conceptually “global”
reductions as in-block reductions in reality• If it works, this strategy will likely be very competetive
• Output elements / local reductions– Still most effective if feasible
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
17
Data Mapping?
• Input isn't globally shared anymore– Constant memory doesn't make sense because most threads
won't need a particular input element
• Read “tiles” of input data relevant to a tile of output data into shared memory– Not all threads in the grid will need the data, but adjacent
threads will with high probability
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
18
Stencil Computation: Fluid Dynamics, Image Convolution
Generic Algorithm Description
• Class of in-place applications where the next “step” for an element depends on a predetermined set of other (usually adjacent) elements.
• Dataset should either be double-buffered or red-black colored to prevent dependencies.
T=0
T=1
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
20
Basic Questions again
• What does each thread do?• One input component for the element?
– Thread blocks compute one or a small number of elements
• The whole computation for one element?– Thread blocks compute tiles of elements
• Again, tradeoffs: mostly determined by how much work goes into an element for the next timestep– Directly related to the size of the stencil
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
21
What memory space?
• Depends on the app.– If adjacent elements
share input values, ideal case for shared memory tiling
– If entire thread blocks compute single elements, tiling doesn't help
• No intrablock sharing
T=0
Overlapping input tiles
Non-Overlapping output tiles
T=1
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
22
What if basic tiling isn't good enough?
• Sometimes, the bandwidth of loading and storing tiles far outweighs the needed computation
• Multi-step kernels• Larger input tiles, multiple
steps within a block– Means there's some redundant
computation for the edges of T=1 intermediate tiles
T=0
Input tiles
Overlapping intermediate tiles of T=1
T=2
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
23
Input/Input Convolutione.g. N-body Interaction
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
24
Generic Algorithm Description
• All input elements interact– Pairwise most common,
sometimes even higher-degree
• Interactions usually contribute to reductions– Either a per-element reduction
or a global reduction, depending on the app
• Examples: gravitational or electrical points in space, two-point autocorrelation function in astronomy
• Threads and data storage?
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
25
What does each thread do?
• If the reduction is per-element, this looks a lot like the input/output convolution case– Input pair is the new “element”– Apply tradeoffs from that case
• If the reduction is global, it looks a lot like a simple reduction.– Again: Input pair is the new “element”– Try to load and reduce many “elements” in-block
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
26
What memory space does the data use?• Per-element reductions handled
the same way as input/output convolution– Tile one copy of input through
constant memory, read in the other from global memory
• Global reductions could be handled by loading input tile pairs into shared memory, and doing as much in-block reduction as possible– N^2 interactions per block for 2 N-element input tiles
• Should prevent memory bandwidth from being a bottleneck if N is reasonable.
Input tiles
Output (tiles or partial reduction)
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
27
Bounded Input/Input Convolutionse.g. N-body approximations (NAMD)
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
28
Generic Algorithm
• Input/Input convolution with some cutoff– O(N) instead of O(N^2)
• Similar approach to the bounded input/output convolution approach– Use some kind of spatial
binning to reduce algorithmic complexity
© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign
29
Modifications from Unbounded Case
• An input “tile” is naturally defined by the binning process– A tile is all input elements in one (or several) bins
• Each tile has a limited number of other tiles to interact– For per-element reductions, this looks a lot like the bounded
input/output convolution case• Load a tile, and interact every other relevant tile within one block
– For global reductions, this is essentially the unchanged from the unbounded case, except that fewer input-pairs are considered for reduction contributions