Download ppt - © John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA

© John A. Stratton 2009ECE 498AL, University of Illinois, Urbana-Champaign

1

ECE 498AL

Lecture 23: Kernel and Algorithm Patterns for CUDA


2

Objective

• Learn about algorithm patterns and principles– What are my threads? Where does my data go?

• This lecture goes over several patterns that have seen significant success with CUDA, and how they got there.– Input/Output Convolution– Bounded Input/Output Convolution– Stencil computation– Input/Input Convolution– Bounded Input/Input Convolution


3

Two Questions

• For every application, the start of a CUDA implementation begins with these two questions:

• What work does each thread do?

• What memory space should each piece of data go?


4

Assumptions

• Computation is independent (parallel) unless otherwise stated– i.e. Reductions are the only real presense of serialization

• An work unit (as presented) is reasonably small for one CUDA thread.– We won't be discussing cases where the “tasks” are just too

large to fit into a thread.

• Global memory is big enough to hold your entire dataset– There are another level of issues to address for this case


5

A Few Commonalities: Reductions and Memory Patterns


6

Reduction patterns in CUDA

• Local– One thread performs an entire, unique reduction

• Matrix Multiplication

• In-Block– Threads only within a block contribute to a reduction

• Only slightly less efficient than local reduction, esp. on new hardware

• Global– Every thread contributes to the reduction: two subtypes

• Blocked: threads within a block contribute to the same reduction, allowing some component of block reduction

– The more reduction you can do within a block, the better• Scattered: threads in a block do not contribute to the same reduction


7

Mapping data into CUDA's memories

• Output must finally end up in global memory– No other way to communicate results to the rest of the world– Intermediate outputs can (and should) be stored in registers

or shared memory

• Globally-shared input goes in constant memory– Run several kernels to process chunks at a time

• Input shared only by adjacent threads should be tiled into the shared memory– Matrix Multiplication tiles


8

Mapping data continued...

• Input not shared by adjacent threads should just be loaded from global memory– There are cases where shared memory is still useful.

• E.g. coalescing data structure loads from global memory

• Texture memory should really only be used if its specialized indexing features are useful– Just accessing it “normally” is usually not worth it– Applications needing specific features might find it helpful

• Linear interpolation (good for FP function lookup tables)• Array bounds clipping or wraparound• Subword type unpacking


9

Input/Output Convolution:e.g. MRI, Direct Summation CP


10

Generic Algorithm Description

•.•.•.

Every input element contributes to every output element

Each output element is dependent on all input elements

Input contributions are combined through some reduction operation

Assumptions:• All input contributions and output elements are independent• An interaction is reasonably small for one CUDA thread.

0

N-1 M-1

0


11

What could each thread be assigned?

Input elements• O(N) threads• Each contributes to M global reductions (scattered reduce)

Output elements • ~M threads• Each reads N input elements, local reduction only

Input/Output pairs• O(N*M) threads• Each thread contributes to one of M global reductions

Pros and Cons to each possibility!

•.•.•.

0

N-1 M-1

0


12

Thread Assignment Tradeoffs

• Input elements / global reductions– Usually ineffective, as it requires M global reductions

• Input/Output pairs– Effective when you need the extra parallelism– You can group threads into blocks based on input or output

elements• Basically a choice between blocked and scattered reductions

• Output elements / local reductions– Very effective if a reasonable amount of input can fit into the

constant memory


13

What memory space does the data use?

• Output has to be in global memory• If input is globally shared (threads assigned Output

elements), constant memory is best.– Again, it's likely that the whole input won't fit in constant

memory at once. – Break up your implementation into “mini-kernels”, each

reading a chunk of the input at a time from constant memory.

• Even if constant memory doesn't make sense (threads assigned Input/Output pairs), shared memory can probably help some.


14

Bounded Input/Output Convolutione.g. Cutoff Summation

(and Matrix Multiplication)


15

Generic problem description

• Input elements will contribute to a bounded range of output elements– Conversely, each output element

is affected by a limited range/set of input elements

• Usually arises from cutoff distances in spatial representations– O(# output elements) instead of

O(# output elements * # input elements)

Cutoff


16

Revisiting thread-assignment tradeoffs

• Input elements / “global” reductions– The reductions that an input element affects are restricted– Might be reasonable, if the the “global” reduction can

become mostly an in-block reduction.

• Input/Output pairs– Still most effective when you need the extra parallelism– Try as much as possible to keep conceptually “global”

reductions as in-block reductions in reality• If it works, this strategy will likely be very competetive

• Output elements / local reductions– Still most effective if feasible


17

Data Mapping?

• Input isn't globally shared anymore– Constant memory doesn't make sense because most threads

won't need a particular input element

• Read “tiles” of input data relevant to a tile of output data into shared memory– Not all threads in the grid will need the data, but adjacent

threads will with high probability


18

Stencil Computation: Fluid Dynamics, Image Convolution


• Class of in-place applications where the next “step” for an element depends on a predetermined set of other (usually adjacent) elements.

• Dataset should either be double-buffered or red-black colored to prevent dependencies.

T=0

T=1


20

Basic Questions again

• What does each thread do?• One input component for the element?

– Thread blocks compute one or a small number of elements

• The whole computation for one element?– Thread blocks compute tiles of elements

• Again, tradeoffs: mostly determined by how much work goes into an element for the next timestep– Directly related to the size of the stencil


21

What memory space?

• Depends on the app.– If adjacent elements

share input values, ideal case for shared memory tiling

– If entire thread blocks compute single elements, tiling doesn't help

• No intrablock sharing

T=0

Overlapping input tiles

Non-Overlapping output tiles

T=1


22

What if basic tiling isn't good enough?

• Sometimes, the bandwidth of loading and storing tiles far outweighs the needed computation

• Multi-step kernels• Larger input tiles, multiple

steps within a block– Means there's some redundant

computation for the edges of T=1 intermediate tiles

T=0

Input tiles

Overlapping intermediate tiles of T=1

T=2


23

Input/Input Convolutione.g. N-body Interaction


24


• All input elements interact– Pairwise most common,

sometimes even higher-degree

• Interactions usually contribute to reductions– Either a per-element reduction

or a global reduction, depending on the app

• Examples: gravitational or electrical points in space, two-point autocorrelation function in astronomy

• Threads and data storage?


25

What does each thread do?

• If the reduction is per-element, this looks a lot like the input/output convolution case– Input pair is the new “element”– Apply tradeoffs from that case

• If the reduction is global, it looks a lot like a simple reduction.– Again: Input pair is the new “element”– Try to load and reduce many “elements” in-block


26

What memory space does the data use?• Per-element reductions handled

the same way as input/output convolution– Tile one copy of input through

constant memory, read in the other from global memory

• Global reductions could be handled by loading input tile pairs into shared memory, and doing as much in-block reduction as possible– N^2 interactions per block for 2 N-element input tiles

• Should prevent memory bandwidth from being a bottleneck if N is reasonable.

Input tiles

Output (tiles or partial reduction)


27

Bounded Input/Input Convolutionse.g. N-body approximations (NAMD)


28

Generic Algorithm

• Input/Input convolution with some cutoff– O(N) instead of O(N^2)

• Similar approach to the bounded input/output convolution approach– Use some kind of spatial

binning to reduce algorithmic complexity


29

Modifications from Unbounded Case

• An input “tile” is naturally defined by the binning process– A tile is all input elements in one (or several) bins

• Each tile has a limited number of other tiles to interact– For per-element reductions, this looks a lot like the bounded

input/output convolution case• Load a tile, and interact every other relevant tile within one block

– For global reductions, this is essentially the unchanged from the unbounded case, except that fewer input-pairs are considered for reduction contributions