Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Programming in CUDA
October 21, 2010October 21, 2010October 21, 2010
Malik M Khan
Outline
•Reminder of CUDA Architecture
•Execution Model- Brief mention of control flow
•Heterogeneous Memory Hierarchy- Locality through data placement
- Maximizing bandwidth through global memory coalescing
- Avoiding memory bank conflicts
•Tiling and its Applicability to CUDA Code Generation- Example Matrix-Vector Multiplication
This lecture includes slides provided by: Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) see http://courses.ece.uiuc.edu/ece498/al1/ and Austin Robison (NVIDIA)
Reading
•David Kirk and Wen-mei Hwu manuscript (in progress)
- http://www.toodoc.com/CUDA-textbook-by-David-Kirk-from-NVIDIA-and-Prof-Wen-mei-Hwu-pdf.html
•CUDA 2.x Manual, particularly Chapters 2 and 4 (download from nvidia.com/cudazone)
•Nice series from Dr. Dobbs Journal by Rob Farber- http://www.ddj.com/cpp/207200659
11/05/09
CUDA Programming Model:A Highly Multithreaded Coprocessor• The GPU is viewed as a compute device that:
- Is a coprocessor to the CPU or host
- Has its own DRAM (device memory)
- Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
• Differences between GPU and CPU threads - GPU threads are extremely lightweight
- Very little creation overhead
- GPU needs 1000s of threads for full efficiency- Multi-core CPU needs only a few
Thread Batching: Grids and Blocks• A kernel is executed as a grid
of thread blocks- All threads share data
memory space
• A thread block is a batch of threads that can cooperate with each other by:- Synchronizing their execution
- For hazard-free shared memory accesses
- Efficiently sharing data through a low latency shared memory
• Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Block and Thread IDs
• Threads and blocks have IDs- So each thread can decide
what data to work on
- Block ID: 1D or 2D (blockIdx.x, blockIdx.y)
- Thread ID: 1D, 2D, or 3D (threadIdx.{x,y,z})
• Simplifies memoryaddressing when processingmultidimensional data- Image processing
- Solving PDEs on volumes
- …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Hardware Implementation: A Set of SIMD Multiprocessors• A device has a set of
multiprocessors• Each multiprocessor is a
set of 32-bit processors with a Single Instruction Multiple Data architecture- Shared instruction unit
• At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp
• The number of threads in a warp is the warp size
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1…
Processor 2 Processor M
Hardware Execution Model
I. SIMD Execution of warpsize=M threads (from single block)
– Result is a set of instruction streams roughly equal to # blocks in thread divided by warpsize
II. Multithreaded Execution across different instruction streams within block
– Also possibly across different blocks if there are more blocks than SMs
III. Each block mapped to single SM– No direct interaction across SMs
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor
2
Registers
Processor M
Registers
ConstantCache
TextureCache
A Very Simple Execution Model•No branch prediction
- Just evaluate branch targets and wait for resolution
- But wait is only a small number of cycles
•No speculation- Only execute useful instructions
Terminology•Divergent paths
- Different threads within a warp take different control flow paths within a kernel function
- N divergent paths in a warp?- An N-way divergent warp is serially issued over the N different paths
using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path.
- Performance decreases by about a factor of N
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Hardware Implementation: Memory Architecture
• The local, global, constant, and texture spaces are regions of device memory
• Each multiprocessor has:- A set of 32-bit registers per
processor
- On-chip shared memory- Where the shared memory
space resides
- A read-only constant cache- To speed up access to the
constant memory space
- A read-only texture cache- To speed up access to the
texture memory space
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Global, constant, texture memories
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Programmer’s View: Memory Spaces
• Each thread can:- Read/write per-thread registers
- Read/write per-thread local memory
- Read/write per-block shared memory
- Read/write per-grid global memory
- Read only per-grid constant memory
- Read only per-grid texture memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
• The host can read/write global, constant, and texture memory
Now Let’s Look at Shared Memory
•Common Programming Pattern (5.1.2 of CUDA manual)
- Load data into shared memory
- Synchronize (if necessary)
- Operate on data in shared memory
- Synchronize (if necessary)
- Write intermediate results to global memory
- Repeat until done
Shared memory
Global memory
Familiar concept?
Mechanics of Using Shared Memory
•__shared__ type qualifier required
•Must be allocated from global/device function, or as “extern”
•Examples:
extern __shared__ float d_s_array[];
/* a form of dynamic allocation *//* MEMSIZE is size of per-block *//* shared memory*/ __host__ void outerCompute() { compute<<<gs,bs,MEMSIZE>>>();} __global__ void compute() { d_s_array[i] = …;}
__global__ void compute2() { __shared__ float d_s_array[M];
/* create or copy from global memory */ d_s_array[j] = …;
/* write result back to global memory */ d_g_array[j] = d_s_array[j];}
Bandwidth to Shared Memory:Parallel Memory Accesses
•Consider each thread accessing a different location in shared memory
•Bandwidth maximized if each one is able to proceed in parallel
•Hardware to support this- Banked memory: each bank can support an access on every
memory cycle
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
Bank Addressing Examples
• No Bank Conflicts- Linear addressing
stride == 1
• No Bank Conflicts- Random 1:1
Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
17
Bank Addressing Examples
• 2-way Bank Conflicts- Linear addressing
stride == 2
• 8-way Bank Conflicts- Linear addressing
stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
x8
x8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
How addresses map to banks on G80
• Each bank has a bandwidth of 32 bits per clock cycle
• Successive 32-bit words are assigned to successive banks
• G80 has 16 banks- So bank = address % 16
- Same as the size of a half-warp- No bank conflicts between different half-warps, only within
a single half-warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no bank conflicts
• The fast case:- If all threads of a half-warp access different banks, there is no
bank conflict
- If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)
• The slow case:- Bank Conflict: multiple threads in the same half-warp access
the same bank
- Must serialize the accesses
- Cost = max # of simultaneous accesses to a single bank
Global Memory Accesses
•Each thread issues memory accesses to data types of varying sizes, perhaps as small as 1 byte entities
•Given an address to load or store, memory returns/updates “segments” of either 32 bytes, 64 bytes or 128 bytes
•Maximizing bandwidth:- Operate on an entire 128 byte segment for each memory
transfer
Understanding Global Memory Accesses
Memory protocol for compute capability 1.2* (CUDA Manual 5.1.2.1)
•Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type)
•Find other active threads requesting addresses within that segment and coalesce
•Reduce transaction size if possible
•Access memory and mark threads as “inactive”
•Repeat until all threads in half-warp are serviced
*Includes Tesla and GTX platforms
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in C
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access direction in Kernel code
…
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in C
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access direction in Kernel code
…
Using Loop Transformations in GPU Code Generation
Cod transformations Application
Conventional Architectures GPU
Tiling •Manage reuse in limited storage •Manage reuse in limited storage•Partition parallel execution at 2 levels
Data-copy •Eliminate conflict misses in cache •Copy data to specialized memory structures
Permutation •Reorder loop structure to enable other optimizations
•Reorder loop structure to enable other optimizations
Unrolling •Exposes fine-grain parallelism by replicating the loop body
•Exposes fine-grain parallelism by replicating the loop body
…
Tiling for Computation Mapping in GPUs
J
I
Tiles Mapped to blocks
Matrix-Vector Multiply: A Naïve Strategy
N = 1024Sequential code*for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i] = a[i] + c[j][i] * b[j];
CUDA Kernel__global__ GPU_MV(float* a, float* b, float** c){ int i= blockIdx.x* 512 + threadIdx.x; for (int j=0; j<N; j++) a[i] = a[i]+ c[j][i]* b[j] ;}
Host Code...copy data to device...dim3 dimGrid(N/512, 1);dim3 dimBlock(512, 1);GPU_MV<<<dimGrid,dimBlock>>>(gpuA,gpuB,gpuC);...copy data back to host…
* Following CUBLAS notation for matrix vector multiplication which takes transposed matrix as input.
Performance < 3.75GF < CUBLAS2.2
Decision Algorithm: Computational Decomposition
• Block Parallelism
• Thread Parallelism
tile_by_index({“i”}, {TI},{l1_control="ii”},{"ii","i", "j”})
cudaize( block{“ii”}, thread{} )
cudaize( block{“ii”}, thread{“i”} )
• Data Staging• Shared Memory
• Registers
• Cudaize• Unrolling
Decision Algorithm: Data Staging
tile_by_index({“j”}, {TJ},{l1_control=“jj”},{"ii",”jj”,"i", "j”})
cudaize( block{“ii”}, thread{“i”} )
unroll to depth( 1 )
Data Reuse inside thread
Data Reuse across threads
Final Loop Order
CUDA-CHiLL Recipe
N = 1024
TI= TJ = 32
tile_by_index({“i”,”j”}, {TI,TJ},{l1_control="ii”, l2_control=“k”},{"ii", “jj”,"i", "j”})
normalize_index(“i”)
cudaize(“mv_GPU”, {a=N, b=N, c=N*N},{block={“ii”}, thread={“i”}})
copy_to_shared(“tx”, “b”, 1)
copy_to_registers(“jj”, “a”)
unroll_to_depth(1)
Final MV Generated Code: with Data staged in shared memory & registers.__global__ GPU_MV(float* a, float* b, float** c) {
int bx = blockIdx.x; int tx = threadIdx.x;
__shared__ float bcpy[32];
float acpy = a[tx + 32 * bx];
for (jj = 0; jj < 32; jj++) {
bcpy[tx] = b[32 * jj + tx];
__syncthreads();
//this loop is actually fully unrolled
for (j = 32 * jj; j <= 32 * jj + 32; j++)
acpy = acpy + c[j][32 * bx + tx] * bcpy [j];
__syncthreads();
}
a[tx + 32 * bx] = acpy; }
Generated Code: with Computational decomposition only.__global__ GPU_MV(float* a, float* b, float** c) {int bx = blockIdx.x; int tx = threadIdx.x;int i = 32*bx+tx;for (j = 0; j < N; j++)
a[i] = a[i] + c[j][i] * b[j];}
Matrix-Vector Multiply: GPU Code
Performance always tops CUBLAS 2.2
& ~4x improvement on the naïve GPU kernel
BLAS Library Kernel OptimizationMV Results
MM Results
2D Convolution
Matrix Transpose
Memory Copy and Stencil Optimization
Summary of Lecture•A deeper probe of performance issues
- Execution model
- Control flow
- Heterogeneous memory hierarchy- Locality and bandwidth
- Tiling for CUDA code generation- An example of Tiling Transformation with Matrix-Vector
Multiplication.