CS179: GPU Programming

CS179: GPU ProgrammingLecture 7: Lab 3 Recitation

Today Miscellaneous CUDA syntax Recap on CUDA and buffers Shared memory for an N-body simulation Flocking simulations Integrators

CUDA Kernels Launching the kernel:

kernel<<<gridDim, blockDim, sMemSize>>>(args); Need to know gridDim, blockDim, sMemSize (and args) If no sMemSize set, it will default to 0

CUDA Kernels Grid and block architecture:

Grids can be 1D, 2D, or on CUDA 2.x+, 3D Blocks can be 1D, 2D, or 3D

1024 threads per block maximum (512 on older systems) Dimension is only for convenience, choose what’s best for you

Most applications are fine in 1D Image processing may lend more intuitively to a 2D block/grid

Shared memory size: Requirement is application-dependent Limited by CUDA version (probably 48kB for you)

CUDA Functions Three different kinds of CUDA functions:

__host__: runs on CPU (__host__ keyword is superfluous) __device__: runs on GPU, only called from GPU

Think of these as helper functions __global__: runs on GPU, only called from CPU

These are our kernel functions

CUDA Functions Things to be aware of:

On older CUDA, __device__ and __global__ don’t have recursion Cannot have function pointers to __device__ functions

Restrictions on __global__ functions: Must return void 64kB maximum size for parameters

CUDA Functions Error checking for memory calls:

Can check status of function using cudaGetErrorString() For lab 3, we make you a macro:#define gpuErrchk(ans) { gpuAssert((ans), (char*)__FILE__, __LINE__); }inline void gpuAssert(cudaError_t code, char* file, int line, bool abort=true){ if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); }}

Call as gpuErrchk(cudaMemcpy(…));

CUDA Variables Like functions, have a few different types:

__device__/__constant__ Stored in global/constant memory, respectively Accessible by all threads and blocks Set using cudaMalloc, cudaMemset, cudaMemcpy, etc. We can also write to __device__ memory on GPU

__shared__ Lives in shared memory Accessible only by threads within associated block Requires syncthreads call to guarantee “correctness”

CUDA Variables Some CUDA vector variable types:

char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2, …

Vector components available via .x, .y, .z, .w var.x

Make vectors with make_<type>(args) var = make_float3(1.0, 2.0, 3.0);

dim3: used for assigning block/grid size Essentially just a uint3 Each component of a dim3 must be at least 1!

CUDA and Buffers Need to know how to link buffers into CUDA Nothing conceptually new, just some functions:

cudaGLRegisterBufferObject(bufferObj) Used to first register the buffer into CUDA -- done once

cudaGLUnregisterBufferObject(bufferObj) Once we’re done with it, we unregister -- done once

cudaGLMapBufferObject((void**)&devPtr, bufferObj); Associates CUDA memory with the buffer -- done once per kernel call

cudaGLUnmapBufferObject(bufferObj); Disassociates the buffer so OpenGL can read -- done once per kernel call

after kernel finishes Remember to include <cuda_gl_interop.h>

N-Body Simulation 1 thread = 1 particle Kernel call handles one step in simulation

Calculate acceleration, then velocity, then position**not quite, as we’ll see in a few slides

1 block wont be enough for all of the particles How do we share all positions?

Load as much global memory into shared memory Calculate acceleration based on those positions Update velocity, then load new global memory and repeat

N-Body Simulation

Flocking First, a video:

https://www.youtube.com/watch?v=ctMty7av0jc



Flocking 2 main ideas (3 for bird flocking)

Separation: bugs will try and stay away from other bugs Cohesion: bugs will try to stay near the center of the swarm Alignment: birds will try and head towards the average heading

Not present in bug flocking algorithms

Flocking Separation: think repelling magnets Inverse squared law works pretty well:

accel ~= 1/d2, where d is the distance between two particles

Flocking Cohesion: move towards average position Cohesion fights separation, try and find factors that balance

the two out well

Flocking Alignment: steer towards average heading of neighbors Dependent on both positions AND velocities! Requires you make more buffers to store velocities

A fair amount more work.. good candidate for extra credit!

Integrators After acceleration is calculated, update Simple Euler is easiest…

But is a bad integrator! Symplectic Euler works better:

Basic idea: update velocity based on old position, then update new position based on new velocity

new_vel = old_vel + dt * accel(old_pos) new_pos = old_pos + dt * new_vel

More complex integrators can be even more accurate, but even harder to implement If you have time, try implementing a different one

(Runge-Kutta, maybe?) for EC

Pingponging Two sets of buffers, one for new, one for old. Why?

Suppose one block finishes while another block is still reading New positions will be used for old calculations!

Solution: pingponging with 2 buffers Both buffers already made for you Use one set for old state, one for new state, then flip when done

Final Notes When loading from shared memory, be sure not to try and

access out of bounds memory Can use %, or mod by shared memory size Problem: % is slow! Solution: We’re provided a WRAP macro for you:

#define WRAP(x,m) ((x)<(m)?(x):((x)-(m)))

Final Notes You will also need to set initial positions and velocities This can be done however you’d like! Idea: have a few initial clusters with semi-random velocities

Don’t feel restricted to this!

Final Notes gluPerspective: controls the camera Based on your simulation, current setup might not fit Feel free to adjust! gluPerspective(float fov, float aspect_ratio, float near, float far)

Final Notes Due Wednesday, 5PM OH at regular posted times Important note: this lab will NOT work remotely!

Trying to ssh and compile will be fine, running will throw crazy errors!

2 new CUDA-capable computers coming to 104ANB soon… For now, get work done early if you need to use minuteman

Documents

CS179: GPU Programming