Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 8Lecture 8Thu, 18 Nov, 2008Thu, 18 Nov, 2008

PreviouslyPreviously

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component

Data types, math functions, timing, Data types, math functions, timing, texturestextures

– Device ComponentDevice Component Math functions, warp voting, atomic Math functions, warp voting, atomic

functions, synch function, texturingfunctions, synch function, texturing

– Host ComponentHost Component High-level runtime APIHigh-level runtime API Low-level driver APILow-level driver API

PreviouslyPreviously

CUDA Runtime ComponentCUDA Runtime Component– Host Component APIsHost Component APIs

Mutually exclusiveMutually exclusive Runtime API is easier to program, hides Runtime API is easier to program, hides

some details from programmersome details from programmer Driver API gives low level control, harder Driver API gives low level control, harder

to programto program Provide: device initialization, management Provide: device initialization, management

of device, streams and eventsof device, streams and events

TodayToday

CUDA Runtime ComponentCUDA Runtime Component– Host Component APIsHost Component APIs

Provide: management of memory & Provide: management of memory & textures, OpenGL/Direct3D textures, OpenGL/Direct3D interoperability (NOT covered)interoperability (NOT covered)

Runtime API provides: emulation mode for Runtime API provides: emulation mode for debuggingdebugging

Driver API provides: management of Driver API provides: management of contexts & modules, execution controlcontexts & modules, execution control

Final ProjectsFinal Projects

Memory Management: Linear MemoryMemory Management: Linear Memory– CUDA Runtime APICUDA Runtime API

Declare: Declare: TYPE*TYPE*Allocate: Allocate: cudaMalloc, cudaMallocPitchcudaMalloc, cudaMallocPitchCopy: Copy: cudaMemcpy, cudaMemcpy2DcudaMemcpy, cudaMemcpy2DFree: Free: cudaFreecudaFree

– CUDA Driver APICUDA Driver APIDeclare: Declare: CUdeviceptrCUdeviceptrAllocate: Allocate: cuMemAlloc, cuMemAllocPitchcuMemAlloc, cuMemAllocPitchCopy: Copy: cuMemcpy, cuMemcpy2DcuMemcpy, cuMemcpy2DFree: Free: cuMemFreecuMemFree

Host Runtime Host Runtime ComponentComponent

Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – expected:Pitch (stride) – expected:// host code// host codefloat *array2D;float *array2D;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, width*sizeof (float), height);width*sizeof (float), height);// device code// device codeint size = width * sizeof (float);int size = width * sizeof (float);for (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)

((char*)array2D + r*size;((char*)array2D + r*size; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}


Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – expected, WRONG:Pitch (stride) – expected, WRONG:// host code// host codefloat *array2D;float *array2D;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, width*sizeof (float), height);width*sizeof (float), height);// device code// device codeint size = width * sizeof (float);int size = width * sizeof (float);for (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)

((char*)array2D + r*size;((char*)array2D + r*size; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}


Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – CORRECT:Pitch (stride) – CORRECT:// host code// host codefloat *array2D; int pitch;float *array2D; int pitch;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, &pitch&pitch, , width*sizeof (float), height);width*sizeof (float), height);// device code// device codefor (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)

((char*)array2D + r*((char*)array2D + r*pitchpitch;; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}


Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – why?Pitch (stride) – why?

Allocation using pitch functions Allocation using pitch functions appropriately pads memory for appropriately pads memory for efficient transfer and copyefficient transfer and copy

Width of allocated rows may Width of allocated rows may exceed exceed width*sizeof(float)width*sizeof(float)

True width given by True width given by pitchpitch


Memory Management: CUDA ArraysMemory Management: CUDA Arrays– CUDA Runtime APICUDA Runtime API

Declare: Declare: cudaArray*cudaArray*Channel: Channel: cudaChannelFormatDesc, cudaChannelFormatDesc, cudaCreateChannelDesc<TYPE>cudaCreateChannelDesc<TYPE>Allocate: Allocate: cudaMallocArraycudaMallocArrayCopy (from linear): Copy (from linear): cudaMemcpy2DToArraycudaMemcpy2DToArrayFree: Free: cudaFreeArraycudaFreeArray


Memory Management: CUDA ArraysMemory Management: CUDA Arrays– CUDA Driver APICUDA Driver API

Declare: Declare: CUarrayCUarrayChannel: Channel: CUDA_ARRAY_DESCRIPTOR CUDA_ARRAY_DESCRIPTOR objectobjectAllocate: Allocate: cuArrayCreatecuArrayCreateCopy (from linear): Copy (from linear): CUDA_MEMCPY2D CUDA_MEMCPY2D objectobjectFree: Free: cuArrayDestroycuArrayDestroy


Memory Management: various other Memory Management: various other functions to copy fromfunctions to copy from– Linear memory to CUDA arraysLinear memory to CUDA arrays– Host to constant memoryHost to constant memory– See Reference ManualSee Reference Manual


Texture ManagementTexture Management– Run-time API: Run-time API: texturetexture type derived type derived

fromfromstruct textureReference {struct textureReference { int normalized; int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; struct cudaChannelFormatDesc channelDesc;}}

– normalizednormalized: 0: false, otherwise true: 0: false, otherwise true


Texture ManagementTexture Management– filterMode:filterMode:cudaFilterModePoint:cudaFilterModePoint: no filtering, no filtering, returned value is of nearest texelreturned value is of nearest texel cudaFilterModeLinear:cudaFilterModeLinear: filters 2/4/8 filters 2/4/8 neighbors for 1D/2D/3D texture, floats neighbors for 1D/2D/3D texture, floats onlyonly

– addressMode: (x,y,z)addressMode: (x,y,z)cudaAddressModeClamp, cudaAddressModeClamp, cudaAddressModeWrap:cudaAddressModeWrap: normalized normalized coordinates onlycoordinates only


Texture ManagementTexture Management– channelDescchannelDesc: texel type: texel typestruct cudaChannelFormatDesc {struct cudaChannelFormatDesc { int x,y,z,w; int x,y,z,w; enum cudaChannelFormatKind f; enum cudaChannelFormatKind f;}}

x,y,z,wx,y,z,w: #bits per component: #bits per component f: cudaChannelFormatKindSigned, f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloatcudaChannelFormatKindFloat


Texture ManagementTexture Management– Run-time API: Run-time API: texturetexture type derived type derived

fromfromstruct textureReference {struct textureReference { int normalized; int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; struct cudaChannelFormatDesc channelDesc;}}

– Apply only to texture references Apply only to texture references bound to CUDA arraysbound to CUDA arrays


Texture ManagementTexture Management– Binding a texture reference to a Binding a texture reference to a

texturetextureRuntime API:Runtime API:

– Linear memory: Linear memory: cudaBindTexturecudaBindTexture– CUDA Array: CUDA Array: cudaBindTextureToArraycudaBindTextureToArray

Driver API:Driver API:– Linear memory: Linear memory: cuTexRefSetAddresscuTexRefSetAddress– CUDA Array: CUDA Array: cuTexRefSetArraycuTexRefSetArray


Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– No native debug support for device No native debug support for device

codecode– Code should be compiled either for Code should be compiled either for

device emulation OR execution: device emulation OR execution: mixing not allowedmixing not allowed

– Device code is compiled for the hostDevice code is compiled for the host


Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– FeaturesFeatures

Each CUDA thread is mapped to a Each CUDA thread is mapped to a host thread, plus one master host thread, plus one master threadthread

Each thread gets 256KB on stackEach thread gets 256KB on stack


Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– AdvantagesAdvantages

Can use host debuggersCan use host debuggersCan use otherwise disallowed Can use otherwise disallowed functions in device code, e.g. functions in device code, e.g. printfprintf

Device and host memory are both Device and host memory are both readable from either device or readable from either device or hosthost


Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– AdvantagesAdvantages

Any device or host specific Any device or host specific function can be called from either function can be called from either device or host codedevice or host code

Runtime detects incorrect use of Runtime detects incorrect use of synch functionssynch functions


Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– Some errors may still remain hiddenSome errors may still remain hidden

Memory access errorsMemory access errorsOut of context pointer operationsOut of context pointer operationsIncorrect outcome of warp vote Incorrect outcome of warp vote functions as warp size is 1 in functions as warp size is 1 in emulation modeemulation mode

Result of FP operations often Result of FP operations often different on host and devicedifferent on host and device


Driver API: Context managementDriver API: Context management– A context encapsulates all resources A context encapsulates all resources

and actions performed within the and actions performed within the driver APIdriver API

– Almost all CUDA functions operate in Almost all CUDA functions operate in a context, except those dealing witha context, except those dealing withDevice enumerationDevice enumerationContext managementContext management


Driver API: Context managementDriver API: Context management– Each host thread can have only one Each host thread can have only one

currentcurrent device context at a time device context at a time– Each host thread maintains a stack Each host thread maintains a stack

of current contextsof current contexts– cuCtxCreate()cuCtxCreate()

Creates a contextCreates a contextPushes it to the top of the stackPushes it to the top of the stackMakes it the current contextMakes it the current context


Driver API: Context managementDriver API: Context management– cuCtxPopCurrent()cuCtxPopCurrent()

Detaches the current context from Detaches the current context from the host thread – makes it the host thread – makes it “uncurrent”“uncurrent”

The context is now The context is now floatingfloatingIt can be pushed to any host It can be pushed to any host thread's stackthread's stack


Driver API: Context managementDriver API: Context management– Each context has a Each context has a usage countusage count

cuCtxCreate cuCtxCreate creates a context creates a context with a usage count of 1with a usage count of 1

cuCtxAttach cuCtxAttach increments the increments the usage count usage count

cuCtxDetach cuCtxDetach decrements the decrements the usage count usage count


Driver API: Context managementDriver API: Context management– A context is destroyed when its A context is destroyed when its

usage count reaches 0.usage count reaches 0.cuCtxDetach, cuCtxDestroycuCtxDetach, cuCtxDestroy


Driver API: Module managementDriver API: Module management– Modules are dynamically loadable Modules are dynamically loadable

packages of device code and data packages of device code and data output by nvccoutput by nvccSimilar to DLLsSimilar to DLLs


Driver API: Module managementDriver API: Module management– Dynamically loading a module and Dynamically loading a module and

accessing its contentsaccessing its contentsCUmodule cuModule;CUmodule cuModule;cuModuleLoad(&cuModule, cuModuleLoad(&cuModule, “myModule.cubin”);“myModule.cubin”);CUfunction cuFunction;CUfunction cuFunction;cuModuleGetFunction(&cuFunction, cuModuleGetFunction(&cuFunction, cuModule, “myKernel”);cuModule, “myKernel”);


Driver API: Execution controlDriver API: Execution control– Set kernel parametersSet kernel parameters

cuFuncSetBlockShape()cuFuncSetBlockShape()–#threads/block for the function#threads/block for the function–How thread IDs are assignedHow thread IDs are assigned

cuFuncSetSharedSize()cuFuncSetSharedSize()–Size of shared memorySize of shared memory

cuParam*()cuParam*()–Specify other parameters for Specify other parameters for next kernel launchnext kernel launch


Driver API: Execution controlDriver API: Execution control– Launch kernelLaunch kernel

cuLaunch(), cuLaunchGrid()cuLaunch(), cuLaunchGrid()– Example 4.5.3.5 in Prog GuideExample 4.5.3.5 in Prog Guide


Final ProjectsFinal Projects

Ideas?Ideas?– DES crackerDES cracker– Image editorImage editor

Resize and smooth an imageResize and smooth an image Gamut mapping?Gamut mapping?

– 3D Shape matching3D Shape matching

All for todayAll for today

Next timeNext time– Memory and Instruction optimizationsMemory and Instruction optimizations

On to exercises!On to exercises!

Documents

Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008