Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · 2011. 10....

Introduction to CUDA (1 of n*)

Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011

* Where n is 2 or 3

Agenda

GPU architecture reviewCUDA

First of two or three dedicated classes

Acknowledgements

Many slides are fromKayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work:

http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

David Kirk and Wen-mei Hwu’s UIUC course:http://courses.engr.illinois.edu/ece498/al/

GPU Architecture Review

GPUs are:ParallelMultithreadedMany-core

GPUs have:Tremendous computational horsepowerHigh memory bandwidth

GPUs are specialized forCompute-intensive, highly parallel computationGraphics!

Transistors are devoted to:ProcessingNot:

Data cachingFlow control

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Transistor Usage

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Threading Hardware in G80

Sources

Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei HwuJohn Nickolls, NVIDIA

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

PrimitiveAssembly

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

Vertices

Pre-transfo

Data S

CPU-GPU Boundary (AGP/PCIe)

Fixed-function pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

PrimitiveAssembly

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

Vertices

Pre-transfo

Data S

Programmable pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

Unified Vertex,Fragment, GeometryProcessor

Front End

GPUFront End

PrimitiveAssembly

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

Vertices

Pre-transfo

Data S

Unified Programmable pipeline

General Diagram (6800/NV40) TurboCache

Uses PCI-Express bandwidth to render directly to system memoryCard needs less memoryPerformance boost while lowering costTurboCache Manager dynamically allocates from main memoryLocal memory used to cache data and to deliver peak performance when needed

NV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels

(quads) passed on to fragment units

Why NV40 series was better

Massive parallelismScalability

Lower end products have fewer pixel pipes and fewer vertex shader units

Computation Power222 million transistorsFirst to comply with Microsoft’s DirectX 9 spec

Dynamic Branching in pixel shaders

Dynamic Branching

Helps detect if pixel needs shadingInstruction flow handled in groups of pixelsSpecify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines

General Diagram (7800/G70)General Diagram (7800/G70)

General Diagram (6800/NV40)

GeForce Go 7800 – Power Issues

Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very muchabout their thermal designsDynamic clock scaling can run as slow as 16 MHz

This is true for the engine, memory, and pixel clocksHeavier use of clock gating than the desktop versionRuns at voltages lower than any other mobile performance partRegardless, you won’t get much battery-based runtime for a 3D game

Triangle Setup/Raster

Shader Instruction Dispatch

Fragment Crossbar

MemoryPartition

Z-Cull

8 Vertex Engines

24 Pixel Shaders

16 Raster Operation Pipelines

GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

The future of GPUs is programmable processingSo – build the architecture around the processor

G80 – Graphics Mode

G80 CUDA mode – A Device Example

Processors execute computing threadsNew operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The GPU has evolved into a very flexible and powerful processor:

It’s programmable using high-level languagesIt supports 32-bit floating point precisionIt offers lots of GFLOPS:

GPU in every PC and workstation

G80 = GeForce 8800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Why Use the GPU for Computing ?

What is Behind such an Evolution?The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about)

So, more transistors can be devoted to data processing rather than data caching and flow control

The fast-growing video game industry exerts

ALUControl

CPU GPU

What is (Historical) GPGPU ?

General Purpose computation using GPU and graphics API in applications other than 3D graphics

GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Applications – see //GPGPU.orgGame effects (FX) physics image processing

Previous GPGPU ConstraintsDealing with graphics API

Working with the corner cases of the graphics API

Addressing modesLimited texture size/dimension

Shader capabilitiesLimited outputs

Instruction setsLack of Integer & bit ops

Communication limitedBetween pixelsScatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

An Example of Physical Reality Behind CUDA CPU

(host)GPU w/

local DRAM(device)

Arrays of Parallel Threads

• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD) – Each thread has an ID that it uses to compute

memory addresses and make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 0

Thread Block N - 1

Thread Blocks: Scalable Cooperation

Divide monolithic thread array into multiple blocks

Threads within a block cooperate via shared memory, atomic operations and barrier synchronizationThreads in different blocks cannot cooperate

76543210 76543210 76543210

Thread Batching: Grids and BlocksA kernel is executed as a grid of thread blocks

All threads share data memory space

A thread block is a batch of threads that can cooperatewith each other by:

Synchronizing their executionFor hazard-free shared memory accesses

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Block and Thread IDs

Threads and blocks have IDsSo each thread can decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

Image processingSolving PDEs on volumes…

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

CUDA Device Memory Space Overview

Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

HostThe host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories(Long Latency Accesses)

Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threads

Texture and Constant Memories

Constants initialized by host Contents visible to all threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Courtesy: NDVIA

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDsEach thread uses IDs to decide what data to work on

Block ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

Image processingSolving PDEs on

CUDA Memory Model OverviewGlobal memory

Main means of communicating R/W Data between host and deviceContents visible to all threadsLong latency access

We will focus on global memory for now

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications

Available in laptops, desktops, and clusters

GPU parallelism is doubling every yearProgramming model scales transparently

Programmable in C with CUDA tools

GeForce 8800

Tesla S870

Tesla D870

Single-Program Multiple-Data (SPMD)

CUDA integrated CPU + GPU application C program

Serial C code executes on CPUParallel Kernel C code executes on GPU thread blocksCPU Serial Code

Grid 0

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Grids and BlocksA kernel is executed as a grid of thread blocks

All threads share global memory space

A thread block is a batch of threads that can cooperate with each other by:

Synchronizing their execution using barrierEfficiently sharing data through a low latency shared memoryTwo threads from two different blocks cannot

CUDA Thread BlockProgrammer declares (Thread) Block:

Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threads

All threads in a Block execute the same thread programThreads share data and synchronize while doing their share of the workThreads have thread idnumbers within Block

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor

Shared Memory

Streaming Processor Array

SPAStreaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

TPCTexture Processor Cluster (2 SM + TEX)

SMStreaming Multiprocessor (8 SP)Multi-threaded processor coreFundamental processing unit for CUDA thread block

SPStreaming ProcessorScalar ALU for a single CUDA thread

CUDA Processor Terminology

Streaming Multiprocessor (SM)Streaming Multiprocessor (SM)

8 Streaming Processors (SP)2 Super Function Units (SFU)

Multi-threaded instruction dispatch

1 to 512 threads activeShared instruction fetch per 32 threadsCover latency of texture/memory loads

20+ GFLOPS16 KB shared memorytexture and global memory access

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

G80 Thread Computing PipelineProcessors execute computing threadsAlternative operating mode specifically for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The future of GPUs is programmable processingSo – build the architecture around the processor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Generates Thread grids based on

kernel calls

Thread Life Cycle in HWGrid is launched on the SPAThread Blocks are serially distributed to all the SM’s

Potentially >1 Thread Block per SM

Each SM launches Warps of Threads

2 levels of parallelismSM schedules and executes Warps that are ready to runAs Warps and Thread Blocks complete,

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

SM Executes Blocks

Threads are assigned to SMs in Block granularity

Up to 8 Blocks to each SM as resource allowsSM in G80 can take up to 768 threads

Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.

Threads run concurrently

t0 t1 t2 … tm

Blocks

Texture L1

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Thread Scheduling/ExecutionEach Thread Blocks is divided in 32-thread Warps

This is an implementation decision, not part of the CUDA programming model

Warps are scheduling units in SMIf 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?

Each Block is divided into 256/32 = 8 WarpsThere are 8 * 3 = 24 Warps

…t0 t1 t2 … t31

……

t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

SM Warp SchedulingSM hardware implements zero-overhead Warp scheduling

Warps whose next instruction has its operands ready for consumption are eligible for executionEligible Warps are selected for execution on a prioritized scheduling policyAll threads in a Warp execute the same instruction when selected

4 clock cycles needed to

warp 8 instruction 11

SM multithreadedWarp scheduler

SM Instruction Buffer – Warp Scheduling

Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cycle

from any warp - instruction buffer slotoperand scoreboarding used to prevent hazards

Issue selection based on round-robin/age of warpSM broadcasts the same instruction

MultithreadedInstruction Buffer

SharedMem

Operand Select

MAD SFU

Scoreboarding

All register operands of all instructions in the Instruction Buffer are scoreboarded

Instruction becomes ready after the needed values are depositedprevents hazardscleared instructions are eligible for issue

Decoupled Memory/Processor pipelinesany thread can continue to issue instructions until scoreboarding prevents issueallows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

Granularity ConsiderationsFor Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32tiles?

For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!

There are 8 warps but each warp is only half full.

For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

There are 16 warps available for scheduling in each SMEach warp spans four slices in the y dimension

For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

There are 24 warps available for scheduling in each SMEach warp spans two slices in the y dimension

Memory Hardware in G80

CUDA Device Memory Space: Review

Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

HostThe host can R/W global, constant, and texturememories

Parallel Memory SharingLocal Memory: per-thread

Private per threadAuto variables, register spill

Shared Memory: per-Block

Shared by threads of the same blockInter-thread communication

Global Memory: per-application

Shared by all threadsInter-Grid communication

Thread

Local Memory

Grid 0

. . .Global

Memory

Grid 1SequentialGridsin Time

SharedMemory

SM Memory Architecture

Threads in a block share data & results

In Memory and Shared MemorySynchronize at barrier instruction

Per-Block Shared Memory Allocation

Keeps data close to processor

t0 t1 t2 … tm

Blocks

Texture L1

SharedMemory

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Courtesy: John Nicols, NVIDIA

SM Register File

Register File (RF)32 KB (8K entries) for each SM in G80

TEX pipe can also read/write RF2 SMs share 1 TEX

Load/Store pipe can also read/write RF

MultithreadedInstruction Buffer

SharedMem

Operand Select

MAD SFU

Programmer View of Register File

There are 8192 registers in each SM in G80

This is an implementation decision, not part of CUDARegisters are dynamically partitioned across all blocks assigned to the SMOnce assigned to a

4 blocks 3 blocksMatrix Multiplication Example

If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?

Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned

How about if each thread increases the use of registers by 1?

Each Block now requires 11*256 = 2816 registers8192 < 2816 *3

Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · 2011. 10....

Documents

Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA-GDB: The NVIDIA CUDA Debugger · CUDA Debugger User Manual Version 2.1 Beta 1 Chapter 1. Introduction 1.1 CUDA-GDB: The NVIDIA CUDA Debugger CUDA-GDB is a ported version of GDB:

Debugging Experience with CUDA-GDB and CUDA …developer.download.nvidia.com/...GTC2012-Debugging...Debugging Experience with CUDA-GDB and CUDA-MEMCHECK Geoff Gerfin Vyas Venkataraman

CUDA Compiler Driver NVCC - Nvidiadocs.nvidia.com/cuda//pdf/CUDA_Compiler_Driver_NVCC.pdf · CUDA Compiler Driver NVCC TRM-06721-001_v9.1 ... CUDA Programming Model ... 3.1. Command

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC...Steve Abbott, Summit Training Workshop, December 2018 GPUDIRECT, CUDA AWARE MPI, & CUDA IPC

GPGPU programming on example of CUDA - Panoramix - …panoramx.ift.uni.wroc.pl/~maq/cuda/prezentacja-cuda-eng.pdf · CPU GPU CUDA Architecture GPU programming Examples Summary GPGPU

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock

Debugging Experience with CUDA-GDB and CUDA …

Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ... ... (3, 1

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

CIS565 Fall 2011 - Penn Engineeringcis565/Homework2011/CIS565... · 2011-09-16 · CIS565 Fall 2011 Assignment 1: GLSL Shader Programming Due by 1:30pm, October 3, 2011 This assignment

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

GLSL Applications: Today’s slides Matrix Operations on the ...cis565/Lectures2011/Lecture7.pdf · Pass1 Output = a x1 * b 11 + a x2 * b 21 +… + a xb * b b1 ….. Passes: = n/Blockssize

Tutorial CUDA - Pascal-Man CUDA © NVIDIA Corporation ... Why GPUs? CUDA programming model, language, and runtime Break CUDA implementation on the GPU ... vec_dot…

Best Practices Guide -- CUDA 2 - Nc State Universitymoss.csc.ncsu.edu/.../2.3/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf · NVIDIA CUDA C Programming Best Practices Guide . CUDA ... CUDA

Debugging Your CUDA Applications With CUDA-GDBdeveloper.download.nvidia.com/GTC/PDF/1062_Satoor.pdf · Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows)

Introduction to Scientific Programming using GPGPU and CUDA · Introduction to Scientific Programming using GPGPU and CUDA ... (NVIDIA CUDA Programming Guide) ... CUDA C OpenCL CUDA

CUDA programming Performance considerations (CUDA best practices)

Code gpu with cuda - CUDA introduction

NVIDIA CUDA Best Practices Guide - Virginia Tech€¦ · CUDA Best Practices Guide Version 3.1 Version 3.1 5/19/2010 NVIDIA CUDA™ NVIDIA CUDA C Best Practices Guide . ... CUDA Programming