General-Purpose Computation on General-Purpose Computation on Graphics HardwareGraphics Hardware
David Luebke University of Virginia
Course IntroductionCourse Introduction
• The GPU on commodity video cards has evolved into an extremely flexible and powerful processor– Programmability– Precision– Power
• We are interested in how to harness that power for general-purpose computation
Motivation: Computational PowerMotivation: Computational Power
• GPUs are fast…– 3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak– GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak– GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak– GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec
peak (reference here)• That’s almost 100 times faster than a 3 Ghz Pentium4!
• GPUs are getting faster, faster– CPUs: annual growth 1.5× decade growth 60× – GPUs: annual growth > 2.0× decade growth > 1000
Courtesy Kurt Akeley,Ian Buck & Tim Purcell, GPU Gems (see course notes)
Motivation:Computational PowerMotivation:Computational Power
Courtesy Naga Govindaraju
GPU
CPU
Motivation:Computational PowerMotivation:Computational Power
Courtesy Ian Buck
GPU
GFL
OPS
multiplies per second
NVIDIA NV30, 35, 40
ATI R300, 360, 420
Pentium 4
July 01 Jan 02 July 02 Jan 03 July 03 Jan 04
An Aside: Computational PowerAn Aside: Computational Power
• Why are GPUs getting faster so fast?– Arithmetic intensity: the specialized nature of GPUs makes it
easier to use additional transistors for computation not cache– Economics: multi-billion dollar video game market is a
pressure cooker that drives innovation
Motivation:Flexible and preciseMotivation:Flexible and precise
• Modern GPUs are deeply programmable– Programmable pixel, vertex, video engines– Solidifying high-level language support
• Modern GPUs support high precision– 32 bit floating point throughout the pipeline– High enough for many (not all) applications
Motivation:The Potential of GPGPUMotivation:The Potential of GPGPU
• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation
• Example applications range from in-game physics simulation to conventional computational science
• Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
The Problem:Difficult To UseThe Problem:Difficult To Use
• GPUs designed for and driven by video games– Programming model is unusual & tied to computer graphics– Programming environment is tightly constrained
• Underlying architectures are:– Inherently parallel– Rapidly evolving (even in basic feature set!)– Largely secret
• Can’t simply “port” code written for the CPU!
GPU Fundamentals:The Graphics PipelineGPU Fundamentals:The Graphics Pipeline
• A simplified graphics pipeline– Note that pipe widths vary– Many caches, FIFOs, and so on not shown
GPUCPU
ApplicationApplication TransformTransform RasterizerRasterizer ShadeShade VideoMemory
(Textures)
VideoMemory
(Textures)VerticesVertices
(3D)(3D)Xformed,Xformed,
LitLitVerticesVertices
(2D)(2D)
FragmentsFragments(pre-pixels)(pre-pixels)
FinalFinalpixelspixels
(Color, Depth)(Color, Depth)
Graphics StateGraphics State
Render-to-textureRender-to-texture
GPU Fundamentals:The Modern Graphics PipelineGPU Fundamentals:The Modern Graphics Pipeline
• Programmable vertex processor!
• Programmable pixel processor!
GPUCPU
ApplicationApplication VertexProcessor
VertexProcessor RasterizerRasterizer Pixel
ProcessorPixel
ProcessorVideo
Memory(Textures)
VideoMemory
(Textures)VerticesVertices
(3D)(3D)Xformed,Xformed,
LitLitVerticesVertices
(2D)(2D)
FragmentsFragments(pre-pixels)(pre-pixels)
FinalFinalpixelspixels
(Color, Depth)(Color, Depth)
Graphics StateGraphics State
Render-to-textureRender-to-texture
VertexProcessor
VertexProcessor
FragmentProcessorFragmentProcessor
GPU Pipeline: TransformGPU Pipeline: Transform
• Vertex Processor (multiple operate in parallel)– Transform from “world space” to “image space”– Compute per-vertex lighting
GPU Pipeline: RasterizerGPU Pipeline: Rasterizer
• Rasterizer– Convert geometric rep. (vertex) to image rep. (fragment)
• Fragment = image fragment– Pixel + associated data: color, depth, stencil, etc.
– Interpolate per-vertex quantities across pixels
GPU Pipeline: ShadeGPU Pipeline: Shade
• Fragment Processors (multiple in parallel)– Compute a color for each pixel– Optionally read colors from textures (images)
Importance of Data ParallelismImportance of Data Parallelism
• GPU: Each vertex / fragment is independent– Temporary registers are zeroed– No static data– No read-modify-write buffers
• Data parallel processing– Best for ALU-heavy architectures: GPUs
• Multiple vertex & pixel pipelines– Hide memory latency (with more computation)
Courtesy of Ian Buck
Arithmetic IntensityArithmetic Intensity
• Lots of ops per word transferred
• GPGPU demands high arithmetic intensity for peak performance– Ex: solving systems of linear equations– Physically-based simulation on lattices– All-pairs shortest paths
Courtesy of Pat Hanrahan
Data Streams & KernelsData Streams & Kernels
• Streams– Collection of records requiring similar computation
• Vertex positions, Voxels, FEM cells, etc.– Provide data parallelism
• Kernels– Functions applied to each element in stream
• Transforms, PDE, …– No dependencies between stream elements
• Encourage high arithmetic intensity
Courtesy of Ian Buck
Example: Simulation GridExample: Simulation Grid
• Common GPGPU computation style– Textures represent computational grids = streams
• Many computations map to grids– Matrix algebra– Image & Volume processing– Physical simulation– Global Illumination
• ray tracing, photon mapping, radiosity
• Non-grid streams can be mapped to grids
Stream ComputationStream Computation
• Grid Simulation algorithm– Made up of steps– Each step updates entire grid– Must complete before next step can begin
• Grid is a stream, steps are kernels– Kernel applied to each stream element
Scatter vs. GatherScatter vs. Gather
• Grid communication– Grid cells share information
Computational Resources InventoryComputational Resources Inventory
• Programmable parallel processors– Vertex & Fragment pipelines
• Rasterizer– Mostly useful for interpolating addresses (texture coordinates)
and per-vertex constants
• Texture unit– Read-only memory interface
• Render to texture– Write-only memory interface
Vertex ProcessorVertex Processor
• Fully programmable (SIMD / MIMD)
• Processes 4-vectors (RGBA / XYZW)
• Capable of scatter but not gather– Can change the location of current vertex– Cannot read info from other vertices– Can only read a small constant memory
• Future hardware enables gather!– Vertex textures
Fragment ProcessorFragment Processor
• Fully programmable (SIMD)
• Processes 4-vectors (RGBA / XYZW)
• Random access memory read (textures)
• Capable of gather but not scatter– No random access memory writes – Output address fixed to a specific pixel
• Typically more useful than vertex processor– More fragment pipelines than vertex pipelines– RAM read– Direct output
CPU-GPU AnalogiesCPU-GPU Analogies
• CPU programming is familiar– GPU programming is graphics-centric
• Analogies can aid understanding
CPU-GPU AnalogiesCPU-GPU Analogies
CPU GPU
Stream / Data Array = Texture
Memory Read = Texture Sample
CPU-GPU AnalogiesCPU-GPU Analogies
Loop body / kernel / algorithm step = Fragment Program
CPU GPU
FeedbackFeedback
• Each algorithm step depend on the results of previous steps
• Each time step depends on the results of the previous time step
CPU-GPU AnalogiesCPU-GPU Analogies
. .
.
Grid[i][j]= x; . . .
Array Write = Render to Texture
CPU GPU
GPU Simulation OverviewGPU Simulation Overview
• Analogies lead to implementation– Algorithm steps are fragment programs
• Computational “kernels”– Current state variables stored in textures– Feedback via render to texture
• One question: how do we invoke computation?
Invoking ComputationInvoking Computation
• Must invoke computation at each pixel– Just draw geometry!– Most common GPGPU invocation is a full-screen quad
Typical “Grid” ComputationTypical “Grid” Computation
• Initialize “view” (so that pixels:texels::1:1)
glMatrixMode(GL_MODELVIEW);glLoadIdentity();glMatrixMode(GL_PROJECTION);glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);
• For each algorithm step:– Activate render-to-texture– Setup input textures, fragment program– Draw a full-screen quad (1x1)
Branching TechniquesBranching Techniques
• Fragment program branches can be expensive– No true fragment branching on NV3X & R3X0
• Better to move decisions up the pipeline– Replace with math– Occlusion Query– Domain decomposition– Z-cull– Pre-computation
Branching with OQBranching with OQ
• Use it for iteration terminationDo { // outer loop on CPU
BeginOcclusionQuery { // Render with fragment program that
// discards fragments that satisfy
// termination criteria} EndQuery
} While query returns > 0
• Can be used for subdivision techniques– Demo later
Domain DecompositionDomain Decomposition
• Avoid branches where outcome is fixed– One region is always true, another false– Separate FPs for each region, no branches
• Example: boundaries
Z-CullZ-Cull
• In early pass, modify depth buffer– Write depth=0 for pixels that should not be modified by later
passes– Write depth=1 for rest
• Subsequent passes– Enable depth test (GL_LESS)– Draw full-screen quad at z=0.5– Only pixels with previous depth=1 will be processed
• Can also use early stencil test
• Not available on NV3X– Depth replace disables ZCull
Pre-computationPre-computation
• Pre-compute anything that will not change every iteration!
• Example: arbitrary boundaries– When user draws boundaries, compute texture containing
boundary info for cells– Reuse that texture until boundaries modified– Combine with Z-cull for higher performance!
High-level Shading LanguagesHigh-level Shading LanguagesHigh-level Shading LanguagesHigh-level Shading Languages
• Cg, HLSL, & GLSlang– Cg:
• http://www.nvidia.com/cg
– HLSL:• http://msdn.microsoft.com/library/default.asp?url=/library/en-us/
directx9_c/directx/graphics/reference/highlevellanguageshaders.asp
– GLSlang: • http://www.3dlabs.com/support/developer/ogl2/whitepapers/
index.html
float3 cPlastic = Cd * (cAmbi + cDiff) + Cs*cSpec
….MULR R0.xyz, R0.xxx, R4.xyxx;MOVR R5.xyz, -R0.xyzz;DP3R R3.x, R0.xyzz, R3.xyzz;…
GPGPU LanguagesGPGPU Languages
• Why do we want them?
– Make programming GPUs easier!• Don’t need to know OpenGL, DirectX, or ATI/NV extensions• Simplify common operations• Focus on the algorithm, not on the implementation
• Sh
– University of Waterloo– http://libsh.sourceforge.net– http://www.cgl.uwaterloo.ca
• Brook
– Stanford University– http://brook.sourceforge.net– http://graphics.stanford.edu
Brook: general purpose streaming languageBrook: general purpose streaming language
• stream programming model– enforce data parallel computing
• streams– encourage arithmetic intensity
• kernels
• C with stream extensions
• GPU = streaming coprocessor
system outlinesystem outline
.br
Brook source files
brcc
source to source compiler
brt
Brook run-time library
Brook language
streamsBrook language
streams
• streams– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;
– similar to arrays, but…• index operations disallowed: position[i]• read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);
– encourage data parallelism
Brook language
kernelsBrook language
kernels
• kernels
– functions applied to streams• similar to for_all construct
kernel void foo (float a<>, float b<>, out float result<>) {
result = a + b;}
float a<100>;float b<100>;float c<100>;
foo(a,b,c);for (i=0; i<100; i++)
c[i] = a[i]+b[i];
Brook language
kernelsBrook language
kernels
• kernels
– functions applied to streams• similar to for_all construct
kernel void foo (float a<>, float b<>, out float result<>) {
result = a + b;}
– no dependencies between stream elements– encourage high arithmetic intensity
Brook language
kernelsBrook language
kernels
• kernels arguments– input/output streams– constant parameters– gather streams– iterator streams
kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) {
result = array[a] + t*b + n;}
float a<100>;float b<100>;float c<100>;float array<25>iter float n<100> = iter(0, 10);
foo(a,b,3.2f,array,n,c);
Brook language
kernelsBrook language
kernels• ray triangle intersection
kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); }}
Brook language
reductionsBrook language
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>, reduce float r<>)
r += a;}
float a<100>;float r;
sum(a,r);r = a[0];for (int i=1; i<100; i++) r += a[i];
Brook language
reductionsBrook language
reductions
• reductions – associative operations only
(a+b)+c = a+(b+c)• sum, multiply, max, min, OR, AND, XOR• matrix multiply
Brook language
reductionsBrook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce function
reduce void sum (float a<>, reduce float r<>)
r += a;}
float a<20>;float r<5>;
sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];
Brook language
stream repeat & strideBrook language
stream repeat & stride
• kernel arguments of different shape
– resolved by repeat and stride
kernel void foo (float a<>, float b<>, out float result<>);
float a<20>;float b<5>;float c<10>;
foo(a,b,c);
foo(a[0], b[0], c[0])foo(a[2], b[0], c[1])foo(a[4], b[1], c[2])foo(a[6], b[1], c[3])foo(a[8], b[2], c[4])foo(a[10], b[2], c[5])foo(a[12], b[3], c[6])foo(a[14], b[3], c[7])foo(a[16], b[4], c[8])foo(a[18], b[4], c[9])
Brook language
matrix vector multiplyBrook language
matrix vector multiply
kernel void mul (float a<>, float b<>, out float result<>)
{ result = a*b; }
reduce void sum (float a<>, reduce float result<>)
{ result += a; }
float matrix<20,10>;float vector<1, 10>;float tempmv<20,10>;float result<20, 1>;
mul(matrix,vector,tempmv);sum(tempmv,result);
MV
V
V
T=
Brook language
matrix vector multiplyBrook language
matrix vector multiply
kernel void mul (float a<>, float b<>, out float result<>)
{ result = a*b; }
reduce void sum (float a<>, reduce float result<>)
{ result += a; }
float matrix<20,10>;float vector<1, 10>;float tempmv<20,10>;float result<20, 1>;
mul(matrix,vector,tempmv);sum(tempmv,result);
RT sum
Running BrookRunning Brook
• Compiling .br filesBrook CG CompilerVersion: 0.2 Built: Apr 24 2004, 18:11:59brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br -h help (print this message) -v verbose (print intermediate generated code) -n no codegen (just parse and reemit the input) -d debug (print cTool internal state) -k keep generated fragment program (in foo.cg) -t disable kernel call type checking -y emit code for ATI 4-output hardware -A enable address virtualization (experimental) -N deny support for kernels calling other kernels -o prefix prefix prepended to all output files -w workspace workspace size (16 - 2048, default 1024) -p shader cpu / ps20 / fp30 / cpumt (can specify multiple)
Running BrookRunning Brook
• BRT_RUNTIME selects platform
CPU Backend: BRT_RUNTIME = cpu
CPU Multithreaded Backend: BRT_RUNTIME = cpumt
NVIDIA NV30 Backend: BRT_RUNTIME = nv30
OpenGL ARB Backend: BRT_RUNTIME = arb
DirectX9 Backend: BRT_RUNTIME = dx9
RuntimeRuntime
• accessing stream data for graphics aps
– Brook runtime api available in c++ code– autogenerated .hpp files for brook code
brook::initialize( "dx9", (void*)device );
// Create streams
fluidStream0 = stream::create<float4>( kFluidSize, kFluidSize );
normalStream = stream::create<float3>( kFluidSize, kFluidSize );
// Get a handle to the texture being used by
// the normal stream as a backing store
normalTexture = (IDirect3DTexture9*)
normalStream->getIndexedFieldRenderData(0);
// Call the simulation kernel
simulationKernel( fluidStream0, fluidStream0, controlConstant,
fluidStream1 );
ApplicationsApplications
• Includes lots of sample applications– Ray-tracer– FFT– Image segmentation– Linear algebra
Brook performanceBrook performance
2-3x faster than CPU implementation
compared against 3GHz P4:• Intel Math Library• FFTW• Custom cached-blocked segment C code
GPUs still lose against SSEcache friendly code.Super-optimizations
• ATLAS• FFTW
ATI Radeon 9800 XT
NVIDIA GeForce 6800
Brook for GPUsBrook for GPUs
• Release v0.3 available on Sourceforge
• Project Page
– http://graphics.stanford.edu/projects/brook
• Source
– http://www.sourceforge.net/projects/brook
• Over 4K downloads!
• Brook for GPUs: Stream Computing on Graphics Hardware
– Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan
Fly-fishing fly images from The English Fly Fishing Shop
GPGPU ExamplesGPGPU Examples
• Fluids
• Reaction/Diffusion
• Multigrid
• Tone mapping