General Purpose Computation on Graphics Processors...

General Purpose Computation on Graphics Processors (GPGPU)

Mike Houston, Stanford University

2Mike Houston - Stanford University Graphics Lab

A little about me

http://graphics.stanford.edu/~mhoustonEducation:– UC San Diego, Computer Science BS– Stanford University, Computer Science MS– Currently a PhD candidate at Stanford University

Research– Parallel Rendering– High performance computing– Computation on graphics processors (GPGPU)

What can you do on GPUs other than graphics?

Large matrix/vector operations (BLAS)Protein Folding (Molecular Dynamics)FFT (SETI, signal processing)Ray TracingPhysics Simulation [cloth, fluid, collision]Sequence Matching (Hidden Markov Models)Speech Recognition (Hidden Markov Models, Neural nets)DatabasesSort/SearchMedical Imaging (image segmentation, processing)And many, many more…

http://www.gpgpu.org

Why use GPUs?

COTS– In every machine

Performance– Intel 3.0 GHz Pentium 4

• 12 GFLOPs peak (MAD)• 5.96 GB/s to main memory

– ATI Radeon X1800XT• 120 GFLOPs peak (fragment engine)• 42 GB/s to video memory

Task vs. Data parallelism

Task parallel– Independent processes with little communication– Easy to use

• “Free” on modern operating systems with SMP

Data parallel– Lots of data on which the same computation is being executed– No dependencies between data elements in each step in the

computation– Can saturate many ALUs– But often requires redesign of traditional algorithms

CPU vs. GPU

CPU– Really fast caches (great for data reuse)– Fine branching granularity– Lots of different processes/threads– High performance on a single thread of execution

GPU– Lots of math units– Fast access to onboard memory– Run a program on each fragment/vertex– High throughput on parallel tasks

CPUs are great for task parallelismGPUs are great for data parallelism

The Importance of Data Parallelism for GPUs

GPUs are designed for highly parallel tasks like renderingGPUs process independent vertices and fragments– Temporary registers are zeroed– No shared or static data– No read-modify-write buffers– In short, no communication between vertices or fragments

Data-parallel processing– GPU architectures are ALU-heavy

• Multiple vertex & pixel pipelines• Lots of compute power

– GPU memory systems are designed to stream data• Linear access patterns can be prefetched• Hide memory latency

Courtesy GPGPU.org

GPGPU Terminology

Arithmetic Intensity

Arithmetic intensity– Math operations per word transferred– Computation / bandwidth

Ideal apps to target GPGPU have:– Large data sets– High parallelism– Minimal dependencies between data elements– High arithmetic intensity– Lots of work to do without CPU intervention

Courtesy GPGPU.org

Data Streams & Kernels

Streams– Collection of records requiring similar computation

• Vertex positions, Voxels, FEM cells, etc.– Provide data parallelism

Kernels– Functions applied to each element in stream

• transforms, PDE, …– No dependencies between stream elements

• Encourage high Arithmetic Intensity

Courtesy GPGPU.org

Scatter vs. Gather

Gather– Indirect read from memory ( x = a[i] )– Naturally maps to a texture fetch– Used to access data structures and data streams

Scatter– Indirect write to memory ( a[i] = x )– Difficult to emulate:

• Render to vertex array• Sorting buffer

– Needed for building many data structures– Usually done on the CPU

Mapping algorithms to the GPU

Mapping CPU algorithms to the GPU

Basics– Stream/Arrays -> Textures– Parallel loops -> Quads– Loop body -> vertex + fragment program– Output arrays -> render targets– Memory read -> texture fetch– Memory write -> framebuffer write

Controlling the parallel loop– Rasterization = Kernel Invocation– Texture Coordinates = Computational Domain– Vertex Coordinates = Computational Range

Courtesy GPGPU.org

Computational Resources

Programmable parallel processors– Vertex & Fragment pipelines

Rasterizer– Mostly useful for interpolating values (texture coordinates) and

per-vertex constants

Texture unit– Read-only memory interface

Render to texture– Write-only memory interface

Courtesy GPGPU.org

Vertex Processors

Fully programmable (SIMD / MIMD)Processes 4-vectors (RGBA / XYZW)Capable of scatter but not gather– Can change the location of current vertex– Cannot read info from other vertices– Can only read a small constant memory

Vertex Texture Fetch– Random access memory for vertices– Limited gather capabilities

• Can fetch from texture• Cannot fetch from current vertex stream

Courtesy GPGPU.org

Fragment Processors

Fully programmable (SIMD)Processes 4-component vectors (RGBA / XYZW)Random access memory read (textures)Generally capable of gather but not scatter– Indirect memory read (texture fetch), but no indirect memory write– Output address fixed to a specific pixel

Typically more useful than vertex processor– More fragment pipelines than vertex pipelines– Direct output (fragment processor is at end of pipeline)– Better memory read performance

For GPGPU, we mainly concentrate on using the fragment processors– Most of the flops– Highest memory bandwidth

Courtesy GPGPU.org

GPGPU example – Adding Vectors

float a[5*5];float b[5*5];float c[5*5];//initialize vector a//initialize vector bfor(int i=0; i<5*5; i++){

c[i] = a[i] + b[i];}

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

Place arrays into 2D texturesConvert loop body into a shaderLoop body = Render a quad

– Needs to cover all the pixels in the output

– 1:1 mapping between pixels and texelsReadback framebuffer into result array

!!ARBfp1.0

TEMP R0;

TEMP R1;

TEX R0, fragment.position, texture[0], 2D;

ADD R0, R0, R1;

MOV fragment.color, R0;

How this basically works – Adding vectors

Bind Input Textures

Bind Render Targets

Load Shader

Render Quad

Readback Buffer

Set Shader Params

!!ARBfp1.0

TEMP R0;

TEMP R1;

ADD R0, R0, R1;

MOV fragment.color, R0;

Vector A Vector B

Vector C

C = A+B

Rolling your own GPGPU apps

Lots of information on GPGPU.orgFor those with a strong graphics background:– Do all the graphics setup yourself– Write your kernels:

• Use high level languages – Cg, HLSL, ASHLI

• Or, direct assembly– ARB_fragment_program, ps20, ps2a, ps2b, ps30

High level languages and systems to make GPGPU easier– Brook (http://graphics.stanford.edu/projects/brookgpu/)– Sh (http://libsh.org)

BrookGPU

History– Developed at Stanford University– Goal: allow non-graphics users to use GPUs for computation– Lots of GPGPU apps written in Brook

Design– C based language with streaming extensions– Compiles kernels to DX9 and OpenGL shading models– Runtimes (DX9/OpenGL) handle all graphics commands

Performance– 80-90% of hand tuned GPU application in many cases

GPGPU and the ATI X1800

GPGPU on the ATI X1800

IEEE 32-bit floating point– Simplifies precision issues in applications

Long programs– We can now handle larger applications– 512 static instructions– Effectively unlimited dynamic instructions

Branching and Looping– No performance cliffs for dynamic branching and looping– Fine branch granularity: ~16 fragments

Faster upload/download– 50-100% increase in PCIe bandwidth over last generation

GPGPU on the ATI X1800, cont.

Advanced memory controller– Latency hiding for streaming reads and writes to memory

• With enough math ops you can hide all memory access!– Large bandwidth improvement over previous generation

Scatter support (a[i] = x)– Arbitrary number of float outputs from fragment processors– Uncached reads and writes for register spilling

F-Buffer– Support for linearizing datasets– Store temporaries “in flight”

GPGPU on the ATI X1800, cont.

Flexibility– Unlimited texture reads– Unlimited dependent texture reads– 32 hardware registers per fragment

512MB memory support– Larger datasets without going to system memory

Performance basics for GPGPU – X1800XT (from GPUBench)

Compute– 83 GFLOPs (MAD)

Memory– 42 GB/s cache bandwidth– 21 GB/s streaming bandwidth– 4 cycle latency for a float4 fetch (cache hit)– 8 cycle latency for a float4 fetch (streaming)

Branch granularity – 16 fragmentsOffload to GPU– Download (GPU -> CPU): 900 MB/s – Upload (CPU -> GPU): 1.4 GB/s

http://graphics.stanford.edu/projects/gpubench

A few examples on the

ATI X1800XT

Basic Linear Algebra Subprograms– High performance computing

• The basis for LINPACK benchmarks• Heavily used in simulation

– Ubiquitous in many math packages• MatLab™• LAPACK

BLAS 1: scalar, vector, vector/vector operationsBLAS 2: matrix-vector operations BLAS 3: matrix-matrix operations

BLAS GPU Performance

saxpy (BLAS1) – single precision vector-scalar productsgemv (BLAS2) – single precision matrix-vector productsgemm (BLAS3) – single precision matrix-matrix product

Relative Performance

0123456789

101112

saxpy sgemv sgemm

3.0GHz P42.5GHz G5Nvidia 7800GTXATI X1800XT

HMMer – Protein sequence matching

Goal– Find matching patterns between protein sequences– Relationship between diseases and genetics– Genetic relationships between species

Problem– HUGE databases to search against– Queries take lots of time to process

• Researches start searches and go home for the nightCore Algorithm (hmmsearch)– Viterbi algorithm– Compare a Hidden Markov Model against a large database of

protein sequencesPaper at IEEE Supercomputing 2005– http://graphics.stanford.edu/papers/clawhmmer/

HMMer – Performance

X1800XT

Protein Folding

GROMACS provides extremely high performance compared to all other programs.Lot of algorithmic optimizations:– Own software routines to calculate the

inverse square root.– Inner loops optimized to remove all

conditionals.– Loops use SSE and 3DNow! multimedia

instructions for x86 processors– For Power PC G4 and later processors:

Altivec instructions providedNormally 3-10 times faster than any other program.Core algorithm in Folding@Homehttp://www.gromacs.org

GROMACS - Performance

0 0.5 1 1.5 2 2.5 3 3.5 4

3.0GHz P4

2.5GHz G5

NVIDIA 7800GTX

ATI X1800XT

Force Kernel Only

Complete Application

GROMACS – GPU Implementation

Written using Brook by non-graphics programmers– Offloads force calculation to GPU (~80% of CPU time)– Force calculation on X1800XT is ~3.5X a 3.0GHz P4– Overall speed up on X1800XT is ~2.5X a 3.0GHz P4

Not yet optimized for X1800XT– Using ps2b kernels, i.e. no looping– Not making use of new scatter functionality

The revenge of Ahmdal’s law– Force calculation no longer bottleneck (38% of runtime)– Need to also accelerate data structure building (neighbor lists)

• MUCH easier with scatter support

This looks like a very promising application for GPUs– Combine CPU and GPU processing for a folding monster!

Making GPGPU easier

What GPGPU needs from vendors

More information– Shader ISA– Latency information– GPGPU Programming guide (floating point)

• How to order code for ALU efficiency• The “real” cost of all instructions• Expected latencies of different types of memory fetches

Direct access to the hardware– GL/DX is not what we want to be using

• We don’t need state tracking• Using graphics commands is odd for doing computation• The graphics abstractions aren’t useful for us

– Better memory managementFast transfer to and from GPU– Non-blocking

Consistent graphics drivers– Some optimizations for games hurt GPGPU performance

What GPGPU needs from the community

Data Parallel programming languages– Lots of academic research

“GCC” for GPUsParallel data structuresMore applications– What will make the average user care about GPGPU?– What can we make data parallel and run fast?

Thanks

The BrookGPU team – Ian Buck, Tim Foley, Jeremy Sugerman, Daniel Horn, Kayvon FatahalianGROMACS – Vishal Vaidyanathan, Erich Elsen, Vijay Pande, Eric DarveHMMer – Daniel Horn, Eric LindahlPat HanrahanEveryone at ATI Technologies

Questions?

I’ll also be around after the talkEmail: mhouston@stanford.eduWeb: http://graphics.stanford.edu/~mhouston

For lots of great GPGPU information:– GPGPU.org (http://www.gpgpu.org)

General Purpose Computation on Graphics Processors...

Documents

SKIN TREATMENTS DELUXE RITUALS GUVON SPA … · Kloofzicht Treat (Pregnancy safe) 60 minutes R520 Back, neck and shoulder massage followed by a Sole Soother Blissful Treat 1 hour

treatment menu - CRT Group · From R1700 From R1250 R180 R120 R70 R70 R450 R200 R490 R730 R510 R410 R324 R630 R520 R260 From R2680 From R1380 From R1700 From …

Advanced Programming (GPGPU) - Computer graphicsgraphics.stanford.edu/~mhouston/public_talks/cs448-gpgpu.pdf · • FFT (SETI, signal processing) ... • Speech Recognition (Hidden

Galileo’s Dilemma: Ptolemy, Copernicus, and the ...hubblesite.org/about_us/public_talks/presentations/...Galileo’s Dilemma: Ptolemy, Copernicus, and the Transformation of the Heavens

Part 2: Parallel Renderinggraphics.stanford.edu/~mhouston/VisWorkshop04/ChromiumVis200… · Sort-last Rendering Concepts Basically, N instances of a parallel application each render

Transiting Extrasolar Planets - HubbleSitehubblesite.org/about_us/public_talks/presentations/burke_2007_02... · Motivated to find bright transiting planets XO Transit Survey PI Peter

This document is for informational purposes only. Dell ...binar.pl/p/poweredge-r520/poweredge-r520_tech_guide.pdf · This document is for informational purposes only. ... PowerEdge,

Dell PowerEdge R520 Technical Guide - English · 2013-04-05 · Dell server management operations ... iv PowerEdge R520 Technical Guide ... The new embedded system management solution

Production Cluster Visualization: Experiences and Challengesgraphics.stanford.edu/~mhouston/VisWorkshop03/RandyFrank.pdf · 2003-10-24 · Production Cluster Visualization: Experiences

General Purpose Computation on Graphics …mhouston/public_talks/R...4 Mike Houston - Stanford University Graphics Lab Why use GPUs? COTS – In every machine Performance – Intel

Spoilt for choice - Keltruck Scania · Spoilt for choice “The 18-tonners ... 2015 Scania R520 LA6x2/2MNA Topline Griffin ... Keltruck, but it does have a full Scania history. It

Retail Product Catalogue 2020 - mksalon.co.za...NIOXIN Trial Kit R787 R318 R956 R389 R520 R1, 056 1 litre 300 ml 1 litre 300 ml 100 ml 1 pack (300 ml cleanser, 300 ml conditioner,

Service Manual - Daewoosvc.daewoo.com.mx/bbs/manual/dg-r520 service manual 060323.pdf · NORTH AMERICA BS MTK/SONY TRAVERSE Record Mode VR Mode ... JPEG Play w/ MP3 Play X x2 Soft

ParaView All you need for parallel visualizationgraphics.stanford.edu/~mhouston/VisWorkshop04/ParaView.pdf · 3 of 32 ParaView – What Is It? zTurn key application zParaView is built

A Portable Runtime Interface For Multi-Level Memory ...graphics.stanford.edu/~mhouston/papers/mhouston-thesis.pdf · A PORTABLE RUNTIME INTERFACE FOR MULTI-LEVEL MEMORY HIERARCHIES

Ray Casting on Programmable Graphics Hardwaregraphics.stanford.edu/~mhouston/VisWorkshop03/martin.pdf · • Ray casting on a GPU for uniform meshes • Features: – early ray termination

Current Biology Magazineprotists acquire food using whole prey cell phagocytosis, which is Current Biology Magazine Current Biology 30, R451–R520, May 18, 2020 R511 accomplished

R520 Series Router Datasheet - microdatafi.myqnapcloud.commicrodatafi.myqnapcloud.com/pdf/WLINK/WL-R520.pdf · B u il t- nwa ch d og, M tik e 8 02. 1 b/g n W i-F su po r t, 5 M d

Designing Graphics Clusters - Computer graphicsgraphics.stanford.edu/~mhouston/VisWorkshop04/GraphicsClusters.pdf · – Intel –AMD – PowerPC ... Assume a Graphics Cluster is

Dell PowerEdge R520 Owner's Manual · 2018. 10. 15. · Citrix ®, Xen®, XenServer ... Rev. A04. Contents 1 About Your System.....9 Front-Panel Features And Indicators ... 101 Safety