THE GPU COMPUTING ERAstrukov/ece154BSpring2018/GPUreview.… · the gpu computing era gpu computing is at a tipping point, becoming more widely used in demanding consumer applications

..........................................................................................................................................................................................................................

THE GPU COMPUTING ERA..........................................................................................................................................................................................................................

GPU COMPUTING IS AT A TIPPING POINT, BECOMING MORE WIDELY USED IN DEMANDING

CONSUMER APPLICATIONS AND HIGH-PERFORMANCE COMPUTING. THIS ARTICLE

DESCRIBES THE RAPID EVOLUTION OF GPU ARCHITECTURES—FROM GRAPHICS

PROCESSORS TO MASSIVELY PARALLEL MANY-CORE MULTIPROCESSORS, RECENT

DEVELOPMENTS IN GPU COMPUTING ARCHITECTURES, AND HOW THE ENTHUSIASTIC

ADOPTION OF CPUþGPU COPROCESSING IS ACCELERATING PARALLEL APPLICATIONS.

......As we enter the era of GPUcomputing, demanding applications withsubstantial parallelism increasingly use themassively parallel computing capabilitiesof GPUs to achieve superior performanceand efficiency. Today GPU computingenables applications that we previouslythought infeasible because of long execu-tion times.

With the GPU’s rapid evolution from aconfigurable graphics processor to a pro-grammable parallel processor, the ubiqui-tous GPU in every PC, laptop, desktop,and workstation is a many-core multi-threaded multiprocessor that excels atboth graphics and computing applications.Today’s GPUs use hundreds of parallelprocessor cores executing tens of thou-sands of parallel threads to rapidly solvelarge problems having substantial inherentparallelism. They’re now the most perva-sive massively parallel processing platformever available, as well as the most cost-effective.

Using NVIDIA GPUs as examples, thisarticle describes the evolution of GPU com-puting and its parallel computing model, theenabling architecture and software develop-ments, how computing applications useCPUþGPU coprocessing, example applica-tion performance speedups, and trends inGPU computing.

GPU computing’s evolutionWhy have GPUs evolved to have large

numbers of parallel threads and manycores? The driving force continues to bethe real-time graphics performance neededto render complex, high-resolution 3Dscenes at interactive frame rates for games.

Rendering high-definition graphics scenesis a problem with tremendous inherent par-allelism. A graphics programmer writes asingle-thread program that draws one pixel, andthe GPU runs multiple instances of thisthread in parallel—drawing multiple pixelsin parallel. Graphics programs, written inshading languages such as Cg or High-Level Shading Language (HLSL), thus scaletransparently over a wide range of threadand processor parallelism. Also, GPU com-puting programs—written in C or Cþþwith the CUDA parallel computingmodel,1,2 or using a parallel computingAPI inspired by CUDA such as Direct-Compute3 or OpenCL4—scale transparentlyover a wide range of parallelism. Softwarescalability, too, has enabled GPUs torapidly increase their parallelism andperformance with increasing transistordensity.

GPU technology developmentThe demand for faster and higher-

definition graphics continues to drive the

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 56

John Nickolls

William J. Dally

NVIDIA

..............................................................

56 Published by the IEEE Computer Society 0272-1732/10/$26.00 �c 2010 IEEE

development of increasingly parallel GPUs.Table 1 lists significant milestones in NVIDIAGPU technology development that drove theevolution of unified graphics and computingGPUs. GPU transistor counts increased expo-nentially, doubling roughly every 18 monthswith increasing semiconductor density. Sincetheir 2006 introduction, CUDA parallelcomputing cores per GPU also doublednearly every 18 months.

In the early 1990s, there were no GPUs.Video graphics array (VGA) controllers gen-erated 2D graphics displays for PCs to accel-erate graphical user interfaces. In 1997,NVIDIA released the RIVA 128 3D single-chip graphics accelerator for games and 3Dvisualization applications, programmed withMicrosoft Direct3D and OpenGL. Evolvingto modern GPUs involved adding pro-grammability incrementally—from fixedfunction pipelines to microcoded processors,configurable processors, programmable pro-cessors, and scalable parallel processors.

Early GPUsThe first GPU was the GeForce 256, a

single-chip 3D real-time graphics processorintroduced in 1999 that included nearly

every feature of high-end workstation 3Dgraphics pipelines of that era. It containeda configurable 32-bit floating-point vertextransform and lighting processor, and a con-figurable integer pixel-fragment pipeline,programmed with OpenGL and MicrosoftDirectX 7 (DX7) APIs.

GPUs first used floating-point arithmeticto calculate 3D geometry and vertices, thenapplied it to pixel lighting and color valuesto handle high-dynamic-range scenes andto simplify programming. They imple-mented accurate floating-point rounding toeliminate frame-varying artifacts on movingpolygon edges that would otherwise sparkleat real-time frame rates.

As programmable shaders emerged, GPUsbecame more flexible and programmable. In2001, the GeForce 3 introduced the first pro-grammable vertex processor that executedvertex shader programs, along with a config-urable 32-bit floating-point pixel-fragmentpipeline, programmed with OpenGL andDX8. The ATI Radeon 9700, introducedin 2002, featured a programmable 24-bitfloating-point pixel-fragment processor pro-grammed with DX9 and OpenGL. TheGeForce FX and GeForce 68005 featured

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 57

Table 1. NVIDIA GPU technology development.

Date Product Transistors CUDA cores Technology

1997 RIVA 128 3 million — 3D graphics accelerator

1999 GeForce 256 25 million — First GPU, programmed with DX7 and OpenGL

2001 GeForce 3 60 million — First programmable shader GPU, programmed

with DX8 and OpenGL

2002 GeForce FX 125 million — 32-bit floating-point (FP) programmable GPU with

Cg programs, DX9, and OpenGL

2004 GeForce 6800 222 million — 32-bit FP programmable scalable GPU, GPGPU

Cg programs, DX9, and OpenGL

2006 GeForce 8800 681 million 128 First unified graphics and computing GPU,

programmed in C with CUDA

2007 Tesla T8, C870 681 million 128 First GPU computing system programmed in C

with CUDA

2008 GeForce GTX 280 1.4 billion 240 Unified graphics and computing GPU, IEEE FP,

CUDA C, OpenCL, and DirectCompute

2008 Tesla T10, S1070 1.4 billion 240 GPU computing clusters, 64-bit IEEE FP, 4-Gbyte

memory, CUDA C, and OpenCL

2009 Fermi 3.0 billion 512 GPU computing architecture, IEEE 754-2008 FP,

64-bit unified addressing, caching, ECC

memory, CUDA C, C++, OpenCL, and

DirectCompute

....................................................................

MARCH/APRIL 2010 57

programmable 32-bit floating-point pixel-fragment processors and vertex processors,programmed with Cg programs, DX9, andOpenGL. These processors were highly multi-threaded, creating a thread and executing athread program for each vertex and pixelfragment. The GeForce 6800 scalable pro-cessor core architecture facilitated multipleGPU implementations with different num-bers of processor cores.

Developing the Cg language6 for pro-gramming GPUs provided a scalable parallelprogramming model for the programmablefloating-point vertex and pixel-fragment pro-cessors of GeForce FX, GeForce 6800, andsubsequent GPUs. A Cg program resemblesa C program for a single thread that drawsa single vertex or single pixel. The multi-threaded GPU created independent threadsthat executed a shader program to drawevery vertex and pixel fragment.

In addition to rendering real-time graph-ics, programmers also used Cg to computephysical simulations and other general-purpose GPU (GPGPU) computations.Early GPGPU computing programs achievedhigh performance, but were difficult to writebecause programmers had to express non-graphics computations with a graphics APIsuch as OpenGL.

Unified computing and graphics GPUsThe GeForce 8800 introduced in 2006

featured the first unified graphics and com-puting GPU architecture7,8 programmablein C with the CUDA parallel computingmodel, in addition to using DX10 andOpenGL. Its unified streaming processorcores executed vertex, geometry, and pixelshader threads for DX10 graphics programs,and also executed computing threads forCUDA C programs. Hardware multithread-ing enabled the GeForce 8800 to efficientlyexecute up to 12,288 threads concurrentlyin 128 processor cores. NVIDIA deployedthe scalable architecture in a family ofGeForce GPUs with different numbers ofprocessor cores for each market segment.

The GeForce 8800 was the first GPU touse scalar thread processors rather than vectorprocessors, matching standard scalar languageslike C, and eliminating the need to managevector registers and program vector

operations. It added instructions to supportC and other general-purpose languages,including integer arithmetic, IEEE 754floating-point arithmetic, and load/storememory access instructions with byte address-ing. It provided hardware and instructions tosupport parallel computation, communica-tion, and synchronization—including threadarrays, shared memory, and fast barriersynchronization.

GPU computing systemsAt first, users built personal supercom-

puters by adding multiple GPU cards toPCs and workstations, and assembled clustersof GPU computing nodes. In 2007, respond-ing to demand for GPU computing systems,NVIDIA introduced the Tesla C870, D870,and S870 GPU card, deskside, and rack-mount GPU computing systems containingone, two, and four T8 GPUs. The T8GPU was based on the GeForce 8800GPU, configured for parallel computing.

The second-generation Tesla C1060 andS1070 GPU computing systems introducedin 2008 used the T10 GPU, based on theGPU in GeForce GTX 280. The T10 fea-tured 240 processor cores, 1-teraflop-per-second peak single-precision floating-pointrate, IEEE 754-2008 double-precision 64-bit floating-point arithmetic, and 4-GbyteDRAM memory. Today there are TeslaS1070 systems with thousands of GPUswidely deployed in high-performance com-puting systems in production and research.

NVIDIA introduced the third-generationFermi GPU computing architecture in2009.9 Based on user experience with priorgenerations, it addressed several key areas tomake GPU computing more broadly appli-cable. Fermi implemented IEEE 754-2008and significantly increased double-precisionperformance. It added error-correcting code(ECC) memory protection for large-scaleGPU computing, 64-bit unified addressing,cached memory hierarchy, and instructionsfor C, Cþþ, Fortran, OpenCL, andDirectCompute.

GPU computing ecosystemThe GPU computing ecosystem is expand-

ing rapidly, enabled by the deployment ofmore than 180 million CUDA-capable

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 58

....................................................................

58 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

GPUs. Researchers and developers have en-thusiastically adopted CUDA and GPUcomputing for a diverse range of applica-tions,10 publishing hundreds of technicalpapers, writing parallel programming text-books,11 and teaching CUDA programmingat more than 300 universities. The CUDAZone (see http://www.nvidia.com/object/cuda_home_new.html) lists more than1,000 links to GPU computing applications,programs, and technical papers. The 2009GPU Technology Conference (see http://www.nvidia.com/object/research_summit_posters.html) published 91 research posters.

Library and tools developers are makingGPU development more productive. GPUcomputing languages include CUDA C,CUDA Cþþ, Portland Group (PGI)CUDA Fortran, DirectCompute, andOpenCL. GPU mathematics packages in-clude MathWorks Matlab, Wolfram Mathe-matica, National Instruments Labview, Sci-Comp SciFinance, and PyCUDA. NVIDIAdeveloped the parallel Nsight GPU develop-ment environment, debugger, and analyzerintegrated with Microsoft Visual Studio.GPU libraries include Cþþ productivitylibraries, dense linear algebra, sparse linearalgebra, FFTs, video and image processing,and data-parallel primitives. Computer systemmanufacturers are developing integratedCPUþGPU coprocessing systems in rack-mount server and cluster configurations.

CUDA scalable parallel architectureCUDA is a hardware and software copro-

cessing architecture for parallel computingthat enables NVIDIA GPUs to execute pro-grams written with C, Cþþ, Fortran,OpenCL, DirectCompute, and other lan-guages. Because most languages weredesigned for one sequential thread, CUDApreserves this model and extends it with aminimalist set of abstractions for expressingparallelism. This lets the programmer focuson the important issues of parallelism—howto design efficient parallel algorithms—using a familiar language.

By design, CUDA enables the develop-ment of highly scalable parallel programsthat can run across tens of thousands of con-current threads and hundreds of processorcores. A compiled CUDA program executes

on any size GPU, automatically using moreparallelism on GPUs with more processorcores and threads.

A CUDA program is organized into ahost program, consisting of one or more se-quential threads running on a host CPU, andone or more parallel kernels suitable for exe-cution on a parallel computing GPU. A ker-nel executes a sequential program on a set oflightweight parallel threads. As Figure 1shows, the programmer or compiler organ-izes these threads into a grid of thread blocks.The threads comprising a thread block cansynchronize with each other via barriersand communicate via a high-speed, per-block shared memory.

Threads from different blocks in the samegrid can coordinate via atomic operations ina global memory space shared by all threads.Sequentially dependent kernel grids can syn-chronize via global barriers and coordinatevia global shared memory. CUDA requiresthat thread blocks be independent, which

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 59

Grid 1

Grid 0

Per-applicationglobal memory

Per-thread privatelocal memory

Thread

Thread block

Per-blockshared memory

• • •

• • •

Figure 1. The CUDA hierarchy of threads, thread blocks, and grids of

blocks, with corresponding memory spaces: per-thread private local,

per-block shared, and per-application global memory spaces.

....................................................................

MARCH/APRIL 2010 59

provides scalability to GPUs with differentnumbers of processor cores and threads.Thread blocks implement coarse-grainedscalable data parallelism, while the light-weight threads comprising each threadblock provide fine-grained data parallelism.Thread blocks executing different kernelsimplement coarse-grained task parallelism.Threads executing different paths implementfine-grained thread-level parallelism. Detailsof the CUDA programming model are avail-able in the programming guide.2

Figure 2 shows some basic features of par-allel programming with CUDA. It containssequential and parallel implementations ofthe SAXPY routine defined by the basic linearalgebra subroutines (BLAS) library. Givenscalar a and vectors x and y containing nfloating-point numbers, it performs the up-date y ¼ ax þ y. The serial implementationis a simple loop that computes one elementof y per iteration. The parallel kernel executeseach of these independent iterations in paral-lel, assigning a separate thread to computeeach element of y. The __global__modifier indicates that the procedure is a ker-nel entry point, and the extended function-call syntax saxpy<<<B, T>>>(. . .)launches the kernel saxpy() in parallelacross B blocks of T threads each. Each threaddetermines which element it should pro-cess from its integer thread block indexblockIdx.x, its thread index within itsblock threadIdx.x, and the total numberof threads per block blockDim.x.

This example demonstrates a com-mon parallelization pattern, where we can

transform a serial loop with independentiterations to execute in parallel across manythreads. In the CUDA paradigm, the pro-grammer writes a scalar program—the paral-lel saxpy() kernel—that specifies thebehavior of a single thread of the kernel.This lets CUDA leverage standard C lan-guage with only a few small additions, suchas built-in thread and block index variables.The SAXPY kernel is also a simple exampleof data parallelism, where parallel threadseach produce assigned result data elements.

GPU computing architectureTo address different market segments,

GPU architectures scale the number of pro-cessor cores and memories to implement dif-ferent products for each segment while usingthe same scalable architecture and software.NVIDIA’s scalable GPU computing archi-tecture varies the number of streaming multi-processors to scale computing performance,and varies the number of DRAM memoriesto scale memory bandwidth and capacity.

Each multithreaded streaming multiproc-essor provides sufficient threads, processorcores, and shared memory to execute oneor more CUDA thread blocks. The parallelprocessor cores within a streaming multi-processor execute instructions for parallelthreads. Multiple streaming multiprocessorsprovide coarse-grained scalable data andtask parallelism to execute multiple coarse-grained thread blocks (possibly running dif-ferent kernels) in parallel. Multithreadingand parallel-pipelined processor cores withineach streaming multiprocessor implement

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 60

void saxpy(uint n, float a,float *x, float *y)

{uint i; for(i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];}

void serial_sample(){

// Call serial SAXPY function

saxpy(n, 2.0, x, y);}

(a)

__global__ void saxpy(uint n, float a,float *x, float *y)

{uint i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n)

y[i] = a*x[i] + y[i];}

void parallel_sample(){

// Launch parallel SAXPY kernel// using ⎡n/256⎤ blocks of 256 threads eachsaxpy<<<ceil(n/256),256>>>(n, 2.0, x, y);

}

(b)

Figure 2. Serial (a) and parallel CUDA (b) SAXPY kernels computing y ¼ ax þ y.

....................................................................

60 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

fine-grained data and thread-level parallelismto execute hundreds of fine-grained threadsin parallel. Application programs using theCUDA model thus scale transparently tosmall and large GPUs with different num-bers of streaming multiprocessors and pro-cessor cores.

Fermi computing architectureTo illustrate GPU computing architecture,

Figure 3 shows the third-generation Fermicomputing architecture configured with16 streaming multiprocessors, each with32 CUDA processor cores, for a total of512 cores. The GigaThread work schedulerdistributes CUDA thread blocks to streamingmultiprocessors with available capacity, dy-namically balancing the computing workloadacross the GPU, and running multiple kerneltasks in parallel when appropriate. The multi-threaded streaming multiprocessors scheduleand execute CUDA thread blocks and indi-vidual threads. Each streaming multiprocessorexecutes up to 1,536 concurrent threads to

help cover long latency loads from DRAMmemory. As each thread block completes exe-cuting its kernel program and releases itsstreaming multiprocessor resources, the workscheduler assigns a new thread block to thatstreaming multiprocessor.

The PCIe host interface connectsthe GPU and its DRAM memory withthe host CPU and system memory. TheCPUþGPU coprocessing and data transfersuse the bidirectional PCIe interface. Thestreaming multiprocessor threads accesssystem memory via the PCIe interface, andCPU threads access GPU DRAM memoryvia PCIe.

The GPU architecture balances its parallelcomputing power with parallel DRAMmemory controllers designed for high mem-ory bandwidth. The Fermi GPU in Figure 3has six high-speed GDDR5 DRAM interfa-ces, each 64 bits wide. Its 40-bit addresseshandle up to 1 Tbyte of address space forGPU DRAM and CPU system memoryfor large-scale computing.

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 61

DR

AM

inte

rfac

eD

RA

M in

terf

ace

Hos

t int

erfa

ceG

igaT

hrea

d

L2 cache

DR

AM

inte

rfac

eD

RA

M in

terf

ace

DR

AM

inte

rfac

eD

RA

M in

terf

ace

SM SM SM SM SM SM SM SM

SMSMSMSMSMSMSMSM

L1 L1 L1 L1 L1 L1 L1 L1

L1L1L1L1L1L1L1L1Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex

TexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTex

Figure 3. Fermi GPU computing architecture with 512 CUDA processor cores organized as 16 streaming multiprocessors

(SMs) sharing a common second-level (L2) cache, six 64-bit DRAM interfaces, and a host interface with the host CPU,

system memory, and I/O devices. Each streaming multiprocessor has 32 CUDA cores.

....................................................................

MARCH/APRIL 2010 61

Cached memory hierarchyFermi introduces a parallel cached mem-

ory hierarchy for load, store, and atomicmemory accesses by general applications.Each streaming multiprocessor has a first-level(L1) data cache, and the streaming multi-processors share a common 768-Kbyte unifiedsecond-level (L2) cache. The L2 cache connectswith six 64-bit DRAM interfaces and the PCIeinterface, which connects with the host CPU,system memory, and PCIe devices. It cachesDRAM memory locations and system mem-ory pages accessed via the PCIe interface.The unified L2 cache services load, store,atomic, and texture instruction requests fromthe streaming multiprocessors and requestsfrom their L1 caches, and fills the streamingmultiprocessor instruction caches and uni-form data caches.

Fermi implements a 40-bit physicaladdress space that accesses GPU DRAM,CPU system memory, and PCIe deviceaddresses. It provides a 40-bit virtual addressspace to each application context and maps itto the physical address space with translationlookaside buffers and page tables.

ECC memoryFermi introduces ECC memory protec-

tion to enhance data integrity in large-scaleGPU computing systems. Fermi ECC cor-rects single-bit errors and detects double-biterrors in the DRAM memory, GPU L2cache, L1 caches, and streaming multiproces-sor registers. The ECC lets us integrate thou-sands of GPUs in a system while maintaininga high mean time between failures (MTBF)for high-performance computing and super-computing systems.

Streaming multiprocessorThe Fermi streaming multiprocessor

introduces several architectural features thatdeliver higher performance, improve its pro-grammability, and broaden its applicability.As Figure 4 shows, the streaming multiproc-essor execution units include 32 CUDA pro-cessor cores, 16 load/store units, and fourspecial function units (SFUs). It has a64-Kbyte configurable shared memory/L1cache, 128-Kbyte register file, instructioncache, and two multithreaded warp schedu-lers and instruction dispatch units.

Efficient multithreadingThe streaming multiprocessor implements

zero-overhead multithreading and threadscheduling for up to 1,536 concurrentthreads. To efficiently manage and executethis many individual threads, the multiproc-essor employs the single-instruction multiple-thread (SIMT) architecture introduced in thefirst unified computing GPU.7,8 The SIMTinstruction logic creates, manages, schedules,and executes concurrent threads in groups of32 parallel threads called warps. A CUDAthread block comprises one or more warps.Each Fermi streaming multiprocessor hastwo warp schedulers and two dispatch unitsthat each select a warp and issue an instruc-tion from the warp to 16 CUDA cores,16 load/store units, or four SFUs. Becausewarps execute independently, the streamingmultiprocessor can issue two warp instruc-tions to appropriate sets of CUDA cores,load/store units, and SFUs.

To support C, Cþþ, and standardsingle-thread programming languages, eachstreaming multiprocessor thread is indepen-dent, having its own private registers, condi-tion codes and predicates, private per-threadmemory and stack frame, instructionaddress, and thread execution state. TheSIMT instructions control the execution ofan individual thread, including arithmetic,memory access, and branching and controlflow instructions. For efficiency, the SIMTmultiprocessor issues an instruction to awarp of 32 independent parallel threads.

The streaming multiprocessor realizes fullefficiency and performance when all threadsof a warp take the same execution path. Ifthreads of a warp diverge at a data-dependentconditional branch, execution serializes foreach branch path taken, and when all pathscomplete, the threads converge to the sameexecution path. The Fermi streaming multi-processor extends the flexibility of the SIMTindependent thread control flow with indi-rect branch and function-call instructions,and trap handling for exceptions anddebuggers.

Thread instructionsParallel thread execution (PTX) instruc-

tions describe the execution of a single threadin a parallel CUDA program. The PTX

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 62

....................................................................

62 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

instructions focus on scalar (rather thanvector) operations to match standard scalarprogramming languages. Fermi implementsthe PTX 2.0 instruction set architecture(ISA), which targets C, Cþþ, Fortran,OpenCL, and DirectCompute programs.Instructions include

� 32-bit and 64-bit integer, addressing,and floating-point arithmetic;

� load, store, and atomic memory access;� texture and multidimensional surface

access;

� individual thread flow control with pre-dicated instructions, branching, func-tion calls, and indirect function callsfor Cþþ virtual functions; and

� parallel barrier synchronization.

CUDA coresEach pipelined CUDA core executes a

scalar floating point or integer instructionper clock for a thread. With 32 cores, thestreaming multiprocessor can execute up to32 arithmetic thread instructions per clock.

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 63

FP = Floating pointINT = Integer arithmetic logicLD/ST = Load/storeSFU = Special function unit

Instruction cache

Warp scheduler Warp scheduler

Dispatch unitDispatch unit

Register file (128 Kbytes)

Core Core

Core Core

Core Core

Core

Core

Core

Core Core

Core Core

CoreCore

CoreCore

CoreCore

Core

Core

Core

CoreCore

CoreLD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

SFU

SFU

SFU

SFU

Uniform cache

64-Kbyte shared memory and L1 cache

Core

Core Core

CoreCore

Core

Core

Interconnect network

CUDA coreDispatch port

Operand collector

FP unit

Result.queue

INT unit

Figure 4. The Fermi streaming multiprocessor has 32 CUDA processor cores, 16 load/store

units, four special function units, a 64-Kbyte configurable shared memory/L1 cache,

128-Kbyte register file, instruction cache, and two multithreaded warp schedulers and

instruction dispatch units.

....................................................................

MARCH/APRIL 2010 63

The integer unit implements 32-bit precisionfor scalar integer operations, including 32-bitmultiply and multiply-add operations, andefficiently supports 64-bit integer operations.The Fermi integer unit adds bit-field insertand extract, bit reverse, and populationcount.

IEEE 754-2008 floating-point arithmeticThe Fermi CUDA core floating-point unit

implements the IEEE 754-2008 floating-point arithmetic standard for 32-bit single-precision and 64-bit double-precision results,including fused multiply-add (FMA) instruc-tions. FMA computes D ¼ A * B þ C withno loss of precision by retaining full precisionin the intermediate product and addition,then rounding the final sum to form theresult. Using FMA enables fast division andsquare-root operations with exactly roundedresults.

Fermi raises the throughput of 64-bitdouble-precision operations to half that ofsingle-precision operations, a dramatic im-provement over the T10 GPU. This perfor-mance level enables broader deployment ofGPUs in high-performance computing.The floating-point instructions handle sub-normal numbers at full speed in hardware,allowing small values to retain partialprecision rather than flushing them tozero or calculating subnormal values inmulticycle software exception handlers asmost CPUs do.

The SFUs execute 32-bit floating-pointinstructions for fast approximations of recip-rocal, reciprocal square root, sin, cos, exp,and log functions. The approximations areprecise to better than 22 mantissa bits.

Unified memory addressing and accessThe streaming multiprocessor load/store

units execute load, store, and atomic mem-ory access instructions. A warp of 32 activethreads presents 32 individual byte addresses,and the instruction accesses each memoryaddress. The load/store units coalesce 32individual thread accesses into a minimalnumber of memory block accesses.

Fermi implements a unified thread ad-dress space that accesses the three separateparallel memory spaces of Figure 1: per-thread local, per-block shared, and global

memory spaces. A unified load/store in-struction can access any of the three mem-ory spaces, steering the access to the correctmemory, which enables general C andCþþ pointer access anywhere. Fermi pro-vides a terabyte 40-bit unified byte addressspace, and the load/store ISA supports64-bit byte addressing for future growth.The ISA also provides 32-bit addressinginstructions when the program can limitits accesses to the lower 4 Gbytes of addressspace.

Configurable shared memory and L1 cacheOn-chip shared memory provides low-

latency, high-bandwidth access to datashared by cooperating threads in the sameCUDA thread block. Fast shared memorysignificantly boosts the performance ofmany applications having predictable regularaddressing patterns, while reducing DRAMmemory traffic.

Fermi introduces a configurable-capacityL1 cache to aid unpredictable or irregularmemory accesses, along with a configurable-capacity shared memory. Each streamingmultiprocessor has 64 Kbytes of on-chipmemory, configurable as 48 Kbytes of sharedmemory and 16 Kbytes of L1 cache, or as16 Kbytes of shared memory and 48 Kbytesof L1 cache.

CPU+GPU coprocessingHeterogeneous CPUþGPU coprocessing

systems evolved because the CPU and GPUhave complementary attributes that allowapplications to perform best using bothtypes of processors. CUDA programs arecoprocessing programs—serial portions exe-cute on the CPU, while parallel portionsexecute on the GPU. Coprocessing optimizestotal application performance.

With coprocessing, we use the right corefor the right job. We use a CPU core (opti-mized for low latency on a single thread) fora code’s serial portions, and we use GPUcores (optimized for aggregate throughputon a code’s parallel portions) for parallel por-tions of code. This approach gives moreperformance per unit area or power thaneither CPU or GPU cores alone.

The comparison in Table 2 illustrates theadvantage of CPUþGPU coprocessing using

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 64

....................................................................

64 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

Amdahl’s law. The table compares the per-formance of four configurations:

� a system containing one latency-optimized (CPU) core,

� a system containing 500 throughput-optimized (GPU) cores,

� a system containing 10 CPU cores, and� a coprocessing system that contains a

single CPU core and 450 GPU cores.

Table 2 assumes that a CPU core is 5�faster and 50� the area of a GPU core—numbers consistent with contemporaryCPUs and GPUs. The coprocessing systemdevotes 10 percent of its area to the singleCPU core and 90 percent of its area tothe 450 GPU cores.

Table 2 compares the four configurationson both a parallel-intensive program (with aserial fraction of only 0.5 percent) and on amostly sequential program (with a serial frac-tion of 75 percent). The coprocessing archi-tecture is the fastest on both programs. Onthe parallel-intensive program, the coprocess-ing architecture is slightly slower on theparallel portion than the pure GPU configu-ration (0.44 seconds versus 0.40 seconds) butmore than makes up for this by running thetiny serial portion 5� faster (1 second versus5 seconds). The heterogeneous architecturehas an advantage over the pure throughput-optimized configuration here because serialperformance is important even for mostlyparallel codes.

On the mostly sequential program, thecoprocessing configuration matches the per-formance of the multi-CPU configurationon the code’s serial portion but runs the par-allel portion 45� as fast, giving a slightlyfaster overall performance. Even on mostlysequential codes, it’s more efficient to runthe code’s parallel portion on a throughput-optimized architecture.

The coprocessing architecture providesthe best performance across a wide rangeof the serial fraction because it uses theright core for each task. By using a latency-optimized CPU to run the code’s serialfraction, it gives the best possible perfor-mance on the serial fraction—which is im-portant even for mostly parallel codes. Byusing throughput-optimized cores to runthe code’s parallel portion, it gives near-optimal performance on the parallel frac-tion as well—which becomes increasinglyimportant as codes become more parallel.It’s wasteful to use large, inefficientlatency-optimized cores to run parallelcode segments.

Application performanceMany applications consist of a mixture of

fundamentally serial control logic and inher-ently parallel computations. Furthermore,these parallel computations are frequentlydata-parallel in nature. This directly matchesthe CUDA coprocessing programmingmodel, namely a sequential control thread

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 65

Table 2. CPU+GPU coprocessing execution time, assuming that a CPU core is 53 faster

and 503 the area of a GPU core.

Configuration

Processing

time for 1 CPU

core

Processing

time for 500

GPU cores

Processing

time for 10 CPU

cores

Processing time

for 1 CPU core +

450 GPU cores

Program type Area 50 500 500 500

Parallel-intensive

program

0.5% serial code

99.5% paralleliz-

able code

Total

1.0

199.0

200.0

5.0

0.4

5.4

1.0

19.9

20.9

1.00

0.44

1.44

Mostly sequential

program

75% serial code

25% paralleliz-

able code

Total

150.0

50.0

200.0

750.0

0.1

750.1

150.0

5.0

155.0

150.0

0.11

150.11

....................................................................

MARCH/APRIL 2010 65

capable of launching a series of parallel ker-nels. The use of parallel kernels launchedfrom a sequential program also makes it rel-atively easy to parallelize an application’s in-dividual components rather than rewrite theentire application.

Many applications—both academic re-search and industrial products—have beenaccelerated using CUDA to achieve signifi-cant parallel speedups.10 Such applicationsfall into a variety of problem domains,including quantum chemistry, moleculardynamics, computational fluid dynamics,quantum physics, finite element analysis,astrophysics, bioinformatics, computer vi-sion, video transcoding, speech recognition,computed tomography, and computationalmodeling.

Table 3 lists some representative applica-tions along with the runtime speedupsobtained for the whole application usingCPUþGPU coprocessing over CPU alone,as measured by application developers.12-22

The speedups using GeForce 8800, TeslaT8, GeForce GTX 280, Tesla T10, andGeForce GTX 285 range from 9� to morethan 130�, with the higher speedups reflect-ing applications where more of the work ran inparallel on the GPU. The lower speedups—while still quite attractive—represent

applications that are limited by the code’sCPU portion, coprocessing overhead, or by di-vergence in the code’s GPU fraction.

The speedups achieved on this diverse setof applications validate the programmabilityof GPUs—in addition to their performance.The breadth of applications that have beenported to CUDA (more than 1,000 on theCUDA Zone) demonstrates the range ofthe programming model and architecture.Applications with dense matrices, sparsematrices, and arbitrary pointer structureshave all been successfully implemented inCUDA with impressive speedups. Similarly,applications with diverse control structuresand significant data-dependent control,such as ray tracing, have achieved good per-formance in CUDA.

Many real-world applications (such asinteractive ray tracing) are composed ofmany different algorithms, each with varyingdegrees of parallelism. OptiX, our interactiveray-tracing software developer’s kit built inthe CUDA architecture, provides a mecha-nism to control and schedule a wide varietyof tasks on both the CPU and GPU. Sometasks are primarily serial and execute onthe CPU, such as compilation, data structuremanagement, and coordination with theoperating system and user interaction.

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 66

Table 3. Representative CUDA application coprocessing speedups.

Application Field Speedup

Two-electron repulsion integral12 Quantum chemistry 130�Gromacs13 Molecular dynamics 137�Lattice Boltzmann14 3D computational fluid

dynamics (CFD)

100�

Euler solver15 3D CFD 16�Lattice quantum chromodynamics16 Quantum physics 10�Multigrid finite element method and

partial differential equation solver17

Finite element analysis 27�

N-body physics18 Astrophysics 100�Protein multiple sequence alignment19 Bioinformatics 36�Image contour detection20 Computer vision 130�Portable media converter* Consumer video 20�Large vocabulary speech recognition21 Human interaction 9�Iterative image reconstruction22 Computed tomography 130�Matlab accelerator** Computational modeling 100�................................................................................................................................

* Elemental Technologies, Badaboom media converter, 2009; http://badaboomit.com.** Accelereyes, Jacket GPU engine for Matlab, 2009; http://www.accelereyes.com.

....................................................................

66 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

Other tasks, such as building an accelerationstructure or updating animations, may runeither on the CPU or the GPU dependingon the choice of algorithms and the perfor-mance required. However, the real powerof OptiX is the manner in which it managesparallel operations that are either regularor highly irregular in a fine-grained fashion,and this is executed entirely on the GPU.

To accomplish this, OptiX decomposesthe algorithm into a series of states. Someof these states consist of uniform operationssuch as generating rays according to a cameramodel, or applying a tone-mapping operator.Other states consist of highly irregular com-putation, such as traversing an accelerationstructure to locate candidate geometry, orintersecting a ray with that candidate geom-etry. These operations are highly irregular be-cause they require an indefinite number ofoperations and can vary in cost by an orderof magnitude or more between rays. Auser-provided program that executes withinan abstract ray-tracing machine controls thedetails of each operation.

To execute this state machine, each warpof parallel threads on the GPU selects ahandful of rays that have matching state.Each warp executes in SIMT fashion, caus-ing each ray to transition to a new state(which might not be the same for all rays).This process repeats until all rays completetheir operations. OptiX can control boththe prioritization of the states and the scopeof the search for rays to target trade-offs indifferent GPU generations. Even if rays tem-porarily diverge to different states, they’ll re-turn to a small number of common states—usually traversal and intersection. This state-machine model gives the GPU an opportu-nity to reunite diverged threads with otherthreads executing the same code, whichwouldn’t occur in a straightforward SIMTexecution model.

The net result is that CPUþGPU copro-cessing enables fast, interactive ray tracing ofcomplex scenes while you watch, which is anapplication that researchers previously con-sidered too irregular for a GPU.

G PU computing is at the tipping point.Single-threaded processor perfor-

mance is no longer scaling at historic rates.

Thus, we must use parallelism for theincreased performance required to delivermore value to users. A GPU that’s opti-mized for throughput delivers parallelperformance much more efficiently thana CPU that’s optimized for latency.A heterogeneous coprocessing architecturethat combines a single latency-optimizedcore (a CPU) with many throughput-optimized cores (a GPU) performs betterthan either alternative alone. This is becauseit uses the right processor for the right job—the CPU for serial sections and critical pathsand the GPU for the parallel sections.

Because they efficiently execute programsacross a range of parallelism, heterogeneousCPUþGPU architectures are becomingpervasive. In high-performance computing,technical computing, and consumer mediaprocessing, CPUþGPU coprocessing hasbecome the architecture of choice. Weexpect the rate of adoption of GPU com-puting in these beachhead areas to acceler-ate, and for GPU computing to spread tobroader application areas. In addition toaccelerating existing applications, we expectGPU computing to enable new classes ofapplications, such as building compellingvirtual worlds and performing universalspeech translation.

As the GPU computing era progresses,GPU performance will continue to scale atMoore’s law rates—about 50 percent peryear—giving an order of magnitudeincrease in performance in five to six years.At the same time, GPU architecture willevolve to further increase the span of appli-cations that it can efficiently address. GPUcores will not become CPUs—they will con-tinue to be optimized for throughput, ratherthan latency. However, they will evolve tobecome more agile and better able to handlearbitrary control and data access patterns.GPUs and their programming systems willalso evolve to make GPUs even easier toprogram—further accelerating their rate ofadoption. MICRO

AcknowledgmentsWe thank Jen-Hsun Huang of NVIDIA

for his Hot Chips 21 keynote23 that inspiredthis article, and the entire NVIDIA team thatbrings GPU computing to market.

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 67

....................................................................

MARCH/APRIL 2010 67

....................................................................References

1. J. Nickolls et al., ‘‘Scalable Parallel Program-

ming with CUDA,’’ ACM Queue, vol. 6,

no. 2, 2008, pp. 40-53.

2. NVIDIA, NVIDIA CUDA Programming Guide,

2009; http://developer.download.nvidia.

com/compute/cuda/2_3/toolk it /docs/

NVIDIA_CUDA_Programming_Guide_2.3.pdf.

3. C. Boyd, ‘‘DirectCompute: Capturing the

Teraflop,’’ Microsoft Personal Developers

Conf., 2009; http://ecn.channel9.msdn.

com/o9/pdc09/ppt/CL03.pptx.

4. Khronos, The OpenCL Specification, 2009;

http://www.khronos.org/OpenCL.

5. J. Montrym and H. Moreton, ‘‘The GeForce

6800,’’ IEEE Micro, vol. 25, no. 2, 2005,

pp. 41-51.

6. W.R. Mark et al., ‘‘Cg: A System for Pro-

gramming Graphics Hardware in a C-like

Language,’’ Proc. Special Interest Group

on Computer Graphics (Siggraph), ACM

Press, 2003, pp. 896-907.

7. E. Lindholm et al., ‘‘NVIDIA Tesla: A Unified

Graphics and Computing Architecture,’’

IEEE Micro, vol. 28, no. 2, 2008, pp. 39-55.

8. J. Nickolls and D. Kirk, ‘‘Graphics and

Computing GPUs,’’ Computer Organization

and Design: The Hardware/Software

Interface, D.A. Patterson and J.L.

Hennessy, 4th ed., Morgan Kaufmann,

2009, pp. A2-A77.

9. NVIDIA, ‘‘Fermi: NVIDIA’s Next Generation

CUDA Compute Architecture,’’ 2009; http://

www.nvidia.com/content/PDF/fermi_white_

papers/NVIDIA_Fermi_Compute_Architecture_

Whitepaper.pdf.

10. M. Garland et al., ‘‘Parallel Computing Expe-

riences with CUDA,’’ IEEE Micro, vol. 28,

no. 4, 2008, pp. 13-27.

11. D.B. Kirk and W.W. Hwu, Programming

Massively Parallel Processors: A Hands-on

Approach, Morgan Kaufmann, 2010.

12. I.S. Ufimtsev and T.J. Martinez, ‘‘Quantum

Chemistry on Graphical Processing Units.

1. Strategies for Two-Electron Integral Eval-

uation,’’ J. Chemical Theory and Computa-

tion, vol. 4, no. 2, 2008, pp. 222-231.

13. M.S. Friedrichs et al., ‘‘Accelerating Molec-

ular Dynamic Simulation on Graphics

Processing Units,’’ J. Computational Chem-

istry, vol. 30, no. 6, 2009, pp. 864-872.

14. J. Tolke and M. Krafczyk, ‘‘TeraFLOP Com-

puting on a Desktop PC with GPUs for 3D

CFD,’’ Int’l J. Computational Fluid Dynam-

ics, vol. 22, no. 7, 2008, pp. 443-456.

15. T. Brandvik and G. Pullan, ‘‘Acceleration of a

3D Euler Solver Using Commodity Graphics

Hardware,’’ Proc. 46th Am. Inst. of Aero-

nautics and Astronautics (AIAA) Aerospace

Sciences Meeting, AIAA, 2008; http://

www.aiaa.org/agenda.cfm?lumeetingid=

1065&dateget=08-Jan-08#session8907.

16. M.A. Clark et al., ‘‘Solving Lattice QCD Sys-

tems of Equations Using Mixed Precision

Solvers on GPUs,’’ Computer Physics

Comm., 2009; http://arxiv.org/abs/0911.

3191v2.

17. D. Goddeke and R. Strzodka, ‘‘Perfor-

mance and Accuracy of Hardware-Oriented

Native-, Emulated-, and Mixed-Precision

Solvers in FEM Simulations (Part 2: Double

Precision GPUs),’’ Ergebnisberichte des

Instituts fur Angewandte Mathematik

[Reports on Findings of the Inst. for

Applied Mathematics], Dortmund Univ. of

Technology, no. 370, 2008; http://www.

mathematik.uni-dortmund.de/~goeddeke/

pubs/GTX280_mixedprecision.pdf.

18. R.G. Belleman, J. Bedorf, and S.P. Zwart,

‘‘High Performance Direct Gravitational

N-body Simulations on Graphics Processing

Units II: An Implementation in CUDA,’’ New

Astronomy, vol. 13, no. 2, 2008, pp. 103-112.

19. Y. Liu, B. Schmidt, and D.L. Maskell, ‘‘MSA-

CUDA: Multiple Sequence Alignment on

Graphics Processing Units with CUDA,’’

Proc. 20th IEEE Int’l Conf. Application-

Specific Systems, Architectures and Pro-

cessors, IEEE CS Press, 2009, pp. 121-128.

20. B. Catanzaro et al., ‘‘Efficient, High-Quality

Image Contour Detection,’’ Proc. IEEE Int’l

Conf. Computer Vision, IEEE CS Press, 2009;

http://www.cs.berkeley.edu/~catanzar/

Damascene/iccv2009.pdf.

21. J. Chong et al., ‘‘Data-Parallel Large Vocabu-

lary Continuous Speech Recognition on

Graphics Processors,’’ tech. report

UCB/EECS-2008-69, Univ. of California at

Berkeley, 2008; http://www.eecs.berkeley.

edu/Pubs/TechRpts/2008/EECS-2008-69.pdf.

22. Y. Pan et al., ‘‘Feasibility of GPU-Assisted

Iterative Image Reconstruction for Mobile

C-Arm CT,’’ Proc. Int’l Soc. for Photonics

and Optonics (SPIE), vol. 7258, SPIE

2009; http://www.sci.utah.edu/~ypan/Pan_

SPIE2009.pdf.

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 68

....................................................................

68 IEEE MICRO

...............................................................................................................................................................................................

HOT CHIPS

23. J.H. Huang, ‘‘2009: The GPU Computing Tip-

ping Point,’’ Proc. IEEE Hot Chips 21, 2009;

http://www.hotchips.org/archives/hc21.

John Nickolls is the director of architectureat NVIDIA for GPU computing. Histechnical interests include developing parallelprocessing products, languages, and architec-tures. Nickolls has a PhD in electricalengineering from Stanford University.

William J. Dally is the chief scientist andsenior vice president of research at NVIDIAand the Willard R. and Inez Kerr BellProfessor of Engineering at StanfordUniversity. His technical interests includeparallel computer architecture, parallel

programming systems, and interconnectionnetworks. He’s a member of the NationalAcademy of Engineering and a fellow ofIEEE, ACM, and the American Academy ofArts and Sciences. Dally has a PhD incomputer science from the CaliforniaInstitute of Technology.

Direct questions and comments aboutthis article to John Nickolls, NVIDIA, 2701San Tomas Expressway, Santa Clara, CA95050; [email protected].

[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 69

....................................................................

MARCH/APRIL 2010 69

For access to more content from the IEEE Computer Society, see computingnow.computer.org.

This article was featured in

Top articles, podcasts, and more.

computingnow.computer.org

Documents

THE GPU COMPUTING ERAstrukov/ece154BSpring2018/GPUreview.… · the gpu computing era gpu computing is at a tipping point, becoming more widely used in demanding consumer applications