10

Click here to load reader

[IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

  • Upload
    norman

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Characterizing Scalar Opportunities in

GPGPU Applications

Zhongliang Chen David Kaeli

Department of Electrical andComputer Engineering

Northeastern University

Boston, MA 02115

Email: {zhonchen, kaeli}@ece.neu.edu

Norman Rubin

NVIDIA Corporation

Email: [email protected]

Abstract—General Purpose computing with Graphics Process-ing Units (GPGPU) has gained widespread adoption in both thehigh performance and general purpose communities. In mostGPU computation, execution exploits a Single Instruction Multi-ple Data (SIMD) model. However, GPU execution typically payslittle attention to whether the data operated upon by the SIMDunits is the same or different. When SIMD computation operateson multiple copies of the same data, redundant computationsare generated. It provides an opportunity to improve efficiencyby just broadcasting the results of a single computation tomultiple outputs. To better serve those operations, modern GPUsare armed with scalar units. Then SIMD instructions that areoperating on the same input data operands will be directed toexecute upon scalar units, requiring only a single copy of the data,and leaving the data-parallel SIMD units available to executenon-scalar operations.

In this paper, we first characterize a number of CUDAprograms taken from the NVIDIA SDK to quantify the potentialfor scalar execution. We observe that 38% of static SIMDinstructions are recognized to operate on the same data by thecompiler, and their dynamic occurences account for 34% of thetotal dynamic instruction execution. We then evaluate the impactof scalar units on a heterogeneous scalar-vector GPU architecture.Our results show that scalar units are utilized 51% of the timeduring execution, though their use places additional pressure onthe interconnect and memory, as shown in the results of ourstudy.

I. INTRODUCTION

General Purpose computing with Graphics ProcessingUnits (GPGPU) is an attractive platform for a growing numberof applications. GPUs were traditionally designed to be streamprocessors for 3-D computer graphics, though they can alsobe effectively used as many-core data parallel processorscapable of high execution throughput and memory bandwidth.Figure 1 compares the single-precision peak performancebetween GPUs and CPUs [1]–[3]. As shown in the Figure,in 2008 GPUs were 13x faster than CPUs. In the past 5years, this gap has widened; currently GPUs provide 17xgreater computational horsepower versus CPUs. Today, GPUsare being deployed in a wide range of acceleration roles forgeneral purpose applications.

The execution model for modern GPUs is based on theSingle Instruction Multiple Data (SIMD) model, which allowsmultiple processing elements to perform the same operation onmultiple data, simultaneously. In terms of system architecture,a GPU device is array of multiprocessors (NVIDIA’s streaming

Fig. 1. Performance comparison between GPUs and CPUs [1]–[3]

multiprocessors or AMD’s compute units), each of whichcontains SIMD units and on-chip shared memory (or AMD’slocal data store). A SIMD unit further contains an array ofbasic processing elements, each containing one ALU. Sharedmemory provides the GPU with the ability to share dataamong processing elements. Also, the multiprocessor supportsbarrier operations to provide for synchronization at the thread-block/work-group level.

The two most popular GPU programming models areCUDA (Compute Unified Device Architecture) [4] andOpenCL (Open Computing Language) [5]. They support bothdata-parallel and task-parallel models. The most commonlyexploited model on the GPU is the data-parallel model, whichis the focus of this paper. In a data-parallel model, computationis represented by a sequence of instructions that execute ona number of indexed threads (or OpenCL’s work-items). Allof the threads are explicitly or implicitly divided into threadblocks (or OpenCL’s work-groups). In a thread block, threadsshare data with fast memory (CUDA’s shared memory orOpenCL’s local memory), and can be synchronized.

When a GPU program is executed, thread blocks are firstscheduled onto multiprocessors, and the individual threadsin the block are further scheduled onto SIMD units in themultiprocessor. Each thread is processed by one processingelement on a SIMD lane.

While data-parallel processing can achieve high speedups,

225978-1-4673-5779-1/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

the standard data-parallel model does not consider the casewhen the input operands to a SIMD instruction are all thesame. Threads are mapped to SIMD units no matter what datathey operate on. In case the computation is performed withmultiple copies of the same data, the parallel operations canbe reduced to Single Instruction Single Data (SISD) execution,which we refer to as a scalar opportunity. If we continue to usethe SIMD hardware for these SISD operations, we are wastingresources and burning unnecessary power. Instead, we turn toa scalar-vector GPU architecture armed with both scalar andSIMD (i.e., vector) units.

On the scalar-vector GPU architecture, scalar opportunitiesare executed on scalar units so that the SIMD engines canbe used to execute true SIMD operations. A good example ofjust such an architecture is AMD’s Graphics Core Next (GCN)architecture. This design adds a scalar coprocessor into eachcompute unit. The scalar coprocessor has a fully functionalinteger ALU, with independent instruction arbitration anddecode logic and also a scalar register file. This new unithelps execute a variety of control flow instructions, includingjumps, calls and returns. The scalar coprocessor presents newopportunities in terms of performance and power efficiency [6].

Our scalar-vector GPU design aims to be more flexiblethan GCN. We designed our scalar unit to handle both integerand floating point instructions. Also, the scalar unit does notneed to have a scalar register file (the scalar units can use thevector register file in order to incur fewer hardware changes).The proposed architecture should be capable of effectivelyutilizing onboard scalar units to serve scalar opportunities inapplications at low cost.

This paper provides a first glimpse of the scalar opportu-nities present in GPU applications. Furthermore, we evaluatethe potential impact that adding a scalar unit can have ona conventional GPU architecture. We examine the challengesand opportunities for various design alternatives on differentmicroarchitectural components, including the multiprocessorpipeline, interconnection network, and memory subsystem.

To the best of our knowledge, this paper is the first attemptto evaluate scalar opportunities in the microarchitecture ofa GPU. This paper makes the following contributions. Fromthe perspective of software, our scalar design identifies scalaropportunities in GPU applications using static analysis, anduses this information to guide scalar unit design. From theperspective of hardware, we evaluate the impact of scalaropportunities when run on a scalar-vector GPU architecture.We discuss and also address opportunities and challengesintroduced by scalar opportunities.

This paper is organized as follows. Section II presentsbackground on GPU programming models and architecture.Section III introduces and defines scalar opportunities inGPGPU applications. Section IV describes the proposed scalar-vector GPU architecture and various design alternatives, andalso discusses implementation details. Section V presents ourexperimental setup and modeling results. Section VI discussesrelated work. The paper is finally concluded in Section VII.

Fig. 2. CUDA programming model [4]

Fig. 3. NVIDIA Fermi GPU architecture [11]

II. BACKGROUND

A. GPGPU Programming Models

There are two popular models widely used today forGPU programming: 1) CUDA and 2) OpenCL. CUDA isa general purpose parallel programming model, introducedby NVIDIA [4]. As shown in Figure 2, CUDA allows theprogrammer to partition a problem into multiple subproblemsthat can be solved independently in parallel by blocks ofthreads. Each subproblem can be further subdivided into finerelements that can be solved cooperatively in parallel by allthreads within the block.

A CUDA program usually has two parts: 1) host coderunning on the CPU and 2) device code running on theGPU. The compilation of device code works as follows: itis extracted by the NVIDIA CUDA C Compiler (NVCC) [7]first, and then compiled to intermediate PTX (Parallel ThreadeXecution) [8] code. The PTX code is further compiled andoptimized at run time by the NVIDIA proprietary OptimizedCode Generator [9] to native SASS instructions (the NVIDIAISA) [10].

The OpenCL programming model [5], which is managed

226

Page 3: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Fig. 4. AMD Graphics Core Next compute unit architecture [6]

by the Khronos Group, is an open standard for general purposeparallel programming across CPUs, GPUs and other devices,giving programmers a portable language to target a rangeof heterogeneous processing platforms. In OpenCL, data ismapped to work-items in an index space, and all work-itemsare explicitly or implicitly divided into work-groups.

B. GPU Architecture

GPUs usually adopt a massively parallel model to achievehigh throughput. Most of the device real estate on a GPU isdedicated to computation rather than control logic or cache.The NVIDIA Fermi architecture [11], shown in Figure 3,features up to 16 streaming multiprocessors, each of whichhas 32 CUDA cores, 16 load/store units, and 4 special functionunits. Each CUDA core has a fully pipelined ALU, and canexecute an ALU instruction per clock for each thread. Eachload/store units allows source and destination addresses to becalculated per thread per clock. Each special function unit canexecute a transcendental instruction per thread per clock [11].

When a GPU kernel is launched, the global scheduler dis-tributes thread blocks to the local schedulers in each streamingmultiprocessor. Threads are further scheduled to SIMD unitsin warps (groups of 32 threads). Each multiprocessor has twowarp schedulers and two instruction dispatch units, allowingtwo warps to be issued and executed concurrently. Two groupsof 16 cores each are used to execute two instructions from twodifferent warps per cycle.

The homogeneous SIMD-only architecture described aboverecently underwent a major microarchitectural change. Fig-ure 4 shows a state-of-the-art AMD Graphics Core Nextcompute unit architecture [6], where scalar units are integratedinto compute units, introducing heterogeneity within the GPU.Unlike standard SIMD units, the scalar units provide fast andefficient integer SISD execution. They are mainly used to ex-pedite address generation and control flow execution in GCN.Furthermore, SIMD units can execute other SIMD instructionsat the same time as the scalar units execute SISD operations.

This heterogeneous architecture provides more flexibility forGPU applications.

III. SCALAR OPPORTUNITIES IN GPU APPLICATIONS

In GPU programming models, computation is representedby a sequence of SIMD instructions, each of which operateson vector operands in multiple threads. Each component ofa vector operand participates in a single computation on oneALU on the GPU. To address thread divergence, an activemask (defined as a bit map) can be used to indicate whether anindividual thread is active or not. If a thread is active, its resultsare confirmed and kept in the updated microarchitectural state.Otherwise, the results are simply discarded.

We define a scalar opportunity as a SIMD instructionoperating on the same data in all of its active threads. Atypical example of scalar opportunities is loading a constantvalue when each active thread loads the same value frommemory and then stores it in the corresponding component ofthe destination vector register. Finally those components storethe same value.

Scalar opportunity analysis can be performed at differentabstraction levels. Compiler-level analysis is more flexible andneeds zero hardware modifications, but it can only identifyscalar opportunities within a thread block or coarser structuresince intra-thread-block information is dynamic. Also, it maybe conservative since the compiler has to consider all possiblecontrol flow paths. Architecture/microarchitecture-level analy-sis is more informative equipped with run-time information,and can handle scalar opportunities within a finer grainedstructure such as a warp, but at a hardware cost. In this paperwe work at the compiler level mainly due to its lower cost andhigh flexibility.

We carry out the characterization of scalar opportunities onNVIDIA PTX code for the following two reasons. One is thatPTX is stable across multiple GPU generations, which makesour approach more general. The other is that there are severalexisting PTX research tools available to use in the literature.However, we claim that our analysis is independent of anyspecific SIMD programming model, and thus applies to otherSIMD-based instruction sets besides PTX, including NVIDIASASS, AMD IL, and AMD ISA.

To better understand scalar opportunities in GPU appli-cations, consider vector addition for example, as shown inFigure 5. As seen in the CUDA code, the variable i isinitialized to the global thread index at first, which is computedusing the thread block dimension, thread block index, and localthread index in a thread block. The corresponding PTX codeis using three vector registers r1, r0, and r3 to keep track ofthose three operands, respectively. The thread block dimensionand index are the same for every thread in a thread block, so thefirst two PTX instructions are scalar opportunities (marked 1 atthe beginning of the line). The following multiply instructionuses r0 and r1 to compute an intermediate value, and is alsoa scalar opportunity. Afterwards, the fourth instruction movesthe thread index to r3, which obviously processes differentdata for each thread. It is therefore a true SIMD instruction(marked 0 at the beginning of the line).

Then a conditional branch follows. Suppose that thesize of the arrays N is not evenly divisible by the size

227

Page 4: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Fig. 5. CUDA and PTX code of vector addition

of a thread block NB , and the arrays can have differentdata in the elements. Then the last thread block has thethreads [NB · ⌊N/NB⌋, NB · ⌈N/NB⌉ − 1], where the threads[NB · ⌊N/NB⌋, N − 1] execute the if block (BB_1_3), whilethe threads [N,NB · ⌈N/NB⌉ − 1] execute the else block(BB_1_4). When the if block is executed, the threads[N,NB · ⌈N/NB⌉ − 1] are inactive, which explains why theload parameter operations are scalar opportunities. However,the addition is not a scalar opportunity since all the activethreads operate on data that can be different.

IV. SCALAR-VECTOR GPU ARCHITECTURE

Scalar opportunities rely on heterogeneous scalar-vectorGPU architectures equipped with SIMD and scalar units toimprove resource utilization, performance, and power. TheAMD Graphics Core Next architecture is currently the onlycommercially available example of such architecture. Thescalar unit design includes an integer ALU to execute arith-metic and logical integer operations, and also a scalar registerfile to hold the operands for scalar opportunities.

We extend the functionality of the AMD scalar unit in ourscalar-vector architecture implementation to support any non-transcendental instructions including integer, floating-point,and others. It can perform general computation in addition toaddress generation and condition manipulation, which impliesthat more scalar opportunities can be executed on our scalarunit. Such design choices may result in higher design com-plexity or added latency. However, some overhead is easily

tolerable as long as overall execution is more efficient due tothe addition of the scalar units.

Another major modification is that we do not supportseparate scalar and vector register files; instead, instructionsalways use a single vector register file. The primary advantageof this scheme is that we avoid expensive data movementbetween two register files. Consider that a SIMD instructionhas a data dependence on a previous scalar opportunity. Ifthe scalar unit and SIMD unit have separate register files, wehave to design a mechanism to broadcast the scalar results tothe vector source operands required by the SIMD instruction,which could limit the benefits of scalar processing. Anotherbenefit is that the vector portion of the existing instruction setdesigned for a traditional SIMD execution model does not haveto be changed in order to benefit from scalar opportunities.Designers only need to focus on the new scalar instructions.

The downside of employing a combined scalar-vectorregister file will be the fact that we may need to add extraread ports to the vector registers. Adding read ports can beexpensive and incur additional power. In order to limit theimpact of this choice, we employ NVIDIA’s operand collectorarchitecture [12] in our design. The operand collector is used tosimulate a multi-ported memory using lower port count mem-ory banks. It uses multiple collector units to buffer operandsof instructions, and a bank request arbitration unit to scheduleregister accesses from collector units. An instruction can beissued from a collector unit to an execution unit when all of itsoperands are ready. We add a collector unit for each scalar unit.The collector unit stores the warp identifier, instruction opcode,register identifiers, and operands. The operand field merelystores the operands for one thread since a scalar unit onlyneeds the scalar operands for any one of the active threads. Itincurs much less storage overhead than the collector units forSIMD units. Moreover, the scalar unit reads no more than onecomponent of a vector register, and the arbitration unit is ableto freely choose the optimal component so that current readrequests in the queue incur fewer bank conflicts. Since somecomponents may correspond to inactive threads, the arbitrationunit uses thread divergence information to read one componentfrom an active thread.

We implemented our new scalar unit design and related mi-croarchitectural components on the GPGPU-Sim version 3.1.0simulator [13]. The model is based on an NVIDIA Fermi GPUarchitecture. GPGPU-Sim is a cycle-level GPU performancesimulator, composed of Single Instruction Multiple Thread(SIMT) cores connected via an on-chip connection networkto memory partitions that interface graphics DRAM (dynamicrandom-access memory). A SIMT core models a highly multi-threaded pipelined SIMD multiprocessor very similar in designto an NVIDIA Streaming Multiprocessor or an AMD ComputeUnit. A processing element corresponds to a lane within anALU pipeline in a SIMT core [13].

As shown in Figure 6, a SIMD instruction is executed on aSIMT core as follows. First, the instruction is fetched from theinstruction cache, decoded, and then stored in the instructionbuffer. The instruction buffer is statically partitioned so thatall warps running on the SIMT core have dedicated storageto place instructions. Then the issue logic checks all the validinstructions, which are decoded but not issued, to establishissue eligibility. A valid instruction can be issued if the

228

Page 5: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Fig. 6. Overview of the scalar-vector GPU architecture implemented inGPGPU-Sim

following three requirements are all satisfied: (1) its warp isnot waiting at a barrier, (2) it passes the Write After Write(WAW) or Read After Write (RAW) hazards check in thescoreboard, and (3) the operand access stage of the instructionpipeline is not stalled. Memory instructions are issued to thememory pipeline. The other instructions always prefer SIMDunits to special function units (SFU), unless they have to beexecuted on special function units. The pipeline also maintainsa SIMT stack per warp to handle branch divergence. Moreover,an operand collector offers a set of buffers and arbitration logicused to provide the appearance of a multi-ported register fileusing multiple banks of single-ported RAMs. The buffers holdthe source operands of instructions in collector units. When allthe operands are ready, the instruction is issued to an executionunit.

Figure 6 shows the major modifications we made toGPGPU-Sim to model our scalar-vector design. These changesinclude:

• Execution unit. We added a configurable number ofscalar units, each of which is pipelined and canexecute all types of ALU instructions except tran-scendentals. They have the same speed as SIMTunits (i.e., they execute one instruction per cycle).Each unit has an independent issue port from theoperand collector, and share the same output pipelineregister as other execution units that are connected toa common writeback stage.

• Operand collector. We added a configurable numberof collector units to each scalar unit. The collector

units have a similar structure to those for SIMD units,but store the operands for only one active thread.

• Issue logic. We modified the warp scheduler so thatscalar instructions can be issued to scalar units atthe same time SIMD instructions are issued to SIMDunits; otherwise, the instructions will never run inparallel on both units. The issue width of our simulatoris configurable. Moreover, scalar opportunities shouldbe able to run on SIMD units as well for flexibility,though we may choose to restrict this option whenoptimizing for power.

• Configuration options. The configurable parametersdescribed above are added as new configuration op-tions for GPGPU-Sim.

V. MODELING RESULTS

In this section, we describe our experimental setup forcharacterizing scalar opportunities. As mentioned before, weidentify scalar opportunities using a compiler pass first, andsimulate the program on a modified GPU simulator with theinformation gathered using the compiler pass to investigatewhat microarchitectural components scalar opportunities uti-lize.

A. Experimental Setup

As described in Section III, a scalar opportunity is a SIMDinstruction operating on the same data across all the activethreads in a thread block. So our first step is to determine if avector operand contains the same data for all the componentscorresponding to active threads. If this condition is satisfied,we call it a uniform vector; otherwise, it is divergent. A scalaropportunity requires that all of its source operands are uniform.

We use a static variable divergence analysis approach pro-posed by Coutinho, et al. [14], to decide whether an operandis uniform or divergent. It first performs PTX-to-PTX codetransformation in order to handle both data dependence andsync dependence. Next, data dependence graph reachabilityanalysis starts from apparent divergent variables (e.g.,threadIDs) and the variables defined by atomic instructions, as shownin Figure 7(a). All the variables reached are marked divergent(black circles); the others are uniform (white circles).

When performing variable divergence analysis on a datadependence graph, we add a tag to each variable indicatingwhether it is uniform or divergent. Then we carry out staticscalar opportunity analysis on a control flow graph using thetags previously generated, as illustrated in Figure 7(b). ASIMD instruction is recognized as a scalar opportunity (whitebox) if and only if all of its source operands are uniform.

Static statistics are insufficient to arrive at the best use ofthe scalar units. For example, assume a program has 10 staticinstructions, where 5 instructions are non-scalar opportunitiesin a loop executing 100 iterations, and the others are scalaropportunities out of the loop. Then the percentage of staticscalar opportunities is 5/10=50%, while that of dynamic scalaropportunities is 5/(5+5*100)=1%. Scalar units will be under-utilized if a program has limited dynamic scalar opportunities.Hence, we also count dynamic occurences of static scalaropportunities.

229

Page 6: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Fig. 7. An example of scalar opportunity analysis (vector addition)

Note that static analysis does not use run-time information,and thus it may be conservative. Specifically, uniform vectorsmay be recognized as divergent in variable divergence analysis,and so some scalar opportunities are not detected. For instance,an instruction subtracting a divergent vector from itself pro-duces a uniform vector 0. However, because the result (i.e., 0)has a data dependency on two divergent source vectors in thedata flow graph, it is actually labeled a divergent vector. If afollowing instruction adds the previous result 0 and a uniformvector, it will be identified as a non-scalar opportunity sincethe 0 was recognized as a divergent vector. Another example isthat of a conditional branch may be taken/not taken for all thethreads when the program is executed, i.e., thread divergencemay not happen. Then the uniform variables defined betweenthe branches and their immediate post-dominators stay uni-form. However, they are recognized as divergent since thecompiler has to consider all the possibilities. Dynamic analysiscan generate run-time statistics under those circumstances.Nevertheless, with dynamic analysis hardware modification isrequired, which will incur high cost, and run-time informationmay heavily depend on program inputs, resulting in veryspecific statistics for certain inputs. Thus we do not considerdynamic analysis in this paper.

We added a compiler pass to GPU Ocelot [15] to performour static analysis. Ocelot is a modular dynamic compilationframework for heterogeneous systems, providing various back-end targets for CUDA programs and analysis modules for thePTX virtual instruction set. In the experiments, we compiledCUDA source code to PTX code first, and then used Ocelotto generate flags indicating if a static instruction is a scalaropportunity and the instruction type (e.g., integer, floating-point, memory, etc.). The information will be read later duringthe simulation.

B. Results

We ran 20 CUDA benchmarks chosen from the NVIDIACUDA SDK version 4.0. We follow the methodology presentedin previous subsection to collect our results. As shown inTable I, the benchmarks range from scientific algorithms (e.g.,the discrete cosine transform) to financial applications (e.g.,binomial option pricing). In this subsection, we characterizethe number of scalar opportunities in these benchmarks first,and then discuss their impact on the scalar-vector GPU mi-croarchitecture components.

We count the number of static scalar opportunities usingOcelot, and profile their dynamic occurences during simula-tion using GPGPU-Sim. As Figure 8 shows, 38% of staticSIMD instructions on average are detected by the compiler asscalar opportunities. The results imply that scalar opportunitiesalways exist in GPGPU applications, even if we use SIMDprogramming models to write and optimize our programs.

We also break down all the static and dynamic scalar op-portunities into individual instruction types, which are shownin Figures 9 and 10, respectively. Parallelism instructionsinclude barrier synchronization, reduction operations on globaland shared memory, and vote instructions. Special functioninstructions are transcendental operations running on specialfunction units. We can see that most scalar opportunities areinteger, floating-point, or memory operations. Since memory

230

Page 7: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

TABLE I. BENCHMARKS

Benchmark Description

BlackScholes Evaluation of fair call and put prices for a given set of

European options by Black-Scholes formula

MersenneTwister Mersenne Twister random number generator and Carte-

sian Box-Muller transformation

MonteCarlo Evaluation of fair call price for a given set of European

options using Monte Carlo approach

SobolQRNG Sobol Quasirandom Sequence Generator

binomialOptions Evaluation of fair call price for a given set of European

options under binomial model

convolutionSeparable A separable convolution filter of a 2D signal with a

Gaussian kernel

dct8x8 Discrete Cosine Transform for blocks of 8 by 8 pixels

dwtHaar1D Discrete Haar wavelet decomposition for 1D signals

with a length which is a power of 2

eigenvalues A bisection algorithm for the computation of all eigen-

values of a tridiagonal symmetric matrix of arbitrary

size

fastWalshTransform Naturally(Hadamard)-ordered Fast Walsh Transform for

batched vectors of arbitrary eligible (power of two)

lengths

histogram256 256-bin histogram

histogram64 64-bin histogram

mergeSort Merge sort algorithm

quasirandomGenerator Niederreiter Quasirandom Sequence Generator and In-

verse Cumulative Normal Distribution function for

Standard Normal Distribution generation

reduction Summarization of a large arrays of values

scalarProd Scalar products of a given set of input vector pairs

scan Parallel prefix sum (given an array of numbers, compute

a new array in which each element is the sum of all the

elements before it in the input array)

sortingNetworks Bitonic sort and odd-even merge sort algorithms

transpose Matrix transpose

vectorAdd Vector addition

Fig. 8. Percentage of static scalar opportunities and their dynamic occurences

Fig. 9. Instruction type breakdown of static scalar opportunities

Fig. 10. Instruction type breakdown of dynamic scalar opportunities

instructions are always executed on the load/store units, weneed to enable scalar unit support for at least integer and float-ing point instructions in order to obtain most of the benefitsavailable by adding scalar units. Also note that transcendentalcalculations such as sine and cosine are all normal SIMDinstructions across the benchmarks, which suggests that asimple scalar unit is enough (no need to add transcendentalsupport). In addition, atomic instructions cannot be executedon scalar units since they contain memory operands and needto access the load/store units.

In some benchmarks such as binomialOptions, thepercentage of static scalar opportunities is significantly higherthan the percentage measured during runtime. This trendimplies that scalar opportunities in those benchmarks are likelypresent in the initialization phase of the code, and thus thoseprograms will not benefit from scalar opportunities in themain loops. On the other hand, in some benchmarks such ashistogram256, the percentage of static scalar opportunitiesis much lower than the number executed. In such cases, scalaropportunities are very likely to be present in the main loopsof these benchmarks.

A similar scenario can be seen for selected types of scalaropportunities. Take BlackScholes for example, wherefloating-point operations account for 65% of all the staticscalar opportunities, but only 7% of all the dynamic scalaropportunities. The main reason is that these instructions arelocated in the initialization phase of the code. In contrast, 17%of the static scalar opportunities are integer instructions, whiletheir dynamic occurences account for 92% of dynamic scalaropportunities. Therefore, integer scalar opportunities determinethe benefits of the scalar units to a large degree.

We evaluate the utilization of four types of execution unitson a scalar-vector GPU architecture, including scalar units,SIMD units, special function units, and load/store units. Asillustrated in Figure 11, scalar units have high utilization, with51% occupancy on average across all of the benchmarks. Notethat the presence of a large number of scalar opportunitiesdoes not directly imply higher utilization of the scalar units,since effective exploitation of these units depends on othermicroarchitecture features. For example, if the load/store unitscannot keep up with supplying source operands to the scalaropportunities, execution will be stalled.

231

Page 8: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

Fig. 11. Utilization of four types of execution units

Fig. 12. Utilization of SIMD units and load/store units in homogeneous andscalar-vector GPU architecture

Fig. 13. Stall cycles difference on scalar-vector GPU architecture overhomogeneous architecture.

From the figure, we can also see that the utilization ofscalar units is generally lower than that of SIMD units. Thereason is that we schedule scalar opportunities on availableSIMD units when all scalar units are busy. We schedule a scalaropportunity into every lane of a SIMD unit for execution. Ourscheduling policy decreases stalls in spite of resulting in lessefficient execution.

By comparing the utilization of SIMD units and load/storeunits in homogeneous and scalar-vector GPU architectures,we can see in Figure 12 that the utilization of SIMD unitsdecreases when we introduce the additional scalar units. How-ever, the utilization of the load/store units remains the same.But scalar units can place more pressure on memory, asexplained below.

Scalar opportunities can put pressure on the multiprocessorpipeline, interconnection network and memory subsystem, asshown in Figure 13. Multiprocessor pipeline stalls can becaused by shared memory bank conflicts, non-coalesced mem-ory accesses, or serialized memory accesses. Interconnectionnetwork stalls happen when DRAM channels cannot acceptrequests from the interconnect. Memory stalls result from inter-connection network congestion when DRAM channels cannotsend packets. In the figure, a positive number implies that stallcycles are increased on a scalar-vector GPU architecture overa homogeneous GPU architecture, while a negative numbersuggests that they are decreased.

The benchmarks show various results. BlackScholesand quasirandomGenerator place more pressure on mul-tiprocessor pipelines, while MersenneTwister places lesspressure. Moreover, for binomialOptions, scalar unitsgreatly relieve much of the pressure on the interconnect. Incontrast, several other benchmarks including dwtHaar1D andscalarProd experience more interconnect stalls. Addition-ally, some benchmarks such as MonteCarlo place additionalstress on memory.

The pressure largely results from parallel execution ofscalar units and SIMD units. Additional source operands haveto be read to fill the scalar unit pipeline. This can result in more

232

Page 9: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

traffic on the memory and interconnects versus the SIMD-onlyarchitecture.

When designing a scalar-vector GPU architecture, we needto keep in mind that when we add scalar units to the mi-croarchitecture, we may need to increase our interconnect andmemory bandwidth to guarantee the data delivered to theseunits. We need to consider the entire data path so that we donot create another hotspot in the microarchitecture.

VI. RELATED WORK

Previous research on divergence in GPGPU applicationsoffered us helpful ideas. Coutinho et al. proposed variabledivergence analysis and optimization algorithms [14]. Theyintroduced a static analysis to determine which vector variablesin a program have the same values for every processingelement. Also, they described a new compiler optimizationthat identifies, via a gene sequencing algorithm, chains ofsimilarities between divergent program paths, and weaves thesepaths together as much as possible. Their analysis is used inour work to determine scalar opportunities.

Collange et al. presented a technique for dynamic detectionof uniform and affine vectors in GPGPU computations [16].They concentrated on two forms of value locality specific tovector computations in GPUs. The first form corresponds tothe uniform pattern present when computing conditions whichavoid divergence in sub-vectors. The second form correspondsto the affine pattern used to access memory efficiently. Theyproposed an idea of using both forms of value locality com-bined with hardware modifications to significantly reduce thepower required for data transfers between the register fileand the functional units. They also looked at how to reducethe power drawn by the SIMD arithmetic units. Their workanalyzed variables only, which differs from our approach oncomputations.

Collange later proposed a mechanism to identify scalarbehavior in CUDA kernels [17]. This prior work describesa compiler analysis pass that identifies statically several kindsof regular patterns that can occur between adjacent threads,including common computations, memory accesses to consec-utive memory locations and uniform control flow. While it isof high quality, this prior work did not consider scalar-vectorGPU architecture.

Stratton et al. described a microthreading approach toefficiently compile fine-grained single-program multiple-datathreaded programs for multicore CPUs [18]. They enabledredundancy removal in both computation and data storage as aprimary optimization, where variance analysis discovers whatportions of the code produce the same value for all threads.Our work differs from theirs in that we target scalar-vectorGPU architectures rather than multicore CPUs.

Hong and Kim proposed an integrated power and per-formance modeling system for GPU [19], which uses anempirical modeling approach to model the GPU power. Theyused the power and timing models to predict performanceper watt and also the optimal number of cores to achieveenergy savings. Their work is based on a conventional SIMD-only GPU architecture, while we focus on scalar-vector GPUarchitecture.

VII. CONCLUSION

In this paper we characterized scalar opportunities inGPU applications using a quantitative approach. The goalwas to motivate the need for GPU architectures to evolvefrom homogeneity (i.e., SIMD only) to heterogeneity (i.e.,scalar plus SIMD). Our static analysis guided approach isflexible and low-cost requiring zero hardware modifications.We have designed and implemented a detailed heterogeneousscalar-vector GPU architecture on a cycle-level simulator, andevaluated hardware resource utilization using commonly usedGPU benchmarks. We have also evaluated the impact of scalaropportunities on multiprocessor pipelines, interconnection net-works, and memory subsystems.

The presence of scalar opportunities in common applica-tions provides us with opportunities to pursue performanceimprovements and reduce power consumption. In order toachieve those efficiencies, we need to carefully tune hardwarecomponents. Our future work will investigate how to buildand optimize scalar-vector GPU architectures to maximize thebenefits from scalar opportunities in wider range of GPGPUapplications.

ACKNOWLEDGMENT

The authors would like to thank Yoav Etsion, who helped toimprove the quality of our final paper. This work is supportedby an NSF EEC Innovations Program award number EEC-0946463, and by both AMD and NVIDIA. The authors wouldalso like to thank the GPGPU-Sim and Ocelot teams for useof their toolsets.

REFERENCES

[1] (2012) Comparison of AMD graphics processing units. [On-line]. Available: http://en.wikipedia.org/wiki/Comparison of AMDgraphics processing units

[2] (2012) Comparison of Nvidia graphics processing units. [On-line]. Available: http://en.wikipedia.org/wiki/Comparison of Nvidiagraphics processing units

[3] Intel Corporation. (2012) Intel microprocessor export compliancemetrics. [Online]. Available: http://www.intel.com/support/processors/sb/CS-017346.htm

[4] NVIDIA Corporation. (2012) NVIDIA CUDA C programming guideversion 4.2.

[5] Khronos OpenCL Working Group. (2011) The OpenCL specificationversion 1.2.

[6] Advanced Micro Devices, Inc. (2012) AMD Graphics Cores Next(GCN) architecture whitepaper.

[7] NVIDIA Corporation. (2011) The CUDA Compiler Driver NVCC.

[8] (2011) PTX: Parallel Thread Execution ISA version 2.3.

[9] M. Murphy. (2011) NVIDIA’s experience with Open64.

[10] NVIDIA Corporation. (2011) cuobjdump application note.

[11] (2009) NVIDIA Fermi whitepaper version 1.1.NVIDIA Fermi Compute Architecture Whitepaper.pdf. [Online].Available: http://www.nvidia.com/content/PDF/fermi white papers/

[12] S. Liu, E. Lindholm, M. Y. Siu, B. W. Coon, and S. F. Oberman,“Operand collector architecture,” U.S. Patent 7 834 881B2, Nov. 16,2010.

[13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “AnalyzingCUDA workloads using a detailed GPU simulator,” in Proceedings ofthe IEEE 2009 International Symposium on Performance Analysis of

Systems and Software, Boston, MA, USA, Apr. 2009, pp. 163 –174.

233

Page 10: [IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

[14] B. Coutinho, D. Sampaio, F. Pereira, and W. Meira, “Divergenceanalysis and optimizations,” in Proceedings of the 20th International

Conference on Parallel Architectures and Compilation Techniques,Galveston Island, TX, USA, Oct. 2011, pp. 320–329.

[15] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot:a dynamic optimization framework for bulk-synchronous applicationsin heterogeneous systems,” in Proceedings of the 19th International

Conference on Parallel Architectures and Compilation Techniques, NewYork, NY, USA, Sep. 2010, pp. 353–364.

[16] S. Collange, D. Defour, and Y. Zhang, “Dynamic detection of uniformand affine vectors in GPGPU computations,” in Proceedings of the

3rd Workshop on Highly Parallel Processing on a Chip, Delft, TheNetherlands, Aug. 2009, pp. 38–47.

[17] S. Collange, “Identifying scalar behavior in CUDA kernels,” INRIA,France, Tech. Rep. hal-00555134, Jan. 2011.

[18] J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu,and W.-m. W. Hwu, “Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs,” in Proceedings of the 8th

Annual IEEE/ACM International Symposium on Code Generation andOptimization, New York, NY, USA, Apr. 2010, pp. 111–119.

[19] S. Hong and H. Kim, “An integrated GPU power and performancemodel,” in Proceedings of the 37th Annual International Symposium onComputer Architecture, New York, NY, USA, Jun. 2010, pp. 280–289.

234