TEG: GPU Performance Estimation Using a Timing Model · TEG: GPU Performance Estimation Using a Timing Model 5 2.2 CUDAProgrammingModel The Compute Uniﬁed Device Architecture (CUDA)

HAL Id: hal-00641726https://hal.inria.fr/hal-00641726

Submitted on 16 Nov 2011

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

TEG: GPU Performance Estimation Using a TimingModel

Junjie Lai, André Seznec

To cite this version:Junjie Lai, André Seznec. TEG: GPU Performance Estimation Using a Timing Model. [ResearchReport] RR-7804, INRIA. 2011. �hal-00641726�

https://hal.inria.fr/hal-00641726

https://hal.archives-ouvertes.fr

ISS

N02

49-6

399

ISR

NIN

RIA

/RR

--78

04--

FR+E

NG

RESEARCHREPORTN° 7804November 2011

Project-Team ALF

TEG: GPU PerformanceEstimation Using aTiming ModelJunjie Lai, André Seznec

RESEARCH CENTRERENNES – BRETAGNE ATLANTIQUE

Campus universitaire de Beaulieu35042 Rennes Cedex

TEG: GPU Performance Estimation Using aTiming Model

Junjie Lai, André Seznec

Project-Team ALF

Research Report n° 7804 — November 2011 — 16 pages

Abstract: Modern Graphic Processing Units (GPUs) offer significant performance speedupover conventional processors. Programming on GPU for general purpose applications has becomean important research area. CUDA programming model provides a C-like interface and is widelyaccepted. However, since hardware vendors do not disclose enough underlying architecture details,programmers have to optimize their applications without fully understanding the performancecharacteristics.In this paper we present a GPU timing model to provide more insights into the applications’performance on GPU. A GPU CUDA program timing estimation tool (TEG) is developed basedon the GPU timing model. Especially, TEG illustrates how performance scales from one warp(CUDA thread group) to multiple concurrent warps on SM (Streaming Multiprocessor). BecauseTEG takes the native GPU assembly code as input, it allows to estimate the execution time withonly a small error. TEG can help programmers to better understand the performance results andquantify bottlenecks’ performance effects.

Key-words: GPGPU, CUDA, Performance Estimation, Analytical Model

TEG: GPU Performance Estimation Using aTiming Model

Résumé : Dans ce rapport, nous proposons une modélisation de la mi-croarchitecture d’un GPU afin d’offrir une meilleure compréhension des perfor-mances d’une application sur le GPU. TEG est un outil d’estimation de tempsd’exécution de programme basé sur cette modélisation.

Mots-clés : GPGPU, CUDA, Performance Estimation, Analytical Model

TEG: GPU Performance Estimation Using a Timing Model 3

1 Introduction

In recent years, more and more HPC researchers begin to pay attention to thepotential of GPUs since GPU can provide enormous computing power and mem-ory bandwidth. Today, many applications have been ported to GPU platformswith programming interfaces like CUDA [7] or OpenCL [8]. Though little effortsare needed to functionally port applications on GPU, programmers still have tospend lot of time to optimize their applications to achieve better performance.However, because few architecture details are disclosed, it is hard to gain insightinto the GPU performance result. Normally, programmers have to exhaustivelyexplore the design space or rely on their programming experience [10]. To betterunderstand the performance remains a challenge for GPGPU community.

We present a GPU timing model to analyze GT200 GPU performance andthe tool TEG to estimate GPU kernel execution time. TEG takes the outputof a NVIDIA disassembler tool called cuobjdump [7]. cuobjdump can processthe CUDA binary file and generate assembly codes. TEG does not execute thecodes, but only uses the information such as instruction type, operands, etc.With the instruction trace and some other necessary output of a functional sim-ulator, TEG can give the timing estimation in cycle-approximate level. Thus itallows programmers to better understand the performance bottlenecks and howmuch penalty the bottlenecks can introduce. We just need to simply removethe bottlenecks’ effects from TEG, and estimate the performance again to com-pare. In our study, we make the following contributions. First, we present atiming model based on micro benchmarks and a new approach to estimate GPUperformance with instruction traces. Second, we present how the performancescales from one warp to multiple concurrent warps on SM, thus TEG can givegood estimation for applications with very few concurrent warps on SM.

Several works on GPU analytical models have been presented to help devel-opers to optimize GPU applications. Hong et al. [4] proposed a simple GPUanalytical model that takes PTX code as input and statically predict executiontime based on analyzing memory warp parallelism and computation warp par-allelism. Another GPU analytical model introduced by Baghsorkhi et al. [1]uses abstract interpretation of a GPU kernel to predict general purpose appli-cations on GPU architecture. Kim et al. [6] presented a tool called CuMAPzto estimate GPU memory performance. CuMAPz collects performance-criticaleffects, such as shared memory bank conflicts and calculates a memory utiliza-tion parameter based on all these factors. Recently, Zhang et al. [12] proposeda quantitative performance model for GPU, which is similar to our approach.The main difference is that in our study, we use instruction trace instead of in-struction statistics as the tool input, and demonstrate finer grained performancescaling behavior.

The paper is organized as follows: Section 2 introduces our target GPU ar-chitecture and CUDA programming model. In Section 3 we present our timingmodel for GPU. Section 4 demonstrates TEG’s workflow. In Section 5 we eval-uate TEG with two cases. Section 6 concludes this study and presents futuredirection.

RR n° 7804

4 Lai & Seznec

2 Background

2.1 GT200 GPU Architecture

GPU represents a major trend in the recent advance on architecture for highperformance computing. GPU processor includes large number of fairly simplecores. We use NVIDIA GT200 architecture (Figure 1) as a candidate for ourresearch.

GT200 is composed of 10 TPCs (Thread ProcessingCluster), each of whichincludes 3 SMs (Streaming Multiprocessor). Each SM further includes 8 SPs(Streaming Processor) and 2 SFUs (Special Function Unit). If we consider SPas one core, then one GPU processor is comprised of 240 cores.

For Geforce GTX 280 model, with 1296MHz shader clock, the single preci-sion peak performance can reach around 933GFlops. GT280 has 8 64-bit wideGDDR3 memory controllers. With 2214MHz memory clock on GTX 280, thememory bandwidth can reach around 140GB/s. Besides, within each TPC thereis a 24KB L1 texture cache and 256KB L2 Texture cache is shared among TPCs.

PCI-E

xpre

ss 2

.0x1

6 Interconnect Network

DRAMInterface

DRAMInterface

DRAMInterface

DRAMInterface

DRAMInterface

DRAMInterface

DRAMInterface

DRAMInterface

TextureL2

TextureL2

TextureL2

TextureL2

TextureL2

TextureL2

TextureL2

TextureL2

TPC 0SM Controller

SMShare

SMShare

SMShare

Texture UnitsTexture L1

TPC 1SM Controller

SMShare

SMShare

SMShare


TPC 9SM Controller

SMShare

SMShare

SMShare


... ...

Figure 1: Block Diagram of GT200 GPU

NVIDIA has announced the new generation GPU with code name Fermi.There are a few differences from GT200 generation. First of all, the TPC isremoved from the architecture and each SM contains 32 SPs, much more than8 of GT200. Fermi provides real L1 and L2 cache hierarchy. Double precisioncomputation power increases dramatically for Tesla version Fermi. Since in ourpresent research step, cache effects are ignored, Fermi will be our next researchtarget.

Inria


2.2 CUDA Programming ModelThe Compute Unified Device Architecture (CUDA) [7] is a C-like abstractionfor NVIDIA GPU, and has very good programmability. A CUDA program iscomposed of host code, which runs on the host CPU, and device code or kernel,which runs on the GPU processor. The device code is first compiled into PTXcode, and then into native GPU binary code. In the PTX code, register resourceis assumed to be unlimited. The resource allocation and some optimizationtechniques, such as instruction reordering, are applied when compiling PTXcode into binary code.

A typical CUDA program normally creates thousands of threads to hidelatency with very light-weight context switch mechanism. These threads aregrouped into 1D to 3D blocks, and blocks are grouped into 1D or 2D grids.Threads within one block can share data in shared memory. Programmers needto provide enough number of threads to get good occupancy, so as to hidearithmetic pipeline latency and memory access latency.

Each block is distributed to one SM at execution time. A barrier synchro-nization operation can only be applied to threads in the same block. The basicexecution and scheduling units are warps. For GT200 GPU, each warp contains32 threads. Because processor resources, such as registers and shared memory,are limited, only a limited set of warps can run concurrently on SM, called activewarps. We only need to model a few warps running concurrently on each SM toestimate GPU’s performance. Thus, good approximation can be obtained whilethe problem size is limited.

There are three levels of memory space in CUDA abstraction. Each threadhas its own local memory. Each block has a private shared memory partition.All threads can access the global memory , the constant memory space and thetexture memory space. Local memory, global memory, constant memory andtexture memory are mapped to off-chip DRAM. Both constant memory andtexture memory are cached through read-only caches. The per-block sharedmemory is mapped to on-chip SRAM.

3 Model SetupIn this part, we present an analytical model for GPU GT200 and the key pa-rameters of the model. Then we discuss some performance effects that TEGcan demonstrate.

3.1 GPU Analytical ModelOur model for GT200 GPU is illustrated in Figure 2. In our model, each SM istaken as one processor core. SM is fed with warp instructions. Inside one SM,there are issue engine, shared memory, register file and functional units like SP,DPU, SFU and LD/ST unit. Global memory load/store instructions are issuedto LD/ST unit.

Every 2 cycles, the issue engine selects one ready warp instruction fromactive warps and issues the instruction to the ready functional units accordingto instruction type. A warp instruction can be be issued when all the sourceoperands are ready. GPU uses a scoreboard mechanism to select the warpwith a ready warp instruction. In our model, different scoreboard policies are

RR n° 7804

6 Lai & Seznec

implemented. For each warp, since instructions are issued in program order, ifone instruction’s source operands are not ready, all the successive instructionshave to wait.

Every three SMs share the same memory pipeline in one TPC, and thusshare 1/10 of peak global memory bandwidth. 8 channels connect the devicememory chips with the GPU processor and each channel bandwidth cannotexceed 1/8 of peak global memory bandwidth. We do not model the on-dierouting of memory requests, since the hardware details have not been disclosed.

TPC x10

Memory Pipeline

Issue Engine

Warp 0 Warp 1 Warp 2 Warp 3

Share Memory & Register File

SPs DPU SFU

SM x3

...

LD/ST Unit

GPU

Main Memory Chanel x8

Figure 2: GPU Analytical Model

3.2 Model Parameters

To use the analytical model in TEG, we need to estimate model parameters. Inthis section, some main parameters are introduced. Much work has been doneto understand GPU architecture through benchmarking [11]. Many results andideas are borrowed from this work.

3.2.1 Instruction Latency

Two types of instruction latency are of interest. One is the execution latency,or cycles that an instruction needs to execute in a functional unit. The otheris the issue latency, or cycles that the scheduler has to wait to issue anotherinstruction after issuing one warp instruction. A warp instruction launches 32operations.

Inria


The issue latency is calculated using instruction throughput information.For example, the throughput for integer add is 8 ops/clock. So the issue latencyis 32/8 = 4 cycles. When there are enough active warps, the scheduler canissue another warp instruction from another warp in 4 cycles after issuing oneinteger add instruction. However, when only one single warp is active on SM,the instruction throughput cannot reach the peak performance. According tothe instruction throughput of one warp, we can calculate the issue latency ofthe instruction in one warp. More details are in Section 3.2.2. We illustratebelow the methodology that we use to measure instruction execution latencyand issue latency.

The typical technique to measure instruction latency is to use the clock()function. The clock() function returns the value of a per-TPC counter. Tomeasure instruction execution latency, we can just put dependent instructionsbetween two clock() function calls. For example, the CUDA code in Listing 1is translated into PTX code in Listing 2.

t0=c lo ck ( ) ;r1=r1+r3 ;r1=r1+r3 ;. . .r1=r1+r3 ;t1=c lo ck ( ) ;

Listing 1: CUDA Code Example

mov . u32 %r6 , %c lock ;add . f32 %f4 , %f4 , %f3 ;add . f32 %f4 , %f4 , %f3 ;. . .add . f32 %f4 , %f4 , %f3 ;mov . u32 %r7 , %c lock ;

Listing 2: PTX Code Example

S2R R3 , SR1 ;SHL R3 , R3 , 0x1 ;FADD32 R4 , R4 , R2 ;FADD32 R4 , R4 , R2 ;. . .FADD32 R4 , R4 , R2 ;S2R R4 , SR1 ;SHL R4 , R4 , 0x1 ;

Listing 3: Disassembly Binary Code Example

The assembly code after compiling PTX code to binary code is in Listing 3.S2R instruction move the clock register to a general purpose register. A depen-dent shift operation after S2R suggests that the clock counter is incrementedat half of the shader clock frequency. An extra 28 cycles is introduced becauseof the dependence between SHL and S2R (24 cycles), and the issue latency ofSHL (4 cycles).

RR n° 7804

8 Lai & Seznec

Instruction Type Execution Latency Issue Latency Issue Latency(multiple warps) (same warp)

Integer ADD 24 4 8Integer MUL (16bit) 24 4 8Integer MAD (16bit) 24 4 8Float ADD 24 4 8Float MUL 24 2 8Float MAD 24 4 8Double ADD 48 32 32Double MUL 48 32 32Double FMA 48 32 32

Table 1: Arithmetic Instruction Latency

For 21 FADD32 instructions between the two clock measurements, the mea-sured cycles are 514. So the execution latency of FADD32 is

(514− 28− 8)/20 ≈ 24.

8 cycles are the issue latency of FADD32 in one warp (Please refer to 3.2.2 formore details).

Some arithmetic instructions’ execution latency and issue latency are listedin Table 1. Since float MUL operation can be issued into both SP and SFU.The instruction has higher throughput and shorter issue latency. In the tablewe only present the latency for 16 bit integer MUL and MAD, since 32-bitinteger MUL and MAD operations are translated into the native 16-bit integerinstructions and a few other instructions. In each SM, there is only one SFUwhich can process double precision arithmetic instructions. Thus, the issuelatency is much longer for double precision arithmetic instructions.

3.2.2 Performance Scaling on One SM

In the previous section, the issue latency is calculated assuming several warpsare running concurrently. For example, float MAD instruction’s issue latencyfor multiple warps is 4. But if we run only one warp, then the measured issuelatency is 8. And for a global memory load instruction GLD.U32, the issuelatency in the same warp is around 60 cycles while the issue latency for multiplewarps is a much smaller value and we use 4 cycles in TEG. Similar resultsare obtained for other arithmetic instructions and memory instructions, whichsuggests that a warp is occupied to issue one instruction while the schedulercan continue to issue instructions from other warps and the occupied period isnormally longer than the waiting time of the scheduler to issue a new instructionfrom another warp. Thus it is not possible to achieve peak performance withonly one active warp on SM even if most nearby instructions in one warp areindependent.

After one warp instruction is issued, the scheduler can switch to anotherwarp to execute another instruction without much waiting. However, if thescheduler still issue instructions from the same warp, the longer issue latency isneeded. This undocumented behavior may affect performance when there arevery few active warps on SM.

Inria


3.2.3 Masked instruction

All 32 threads within a warp execute the same warp instruction at a time.When threads of a warp diverge due to a data-dependent branch, they may havedifferent execution path. GPU executes each path in a serial manner. Thus,the warp instruction is masked by a condition dependent on thread index. Formasked arithmetic instructions, we find that all behavior remains similar asthe un masked behavior. That is to say, all the issue latency and executionlatency are the same as those of unmasked arithmetic instructions. For memoryoperations, since less data needs to be transfered, the latency is shorter and lessmemory bandwidth is occupied.

3.2.4 Memory Access

We consider the memory access separately from the other instructions becauseof 3 reasons. First, other functional units belong to one SM only, but each 3SMs within one TPC share the same memory pipeline and all SMs share thesame 8 global memory channels. Second, the scheduler needs to wait around 60cycles after issuing one global memory instruction to issue another instructionin the same warp, but it can issue another instruction very quickly if it switchesto another warp (Refer to Section 3.2.2). Third, memory access has much morecomplex behavior. For shared memory access, there might be bank conflicts(Section 3.3.3), and then all memory accesses of one half-warp are serialized.For global memory access, there might be coalesced and uncoalesced accesses(Section 3.3.4).

The typical shared memory latency is about 38 cycles and the global memorylatency without TLB miss is about 436 to 443 cycles [11].

Let Cmem represent the maximum number of concurrent memory transac-tions per TPC and it is calculated as follows:

NTPC ∗NWarp ∗ ele_size ∗ Cmem

mem_latency ∗ 1Clk

= Bpeak

Cmem =Bpeak ∗mem_latency

NTPC ∗NWarp ∗ ele_size ∗ Clk

where NTPC , NWarp, ele_size, mem_latency, Clk, and Bpeak denote the num-ber of TPCs, the number of threads per warp, the accessed data type size,the global memory latency, processor clock frequency, and the peak globalmemory bandwidth, respectively. For double precision memory transactions,Cmem ≈ 18. Thus the number of unfinished double precision memory transac-tions through the memory pipeline of a TPC cannot exceed 18.

3.3 Performance Effects3.3.1 Branch Divergence

Masked instructions (Section 3.2.3) are warp instructions with a warp size mask.Each bit of the mask indicates whether the corresponding thread is active toexecute the instruction. Threads of the same warp may have different executionpath. Since SM has to finish each path in serial and then rejoin, extra executiontime is introduced.

RR n° 7804

10 Lai & Seznec

3.3.2 Instruction Dependence and Memory Access Latency

One of the motivations or advantages of GPU processor is that it can hidelatency due to instruction dependence or memory access by forking large numberof threads. However, when there are very few active warps, it is possible thatat some point, all warps are occupied in issuing instructions. The scheduleris available but none of the active warps can be released. Thus the latencycannot be perfectly hidden and may become an important factor to performancedegradation.

3.3.3 Bank Conflicts in Shared Memory

The shared memory is divided in 16 memory modules, or banks, with the bankwidth of 4 bytes. The bank is interleaved so that successive 4 bytes words inshared memory space are in successive banks. Threads in a half-warp shouldaccess different banks to achieve maximum shared memory bandwidth. Other-wise the access is serialized [9], except all threads in a half-warp read the sameshared memory address.

For example, the float ADD instructionFADD32 R2, g[A1+0xb], R2;has a operand g[A1+0xb] located in shared memory. The execution latency isaround 74 cycles without bank conflict. If all threads within a half-warp accessthe same bank, the execution latency becomes about 266 cycles.

3.3.4 Uncoalesced Memory Access in Global Memory

The global memory of GPU has very high access latency comparing to sharedmemory latency. For global memory accesses of a half-warp, if certain condi-tions are satisfied, the memory transactions an be coalesced into one or twotransactions. The required conditions depend on GPU hardware and CUDAcompute capabilities. The general guideline is that threads of one half-warpshould access adjacent memory elements. If the coalesced conditions cannotbe met, more memory transactions are needed, introducing much performanceloss. For example, if every thread loads 4 bytes from global memory, in the worstcase, to serve each thread in the half-warp, 16 separate 32-byte transactions areissued. Thus 87.5% of the global memory bandwidth is wasted.

3.3.5 Chanel Skew in Global Memory

The global memory of GT200 GPU is divided into 8 partitions. The globalmemory thus can be accessed through 8 channels. The channel width is 256Btyes(32*8B) [9]. Similar as accessing to shared memory, concurrent accesses toglobal memory should be distributed evenly among all the partitions to achievehigh global memory bandwidth. Load imbalance on the memory channels maysignificantly impair performance. If the application’s memory access pattern hassignificant imbalance over different channels, much performance degradation willbe introduced.

Inria


SourceCode

BinaryCode

NvidiaCC

FunctionalSimulator

InstructionTrace

Disassembler

AssemblyCode

OperatorOperand

Dependence

InstructionAnalysis

IssueEngine Model

Functional UnitsModel

InformationCollector

Figure 3: Workflow of TEG

4 Workflow of TEG

Based on our timing model of GPU, we have developed the GPU timing esti-mation tool TEG. The workflow of TEG is illustrated in Figure 3. The CUDAsource code is first compiled into binary code with NVIDIA compiler collection.The binary code includes the native kernel code that runs on GPU device. Sec-ond, the binary code is disassembled using tool cuobjdump provided by NVIDIA[7]. Third, TEG analyzes the generated assembly code and obtains informationsuch as instruction type, and operands’ type, etc.

We need the actual instruction traces in many cases. The instruction tracecan be obtained with detailed GPU simulators, such as Barra[3] or GPGPU-Sim[2]. In our study, the instruction trace is provided by Barra simulator.

So after the third step, the assembly code information and instruction traceare served to issue engine model (see Figure 3). The issue engine model issuesall the warp instructions to corresponding functional units model according tothe instruction trace and our GPU timing model. At this stage, all runtimetiming information can be collected by our information collector.

We can vary the configuration of TEG, such as the active warp number onSM to observe how performance scales from one warp to multiple concurrentwarps. We can also compare the performance with or without one bottleneck bychoosing whether or not to apply the bottleneck’s effects in TEG. Thus we canquantify how much performance gain we may get by eliminating the bottleneckand programmers can decide whether it is worth the optimization efforts.

RR n° 7804

12 Lai & Seznec

WarpNum 1 2 4 8 16Measured (cycles) 55605 55803 71465 107668 186958Predicted (cycles) 52590 52878 64578 109364 200538Error -5.73% -5.53% -10.66% 1.55% 6.77%

Table 2: C = ABT with Bank Conflict

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 4 8

Measured

Predicted

Figure 4: C = ABT with Bank Conflict

5 Evaluation

5.1 Dense Matrix Multiplication

We choose one example of dense matrix multiplication in CUDA SDK and todemonstrate the function of TEG, we change C = AB into C = ABT .

C(i, j) =∑k

A(i, k) ∗B(j, k)

In the implementation, the three matrices A, B and C, are partitioned into16x16 sub-matrix. The computation for a C sub-matrix is assigned to a CUDAblock. A block is composed of 256 (16x16) threads and each thread computesone element in the C sub-matrix. In the CUDA kernel, at each step, a block ofthreads load the A and B sub-matrices first into shared memory. After a barriersynchronization of the block, each thread loads A(i, k) and B(j, k) from sharedmemory, and accumulates the multiplication result to C(i, j).

However, since a half-warp of threads, load B(j, k), B(j + 1, k), . . . , B(j +15, k), for a shared memory allocation like B[16][16], all the 16 elements willreside in the same bank and there would be bank conflicts in the shared memory.

In the following experiment, we assign each warp with the same amountof workload and run 1 to 16 warps concurrently on one SM. We use clock()function to measure the execution time on device of one block, since the barriersynchronization is only applicable within one block. And for multiple blocks’total execution time, we use the measured host time to calculate the device

Inria


WarpNum 1 2 4 8 16Measured (cycles) 17511 17291 18330 23228 33227Predicted (cycles) 16746 17528 19510 23630 34896Error -4.57% 1.35% 6.05% 1.70% 4.78%

Table 3: C = ABT Modified

execution time. For example, when there are 30 blocks, each SM can be assignedone block and when there are 60 blocks, each SM has two blocks to execute.Then we compare the host time for the two configurations and calculate thecycles for 2 blocks (16 warps) to finish on the SM.

The measured and predicted execution time of 1 to 16 concurrent warps onone SM is illustrated in Table 2. Then we normalize the execution time withthe workload and show the speed up from 1 to 16 active warps on each SM inFigure 4.

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

Measured

Predicted

Figure 5: C = ABT Modified

For GPU performance optimization, programmers often come to the questionthat how much performance loss due to one performance degradation factor.With TEG, it is fairly easy to answer the question. We just need to changethe configuration. In this case, in the tool, we just assume all shared memoryaccesses are conflict-free. Thus we can estimate the performance without sharedmemory bank conflict, which is illustrated in Figure 5 and Table 3.

We then modified the CUDA code to eliminate bank conflicts and comparethe result with TEG’s output. The comparison shows very good approximation.

5.2 Lattice QCD

Quantum chromodynamics (QCD) is the physical theory for strong interac-tions and lattice QCD is a numerical approach to QCD theory. Lattice QCDsimulation is one of the challenging problems for high performance computingcommunity.

RR n° 7804

14 Lai & Seznec

WarpNum 1 2 4Measured (cycles) 51053 68383 122430Predicted (cycles) 46034 66674 110162Error -10.90% -2.56% -11.14%

Table 4: Hopping_Matrix kernel with Uncoalesced Accesses

We select one kernel in Hopping_Matrix routine [5] as our example. Theinput of the Hopping_Matrix kernel include the spinor field, the gauge field, theoutput is the result spinor field. The spinor field resides on the 4D space-timesite and is represented by a 3x4 complex matrix data structure. The gaugefield on the link connecting neighbor sites is implemented as a 3x3 complexmatrix. The half spinor filed is represented by a 3x2 complex matrix, whichis the temporary data generated on each of 8 space-time directions for one fullspinor.

The functionality of the kernel is not important to our discussion. Instead,the memory layout is of interest. In the first version of our implementation,all the data is organized in array of structures. This is typical data layout forconventional processors to obtain good cache hit rate. However, GPU has muchmore concurrent threads. Normally different threads are assigned with differentdata structures. So the accesses of the threads in a warp have a long strideof the size of the data structure. Thus, accesses to global memory cannot becoalesced. The predicted and measured execution results are illustrated in Table4 and Figure 6. Since each thread occupies much register resource, the activewarp number is limited.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 2 4

Measured

Predicted

Figure 6: Hopping_Matrix kernel with Uncoalesced Accesses

If we reorganize the data layout into structure of arrays, the memory accessesof threads in a warp would be adjacent. Thus they can be coalesced. Theresult is shown in Table 5 and Figure 7. This case also shows that TEG caneasily demonstrate the performance loss due to performance bottlenecks, suchas uncoalesced memory accesses.

Inria


WarpNum 1 2 4Measured (cycles) 37926 47038 73100Predicted (cycles) 36202 45204 68104Error -4.76% -4.06% -7.34%

Table 5: Hopping_Matrix kernel with Coalesced Accesses

0

0.5

1

1.5

2

2.5

1 2 4

Measured

Predicted

Figure 7: Hopping_Matrix kernel with Coalesced Accesses

6 Conclusion

In our paper, we present our timing model for GT200 GPU and a timing esti-mation tool TEG. With the timing model and the assembly code as input, TEGcan estimate GPU cycle-approximate performance. Evaluation cases show thatTEG can get very close performance approximation. Especially, TEG has goodapproximation for applications with very few active warps on SM. Thus wecould better understand GPU’s performance result and quantify bottlenecks’performance effects. Present profiling tools can only provide programmers withbottleneck statistics, like number of shared memory bank conflict, etc. TEGallows programmers to understand how much performance one bottleneck canimpair and forsee the benefit of eliminating the bottleneck.

The main limitation is that TEG cannot handle the situation when there ismuch memory traffic and a lot of memory contention occurs because we lack theknowledge of detailed on-die memory controller organization and the analysis isfar too complicated for our analysis method. Our future plan includes studyingthe new Fermi architecture and introducing cache effects into our model.

References

[1] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W.Hwu. An adaptive performance modeling tool for gpu architectures. SIG-PLAN Not., 45:105–114, January 2010.

RR n° 7804

16 Lai & Seznec

[2] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt.Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages163–174. IEEE, 2009.

[3] S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallelfunctional simulator for gpgpu. In 2010 18th Annual IEEE/ACM Interna-tional Symposium on Modeling, Analysis and Simulation of Computer andTelecommunication Systems, page 351Ð360. IEEE, 2010.

[4] S. Hong and H. Kim. An analytical model for a gpu architecture withmemory-level and thread-level parallelism awareness. SIGARCH Comput.Archit. News, 37(3):152–163, 2009.

[5] K. Z. Ibrahim and F. Bodin. Efficient simdization and data management ofthe lattice qcd computation on the cell broadband engine. Sci. Program.,17(1-2):153–172, 2009.

[6] Y. Kim and A. Shrivastava. Cumapz: a tool to analyze memory accesspatterns in cuda. In Design Automation Conference (DAC), 2011 48thACM/EDAC/IEEE, page 128Ð133. IEEE, 2011.

[7] NVIDIA. NVIDIA CUDA C Programming Guide 4.0.

[8] Opencl. http://www.khronos.org/opencl/.

[9] G. Ruetsch and P. Micikevicius. Optimizing matrix transpose in cuda.2009.

[10] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A.Stratton, and W. mei W. Hwu. Program optimization space pruningfor a multithreaded gpu. In CGO ’08: Proceedings of the sixth annualIEEE/ACM international symposium on Code generation and optimiza-tion, pages 195–204, New York, NY, USA, 2008. ACM.

[11] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos.Demystifying gpu microarchitecture through microbenchmarking. In IS-PASS’10, pages 235–246, 2010.

[12] Y. Zhang and J. D. Owens. A quantitative performance analysis model forgpu architectures. In Proceedings of the 17th IEEE International Sympo-sium on High-Performance Computer Architecture (HPCA 17), Feb. 2011.

Inria

RESEARCH CENTRERENNES – BRETAGNE ATLANTIQUE

Campus universitaire de Beaulieu35042 Rennes Cedex

PublisherInriaDomaine de Voluceau - RocquencourtBP 105 - 78153 Le Chesnay Cedexinria.fr

ISSN 0249-6399

Documents

TEG: GPU Performance Estimation Using a Timing Model · TEG: GPU Performance Estimation Using a Timing Model 5 2.2 CUDAProgrammingModel The Compute Uniﬁed Device Architecture (CUDA)