Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Correctly Rounded Dot Product in CUDA

Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg

Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg

Germany

SWIM 2014

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38


Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results



Outline

1 Introduction

2 CUDA


4 Implementation




MotivationComputer Architecture

A change in the design of computer architecture

Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation



MotivationGraphic Devices

GPGPU

Performance

Highly parallel

CUDA (Compute Unified Device Architecture)

NVIDIA GPU’s

OpenCL (Open Computing Language)

Open standardC like programming languageCPU’s, GPU’s, and many other



Outline

1 Introduction

2 CUDA


4 Implementation




CUDACompute Unified Device Architecture

GPU architecture for General Purpose Computing

PCI Express add-in boards

Highly parallelSeveral streaming multiprocessors (SM)

Each consisting of several cores

Compute capability = classification

Usable like a many-core CPU

CUDA C



CUDA Thread Hierarchy

Grid = group of threadblocks

Thread block = group ofthreads running on oneSM

Warp = 32 threads of athread block

SIMT (SingleInstruction MultipleThreads)Fundamental unit

Block(0,0)

Block(0,1) Block(1,1)

Block(1,0) Block(2,0)

Block(2,1)

Grid

Thread (1,0) Thread (2,0) Thread (3,0)

Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)

Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)

Block (1,1)

Thread (0,0)



CUDA Memory Model

Memory Access Scope Lifetime

Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation



CUDA Processing flow



Outline

1 Introduction

2 CUDA


4 Implementation




Error Free Transformations

f1 f2

a + b

s = fl(a + b)

r

Error of the summation of two floating point numbers isalso a floating point number

Computable by a simple algorithm

(s, r) = twoSum(a, b):

s = a + b

a’ = s - b

b’ = s - a’

δa = a - a’

δb = b - b’

r = δa + δb



twoProduct

(s, r) = twoProduct (a, b):

s = a · b

(a1, a2) = split (a)

(b1, b2) = split (b)

r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)

(x,y) = split (a):

factor = 2s + 1c = factor · a

x = c - (c - a)

y = a - x

(s, r) = twoProduct_fast (a, b):

s = a · b

r = fma(a, b, -s)



Outline

1 Introduction

2 CUDA


4 Implementation




Long Accumulator

Array of 67 * 64 Bit (unsigned long long int)

32 Bit data32 Bit overflow

2 accumulators

positivenegative



Long Accumulator

addFloatToAccuDev

Device function

Atomic add

Memory accesses

double: max. 3float: max. 2

SIMT



Long Accumulator

addFloatsToAccuKernel

1 Allocate memory on device

2 Add all floats to the accumulators (positive and negative)

addFloatToAccuDev

3 Propagate carry bits

4 Positive accu + negative accu

5 Determine exactly rounded floating point number



Long Accumulator

Propagate carry bits

1100101010110101

1100101010110101

1100101101111111

0001000010000000

0001000101001011



Long Accumulator

Exactly rounded floating point number



EFT accumulation

addArrayWithCuda

Compute sum of n floating point numbers

Iterative execution

Repetitive Kernel executionsTree-like

Parameters

inputs

errors

maxExp

numErrNonZero



EFT accumulation



EFT accumulationerror handling

errorFreeSum

1 Compute sum S and store result in inputs. Store errorsin errors.

2 Compute the sum E of all errors and add E to S . Storeerrors in errors.

3 Estimate the the remaining error.

numErrNonZero ∗ 2maxExp+1≥

∣

∣

∣

∑

ei

∣

∣

∣, ei ∈ errors

4 Go to 2 if remaining error has influence to the sum.



Dot product

a · b =n−1∑

i=0

ai · bi

1 Compute the products ai · bi using

twoProduct

twoProduct fast

2 Accumulate the products using

long accumulatorerrorFreeSum



Outline

1 Introduction

2 CUDA


4 Implementation




Test System

Intel Core i5-2500K 3.30 GHz

NVIDIA Geforce GTX 760

Compute capability 3.01152 CUDA cores

Windows 7 Professional (64 Bit)



Inc Input const ThreadsFloat

512 Threads * 512 Blocks = 262144 Threads



Inc Input const ThreadsDouble

512 Threads * 512 Blocks = 262144 Threads



Inc Input inc ThreadsFloat

Threads / Input = 8



Inc Input inc ThreadsDouble

Threads / Input = 8



Inc pos Input inc ThreadsFloat



Inc pos Input inc ThreadsDouble



Inc small pos Input inc ThreadsFloat



Inc small pos Input inc ThreadsDouble



Dot ProductFloat



Dot ProductDouble



Dot Product FastFloat



Dot Product FastDouble



Questions ?

Marco [email protected]


Documents

Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …