Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation...

Preview:

Citation preview

Introduction CUDA Error Free Transformations Implementation Experimental Results

Correctly Rounded Dot Product in CUDA

Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg

Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg

Germany

SWIM 2014

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 2/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 3/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

MotivationComputer Architecture

A change in the design of computer architecture

Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 4/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

MotivationGraphic Devices

GPGPU

Performance

Highly parallel

CUDA (Compute Unified Device Architecture)

NVIDIA GPU’s

OpenCL (Open Computing Language)

Open standardC like programming languageCPU’s, GPU’s, and many other

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 5/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 6/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDACompute Unified Device Architecture

GPU architecture for General Purpose Computing

PCI Express add-in boards

Highly parallelSeveral streaming multiprocessors (SM)

Each consisting of several cores

Compute capability = classification

Usable like a many-core CPU

CUDA C

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 7/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Thread Hierarchy

Grid = group of threadblocks

Thread block = group ofthreads running on oneSM

Warp = 32 threads of athread block

SIMT (SingleInstruction MultipleThreads)Fundamental unit

Block(0,0)

Block(0,1) Block(1,1)

Block(1,0) Block(2,0)

Block(2,1)

Grid

Thread (1,0) Thread (2,0) Thread (3,0)

Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)

Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)

Block (1,1)

Thread (0,0)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 8/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Memory Model

Memory Access Scope Lifetime

Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 9/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Processing flow

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 10/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 11/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Error Free Transformations

f1 f2

a + b

s = fl(a + b)

r

Error of the summation of two floating point numbers isalso a floating point number

Computable by a simple algorithm

(s, r) = twoSum(a, b):

s = a + b

a’ = s - b

b’ = s - a’

δa = a - a’

δb = b - b’

r = δa + δb

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 12/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

twoProduct

(s, r) = twoProduct (a, b):

s = a · b

(a1, a2) = split (a)

(b1, b2) = split (b)

r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)

(x,y) = split (a):

factor = 2s + 1c = factor · a

x = c - (c - a)

y = a - x

(s, r) = twoProduct_fast (a, b):

s = a · b

r = fma(a, b, -s)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 13/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 14/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Array of 67 * 64 Bit (unsigned long long int)

32 Bit data32 Bit overflow

2 accumulators

positivenegative

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 15/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

addFloatToAccuDev

Device function

Atomic add

Memory accesses

double: max. 3float: max. 2

SIMT

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 16/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

addFloatsToAccuKernel

1 Allocate memory on device

2 Add all floats to the accumulators (positive and negative)

addFloatToAccuDev

3 Propagate carry bits

4 Positive accu + negative accu

5 Determine exactly rounded floating point number

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 17/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Propagate carry bits

1100101010110101

1100101010110101

1100101101111111

0001000010000000

0001000101001011

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 18/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Exactly rounded floating point number

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 19/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulation

addArrayWithCuda

Compute sum of n floating point numbers

Iterative execution

Repetitive Kernel executionsTree-like

Parameters

inputs

errors

maxExp

numErrNonZero

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 20/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 21/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulationerror handling

errorFreeSum

1 Compute sum S and store result in inputs. Store errorsin errors.

2 Compute the sum E of all errors and add E to S . Storeerrors in errors.

3 Estimate the the remaining error.

numErrNonZero ∗ 2maxExp+1≥

ei

∣, ei ∈ errors

4 Go to 2 if remaining error has influence to the sum.

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 22/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot product

a · b =n−1∑

i=0

ai · bi

1 Compute the products ai · bi using

twoProduct

twoProduct fast

2 Accumulate the products using

long accumulatorerrorFreeSum

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 23/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 24/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Test System

Intel Core i5-2500K 3.30 GHz

NVIDIA Geforce GTX 760

Compute capability 3.01152 CUDA cores

Windows 7 Professional (64 Bit)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 25/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input const ThreadsFloat

512 Threads * 512 Blocks = 262144 Threads

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 26/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input const ThreadsDouble

512 Threads * 512 Blocks = 262144 Threads

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 27/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input inc ThreadsFloat

Threads / Input = 8

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 28/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input inc ThreadsDouble

Threads / Input = 8

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 29/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc pos Input inc ThreadsFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 30/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc pos Input inc ThreadsDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 31/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc small pos Input inc ThreadsFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 32/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc small pos Input inc ThreadsDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 33/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot ProductFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 34/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot ProductDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 35/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot Product FastFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 36/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot Product FastDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 37/38

Introduction CUDA Error Free Transformations Implementation Experimental Results

Questions ?

Marco Nehmeiernehmeier@informatik.uni-wuerzburg.de

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 38/38

Recommended