38
Introduction CUDA Error Free Transformations Implementation Experimental Results Correctly Rounded Dot Product in CUDA Alexander Dallmann, Fabian J¨ org, Marco Nehmeier and urgen Wolff von Gudenberg Chair of Computer Science II Software Engineering Universit¨ at W¨ urzburg Germany SWIM 2014 Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38

Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

  • Upload
    hahanh

  • View
    227

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Correctly Rounded Dot Product in CUDA

Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg

Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg

Germany

SWIM 2014

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38

Page 2: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 2/38

Page 3: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 3/38

Page 4: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

MotivationComputer Architecture

A change in the design of computer architecture

Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 4/38

Page 5: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

MotivationGraphic Devices

GPGPU

Performance

Highly parallel

CUDA (Compute Unified Device Architecture)

NVIDIA GPU’s

OpenCL (Open Computing Language)

Open standardC like programming languageCPU’s, GPU’s, and many other

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 5/38

Page 6: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 6/38

Page 7: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDACompute Unified Device Architecture

GPU architecture for General Purpose Computing

PCI Express add-in boards

Highly parallelSeveral streaming multiprocessors (SM)

Each consisting of several cores

Compute capability = classification

Usable like a many-core CPU

CUDA C

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 7/38

Page 8: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Thread Hierarchy

Grid = group of threadblocks

Thread block = group ofthreads running on oneSM

Warp = 32 threads of athread block

SIMT (SingleInstruction MultipleThreads)Fundamental unit

Block(0,0)

Block(0,1) Block(1,1)

Block(1,0) Block(2,0)

Block(2,1)

Grid

Thread (1,0) Thread (2,0) Thread (3,0)

Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)

Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)

Block (1,1)

Thread (0,0)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 8/38

Page 9: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Memory Model

Memory Access Scope Lifetime

Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 9/38

Page 10: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

CUDA Processing flow

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 10/38

Page 11: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 11/38

Page 12: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Error Free Transformations

f1 f2

a + b

s = fl(a + b)

r

Error of the summation of two floating point numbers isalso a floating point number

Computable by a simple algorithm

(s, r) = twoSum(a, b):

s = a + b

a’ = s - b

b’ = s - a’

δa = a - a’

δb = b - b’

r = δa + δb

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 12/38

Page 13: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

twoProduct

(s, r) = twoProduct (a, b):

s = a · b

(a1, a2) = split (a)

(b1, b2) = split (b)

r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)

(x,y) = split (a):

factor = 2s + 1c = factor · a

x = c - (c - a)

y = a - x

(s, r) = twoProduct_fast (a, b):

s = a · b

r = fma(a, b, -s)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 13/38

Page 14: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 14/38

Page 15: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Array of 67 * 64 Bit (unsigned long long int)

32 Bit data32 Bit overflow

2 accumulators

positivenegative

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 15/38

Page 16: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

addFloatToAccuDev

Device function

Atomic add

Memory accesses

double: max. 3float: max. 2

SIMT

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 16/38

Page 17: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

addFloatsToAccuKernel

1 Allocate memory on device

2 Add all floats to the accumulators (positive and negative)

addFloatToAccuDev

3 Propagate carry bits

4 Positive accu + negative accu

5 Determine exactly rounded floating point number

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 17/38

Page 18: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Propagate carry bits

1100101010110101

1100101010110101

1100101101111111

0001000010000000

0001000101001011

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 18/38

Page 19: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Long Accumulator

Exactly rounded floating point number

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 19/38

Page 20: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulation

addArrayWithCuda

Compute sum of n floating point numbers

Iterative execution

Repetitive Kernel executionsTree-like

Parameters

inputs

errors

maxExp

numErrNonZero

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 20/38

Page 21: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulation

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 21/38

Page 22: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

EFT accumulationerror handling

errorFreeSum

1 Compute sum S and store result in inputs. Store errorsin errors.

2 Compute the sum E of all errors and add E to S . Storeerrors in errors.

3 Estimate the the remaining error.

numErrNonZero ∗ 2maxExp+1≥

ei

∣, ei ∈ errors

4 Go to 2 if remaining error has influence to the sum.

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 22/38

Page 23: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot product

a · b =n−1∑

i=0

ai · bi

1 Compute the products ai · bi using

twoProduct

twoProduct fast

2 Accumulate the products using

long accumulatorerrorFreeSum

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 23/38

Page 24: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 24/38

Page 25: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Test System

Intel Core i5-2500K 3.30 GHz

NVIDIA Geforce GTX 760

Compute capability 3.01152 CUDA cores

Windows 7 Professional (64 Bit)

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 25/38

Page 26: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input const ThreadsFloat

512 Threads * 512 Blocks = 262144 Threads

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 26/38

Page 27: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input const ThreadsDouble

512 Threads * 512 Blocks = 262144 Threads

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 27/38

Page 28: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input inc ThreadsFloat

Threads / Input = 8

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 28/38

Page 29: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc Input inc ThreadsDouble

Threads / Input = 8

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 29/38

Page 30: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc pos Input inc ThreadsFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 30/38

Page 31: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc pos Input inc ThreadsDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 31/38

Page 32: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc small pos Input inc ThreadsFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 32/38

Page 33: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Inc small pos Input inc ThreadsDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 33/38

Page 34: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot ProductFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 34/38

Page 35: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot ProductDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 35/38

Page 36: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot Product FastFloat

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 36/38

Page 37: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Dot Product FastDouble

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 37/38

Page 38: Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation ExperimentalResults Correctly Rounded Dot Product in CUDA Alexander …

Introduction CUDA Error Free Transformations Implementation Experimental Results

Questions ?

Marco [email protected]

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 38/38