Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation...

Introduction CUDA Error Free Transformations Implementation Experimental Results

Correctly Rounded Dot Product in CUDA

Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg

Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg

Germany

SWIM 2014

Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38

Outline

1 Introduction

2 CUDA

3 Error Free Transformations

4 Implementation

5 Experimental Results

Outline

1 Introduction

2 CUDA

4 Implementation

MotivationComputer Architecture

A change in the design of computer architecture

Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation

MotivationGraphic Devices

Performance

Highly parallel

CUDA (Compute Unified Device Architecture)

NVIDIA GPU’s

OpenCL (Open Computing Language)

Open standardC like programming languageCPU’s, GPU’s, and many other

Outline

1 Introduction

2 CUDA

4 Implementation

CUDACompute Unified Device Architecture

GPU architecture for General Purpose Computing

PCI Express add-in boards

Highly parallelSeveral streaming multiprocessors (SM)

Each consisting of several cores

Compute capability = classification

Usable like a many-core CPU

CUDA C

CUDA Thread Hierarchy

Grid = group of threadblocks

Thread block = group ofthreads running on oneSM

Warp = 32 threads of athread block

SIMT (SingleInstruction MultipleThreads)Fundamental unit

Block(0,0)

Block(0,1) Block(1,1)

Block(1,0) Block(2,0)

Block(2,1)

Thread (1,0) Thread (2,0) Thread (3,0)

Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)

Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)

Block (1,1)

Thread (0,0)

CUDA Memory Model

Memory Access Scope Lifetime

Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation

CUDA Processing flow

Outline

1 Introduction

2 CUDA

4 Implementation

Error Free Transformations

s = fl(a + b)

Error of the summation of two floating point numbers isalso a floating point number

Computable by a simple algorithm

(s, r) = twoSum(a, b):

s = a + b

a’ = s - b

b’ = s - a’

δa = a - a’

δb = b - b’

r = δa + δb

twoProduct

(s, r) = twoProduct (a, b):

s = a · b

(a1, a2) = split (a)

(b1, b2) = split (b)

r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)

(x,y) = split (a):

factor = 2s + 1c = factor · a

x = c - (c - a)

y = a - x

(s, r) = twoProduct_fast (a, b):

s = a · b

r = fma(a, b, -s)

Outline

1 Introduction

2 CUDA

4 Implementation

Long Accumulator

Array of 67 * 64 Bit (unsigned long long int)

32 Bit data32 Bit overflow

2 accumulators

positivenegative

Long Accumulator

addFloatToAccuDev

Device function

Atomic add

Memory accesses

double: max. 3float: max. 2

Long Accumulator

addFloatsToAccuKernel

1 Allocate memory on device

2 Add all floats to the accumulators (positive and negative)

addFloatToAccuDev

3 Propagate carry bits

4 Positive accu + negative accu

5 Determine exactly rounded floating point number

Long Accumulator

Propagate carry bits

1100101010110101

1100101101111111

0001000010000000

0001000101001011

Long Accumulator

Exactly rounded floating point number

EFT accumulation

addArrayWithCuda

Compute sum of n floating point numbers

Iterative execution

Repetitive Kernel executionsTree-like

Parameters

inputs

errors

maxExp

numErrNonZero

EFT accumulation

EFT accumulationerror handling

errorFreeSum

1 Compute sum S and store result in inputs. Store errorsin errors.

2 Compute the sum E of all errors and add E to S . Storeerrors in errors.

3 Estimate the the remaining error.

numErrNonZero ∗ 2maxExp+1≥

∣, ei ∈ errors

4 Go to 2 if remaining error has influence to the sum.

Dot product

a · b =n−1∑

ai · bi

1 Compute the products ai · bi using

twoProduct

twoProduct fast

2 Accumulate the products using

long accumulatorerrorFreeSum

Outline

1 Introduction

2 CUDA

4 Implementation

Test System

Intel Core i5-2500K 3.30 GHz

NVIDIA Geforce GTX 760

Compute capability 3.01152 CUDA cores

Windows 7 Professional (64 Bit)

Inc Input const ThreadsFloat

512 Threads * 512 Blocks = 262144 Threads

Inc Input const ThreadsDouble

512 Threads * 512 Blocks = 262144 Threads

Inc Input inc ThreadsFloat

Threads / Input = 8

Inc Input inc ThreadsDouble

Threads / Input = 8

Inc pos Input inc ThreadsFloat

Inc pos Input inc ThreadsDouble

Inc small pos Input inc ThreadsFloat

Inc small pos Input inc ThreadsDouble

Dot ProductFloat

Dot ProductDouble

Dot Product FastFloat

Dot Product FastDouble

Questions ?

Marco Nehmeiernehmeier@informatik.uni-wuerzburg.de

Correctly Rounded Dot Product in CUDA - Uppsala … CUDA ErrorFreeTransformations Implementation...

Documents

NVIDIA CUDA Best Practices Guide - Virginia Tech€¦ · CUDA Best Practices Guide Version 3.1 Version 3.1 5/19/2010 NVIDIA CUDA™ NVIDIA CUDA C Best Practices Guide . ... CUDA Programming

CUDA-GDB: The NVIDIA CUDA Debuggerdeveloper.download.nvidia.com/compute/cuda/2_1/cudagdb/... · 2008-12-24 · 1.1 CUDA-GDB: The NVIDIA CUDA Debugger ... You must select Linux 32-bit

Correctly Rounded Exponential Function in Double Precision ... · Correctly Rounded Exponential Function in Double Precision Arithmetic David Defour, Florent De Dinechin, Jean-Michel

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock

An#Introduction#to#CUDA/OpenCL# …parlab.eecs.berkeley.edu/sites/all/parlab/files/CatanzaroIntroToG... · Mapping#CUDA#to#Nvidia#GPUs#! ... Introduction to CUDA! CUDA Programming

Debugging Experience with CUDA-GBD and CUDA-MEMCHECK · 2012-11-27 · Debugging Experience with CUDA-GDB and CUDA-MEMCHECK ... CUDA Debugging Solutions C UDA-G DB (Linux & Mac) C

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

Introduction to GPU Programming with the CUDA Platform · Resources ThispresentationandallsourcecodeareavailableatGitHub: • github.com/phrb/intro-cuda Moreresources: • CUDAC:docs.nvidia.com/cuda/cuda-c-programming-guide

Debugging Experience with CUDA-GDB and CUDA ......2 CUDA Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows) NVIDIA® Nsight Eclipse Edition (NEW!)Visual

NVIDIA CUDA Getting Started Guide for Linux‣ CUDA Driver

An Introduction to GPU Computing and CUDA Architecturedeveloper.download.nvidia.com/CUDA/training/GTC... · What is CUDA? CUDA Architecture Expose GPU computing for general purpose

CUDA Lecture 5 CUDA at the University of Akron

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

Debugging Experience with CUDA-GDB and CUDA …developer.download.nvidia.com/...GTC2012-Debugging...Debugging Experience with CUDA-GDB and CUDA-MEMCHECK Geoff Gerfin Vyas Venkataraman

Tutorial CUDA - Pascal-Man CUDA © NVIDIA Corporation ... Why GPUs? CUDA programming model, language, and runtime Break CUDA implementation on the GPU ... vec_dot…

Debugging Your CUDA Applications With CUDA-GDBdeveloper.download.nvidia.com/GTC/PDF/1062_Satoor.pdf · Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows)

Getting Started with CUDA C/C++ · Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator . CPU GPU ... LabVIEW . Programming a CUDA Language CUDA C/C++ Based on industry-standard

v5.0 | October 2012 NVIDIA CUDA SAMPLES Release Notesdirac.ruc.dk/manuals/cuda-5.0/CUDA_Samples_Release_Notes.pdf · NVIDIA CUDA Samples v5.0 | ii CUDA SAMPLES 5.0 NOTES R304 Driver

Computing Correctly Rounded Integer Powers in Floating-Point …perso.ens-lyon.fr/jean-michel.muller/power-journal.pdf · 2009. 5. 29. · Computing Correctly Rounded Integer Powers

March 2015 CUDA-GDB CUDA DEBUGGER - Rice University · CUDA-GDB CUDA DEBUGGER DU-05227-042 _v7.0 | March 2015 User Manual. CUDA Debugger DU-05227-042 _v7.0 | ii TABLE OF CONTENTS