Upload
hahanh
View
227
Download
2
Embed Size (px)
Citation preview
Introduction CUDA Error Free Transformations Implementation Experimental Results
Correctly Rounded Dot Product in CUDA
Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg
Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg
Germany
SWIM 2014
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 2/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 3/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
MotivationComputer Architecture
A change in the design of computer architecture
Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 4/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
MotivationGraphic Devices
GPGPU
Performance
Highly parallel
CUDA (Compute Unified Device Architecture)
NVIDIA GPU’s
OpenCL (Open Computing Language)
Open standardC like programming languageCPU’s, GPU’s, and many other
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 5/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 6/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDACompute Unified Device Architecture
GPU architecture for General Purpose Computing
PCI Express add-in boards
Highly parallelSeveral streaming multiprocessors (SM)
Each consisting of several cores
Compute capability = classification
Usable like a many-core CPU
CUDA C
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 7/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Thread Hierarchy
Grid = group of threadblocks
Thread block = group ofthreads running on oneSM
Warp = 32 threads of athread block
SIMT (SingleInstruction MultipleThreads)Fundamental unit
Block(0,0)
Block(0,1) Block(1,1)
Block(1,0) Block(2,0)
Block(2,1)
Grid
Thread (1,0) Thread (2,0) Thread (3,0)
Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)
Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)
Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)
Block (1,1)
Thread (0,0)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 8/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Memory Model
Memory Access Scope Lifetime
Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 9/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Processing flow
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 10/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 11/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Error Free Transformations
f1 f2
a + b
s = fl(a + b)
r
Error of the summation of two floating point numbers isalso a floating point number
Computable by a simple algorithm
(s, r) = twoSum(a, b):
s = a + b
a’ = s - b
b’ = s - a’
δa = a - a’
δb = b - b’
r = δa + δb
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 12/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
twoProduct
(s, r) = twoProduct (a, b):
s = a · b
(a1, a2) = split (a)
(b1, b2) = split (b)
r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)
(x,y) = split (a):
factor = 2s + 1c = factor · a
x = c - (c - a)
y = a - x
(s, r) = twoProduct_fast (a, b):
s = a · b
r = fma(a, b, -s)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 13/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 14/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Array of 67 * 64 Bit (unsigned long long int)
32 Bit data32 Bit overflow
2 accumulators
positivenegative
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 15/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
addFloatToAccuDev
Device function
Atomic add
Memory accesses
double: max. 3float: max. 2
SIMT
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 16/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
addFloatsToAccuKernel
1 Allocate memory on device
2 Add all floats to the accumulators (positive and negative)
addFloatToAccuDev
3 Propagate carry bits
4 Positive accu + negative accu
5 Determine exactly rounded floating point number
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 17/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Propagate carry bits
1100101010110101
1100101010110101
1100101101111111
0001000010000000
0001000101001011
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 18/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Exactly rounded floating point number
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 19/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulation
addArrayWithCuda
Compute sum of n floating point numbers
Iterative execution
Repetitive Kernel executionsTree-like
Parameters
inputs
errors
maxExp
numErrNonZero
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 20/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 21/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulationerror handling
errorFreeSum
1 Compute sum S and store result in inputs. Store errorsin errors.
2 Compute the sum E of all errors and add E to S . Storeerrors in errors.
3 Estimate the the remaining error.
numErrNonZero ∗ 2maxExp+1≥
∣
∣
∣
∑
ei
∣
∣
∣, ei ∈ errors
4 Go to 2 if remaining error has influence to the sum.
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 22/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot product
a · b =n−1∑
i=0
ai · bi
1 Compute the products ai · bi using
twoProduct
twoProduct fast
2 Accumulate the products using
long accumulatorerrorFreeSum
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 23/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 24/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Test System
Intel Core i5-2500K 3.30 GHz
NVIDIA Geforce GTX 760
Compute capability 3.01152 CUDA cores
Windows 7 Professional (64 Bit)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 25/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input const ThreadsFloat
512 Threads * 512 Blocks = 262144 Threads
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 26/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input const ThreadsDouble
512 Threads * 512 Blocks = 262144 Threads
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 27/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input inc ThreadsFloat
Threads / Input = 8
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 28/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input inc ThreadsDouble
Threads / Input = 8
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 29/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc pos Input inc ThreadsFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 30/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc pos Input inc ThreadsDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 31/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc small pos Input inc ThreadsFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 32/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc small pos Input inc ThreadsDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 33/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot ProductFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 34/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot ProductDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 35/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot Product FastFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 36/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot Product FastDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 37/38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Questions ?
Marco [email protected]
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 38/38