Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
A hybrid Cholesky decomposition algorithm formulticore CPUs with GPU accelerators
Gary Macindoe
Department of Statistical ScienceUniversity College London
8th February 2013
Cholesky Decomposition
Used throughout Computational Statistics and Machine LearningFinds L such that A = LLT
“Square root” of a matrixO(N3) operationsPerformance bottleneckApplies only to symmetric, square, positive definite matricesOperates in the upper or lower triangleProvides fast ways of computing the inverse and determinant
Example use of Cholesky Decomposition
Used in multivariate Normal distribution
To generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
Example use of Cholesky Decomposition
Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
Example use of Cholesky Decomposition
Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
Calculating the Cholesky Decomposition
DefinitionLower triangular Cholesky Decomposition A = LLT
Li,j =
√
Aj,j −∑j−1
k=1 L2j,k if i == j
1Lj,j
(Ai,j −
∑j−1k=1 Li,kLj,k
)if i > j
Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.
Calculating the Cholesky Decomposition
DefinitionLower triangular Cholesky Decomposition A = LLT
Li,j =
√
Aj,j −∑j−1
k=1 L2j,k if i == j
1Lj,j
(Ai,j −
∑j−1k=1 Li,kLj,k
)if i > j
Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)
4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
GPU Hardware
GPU dedicates more die area to data processingCPU dedicates more die area to control flow and cache
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)
Each SM has its own shared memory and registersEach SM is simpler than a CPU core
Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registers
Each SM is simpler than a CPU coreBetter at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registersEach SM is simpler than a CPU core
Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
GPU Programming
GPUs execute kernel functions written in CUDA-Ctemplate <typename T>
__global__ void scale(int n, T alpha , T * x, int incx) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
x[i * incx] *= alpha;
}
nVidia CUDA compiler converts CUDA-C code into GPU binaryfilesCUDA runtime library provides an API to
transfer compiled code and data onto the GPUlaunch kernel functions using a 3D grid of 3D thread blocks
GPU Thread hierarchy
Each thread block runs on one SMEach SM runs more than one thread block
Within each thread block threads are multitasked in groups of 32called warpsWithin a warp threads are SIMD
Run the same instruction at the same timeThreads within a block can synchronize with each other
Ensures all threads in the block are at the same instruction
GPU Memory hierarchy
GPUs have three types of memory to store data:Global memory - large, slow, accessible from all threads (andthe CPU)Shared memory - small, fast, shared between threads in a blockRegisters - small, fastest, private to each thread
In global and shared memory highest bandwidth is obtained whenconsecutive threads access consecutive elements.
GPU vs CPU
CPU better atcomplex maths (like square root in Cholesky)branching (if/then/else)
GPU better atsimple maths (sums, multiplies)executing operations in parallel
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)
4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
GPU Matrix Multiply
C = αAB + βC
Each element of C is independentHave a 2D grid of 2D thread blocks each runningCi,j = α
∑kl=0 Ai,lBl,j + βCi,j in parallel
Calculating one element of C requires reading k elements fromAi,0→k and k elements from B0→k ,j
Calculating all of C requires reading 2mnk elements from globalmemoryProblem: reading elements of A and B from global memory isslow
Blocked GPU Matrix Multiply
A
-
B?
C
Divide A into blocks ofmb × kb, B into blocks ofkb × nb and C into blocks ofmb × nbStore one block of C inregisters and process a1× nb row per thread usingthread blocks of mb × 1Read blocks of A and B intoshared memory and accessfrom all threads in the threadblock
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires reading
mmb ×
nnb ×
kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A
andm
mb ×n
nb ×kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in total
This is2
1mb + 1
nb
times less than when no blocking is used (mb = 1 and nb = 1)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?
ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)
Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/s
Bandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/s
FLOP:word ratio: 708.4839.744 = 17.82
Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
Hybrid Cholesky Decomposition - Results
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
Performance reaches 180GFLOPs/s
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)
Separate into A = A−1 and B = αBAT
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT
GPU Triangular Matrix Multiply
B = αBAT
Implementation which updates B in place has similardependencies to triangular solve (so similar performance)
Bi,j = α
j∑k=0
Ak ,iBk ,j
Elements of B which have not yet been calculated are used toupdate the current element
In an “out of place” implementation each element is independent
Xi,j = α
j∑k=0
Ak ,iBk ,j
GPU Triangular Matrix Multiply
B = αBAT
Implementation which updates B in place has similardependencies to triangular solve (so similar performance)
Bi,j = α
j∑k=0
Ak ,iBk ,j
Elements of B which have not yet been calculated are used toupdate the current elementIn an “out of place” implementation each element is independent
Xi,j = α
j∑k=0
Ak ,iBk ,j
Calculating the Inverse
Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT
Now need to form inverse of A
A is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.
Calculating the Inverse
Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT
Now need to form inverse of AA is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z
1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z
1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
Hybrid Cholesky decomposition without triangularsolve
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
Improving diagonal block transfer
Each memory copy has overhead
t =n
bandwidth+ overhead
Improving diagonal block transfer
Each memory copy has overhead
t =n
bandwidth+ overhead
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrix
Each column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around B
No padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
Block Column Copy - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
Tuning the block size
For CPU blocked algorithms the block size is chosen so thatblocks fit in the CPU cacheIntroduced an extra level of blocking for hybrid algorithmAim to choose block size so that workload is balanced betweencomputing devices
Static block size
Aim to minimise area between two curves
Dynamic block size
Can change block size on each iteration and still have a correctalgorithm
Dynamic block size - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPU
Implement unblocked Cholesky decomposition for the GPUDue to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiply
Possible to overlap both on the GPU?
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranches
nVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branches
Write combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
Multiple Kernels - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Final Results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.
Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.
Different types of parallel workloads can be sent to the mostappropriate device.
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.
References
Vasily Volkov and James W. Demmel.Benchmarking GPUs to tune dense linear algebra.In Proceedings of the 2008 ACM/IEEE conference onSupercomputing, SC ’08, pages 1–11, Piscataway, NJ, USA,2008. IEEE Press.