A hybrid Cholesky decomposition algorithm for multicore CPUs with

A hybrid Cholesky decomposition algorithm formulticore CPUs with GPU accelerators

Gary Macindoe

Department of Statistical ScienceUniversity College London

8th February 2013

Cholesky Decomposition

Used throughout Computational Statistics and Machine LearningFinds L such that A = LLT

“Square root” of a matrixO(N3) operationsPerformance bottleneckApplies only to symmetric, square, positive definite matricesOperates in the upper or lower triangleProvides fast ways of computing the inverse and determinant

Example use of Cholesky Decomposition

Used in multivariate Normal distribution

To generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz

To calculate the probability density function

(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)


Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz


(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)


Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz


(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)

Calculating the Cholesky Decomposition

DefinitionLower triangular Cholesky Decomposition A = LLT

Li,j =

√

Aj,j −∑j−1

k=1 L2j,k if i == j

1Lj,j

(Ai,j −

∑j−1k=1 Li,kLj,k

)if i > j

Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.

Calculating the Cholesky Decomposition

DefinitionLower triangular Cholesky Decomposition A = LLT

Li,j =

√

Aj,j −∑j−1

k=1 L2j,k if i == j

1Lj,j

(Ai,j −

∑j−1k=1 Li,kLj,k

)if i > j

Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)

4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T




@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T


GPU Hardware

GPU dedicates more die area to data processingCPU dedicates more die area to control flow and cache

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)

Each SM has its own shared memory and registersEach SM is simpler than a CPU core

Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registers

Each SM is simpler than a CPU coreBetter at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registersEach SM is simpler than a CPU core

Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

GPU Programming

GPUs execute kernel functions written in CUDA-Ctemplate <typename T>

__global__ void scale(int n, T alpha , T * x, int incx) {

const int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n)

x[i * incx] *= alpha;

}

nVidia CUDA compiler converts CUDA-C code into GPU binaryfilesCUDA runtime library provides an API to

transfer compiled code and data onto the GPUlaunch kernel functions using a 3D grid of 3D thread blocks

GPU Thread hierarchy

Each thread block runs on one SMEach SM runs more than one thread block

Within each thread block threads are multitasked in groups of 32called warpsWithin a warp threads are SIMD

Run the same instruction at the same timeThreads within a block can synchronize with each other

Ensures all threads in the block are at the same instruction

GPU Memory hierarchy

GPUs have three types of memory to store data:Global memory - large, slow, accessible from all threads (andthe CPU)Shared memory - small, fast, shared between threads in a blockRegisters - small, fastest, private to each thread

In global and shared memory highest bandwidth is obtained whenconsecutive threads access consecutive elements.

GPU vs CPU

CPU better atcomplex maths (like square root in Cholesky)branching (if/then/else)

GPU better atsimple maths (sums, multiplies)executing operations in parallel

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.


@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)

4: D = D(B−1)T



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T



@

@

@

@

@

A B

C D


1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T


GPU Matrix Multiply

C = αAB + βC

Each element of C is independentHave a 2D grid of 2D thread blocks each runningCi,j = α

∑kl=0 Ai,lBl,j + βCi,j in parallel

Calculating one element of C requires reading k elements fromAi,0→k and k elements from B0→k ,j

Calculating all of C requires reading 2mnk elements from globalmemoryProblem: reading elements of A and B from global memory isslow

Blocked GPU Matrix Multiply

A

-

B?

C

Divide A into blocks ofmb × kb, B into blocks ofkb × nb and C into blocks ofmb × nbStore one block of C inregisters and process a1× nb row per thread usingthread blocks of mb × 1Read blocks of A and B intoshared memory and accessfrom all threads in the threadblock

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires reading

mmb ×

nnb ×

kkb ×mb × kb elements of A and

mmb ×

nnb ×


orm × n × k × (

1mb

+1

nb)


21

mb + 1nb


Bandwidth Reduction

There are mmb ×

nnb blocks of C


mb ×n

nb ×kkb ×mb × kb elements of A

andm

mb ×n

nb ×kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)


21

mb + 1nb


Bandwidth Reduction

There are mmb ×

nnb blocks of C


mb ×n


mmb ×

nnb ×


orm × n × k × (

1mb

+1

nb)


21

mb + 1nb


Bandwidth Reduction

There are mmb ×

nnb blocks of C


mb ×n


mmb ×

nnb ×


orm × n × k × (

1mb

+1

nb)

elements in total

This is2

1mb + 1

nb


Bandwidth Reduction

There are mmb ×

nnb blocks of C


mb ×n


mmb ×

nnb ×


orm × n × k × (

1mb

+1

nb)


21

mb + 1nb


Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×


What is the FLOP:word ratio?

ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)









Can be worked out using the GPU documentation (GTX 285)

Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48




Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/s

Bandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48




Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/s

FLOP:word ratio: 708.4839.744 = 17.82

Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×





GPU Symmetric Rank-K Update

C = αAAT + βC

Similar to matrix multiplication with B = AT except only the lower halfof C is written to

A-

-C

@

@

@

Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half


C = αAAT + βC


A-

-C

@

@

@



C = αAAT + βC


A-

-C

@

@

@


GPU Triangular SolveSolves XAT = αB by calculating

B = αB(A−1)T

where X overwrites B

Bi,j = αBi,j −j∑

k=0

Ak ,iBi,k

Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)

A

@

@

@

-

?

B- -


B = αB(A−1)T



k=0

Ak ,iBi,k


A

@

@

@

-

?

B- -


B = αB(A−1)T



k=0

Ak ,iBi,k


A

@

@

@

-

?

B- -

Hybrid Cholesky Decomposition - Results

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Performance reaches 180GFLOPs/s

Replacing Triangular Solve

Triangular Solve is slow

B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT



B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)

Separate into A = A−1 and B = αBAT



B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT

GPU Triangular Matrix Multiply

B = αBAT

Implementation which updates B in place has similardependencies to triangular solve (so similar performance)

Bi,j = α

j∑k=0

Ak ,iBk ,j

Elements of B which have not yet been calculated are used toupdate the current element

In an “out of place” implementation each element is independent

Xi,j = α

j∑k=0

Ak ,iBk ,j

GPU Triangular Matrix Multiply

B = αBAT

Implementation which updates B in place has similardependencies to triangular solve (so similar performance)

Bi,j = α

j∑k=0

Ak ,iBk ,j

Elements of B which have not yet been calculated are used toupdate the current elementIn an “out of place” implementation each element is independent

Xi,j = α

j∑k=0

Ak ,iBk ,j

Calculating the Inverse

Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT

Now need to form inverse of A

A is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.

Calculating the Inverse

Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT

Now need to form inverse of AA is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z

1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D


@

@

@

@

@

A B

C D

X

Z

1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T



@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T



@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T



@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T



@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T



0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Improving diagonal block transfer

Each memory copy has overhead

t =n

bandwidth+ overhead

Improving diagonal block transfer

Each memory copy has overhead

t =n

bandwidth+ overhead

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrix

Each column must be copied separately

t(m,n) = n × (m

bandwidth+ overhead)

If the number of rows is a multiple of the memory alignmentthere is no padding

t(m,n) =m × n

bandwidth+ overhead

But submatrices are always padded

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately

t(m,n) = n × (m



t(m,n) =m × n

bandwidth+ overhead



t(m,n) = n × (m



t(m,n) =m × n

bandwidth+ overhead



t(m,n) = n × (m



t(m,n) =m × n

bandwidth+ overhead


Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around BNo padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb


Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around B

No padding when n is amultiple of the memoryalignment


n × nbbandwidth

+ overhead <

nb× (nb



Block Column Copy

@

@

@

@

@

A B

C D

X

Z



n × nbbandwidth

+ overhead <

nb× (nb



Block Column Copy

@

@

@

@

@

A B

C D

X

Z



n × nbbandwidth

+ overhead <

nb× (nb



Block Column Copy

@

@

@

@

@

A B

C D

X

Z



n × nbbandwidth

+ overhead <

nb× (nb



Block Column Copy - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Tuning the block size

For CPU blocked algorithms the block size is chosen so thatblocks fit in the CPU cacheIntroduced an extra level of blocking for hybrid algorithmAim to choose block size so that workload is balanced betweencomputing devices

Static block size

Aim to minimise area between two curves

Dynamic block size

Can change block size on each iteration and still have a correctalgorithm

Dynamic block size - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPU

Implement unblocked Cholesky decomposition for the GPUDue to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiplyPossible to overlap both on the GPU?


Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU

Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads









GPU is already performing matrix multiply

Possible to overlap both on the GPU?





Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranches

nVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel

First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition

Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branches

Write combined matrix multiply and unblocked Choleskydecomposition kernel


Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel


Multiple Kernels - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Final Results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.

Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.

Different types of parallel workloads can be sent to the mostappropriate device.

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.

References

Vasily Volkov and James W. Demmel.Benchmarking GPUs to tune dense linear algebra.In Proceedings of the 2008 ACM/IEEE conference onSupercomputing, SC ’08, pages 1–11, Piscataway, NJ, USA,2008. IEEE Press.

Documents

A hybrid Cholesky decomposition algorithm for multicore CPUs with