96
A hybrid Cholesky decomposition algorithm for multicore CPUs with GPU accelerators Gary Macindoe Department of Statistical Science University College London 8th February 2013

A hybrid Cholesky decomposition algorithm for multicore CPUs with

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A hybrid Cholesky decomposition algorithm for multicore CPUs with

A hybrid Cholesky decomposition algorithm formulticore CPUs with GPU accelerators

Gary Macindoe

Department of Statistical ScienceUniversity College London

8th February 2013

Page 2: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Cholesky Decomposition

Used throughout Computational Statistics and Machine LearningFinds L such that A = LLT

“Square root” of a matrixO(N3) operationsPerformance bottleneckApplies only to symmetric, square, positive definite matricesOperates in the upper or lower triangleProvides fast ways of computing the inverse and determinant

Page 3: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Example use of Cholesky Decomposition

Used in multivariate Normal distribution

To generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz

To calculate the probability density function

(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)

Page 4: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Example use of Cholesky Decomposition

Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz

To calculate the probability density function

(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)

Page 5: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Example use of Cholesky Decomposition

Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)

z ∼ N (0,1)

x = µ+√

Σz

To calculate the probability density function

(2π)−n2 |Σ|−

12 e−

12 (x−µ)T Σ−1(x−µ)

Page 6: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Calculating the Cholesky Decomposition

DefinitionLower triangular Cholesky Decomposition A = LLT

Li,j =

Aj,j −∑j−1

k=1 L2j,k if i == j

1Lj,j

(Ai,j −

∑j−1k=1 Li,kLj,k

)if i > j

Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.

Page 7: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Calculating the Cholesky Decomposition

DefinitionLower triangular Cholesky Decomposition A = LLT

Li,j =

Aj,j −∑j−1

k=1 L2j,k if i == j

1Lj,j

(Ai,j −

∑j−1k=1 Li,kLj,k

)if i > j

Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.

Page 8: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 9: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 10: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 11: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 12: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 13: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)

4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 14: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 15: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked Cholesky decomposition

Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Step 2 is independent of steps 1and 3.

Page 16: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Hardware

GPU dedicates more die area to data processingCPU dedicates more die area to control flow and cache

Page 17: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)

Each SM has its own shared memory and registersEach SM is simpler than a CPU core

Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

Page 18: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registers

Each SM is simpler than a CPU coreBetter at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

Page 19: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Hardware

A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registersEach SM is simpler than a CPU core

Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)

Page 20: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Programming

GPUs execute kernel functions written in CUDA-Ctemplate <typename T>

__global__ void scale(int n, T alpha , T * x, int incx) {

const int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n)

x[i * incx] *= alpha;

}

nVidia CUDA compiler converts CUDA-C code into GPU binaryfilesCUDA runtime library provides an API to

transfer compiled code and data onto the GPUlaunch kernel functions using a 3D grid of 3D thread blocks

Page 21: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Thread hierarchy

Each thread block runs on one SMEach SM runs more than one thread block

Within each thread block threads are multitasked in groups of 32called warpsWithin a warp threads are SIMD

Run the same instruction at the same timeThreads within a block can synchronize with each other

Ensures all threads in the block are at the same instruction

Page 22: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Memory hierarchy

GPUs have three types of memory to store data:Global memory - large, slow, accessible from all threads (andthe CPU)Shared memory - small, fast, shared between threads in a blockRegisters - small, fastest, private to each thread

In global and shared memory highest bandwidth is obtained whenconsecutive threads access consecutive elements.

Page 23: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU vs CPU

CPU better atcomplex maths (like square root in Cholesky)branching (if/then/else)

GPU better atsimple maths (sums, multiplies)executing operations in parallel

Page 24: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 25: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 26: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 27: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 28: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)

4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 29: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 30: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Blocked Cholesky decomposition

@

@

@

@

@

A B

C D

B starts at top left and moves tobottom right

1: B = B − AAT

2: D = D − CAT

3: B = chol(B)4: D = D(B−1)T

Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.

Page 31: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Matrix Multiply

C = αAB + βC

Each element of C is independentHave a 2D grid of 2D thread blocks each runningCi,j = α

∑kl=0 Ai,lBl,j + βCi,j in parallel

Calculating one element of C requires reading k elements fromAi,0→k and k elements from B0→k ,j

Calculating all of C requires reading 2mnk elements from globalmemoryProblem: reading elements of A and B from global memory isslow

Page 32: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Blocked GPU Matrix Multiply

A

-

B?

C

Divide A into blocks ofmb × kb, B into blocks ofkb × nb and C into blocks ofmb × nbStore one block of C inregisters and process a1× nb row per thread usingthread blocks of mb × 1Read blocks of A and B intoshared memory and accessfrom all threads in the threadblock

Page 33: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 34: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires reading

mmb ×

nnb ×

kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 35: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A

andm

mb ×n

nb ×kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 36: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 37: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in total

This is2

1mb + 1

nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 38: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Bandwidth Reduction

There are mmb ×

nnb blocks of C

Calculating all of C now requires readingm

mb ×n

nb ×kkb ×mb × kb elements of A and

mmb ×

nnb ×

kkb × kb × nb elements of B

orm × n × k × (

1mb

+1

nb)

elements in totalThis is

21

mb + 1nb

times less than when no blocking is used (mb = 1 and nb = 1)

Page 39: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 40: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?

ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 41: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 42: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)

Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 43: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/s

Bandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 44: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/s

FLOP:word ratio: 708.4839.744 = 17.82

Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 45: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]

What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)

Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48

39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×

Page 46: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Symmetric Rank-K Update

C = αAAT + βC

Similar to matrix multiplication with B = AT except only the lower halfof C is written to

A-

-C

@

@

@

Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half

Page 47: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Symmetric Rank-K Update

C = αAAT + βC

Similar to matrix multiplication with B = AT except only the lower halfof C is written to

A-

-C

@

@

@

Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half

Page 48: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Symmetric Rank-K Update

C = αAAT + βC

Similar to matrix multiplication with B = AT except only the lower halfof C is written to

A-

-C

@

@

@

Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half

Page 49: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Triangular SolveSolves XAT = αB by calculating

B = αB(A−1)T

where X overwrites B

Bi,j = αBi,j −j∑

k=0

Ak ,iBi,k

Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)

A

@

@

@

-

?

B- -

Page 50: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Triangular SolveSolves XAT = αB by calculating

B = αB(A−1)T

where X overwrites B

Bi,j = αBi,j −j∑

k=0

Ak ,iBi,k

Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)

A

@

@

@

-

?

B- -

Page 51: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Triangular SolveSolves XAT = αB by calculating

B = αB(A−1)T

where X overwrites B

Bi,j = αBi,j −j∑

k=0

Ak ,iBi,k

Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)

A

@

@

@

-

?

B- -

Page 52: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky Decomposition - Results

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Performance reaches 180GFLOPs/s

Page 53: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Replacing Triangular Solve

Triangular Solve is slow

B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT

Page 54: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Replacing Triangular Solve

Triangular Solve is slow

B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)

Separate into A = A−1 and B = αBAT

Page 55: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Replacing Triangular Solve

Triangular Solve is slow

B = αB(A−1)T

Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT

Page 56: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Triangular Matrix Multiply

B = αBAT

Implementation which updates B in place has similardependencies to triangular solve (so similar performance)

Bi,j = α

j∑k=0

Ak ,iBk ,j

Elements of B which have not yet been calculated are used toupdate the current element

In an “out of place” implementation each element is independent

Xi,j = α

j∑k=0

Ak ,iBk ,j

Page 57: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Triangular Matrix Multiply

B = αBAT

Implementation which updates B in place has similardependencies to triangular solve (so similar performance)

Bi,j = α

j∑k=0

Ak ,iBk ,j

Elements of B which have not yet been calculated are used toupdate the current elementIn an “out of place” implementation each element is independent

Xi,j = α

j∑k=0

Ak ,iBk ,j

Page 58: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Calculating the Inverse

Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT

Now need to form inverse of A

A is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.

Page 59: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Calculating the Inverse

Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT

Now need to form inverse of AA is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.

Page 60: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z

1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 61: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z

1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 62: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 63: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 64: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 65: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

@

@

@

@

@

A B

C D

X

Z1: B = B − AAT

2: X = D − CAT

3: B = chol(B)4: Z = B−1

5: D = XZ T

Use out of place matrix multiply to populate X then triangular multiplyto copy back to D

Page 66: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Hybrid Cholesky decomposition without triangularsolve

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Page 67: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Improving diagonal block transfer

Each memory copy has overhead

t =n

bandwidth+ overhead

Page 68: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Improving diagonal block transfer

Each memory copy has overhead

t =n

bandwidth+ overhead

Page 69: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrix

Each column must be copied separately

t(m,n) = n × (m

bandwidth+ overhead)

If the number of rows is a multiple of the memory alignmentthere is no padding

t(m,n) =m × n

bandwidth+ overhead

But submatrices are always padded

Page 70: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately

t(m,n) = n × (m

bandwidth+ overhead)

If the number of rows is a multiple of the memory alignmentthere is no padding

t(m,n) =m × n

bandwidth+ overhead

But submatrices are always padded

Page 71: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately

t(m,n) = n × (m

bandwidth+ overhead)

If the number of rows is a multiple of the memory alignmentthere is no padding

t(m,n) =m × n

bandwidth+ overhead

But submatrices are always padded

Page 72: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately

t(m,n) = n × (m

bandwidth+ overhead)

If the number of rows is a multiple of the memory alignmentthere is no padding

t(m,n) =m × n

bandwidth+ overhead

But submatrices are always padded

Page 73: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around BNo padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb

bandwidth+ overhead)

Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Page 74: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around B

No padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb

bandwidth+ overhead)

Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Page 75: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around BNo padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb

bandwidth+ overhead)

Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Page 76: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around BNo padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb

bandwidth+ overhead)

Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Page 77: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy

@

@

@

@

@

A B

C D

X

Z

Define column around BNo padding when n is amultiple of the memoryalignment

Faster to copy (larger) columnwhen

n × nbbandwidth

+ overhead <

nb× (nb

bandwidth+ overhead)

Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace

Page 78: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Block Column Copy - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Page 79: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Tuning the block size

For CPU blocked algorithms the block size is chosen so thatblocks fit in the CPU cacheIntroduced an extra level of blocking for hybrid algorithmAim to choose block size so that workload is balanced betweencomputing devices

Page 80: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Static block size

Aim to minimise area between two curves

Page 81: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Dynamic block size

Can change block size on each iteration and still have a correctalgorithm

Page 82: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Dynamic block size - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

GFLO

Ps/s

n

Page 83: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPU

Implement unblocked Cholesky decomposition for the GPUDue to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiplyPossible to overlap both on the GPU?

Page 84: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU

Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiplyPossible to overlap both on the GPU?

Page 85: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU

Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiplyPossible to overlap both on the GPU?

Page 86: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU

Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiply

Possible to overlap both on the GPU?

Page 87: A hybrid Cholesky decomposition algorithm for multicore CPUs with

GPU Cholesky decomposition

Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU

Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads

GPU is already performing matrix multiplyPossible to overlap both on the GPU?

Page 88: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranches

nVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel

First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition

Page 89: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branches

Write combined matrix multiply and unblocked Choleskydecomposition kernel

First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition

Page 90: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Multiple kernels

nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel

First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition

Page 91: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Multiple Kernels - results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Page 92: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Final Results

0

50

100

150

200

250

300

350

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Page 93: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.

Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.

Page 94: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.

Different types of parallel workloads can be sent to the mostappropriate device.

Page 95: A hybrid Cholesky decomposition algorithm for multicore CPUs with

Conclusions

Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.

Page 96: A hybrid Cholesky decomposition algorithm for multicore CPUs with

References

Vasily Volkov and James W. Demmel.Benchmarking GPUs to tune dense linear algebra.In Proceedings of the 2008 ACM/IEEE conference onSupercomputing, SC ’08, pages 1–11, Piscataway, NJ, USA,2008. IEEE Press.