Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

Batched Gauss-Jordan Elimination forBlock-Jacobi Preconditioner Generation on GPUsHartwig Anzt, Jack Dongarra, Goran Flegar, Enrique Quintana-OrtiPMAM’17, Austin, TX02/05/2017

http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf

2

• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)

• Often superior to scalar Jacobi preconditioner

• Light-weight preconditioner (compared to ILU)

• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )

• Factorization of diag. blocks in prec. setup + trsv in prec. application

• Inversion of diag. blocks in prec. setup + SpMV in prec. application

Block-Jacobi preconditioner

3








4

• Preconditioner setup (inversion of diagonal blocks) can become a challenge

• Batched inversion of many small systems

• Diagonal blocks are typically of small size








5

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

Block-Jacobi preconditioner generation

6

......


1.

2.



3.


7

• Limit the size to 32 x 32

• Use one warp per system

• Inversion in registers

• Communicate via __shfl()

• Implicit pivoting: • accumulate row/col swaps • Realize them when writing results

• No operations on triangular systems

• Implicit pivoting

• Implementation can use static system size

Advantages over LU-based approach:

Batched Gauss-Jordan elimination (BGJE) on GPUs

8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

2

4

6

8

10

12

GF

lop

/s

K40

K80

K20

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

20

40

60

80

100

120

140

GF

lop

/s

K40

K20

K80

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

5

10

15

20

25

30

35

GF

lop

/s

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

100

200

300

400

500

600

GF

lop

/s

GJELU

NVIDIA Pascal architecture (P100)

NVIDIA Kepler architecture (K20/K40/K80)

Inversion 8 x 8

Inversion 32 x 32

Inversion 32 x 32

Inversion 8 x 8

9

NVIDIA Pascal architecture (P100)

0 1 2 3 4 5

Batch size ×104

0

100

200

300

400

500

600

GF

lop

/s

Gauss-Jordan elimination (2 n3 Flops)

MAGMA dgetrf (2/3 n3 Flops)

cuBLAS dgetrf (2/3 n 3 Flops)

0 1 2 3 4 5

Batch size ×104

0

0.5

1

1.5

2

2.5

3

3.5

time/m

atr

ix [m

s]

×10-3

Gauss-Jordan elimination (Inversion)MAGMA dgetrf (Factorization)cuBLAS dgetrf (Factorization)

• Batched Gauss-Jordan elimination (BGJE) achieves >10% of DP peak

• BGJE (inversion) completes 2x faster than factorization only

• 60 SMX, 5.3 TF DP, 720 GB/s ( 7.36 : 1 )

10

......


1.

2.



3.


11

Challenge: diagonal block extraction from sparse data structures

......


1.

2.



3.

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

12








shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

13








• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage


memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5



memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5


14









15









• Merge kernels vs. 3 separate kernels:[ extraction + inversion + insertion ][ extraction ] + [ inversion ] + [ insertion ]


memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5


0 10 20 30 40 50 60Test matrices

0

2

4

6

8

10

12

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

7

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

16

K40

P100

Diagonal block extraction from sparse data structures

Results show:

• Runtime block-Jacobi vs. BGJE

• Jacobi block size 32

• 56 test matrices (SuiteSparse)

• Data arranged w.r.t. performanceof (avg.) best strategy

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

7

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

17

K40

P100


Results show:

• Runtime block-Jacobi vs. BGJE

• Jacobi block size 8

• 56 test matrices (SuiteSparse)

• Data arranged w.r.t. performanceof (avg.) best strategy

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

18


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

1

1.5

2

2.5

3

3.5

4

4.5

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

Tridiagonal-type

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0

5

10

15

20

25

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

Arrow-type

Tridiagonal-type matrix structure:- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance

Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction

K40

K40

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0.5

1

1.5

2

2.5

3

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0

5

10

15

20

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

19


Tridiagonal-type

Arrow-type

P100

P100Tridiagonal-type matrix structure:

- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance

Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction

Cheby

shev

2

dw10

24

dw20

48

LeGre

sley_

2508

saylr

4

Cheby

shev

3

olm

5000

sts4

098

piston

nasa

2910

s3rm

t3m

3

dw40

96

dw81

92

s3rm

t3m

1

s2rm

t3m

1

s1rm

t3m

1

s3rm

q4m

1

s2rm

q4m

1

linve

rse

CurlC

url_0Kuu

bcss

tk18

nem

eth1

5

bcss

tk17

bcss

tk38

Pres_

Poiss

on

ABACUS_she

ll_ud

spm

srtls

cbuc

kle

msc

1084

8

cz40

948

gridge

na

2D_5

4019

_highKnd

3k

ibm

_mat

rix_2

3D_5

1448

_3D

rail_

7984

1

gas_

sens

or

sme3

Dc

mat

rix_9 dc

3nd

6k

mat

rix-n

ew_3

ship_0

03

cran

kseg

_1

nd12

k

CurlC

url_1

nd24

k

af_s

hell3 F1

ecolog

y2

G3_

circu

it

Emilia

_923

Hook_

1498

Flan_

1565

ML_

Gee

r10-5

10-4

10-3

10-2

10-1

Blo

ck J

aco

bi g

enera

tion tim

e [s]

Block size 4Block size 8Block size 16Block size 24Block size 32

20

Block-Jacobi preconditioner setup

K40

• Preconditioner setup needs up to 0.1 s on K40 GPU.

21

Block-Jacobi preconditioner setup

K40

Cheby

shev

2

dw20

48

dw10

24

LeGre

sley_

2508

Cheby

shev

3

saylr

4

sts4

098

olm

5000

nasa

2910

s2rm

q4m

1

s2rm

t3m

1

s3rm

t3m

1

s3rm

t3m

3

s3rm

q4m

1

s1rm

t3m

1

dw81

92

dw40

96Kuu

bcss

tk38

CurlC

url_0

linve

rse

nem

eth1

5

bcss

tk18

piston

bcss

tk17

nd3k

cbuc

kle

Pres_

Poiss

on

msc

1084

8

ABACUS_she

ll_ud

nd6k

spm

srtls

cz40

948

gridge

na

ibm

_mat

rix_2

sme3

Dc

3D_5

1448

_3D

2D_5

4019

_highK

nd12

k

gas_

sens

or

rail_

7984

1

cran

kseg

_1

mat

rix_9dc

3

mat

rix-n

ew_3

nd24

k

ship_0

03

CurlC

url_1 F1

af_s

hell3

ecolog

y2

Emilia

_923

G3_

circu

it

Hook_

1498

ML_

Gee

r

Flan_

1565

10-5

10-4

10-3

10-2

10-1

Blo

ck J

aco

bi g

enera

tion tim

e [s]

Block size 4Block size 8Block size 16Block size 24Block size 32

P100

• Preconditioner setup needs up to 0.1 s on K40 GPU.

• Preconditioner setup needs up to 0.01 s on P100 GPU.

• Irrelevant for overall Iterative solver execution time.

22

This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy.

http://icl.cs.utk.edu/magma/

All functionalities are part of the MAGMA-sparse project.

http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf

Documents

Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]