22
Batched Gauss - Jordan Elimination for Block - Jacobi Preconditioner Generation on GPUs Hartwig Anzt, Jack Dongarra, Goran Flegar, Enrique Quintana-Orti PMAM’17, Austin, TX 02/05/2017 http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf

Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

Batched Gauss-Jordan Elimination forBlock-Jacobi Preconditioner Generation on GPUsHartwig Anzt, Jack Dongarra, Goran Flegar, Enrique Quintana-OrtiPMAM’17, Austin, TX02/05/2017

http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf

Page 2: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

2

• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)

• Often superior to scalar Jacobi preconditioner

• Light-weight preconditioner (compared to ILU)

• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )

• Factorization of diag. blocks in prec. setup + trsv in prec. application

• Inversion of diag. blocks in prec. setup + SpMV in prec. application

Block-Jacobi preconditioner

Page 3: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

3

• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)

• Often superior to scalar Jacobi preconditioner

• Light-weight preconditioner (compared to ILU)

• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )

• Factorization of diag. blocks in prec. setup + trsv in prec. application

• Inversion of diag. blocks in prec. setup + SpMV in prec. application

Block-Jacobi preconditioner

Page 4: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

4

• Preconditioner setup (inversion of diagonal blocks) can become a challenge

• Batched inversion of many small systems

• Diagonal blocks are typically of small size

• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)

• Often superior to scalar Jacobi preconditioner

• Light-weight preconditioner (compared to ILU)

• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )

• Factorization of diag. blocks in prec. setup + trsv in prec. application

• Inversion of diag. blocks in prec. setup + SpMV in prec. application

Block-Jacobi preconditioner

Page 5: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

5

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

Block-Jacobi preconditioner generation

Page 6: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

6

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

Block-Jacobi preconditioner generation

Page 7: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

7

• Limit the size to 32 x 32

• Use one warp per system

• Inversion in registers

• Communicate via __shfl()

• Implicit pivoting: • accumulate row/col swaps • Realize them when writing results

• No operations on triangular systems

• Implicit pivoting

• Implementation can use static system size

Advantages over LU-based approach:

Batched Gauss-Jordan elimination (BGJE) on GPUs

Page 8: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

2

4

6

8

10

12

GF

lop

/s

K40

K80

K20

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

20

40

60

80

100

120

140

GF

lop

/s

K40

K20

K80

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

5

10

15

20

25

30

35

GF

lop

/s

GJELU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Batch size ×105

0

100

200

300

400

500

600

GF

lop

/s

GJELU

NVIDIA Pascal architecture (P100)

NVIDIA Kepler architecture (K20/K40/K80)

Inversion 8 x 8

Inversion 32 x 32

Inversion 32 x 32

Inversion 8 x 8

Page 9: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

9

NVIDIA Pascal architecture (P100)

0 1 2 3 4 5

Batch size ×104

0

100

200

300

400

500

600

GF

lop

/s

Gauss-Jordan elimination (2 n3 Flops)

MAGMA dgetrf (2/3 n3 Flops)

cuBLAS dgetrf (2/3 n 3 Flops)

0 1 2 3 4 5

Batch size ×104

0

0.5

1

1.5

2

2.5

3

3.5

time/m

atr

ix [m

s]

×10-3

Gauss-Jordan elimination (Inversion)MAGMA dgetrf (Factorization)cuBLAS dgetrf (Factorization)

• Batched Gauss-Jordan elimination (BGJE) achieves >10% of DP peak

• BGJE (inversion) completes 2x faster than factorization only

• 60 SMX, 5.3 TF DP, 720 GB/s ( 7.36 : 1 )

Page 10: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

10

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

Block-Jacobi preconditioner generation

Page 11: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

11

Challenge: diagonal block extraction from sparse data structures

......

extract diagonal block from sparse data structure

1.

2.

insert solution as diagonal block into the preconditioner matrix

invert diagonal block via Gauss-Jordan elimination

3.

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

Page 12: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

12

Challenge: diagonal block extraction from sparse data structures

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

Page 13: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

13

Challenge: diagonal block extraction from sparse data structures

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage

shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

Page 14: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

14

Challenge: diagonal block extraction from sparse data structures

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage

Page 15: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

15

Challenge: diagonal block extraction from sparse data structures

• Matrix in CSR format

• Diagonal block extraction costly

• Assign one thread per row

• Unbalanced nonzero distribution results in load imbalance

• “Dense” rows become bottleneck

• Uncoalescent memory access

• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage

• Merge kernels vs. 3 separate kernels:[ extraction + inversion + insertion ][ extraction ] + [ inversion ] + [ insertion ]

shared memoryextraction

memory requests

...

...

...

...

...

...

...

...

......

......

......

......

......

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

cachedextraction

1

2

3

4

5

col-indicesrow-ptr col-indicesrow-ptr shared_mem

Page 16: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

0 10 20 30 40 50 60Test matrices

0

2

4

6

8

10

12

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

7

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

16

K40

P100

Diagonal block extraction from sparse data structures

Results show:

• Runtime block-Jacobi vs. BGJE

• Jacobi block size 32

• 56 test matrices (SuiteSparse)

• Data arranged w.r.t. performanceof (avg.) best strategy

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

Page 17: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

7

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

0 10 20 30 40 50 60Test matrices

1

2

3

4

5

6

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

17

K40

P100

Diagonal block extraction from sparse data structures

Results show:

• Runtime block-Jacobi vs. BGJE

• Jacobi block size 8

• 56 test matrices (SuiteSparse)

• Data arranged w.r.t. performanceof (avg.) best strategy

US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction

Page 18: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

18

Diagonal block extraction from sparse data structures

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

1

1.5

2

2.5

3

3.5

4

4.5

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

Tridiagonal-type

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0

5

10

15

20

25

BJ

ge

ne

ratio

n t

ime

/ B

GJE

tim

e USUCMSMC

Arrow-type

Tridiagonal-type matrix structure:- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance

Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction

K40

K40

Page 19: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0.5

1

1.5

2

2.5

3

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size

×104

0

5

10

15

20

BJ

genera

tion tim

e / B

GJE

tim

e USUCMSMC

19

Diagonal block extraction from sparse data structures

Tridiagonal-type

Arrow-type

P100

P100Tridiagonal-type matrix structure:

- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance

Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction

Page 20: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

Cheby

shev

2

dw10

24

dw20

48

LeGre

sley_

2508

saylr

4

Cheby

shev

3

olm

5000

sts4

098

piston

nasa

2910

s3rm

t3m

3

dw40

96

dw81

92

s3rm

t3m

1

s2rm

t3m

1

s1rm

t3m

1

s3rm

q4m

1

s2rm

q4m

1

linve

rse

CurlC

url_0Kuu

bcss

tk18

nem

eth1

5

bcss

tk17

bcss

tk38

Pres_

Poiss

on

ABACUS_she

ll_ud

spm

srtls

cbuc

kle

msc

1084

8

cz40

948

gridge

na

2D_5

4019

_highKnd

3k

ibm

_mat

rix_2

3D_5

1448

_3D

rail_

7984

1

gas_

sens

or

sme3

Dc

mat

rix_9 dc

3nd

6k

mat

rix-n

ew_3

ship_0

03

cran

kseg

_1

nd12

k

CurlC

url_1

nd24

k

af_s

hell3 F1

ecolog

y2

G3_

circu

it

Emilia

_923

Hook_

1498

Flan_

1565

ML_

Gee

r10-5

10-4

10-3

10-2

10-1

Blo

ck J

aco

bi g

enera

tion tim

e [s]

Block size 4Block size 8Block size 16Block size 24Block size 32

20

Block-Jacobi preconditioner setup

K40

• Preconditioner setup needs up to 0.1 s on K40 GPU.

Page 21: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

21

Block-Jacobi preconditioner setup

K40

Cheby

shev

2

dw20

48

dw10

24

LeGre

sley_

2508

Cheby

shev

3

saylr

4

sts4

098

olm

5000

nasa

2910

s2rm

q4m

1

s2rm

t3m

1

s3rm

t3m

1

s3rm

t3m

3

s3rm

q4m

1

s1rm

t3m

1

dw81

92

dw40

96Kuu

bcss

tk38

CurlC

url_0

linve

rse

nem

eth1

5

bcss

tk18

piston

bcss

tk17

nd3k

cbuc

kle

Pres_

Poiss

on

msc

1084

8

ABACUS_she

ll_ud

nd6k

spm

srtls

cz40

948

gridge

na

ibm

_mat

rix_2

sme3

Dc

3D_5

1448

_3D

2D_5

4019

_highK

nd12

k

gas_

sens

or

rail_

7984

1

cran

kseg

_1

mat

rix_9dc

3

mat

rix-n

ew_3

nd24

k

ship_0

03

CurlC

url_1 F1

af_s

hell3

ecolog

y2

Emilia

_923

G3_

circu

it

Hook_

1498

ML_

Gee

r

Flan_

1565

10-5

10-4

10-3

10-2

10-1

Blo

ck J

aco

bi g

enera

tion tim

e [s]

Block size 4Block size 8Block size 16Block size 24Block size 32

P100

• Preconditioner setup needs up to 0.1 s on K40 GPU.

• Preconditioner setup needs up to 0.01 s on P100 GPU.

• Irrelevant for overall Iterative solver execution time.

Page 22: Batched Gauss-Jordan Elimination for Block-Jacobi ...hanzt/talks/BGJE.pdf · Emilia_923 G3_circuitHook_1498ML_GeerFlan_1565 10-5 10-4 10-3 10-2 10-1 Block Jacobi generation time [s]

22

This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy.

http://icl.cs.utk.edu/magma/

All functionalities are part of the MAGMA-sparse project.

http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf