Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Batched Gauss-Jordan Elimination forBlock-Jacobi Preconditioner Generation on GPUsHartwig Anzt, Jack Dongarra, Goran Flegar, Enrique Quintana-OrtiPMAM’17, Austin, TX02/05/2017
http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf
2
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Factorization of diag. blocks in prec. setup + trsv in prec. application
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
Block-Jacobi preconditioner
3
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Factorization of diag. blocks in prec. setup + trsv in prec. application
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
Block-Jacobi preconditioner
4
• Preconditioner setup (inversion of diagonal blocks) can become a challenge
• Batched inversion of many small systems
• Diagonal blocks are typically of small size
• Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM)
• Often superior to scalar Jacobi preconditioner
• Light-weight preconditioner (compared to ILU)
• Parallel preconditioner application ( batched trsv / SpMV + vector scaling )
• Factorization of diag. blocks in prec. setup + trsv in prec. application
• Inversion of diag. blocks in prec. setup + SpMV in prec. application
Block-Jacobi preconditioner
5
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
6
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
7
• Limit the size to 32 x 32
• Use one warp per system
• Inversion in registers
• Communicate via __shfl()
• Implicit pivoting: • accumulate row/col swaps • Realize them when writing results
• No operations on triangular systems
• Implicit pivoting
• Implementation can use static system size
Advantages over LU-based approach:
Batched Gauss-Jordan elimination (BGJE) on GPUs
8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Batch size ×105
0
2
4
6
8
10
12
GF
lop
/s
K40
K80
K20
GJELU
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Batch size ×105
0
20
40
60
80
100
120
140
GF
lop
/s
K40
K20
K80
GJELU
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Batch size ×105
0
5
10
15
20
25
30
35
GF
lop
/s
GJELU
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Batch size ×105
0
100
200
300
400
500
600
GF
lop
/s
GJELU
NVIDIA Pascal architecture (P100)
NVIDIA Kepler architecture (K20/K40/K80)
Inversion 8 x 8
Inversion 32 x 32
Inversion 32 x 32
Inversion 8 x 8
9
NVIDIA Pascal architecture (P100)
0 1 2 3 4 5
Batch size ×104
0
100
200
300
400
500
600
GF
lop
/s
Gauss-Jordan elimination (2 n3 Flops)
MAGMA dgetrf (2/3 n3 Flops)
cuBLAS dgetrf (2/3 n 3 Flops)
0 1 2 3 4 5
Batch size ×104
0
0.5
1
1.5
2
2.5
3
3.5
time/m
atr
ix [m
s]
×10-3
Gauss-Jordan elimination (Inversion)MAGMA dgetrf (Factorization)cuBLAS dgetrf (Factorization)
• Batched Gauss-Jordan elimination (BGJE) achieves >10% of DP peak
• BGJE (inversion) completes 2x faster than factorization only
• 60 SMX, 5.3 TF DP, 720 GB/s ( 7.36 : 1 )
10
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
Block-Jacobi preconditioner generation
11
Challenge: diagonal block extraction from sparse data structures
......
extract diagonal block from sparse data structure
1.
2.
insert solution as diagonal block into the preconditioner matrix
invert diagonal block via Gauss-Jordan elimination
3.
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Uncoalescent memory access
12
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Uncoalescent memory access
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
13
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Uncoalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
14
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Uncoalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
15
Challenge: diagonal block extraction from sparse data structures
• Matrix in CSR format
• Diagonal block extraction costly
• Assign one thread per row
• Unbalanced nonzero distribution results in load imbalance
• “Dense” rows become bottleneck
• Uncoalescent memory access
• Add intermediate step:• Extract diag. blocks to shared mem.• Multiple threads for each row• Coalescent access to data in CSR• Col-major storage
• Merge kernels vs. 3 separate kernels:[ extraction + inversion + insertion ][ extraction ] + [ inversion ] + [ insertion ]
shared memoryextraction
memory requests
...
...
...
...
...
...
...
...
......
......
......
......
......
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cachedextraction
1
2
3
4
5
col-indicesrow-ptr col-indicesrow-ptr shared_mem
0 10 20 30 40 50 60Test matrices
0
2
4
6
8
10
12
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
0 10 20 30 40 50 60Test matrices
1
2
3
4
5
6
7
BJ
ge
ne
ratio
n t
ime
/ B
GJE
tim
e USUCMSMC
16
K40
P100
Diagonal block extraction from sparse data structures
Results show:
• Runtime block-Jacobi vs. BGJE
• Jacobi block size 32
• 56 test matrices (SuiteSparse)
• Data arranged w.r.t. performanceof (avg.) best strategy
US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction
0 10 20 30 40 50 60Test matrices
1
2
3
4
5
6
7
BJ
ge
ne
ratio
n t
ime
/ B
GJE
tim
e USUCMSMC
0 10 20 30 40 50 60Test matrices
1
2
3
4
5
6
BJ
ge
ne
ratio
n t
ime
/ B
GJE
tim
e USUCMSMC
17
K40
P100
Diagonal block extraction from sparse data structures
Results show:
• Runtime block-Jacobi vs. BGJE
• Jacobi block size 8
• 56 test matrices (SuiteSparse)
• Data arranged w.r.t. performanceof (avg.) best strategy
US: unmerged + shared extractionUC: unmerged + cached extractionMS: merged + shared extractionMC: merged + cached extraction
18
Diagonal block extraction from sparse data structures
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
1
1.5
2
2.5
3
3.5
4
4.5
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
Tridiagonal-type
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
0
5
10
15
20
25
BJ
ge
ne
ratio
n t
ime
/ B
GJE
tim
e USUCMSMC
Arrow-type
Tridiagonal-type matrix structure:- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance
Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction
K40
K40
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
0.5
1
1.5
2
2.5
3
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Matrix size
×104
0
5
10
15
20
BJ
genera
tion tim
e / B
GJE
tim
e USUCMSMC
19
Diagonal block extraction from sparse data structures
Tridiagonal-type
Arrow-type
P100
P100Tridiagonal-type matrix structure:
- balanced nnz/row- few nnz per row- each thread has similar work- shared memory and cache-based strategies withsimilar performance
Arrow-type matrix structure:- few rows with many elements- corresponding threads are bottleneck for cached extraction
Cheby
shev
2
dw10
24
dw20
48
LeGre
sley_
2508
saylr
4
Cheby
shev
3
olm
5000
sts4
098
piston
nasa
2910
s3rm
t3m
3
dw40
96
dw81
92
s3rm
t3m
1
s2rm
t3m
1
s1rm
t3m
1
s3rm
q4m
1
s2rm
q4m
1
linve
rse
CurlC
url_0Kuu
bcss
tk18
nem
eth1
5
bcss
tk17
bcss
tk38
Pres_
Poiss
on
ABACUS_she
ll_ud
spm
srtls
cbuc
kle
msc
1084
8
cz40
948
gridge
na
2D_5
4019
_highKnd
3k
ibm
_mat
rix_2
3D_5
1448
_3D
rail_
7984
1
gas_
sens
or
sme3
Dc
mat
rix_9 dc
3nd
6k
mat
rix-n
ew_3
ship_0
03
cran
kseg
_1
nd12
k
CurlC
url_1
nd24
k
af_s
hell3 F1
ecolog
y2
G3_
circu
it
Emilia
_923
Hook_
1498
Flan_
1565
ML_
Gee
r10-5
10-4
10-3
10-2
10-1
Blo
ck J
aco
bi g
enera
tion tim
e [s]
Block size 4Block size 8Block size 16Block size 24Block size 32
20
Block-Jacobi preconditioner setup
K40
• Preconditioner setup needs up to 0.1 s on K40 GPU.
21
Block-Jacobi preconditioner setup
K40
Cheby
shev
2
dw20
48
dw10
24
LeGre
sley_
2508
Cheby
shev
3
saylr
4
sts4
098
olm
5000
nasa
2910
s2rm
q4m
1
s2rm
t3m
1
s3rm
t3m
1
s3rm
t3m
3
s3rm
q4m
1
s1rm
t3m
1
dw81
92
dw40
96Kuu
bcss
tk38
CurlC
url_0
linve
rse
nem
eth1
5
bcss
tk18
piston
bcss
tk17
nd3k
cbuc
kle
Pres_
Poiss
on
msc
1084
8
ABACUS_she
ll_ud
nd6k
spm
srtls
cz40
948
gridge
na
ibm
_mat
rix_2
sme3
Dc
3D_5
1448
_3D
2D_5
4019
_highK
nd12
k
gas_
sens
or
rail_
7984
1
cran
kseg
_1
mat
rix_9dc
3
mat
rix-n
ew_3
nd24
k
ship_0
03
CurlC
url_1 F1
af_s
hell3
ecolog
y2
Emilia
_923
G3_
circu
it
Hook_
1498
ML_
Gee
r
Flan_
1565
10-5
10-4
10-3
10-2
10-1
Blo
ck J
aco
bi g
enera
tion tim
e [s]
Block size 4Block size 8Block size 16Block size 24Block size 32
P100
• Preconditioner setup needs up to 0.1 s on K40 GPU.
• Preconditioner setup needs up to 0.01 s on P100 GPU.
• Irrelevant for overall Iterative solver execution time.
22
This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy.
http://icl.cs.utk.edu/magma/
All functionalities are part of the MAGMA-sparse project.
http://www.icl.utk.edu/~hanzt/talks/BGJE.pdf