ILUT Smoothers for Hybrid C-AMG with Scaled Triangular Factors

Received: Added at production Revised: Added at production Accepted: Added at production

DOI: xxx/xxxx

ARTICLE TYPE

ILU Smoothers for C-AMG with Scaled Triangular Factors

Stephen Thomas1 | Arielle K. Carr*2 | Paul Mullowney3 | Katarzyna Świrydowicz4 | Marcus S. Day5

1National Renewable Energy Laboratory,Golden, CO, USA, Email:[email protected]

2Department of Computer Science andEngineering, Lehigh University, Bethlehem,PA, USA, Email: [email protected]


4Pacific Northwest National Laboratory,Richland, WA, USA, Email:[email protected]


Correspondence*Corresponding author

Summary

ILU smoothers can be effective in the algebraic multigrid V -cycle. However, directtriangular solves are comparatively slow on GPUs. Previous work by Chow andPatel1 and Antz et al.2 proposed an iterative approach to solve these systems. Unfor-tunately, when the Jacobi iteration is applied to highly non-normal upper or lowertriangular factors, the iterations will diverge. An ILU smoother is introduced forclassical Ruge-Stüben C-AMG that applies row and/or column scaling to mitigatethe non-normality of the upper triangular factor. Our approach facilitates the use ofJacobi iteration in place of the inherently sequential triangular solve. Because thescaling is applied to the upper triangular factor, it can be done locally for a diagonalblock of the global matrix. An ILUT Schur complement smoother, that solves theSchur system along subdomain (MPI rank) boundaries using GMRES, maintains aconstant iteration count and improves strong-scaling. Numerical results and paral-lel performance are presented for the Nalu-Wind and PeleLM3 pressure solvers. Forlarge problem sizes, GMRES+AMGwith iterative triangular solves executes at leastfive times faster than when using direct solves on the NREL Eagle supercomputer.

KEYWORDS:iterative triangular solvers, GPU acceleration, incomplete LU smoother, algebraic multigrid

1 INTRODUCTION

Incomplete LU factorizations (e.g. ILU(k), ILUTP) can be employed as a smoother in algebraic multigrid (AMG) methods4–6.However, when applying the incomplete factors in the AMG solve phase, the requisite direct triangular solves may result ina performance bottleneck on massively parallel architectures such as GPUs2. Iterative methods, such as the stationary Jacobiiteration, can be an effective means of approximating the sparse triangular solution2, 7. Our contribution is to introduce anapproach for mitigating the effects of problematic factors that may inhibit the convergence of the Jacobi iteration. Specifically,we are concerned with situations in which L or U has a high condition number or a high degree of non-normality. When thesedifficult cases are properly dealt with, it becomes possible to use the significantly faster sparse matrix-vector (SpMV) productsappearing in the Jacobi iteration on GPU architectures, thus achieving substantial acceleration in the AMG solve phase.Consider the case of solving the linear system of the form Ax = b, where A is a sparse symmetric n × n matrix that is

(highly) ill-conditioned, where the singular values span a dynamic range of more than 16 order of magnitude. Such systems mayarise, for example, in “projection” methods for evolving variable-density incompressible and reacting flows in the low Machflow regime, particularly when using cut-cell approaches to complex geometries, where non-covered cells that are cut by thedomain boundary can have arbitrarily small volumes and areas3. Equilibration techniques are designed to reduce the conditionnumber of A, �(A),8, 9 and can therefore improve the accuracy of the solution, especially with direct techniques1. However,

1The instability of the solution to a linear system with an ill-conditioned coefficient matrix is well-known; see 10 Chapter 4.2

arX

iv:2

111.

0951

2v4

[m

ath.

NA

] 1

4 Ja

n 20

22

2 Thomas, Carr, Mullowney, Świrydowicz, Day

equilibration has received considerably less attention when used in conjunction with AMG. Indeed, equilibration of A can leadto highly non-normal factors L and U when subsequently computing the ILU factorization of A11. Even in the case whereequilibration is not performed, the L and U factors can be highly non-normal. Each situation renders the use of Jacobi iterativetriangular solvers ineffective when employing the L and U factors in the AMG solve phase. Because fast, accurate solves areessential to application codes that rely on sophisticated smoothing algorithms for solving the linear systems, it is critical to avoidexacerbating their departure from normality. Consequently, a new technique is proposed in this paper that avoids scaling thecoefficient matrix A. Instead, row and column scaling methods are applied when ill-conditioned and/or highly non-normal Land U factors are encountered.Specifically, consider the incomplete LU factorization, such thatA ≈ LDU , and the Ruiz strategy12, such thatA ≈ LDrUDc .

Here, D is a diagonal matrix extracted from U by row scaling.2 Dr and Dc represent row and column scaling, respectively. Inmany cases, and for the applications considered here, scaling leads to a significant reduction in both the condition number anddeparture from normality of these factors13. To our knowledge, scaling of the L or U factors to reduce the condition number ofthe factors and mitigate high non-normality is a novel approach when applied within an ILU smoother for C-AMG. The methodavoids block Jacobi relaxation, another technique employed to account for non-normal factors when using iterative methods tocompute the triangular solve7. Block Jacobi requires a reordering strategy such as RCM7, which may reduce the departure fromnormality, but can also increase computational cost in parallel for large matrices.An ILUT Schur complement smoother maintains a constant Krylov solver iteration count as the number of parallel processes

(subdomains or MPI ranks) increases. The Schur complement system represents the interface degrees of freedom at subdomainboundaries. These are associated with the column indices corresponding to row indices owned by other MPI ranks in the hypreParCSR block partitioning of the global matrix14. A single GMRES iteration is employed to solve the Schur complement sys-tem for the interface variables, followed by back-substitution for the internal variables. Iterative triangular solves are appliedin the latter case. A key observation is that the explicit residual computation r(k) = b − Ax(k) is not needed for this singleGMRES iteration, resulting in a significant computational cost saving. A hierarchical basis formulation of AMG based on theC-F block matrix partitioning and Schur complements is considered in Chow and Vassilevski15. Their proposed method canbecome expensive as the number of nonzeros in the coefficient matrix increases. A similar increase in cost is observed here,along with an increase in the departure from normality of U and ‖Us‖2. These can both be mitigated by limiting the fill level inthe ILU factorization. Thus, Jacobi iterations can be effectively applied, resulting in comparatively inexpensive solution of thetriangular systems for the internal variables on the GPU.To obtain our results, relatively large matrices of dimension > 10 million were extracted from the PeleLM nodal pressure

projection solver3 and the Nalu-Wind pressure continuity solver16. For these application, scalingLwas found to be unnecessary,but our approach can be easily extended to cases when scaling is needed for both triangular factors. Scaling reduces the conditionnumber of U significantly. Furthermore, applying scaling directly to the non-normal U factor to reduce its departure fromnormality facilitates the use of a Neuman series (Jacobi) iteration to solve the triangular system. A performancemodel is providedtogether with numerical and parallel scaling results in the following sections.This paper is summarized as follows. In section §2 the classical algebraic multigrid linear solver is reviewed, along with

Jacobi iteration for triangular systems and metrics for non-normal matrices. In section §3, incomplete LU factorizations aredescribed together with LDU row scaling and Ruiz row/column scaling. An algorithm for applying these scaling techniqueswithin C-AMG is then given, and an empirical analysis demonstrates the effectiveness of these approaches to reduce dep(U ) and�(U ). In section §4, the hypre implementations of the C-AMG considered in this paper are described, a performance model isprovided, and results are presented for linear systems from the PeleLM pressure continuity solver3. In both the results presentedin section §4 as well as preceding sections where an experimental analysis is provided, it will be made clear where either solveris employed. Finally, in section §7, conclusions are provided and future work is proposed.Table 1 summarizes the notation used throughout this paper.

2 ALGEBRAIC MULTIGRID

In this paper, we introduce an ILU smoother for AMG with iterative (Jacobi) triangular solves for the L and U factors. Similar(asynchronous) relaxation-type smoothers have been considered in other contexts; for example in geometric multigrid (GMG)

2Note that D can also be factored out from U .

Thomas, Carr, Mullowney, Świrydowicz, Day 3

Notation DefinitionLowercase letters, e.g., v column vectorUppercase letters, e.g., V , A,H , D matrixLowercase letter with two subscripts, e.g., aij element of matrix A in row i and column jUppercase letter with subscript, e.g.,Ak ktℎ matrix in a sequenceLowercase letter with subscript, e.g.,vk ktℎ vector in a sequenceUppercase L lower triangular matrixUppercase U upper triangular matricesUs strictly upper triangular matrixLs strictly lower triangular matrix‖A‖2 the 2-norm of A‖A‖F the Frobenius norm of A‖A‖∞ the infinity norm of A�(A) 2-norm condition number of Adep(U ) departure from normality of U

TABLE 1 Notation used in this paper for vectors, matrices, and operators.

methods for structured grids on the GPU, these smoothers were shown to be highly efficient17. However, to our knowledge, thisis the first application of iterative triangular solvers for AMG. Furthermore, to ensure Jacobi does not diverge, we demonstratehow to mitigate a high degree of nonnormality in the triangular factors by using row and column scaling of the triangular factorsthemselvesAlgebraic multigrid methods (AMG) are effective and scalable solvers for unstructured grids that are well-suited for high-

performance computer architectures. When employed as a stand-alone solver or as a preconditioner for a Krylov iteration suchas BiCGStab18, AMG can, in theory, solve a linear system with n unknowns in (n) operations. An AMG preconditioneraccelerates the Krylov iterative solution of a linear system

Ax = b (1)

through error reduction by using a sequence of coarser matrices called a hierarchy, which is denoted asAk, where k = 0, 1,…m,and A0 is the coefficient matrix from (1) with dimensions n × n. Each Ak has dimensions mk ×mk where mk > mk+1 for k < m.For the purposes of this paper, it will be assumed that

Ak = Rk Ak−1 Pk , (2)

for k > 0, where Pk is a rectangular matrix with dimensionsmk−1×mk. Pk is referred to as a prolongation matrix or prolongator.Rk is the restriction matrix and Rk = P T

k in the Galerkin formulation of AMG. Associated with each Ak, k < m, is a solvercalled a smoother, which is usually an inexpensive iterative method, e.g., Gauss-Seidel, polynomial, or incomplete factorization.The coarse solver associated with Am is often a direct solver, although it may be an iterative method if Am is singular.The setup phase of AMG is nontrivial for several reasons. Each prolongator Pk is developed algebraically from Ak−1 (hence

the name of the method). Once the transfer matrices are determined, the coarse-matrix representations are recursively computedfrom A through sparse matrix-matrix multiplication. In the AMG solve phase, a small number of sweeps is applied after fineto coarse pre-smoothing or coarse to fine post-smoothing residual transfers in the V -cycle. The solution of the coarsest-systemsolve is then interpolated to the next finer level, where it becomes a correction for that level’s previous approximate solution.A few steps of post-smoothing are applied after this correction. This process is repeated until the finest level is reached. Thisdescribes a V -cycle, the simplest complete AMG cycle; see Algorithm 1 for the detailed description. AMG methods achieveoptimality (constant work per degree of freedom in A0) through complementary error reductions by the smoother and solutioncorrections propagated from coarser levels.


Algorithm 1Multigrid single-cycle algorithm (� = 1 yields V -cycle) for solving Ax = b. The hierarchy has m + 1 levels.//Solve Ax = b.Set x = 0.Set � = 1 for V -cycle.call Multilevel(A, b, x, 0, �).

functionMULTILEVEL(Ak, b, x, k, �)// Solve Akx = b (k is current grid level)// Pre smoothing stepx = S1

k(Ak, b, x)if (k ≠ m) then

// Pk is the interpolant of Ak// Rk is the restrictor of Akrk+1 = Rk(b − Akx)Ak+1 = RkAkPkv = 0for i = 1…� do

MULTILEVEL(Ak+1, rk+1, v, k + 1, �)end forx = x + Pkv// Post smoothing stepx = S2

k(Ak, b, x)end if

end function

2.1 Classical Ruge-Stüben AMGA brief overview of classical Ruge-Stüben AMG19 is now given, starting with some notation that will be used in the subsequentdiscussions. Point j is a neighbor of i if and only if there is a non-zero element aij of the matrix A. Point j strongly influences iif and only if

|aij| ≥ �maxk≠i

| aik | , (3)

where � is the strength of connection threshold, 0 < � ≤ 1. This strong influence relation is used to select coarse points. Theselected coarse points are retained in the next coarser level, and the remaining fine points are dropped. Let Ck and Fk be thecoarse and fine points selected at level k, and let mk be the number of grid points at level k (m0 = n). Then, mk = |Ck| + |Fk|,mk+1 = |Ck|, Ak is a mk×mk matrix, and Pk is a mk−1×mk matrix. Here, the coarsening is performed row-wise by interpolatingbetween coarse and fine points. The coarsening generally attempts to fulfill two contradictory criteria. In order to ensure thata chosen interpolation scheme is well-defined and of good quality, some close neighborhood of each fine point must contain asufficient amount of coarse points to interpolate from. Hence the set of coarse points must be rich enough. However, the set ofcoarse points should be sufficiently small in order to achieve a reasonable coarsening rate. The interpolation should lead to areduction of roughly five times the number of non-zeros at each level of the V -cycle.A smoother is employed at each level, as described above. In this paper, an ILU factorization is computed and the Jacobi

iteration is used to solve the resulting triangular systems. In section 3, we will describe the ILU factorizations considered forthis smoother and how our approach mitigates the effects of potentially problematic factors, but first the necessary backgroundfor the Jacobi iteration is provided. We note that the Jacobi iteration may also be used as a smoother in C-AMG, but here it isused solely as an algorithm for triangular solves required by an ILU smoother.


2.2 Jacobi IterationHere, the notation employed by Antz et al.2 is adopted for the iterative solution of triangular systems. The Jacobi iteration forsolving Ax = b can be written in the compact form

xk+1 = G xk +D−1b (4)

with the regular splitting A =M −N ,M = D andN = D−A. The iteration matrix G is defined as G = I −D−1A. Given thepreconditionerM = D, the iteration can also be written in the non-compact form

xk+1 = xk +D−1 ( b − A xk ) (5)

where D is the diagonal part of A. For the triangular systems resulting from the ILU factorization A ≈ LU (as opposed to theregular splitting A = D + L + U ), the iteration matrices are denoted GL and GU for the lower and upper triangular factors, Land U . Let DL and DU be the diagonal parts of the triangular factors L and U and let I denote the identity matrix. Assume Lhas a unit diagonal, then

GL = D−1L (DL − L ) = I − L (6)

GU = D−1U (DU − U ) = I −D−1

U U (7)

A sufficient condition for the Jacobi iteration to converge is that the spectral radius of the iteration matrix is less than unity�(M) < 1.GL is strictly lower andGU is strictly upper triangular, which implies that the spectral radius of both iteration matricesis zero. Therefore, the Jacobi iteration converges in the asymptotic sense for any triangular system. However, the convergence(or not) of the Jacobi method in practice also depends on the degree of non-normality of the coefficient matrix.

2.3 Non-Normality of MatricesA normal matrix A ∈ ℂn×n satisfies A∗A = AA∗, and this property is referred to as normality throughout this paper. Naturally,a matrix that is non-normal can be defined in terms of the difference between A∗A and AA∗. For example, one might use

‖A∗A − AA∗‖F

‖A∗A‖F(8)

to indicate the degree of non-normality. In the current paper, Henrici’s definition of the departure from normality of a matrix isemployed

dep(A) =√

‖A‖2F − ‖D‖2F , (9)whereD ∈ ℂn×n is the diagonal matrix containing the eigenvalues ofA13. Further information on metrics and bounds describing(non)normality of matrices can be found in13, 20–22 and references therein.It follows directly from the definition of normality that an upper (or lower) triangular matrix cannot be normal unless it is a

diagonal matrix (see23 Lemma 1.13 for a proof). Therefore, some degree of non-normality is expected in the L and U factors.However, if the departure from normality is too great, the Jacobi iterations may diverge2, 24. Because the condition number ofa matrix can also be a good indicator of its departure from normality, known strategies for reducing a high condition numberare applied to reduce a high degree of non-normality of the factors. There exist several ways to characterize the degree of non-normality of a matrix (see20 Chapter 48 for a summary). While no one single metric can be generalized to all matrices, dep(A)proves to be a sufficient criterion for the application considered in this paper. Our approach is described next in detail.

3 AN ILU SMOOTHERWITH SCALED FACTORS

A sparse triangular solver is a critical kernel in many applications, and significant efforts have been devoted to improving theperformance of a general-purpose sparse triangular solver for GPUs25. For example, the traditional parallel algorithm is basedupon level-set scheduling, derived from the sparsity structure of the triangular matrix26. Independent computations proceedwithin each level of the elimination tree. Overall, the sparsity pattern of the triangular matrix can result in an extremely deepand narrow tree and thereby limit the amount of available parallelism for the solver to fully utilize many-core architectures,especially when compared to a sparse matrix-vector multiply (SpMV). An iterative approach was proposed by Anzt et al.2 and


specifically the Jacobi iteration, to solve triangular systems. Adopting such an approach allows us to take advantage of the speedof sparse matrix-vector products on GPUs.Before considering how best to take advantage of the GPU architecture, let us first examine the potential inaccuracy associated

with triangular solves when the L and U factors have high condition numbers. When using a threshold parameter (or droptolerance) to limit the number of non-zeros in a row (or column) of the factors, the resulting factorization of a symmetric matrixcould be highly nonsymmetric11. One observable indicator is the vertical striping in the sparsity pattern ofL+U , which signifiesorders of magnitude difference in the entries of a row (or column) of the coefficient matrix A11. In fact, exactly this behavior isobserved in the non-zero patterns of the (symmetric) coefficient matrices for the systems considered in this paper when using avery small drop tolerance and allowing for a high amount of fill; see Figure 1 for the sparsity pattern of L + U at the first fourlevels of C-AMG using AMGToolbox for matrix dimension N = 14186.3 Here, the drop tolerance is set to 1.e−15 and fill to200 using the ILUTP implementation in Carr et al.27 with pivoting turned off. 4This striping can be attributed to small pivots when computing the ILU factorization, resulting in large amounts of fill-in. This

can be mitigated, in part, by enforcing a smaller fill level, as discussed in Chow and Saad11. However, for highly ill-conditionedA, as is the case for the matrices considered here, limiting the fill level may be insufficient to prevent ill-conditioned L and U .Figure 2 displays the nonzero pattern of L + U for the same matrices as in Figure 1, but with the fill level now set to 10. Thedramatic striping pattern is no longer observed, although some striping is evident, indicating L and U are still ill-conditioneddespite restricting the amount of fill.

(a) First Level (b) Second Level (c) Third Level (d) Fourth Level

FIGURE 1 Non-zero patterns of L+U for matrix sizeN = 14186 for the first four levels using AMGToolbox. Drop toleranceis set to 1.e−15 and fill to 200.

An additional approach is to apply a scaling strategy to reduce the condition number ofA prior to computing an ILU factoriza-tion. Even so, scaling A directly presents two potential problems: First, it requires working on the global A, which is distributedacross all of the MPI ranks in parallel; and second, scaling A may also increase its the departure from normality (and thereforethat of its corresponding L and U factors)11. Thus, we consider an approach where the triangular factors are scaled directly, andshow this addresses both problems, ultimately allowing us to reliably employ the Jacobi iteration.

3.1 Row and Column Scaling of L and U to Reduce Non-NormalityWe consider row and column scaling as well as row scaling alone. For the former, the Ruiz algorithm is applied, which is aniterative procedure that scales the matrix such that the resulting diagonal entries are one, and all other elements are less than (orequal to) one12. Scaling reduces the condition numbers of U , �(U ), as well as the high degree of non-normality of U , which ismeasured using dep(U ) as defined in (9). The Ruiz scaling algorithm12 is provided in Algorithm 2.Row and column scaling results in a linear system that has been equilibrated and now takes the form

LDr U Dc x = b

3We also observe a similar vertical striping pattern in the matrices forN = 331 andN = 2110 but omit these figures for brevity.4Note that the term pivoting refers to row or column exchanges that are employed when a small pivot, or divisor, is encountered in the factorization. As such, with

pivoting turned off, the pivot here is always the diagonal element.


(a) First Level (b) Second Level (c) Third Level (d) Fourth Level

FIGURE 2 Non-zero patterns of L+U for matrix sizeN = 14186 for the first four levels using AMGToolbox. Drop toleranceis set to 1.e−15 and fill to 10.

Algorithm 2 Ruiz algorithm for row and column scaling of a coefficient matrix A and corresponding right hand side b.//Iteratively scale A ∈ ℂn×n, b ∈ ℂn

k = 0Set Ak = A, bk = b,Dk = Inwhile not converged do

Dr = diag(√

‖aTi,∶‖∞) // aTi,∶ the row vectors of A

Dc = diag(√

‖a∶,j‖∞) // a∶,j the col vectors of AAk+1 = D−1

r AkD−1c

bk+1 = DrbkDk+1 = DkD−1

ck = k + 1

end while//After solving Akx = bk, x = D−1

k x

and the Jacobi iterations are applied to the upper triangular matrixDrUDc (in addition to the lower triangular factorL). BecauseDr U Dc has a unit diagonal, the iterations can be expressed in terms of a Neumann series, which is discussed below. When anLDU or LDLT factorization is available, the diagonal matrixD can represent row scaling for either the L or U matrix. For theapplications considered in this paper, we scale the U matrix. As such, when employing only row scaling, given an incompleteLU factorization, D can be written as

D = diag(U ), (10)and the scaled U is subsequently defined as

U = D−1U (11)to obtain the incomplete LDU factorization from the incomplete LU factorization.Table 2 displays the reduction in the departure from normality of U , dep(U ), when using Ruiz scaling and row scaling, as

well as dep(L), for a selection of matrices extracted from the SuiteSparse Matrix Collection28 (first five rows). Results for threematrices extracted from PeleLM (final three rows) are also provided. Note again that scaling is only applied to the U factor forthe application considered in this paper. To generate these results Matlab’s iluwas applied with setup type ‘nofill’ (i.e. ILU(0)).The condition numbers of the matrices are listed in Table 3, forA,L,U , andU scaled with both Ruiz and row scaling strategies.Our results show that the scaling strategies substantially reduce both �(U ) and dep(U ), and that in many cases, the Ruiz

strategy produces the larger reduction compared with row scaling alone. This may be attributed to the (additional) columnscaling in the Ruiz approach. However, in some cases, row scaling results in a lower departure from normality. In practice, weobserve in that dep(U ) is minimal after only the first or second iteration of the Ruiz algorithm, and slightly increases for theremaining iterations. Early termination of the algorithm can be exercised, however, in all cases where row scaling results in asmaller dep(U ), both are nearly the same, or at least of the same order of magnitude. Furthermore, row scaling reduces �(U )


Matrix Dimension dep(L) dep(U ) dep(D−1U ) dep(DrUDc)(row scaling) (Ruiz scaling)

af_0_0_k101 503625 326.95 1.84e8 326.95 320.89af_shell1 504855 386.66 1.52e8 386.66 407.35bundle_adj 513351 8.52e6 4.52e11 8.52e6 438.70F1 343791 335.52 4.89e8 335.52 331.79offshore 259789 231.86 7.05e15 231.86 222.71PeleLM331 331 8.37 1.50e6 8.37 4.13PeleLM2110 2110 16.99 1.09e7 16.99 9.33PeleLM14186 14186 1.45e4 1.00e6 1.45e4 26.33

TABLE 2Departure form normality for theL andU factors when applying an ILU(0) factorization to several matrices, followedby the departure from normality after row scaling and Ruiz scaling are applied to the U factor. The first five come from28, andthe last three are extracted from PeleLM. Matlab’s ilu with type ‘nofill’ was computed for all matrices.

Matrix �(A) �(L) �(U ) �(D−1U ) �(DrUDc)(row scaling) (Ruiz scaling)

af_0_0_k101 3.60e8 156.54 1.02e3 75.78 108.83af_shell1 1.72e10 49.99 231.94 116.42 171.30bundle_adj 6.10e15 4.59e12 3.53e14 2.93e12 2.37e3F1 3.26e7 6.51e3 1.34e5 2.67e4 1.54e4offshore 2.32e13 96.79 7.56e10 148.35 156.99PeleLM331 3.48e17 14.06 3.87e9 43.06 18.72PeleLM2110 3.21e17 13.39 4.41e9 34.02 12.31PeleLM14186 6.64e15 1.83e8 6.87e12 1.74e7 9.51

TABLE 3 Condition number for A, L, and U when applying an ILU(0) factorization to several matrices, followed by thecondition numbers after row scaling and Ruiz scaling are applied to the U factor. The first five come from28, and the last threeare extracted from PeleLM. Matlab’s ilu function with type ‘nofill’ was computed for all matrices.

more than when applying the Ruiz strategy for a subset of the matrices. We note that while the Ruiz strategy preserves symmetry,it is not guaranteed to provide the minimal condition number compared with other scaling strategies.Both Ruiz scaling with ILU and row scaling with incomplete LDU produce a matrix U = I + Us, with unit diagonal and

strictly upper triangular Us. When an LDU or LDLT factorization is available, the Neumann series can be directly obtainedfrom U or LT as these are strictly upper triangular matrices. The matrix L already has a unit diagonal, and a Neumann seriescan again be obtained. Thus, in either approach and considering only the U factor, the inverse of such a matrix can be expressedas a Neumann series

U ps = ( I + Us )−1 = I − Us + U 2

s −⋯ =p∑

i=0(−1)iU i

s

Because Us is upper triangular and nilpotent, the above sum is finite. The series is also guaranteed to converge when ‖Us‖2 < 1.In practice, this is true for the ILU(0) and ILUT smoothers for certain drop tolerances. Even in cases when ‖Us‖2 ≥ 1, we findin practice that ‖U p

s ‖2, tends to be less than one for small p. Figure 3 displays ‖U ps ‖2 when employing Ruiz scaling at the finest

level (i.e., referred to as l = 1 in later results). Here, Matlab’s ilu is used with type ‘ilutp’, threshold 0 (i.e., no pivoting), andvarious drop tolerances. For larger drop tolerances (i.e. droptol = 1.e − 2 for both Ruiz and row scaling), ‖U p

s ‖ < 1 for p = 1,giving convergence of the Neumann series. Although, for smaller drop tolerances this is no longer the case. Furthermore, insome cases ‖U p

s ‖2 actually increases for the first few values of p, but eventually decreases and falls below 1. When applyingrow scaling, similar results to those obtained when using the Ruiz strategy are observed forN = 331 andN = 2110 and thoseresults are omitted for brevity. Figure 4, displays ‖U p

s ‖2 for matrix dimensionN = 14186when using only row scaling. In somecases, ‖U p

s ‖ < 1 for modest p (e.g., p = 7 and droptol = 1.e − 2). For smaller drop tolerances, p can be moderately large (e.g.,p = 45 for matrix dimension droptol = 1.e − 2).


(a)Matrix dimensionN = 331

(b)Matrix dimensionN = 2110 (c)Matrix dimensionN = 14186

FIGURE 3 ‖U ps ‖2 for p = 1, 2,…40, forU = I+Us scaled using the Ruiz strategy and whereUs is the strictly upper triangular

part of U . The black, dotted line shows the bound 1.

That ‖U ps ‖2 eventually decreases is not unexpected as the number of possible nonzeros in U p

s necessarily grows smaller ask grows larger when Us is dense. However, numerical nilpotence is observed for p ≪ n in many cases for sparse Us, and thesize of p clearly depends on the number of nonzeros allowed in U and consequently Us (either by imposition of small droptolor conservative fill, or both). In other words, ‖U p

s ‖2 is zero, or nearly so, much sooner than the theoretically guaranteed ‖U ns ‖2.

Analyzing this result and the theoretical implications of ‖U ps ‖2 < 1 for modest p > 1 (e.g., p = 2 or 3) when using our approach

is part of future consideration.Finally, the algorithm for an ILU smoother that uses the Ruiz scaling applied to non-normal triangular factors is given in

Algorithm 3. The algorithm is described as it is applied to produce the results presented next; in other words, only the U factoris scaled. However, this algorithm can be extended to also scale theL factor in cases when the problem demands that bothL andU be scaled. For the applications considered in this paper, we need only scale the U factor. To employ only row scaling withinAlgorithm 3, in lieu of the call to Ruiz and subsequent formation ofDu, one would constructD as in (10) and update U as in (11).

4 KRYLOVMETHODS WITH C-AMG PRECONDITIONER

Our new approach is applied to linear systems taken from the “nodal projection” component of the time stepping strategyused in PeleLM3. PeleLM is an adaptive mesh low Mach number combustion code developed and supported under DOE’s


FIGURE 4 ‖U ps ‖2 for p = 1, 2,…40, for U = I +Us using row scaling LDU and where Us is the strictly upper triangular part

of U . The black, dotted line shows the bound 1. Matrix dimension N = 14186. Not shown in figure: ‖U ps ‖2 drops below 1 at

p = 45.

Algorithm 3 ILU+Jacobi smoother for C-AMG with Ruiz scaling for non-normal upper triangular factors.Given A ∈ ℂn×n, b ∈ ℂn

Define droptol and fillCompute AP ≈ LU with droptol and fill imposedDefine mL and mu, total number of Jacobi iterations for solving L and UDefine y = 0, v = y//Jacobi iteration to solve Ly = bfor k = 1 ∶ mL do

y = y + (b − Ly)end forCall Algorithm 2 (Ruiz scaling) with U and y to obtain scaled U and y, and DkLet Du = diag(U )Define D = D−1

u//Jacobi iteration to solve Uv = yfor k = 1 ∶ mU do

v = v +D(y − Uv)end for//Update and unpermute the solutionv = Dkvx = P −1v

Exascale Computing Program. PeleLM features the use of a variable-density projection scheme to ensure that the velocity fieldused to advect the state satisfies an elliptic divergence constraint. Physically, this constraint enforces that the resulting flowevolves consistently with a spatially uniform thermodynamic pressure across the domain. A key feature of the model is that thefluid density may vary considerably across the computational domain, and this can lead to highly ill-conditioned matrices thatrepresent the elliptic projection operator. Section 5 shows that the standard Jacobi and Gauss-Seidel smoothers are less effectivein these cases at reducing the residual error at each level of the C-AMG V -cycle and this can lead to very large iteration countsfor the Krylov solver.


4.1 Stopping CriteriaThe stopping criteria for Krylov methods is an important consideration and is related to backward error for solving linear systemsAx = b. The most common convergence criterion found in existing iterative solver frameworks is based upon the relativeresidual, defined by

‖rk‖2‖b‖2

=‖b − Axk‖2

‖b‖2< tol, (12)

where rk and xk represent, respectively, the residual and approximate solution after k iterations of the iterative solver. Analternative metric commonly employed in direct solvers is the norm-wise relative backward error (NRBE)

NRBE =‖rk‖2

‖b‖2 + ‖A‖∞‖xk‖2. (13)

In our numerical experiments, the norm-wise relative backward error for the solution of linear systems with BiCGStab wassometimes found to be lower than when the right-preconditioned GMRES was employed29. Indeed, the latter exhibited falseconvergence (the implicit GMRES residual norm did not agree with the explicit norm of the explicit residual rk = b − Axk)when executed in parallel for highly ill-conditioned problems, �(A) = 1e+15. Flexible FGMRES23 was found to be the mosteffective Krylov solver in combination with AMG and did not exhibit false convergence. Our test problems are based on pressurelinear systems extracted from the PeleLM3 and Nalu-Wind16 models, solved using either the AMGToolbox, a prototype forthe LAMG framework from Joubert and Cullum30, 31 or the Hypre-boomerAMG library5. The systems are iterated up to sixBiCGStab iterations and the NRBE is (").

4.2 AMG ImplementationsThe hypre-BoomerAMG library was designed for massively-parallel computation32, 33 and now also supports GPU accelerationof key solver components34. hypre implements classical Ruge-Stüben C-AMG. Interpolation operators in AMG transfer residualerrors between adjacent levels. There are a variety of interpolation schemes available in BoomerAMG on CPUs. Direct interpo-lation4 is straightforward to implement on GPUs because the interpolatory set of a fine point i is just a subset of the neighborsof i, and thus the interpolation weights can be determined solely by the i-th equation. The weights wij are computed by solvingthe local optimization problem

min ‖aiiwTi + ai,Csi ‖2 s.t. wT

i fCsi = fi,wherewi is a vector that containswij ,Cs

i and denotes strong C-neighbors of i and f is a target vector that needs to be interpolatedexactly. For elliptic problems where the near null-space is spanned by constant vectors, i.e., f = 1, the closed-form solution of(4.2) is given by

wij = −aij + �i∕nCsi

aii +∑

k∈Nwiaik, �i =

∑

k∈{fi∪Cwi }aik ,

where nCsi denotes the number of points in Csi , C

wi the weak C-neighbors of i, fi the F-neighbors, andNw

i the weak neighbors.With minor modifications to the original form, it turns out that the extended interpolation operator can be rewritten by using

standard sparse matrix computations such as matrix-matrix (M-M) multiplication and diagonal scaling with certain FF - andFC-sub-matrices. The coarse-fine C-F splitting of the coarse matrix A and the full prolongation operator P are given by

A =[

AFF AFCACF ACC

]

, P =[

WI

]

where A is assumed to be decomposed into A = D + As + Aw, the diagonal, the strong part and weak part respectively, andAwFF , A

wFC , A

sFF and AsFC are the corresponding sub-matrices of Aw and As.

The extended “MM-ext” interpolation takes the form

W = −[

(DFF +D )−1(AsFF +D�)][

D−1� A

sFC

]

withD� = diag(AsFC1C ) D = diag(AwFF 1F + AwFC1C )

This formulation allows simple and efficient implementations that can utilize available optimized sparse kernels on GPUs.Similar approaches that are referred to as “MM-ext+i” modified from the original extended+i algorithm35 and “MM-ext+e”are also available in BoomerAMG. See36 for details on the class of M-M based interpolation operators. A recursive polynomialtype smoother has been implemented and based upon a Gauss-Seidel iteration which generates a Neumann series.


A polynomial type smoother37 can be derived from a Jacobi iteration applied to the triangular system for (D + L) in theGauss-Seidel scheme and then used to solve the linear system, Ax = b, with residual r = b−Ax, whereD is the diagonal of A.An alternate formulation is to replace (D +L)−1 with (I +D−1L)−1D−1 in the preconditioned iteration, and replace the matrixinverse with a truncated Neumann series.

x(k+1) = x(k) +p∑

j=0(−D−1L)jD−1 r(k)

In practice, the Neumann series converges rapidly for close to normal matrices where the off-diagonal elements of L decayrapidly to zero. Because the matrixL is once again strictly lower triangular, it is nilpotent and the Neumann series is a finite sum.Following Saad23 Chapter 14.2, Osei-Kuffuor 38 and Falgout et al.34, in order to derive a Schur complement preconditioner

for the partitioned linear system

A[

xy

]

=[

fg

]

,

consider the block A = LU factorization of the coefficient matrix A

A =[

B EF C

]

=[

I 0FB−1 I

] [

B E0 S

]

The block matrix B is associated with the diagonal block (subdomain) of the global matrix distributed across MPI ranks byhypre. The Schur complement is S = C − FB−1E, and the reduced system for the interface variables, y, is given by

S y = g − F B−1 f (14)

Then the internal, or local, variables represented by x are obtained by back-substitution according to the expression

x = B−1 ( f − E y )

An ILUT Schur complement smoother for one level of the V -cycle in hypre is implemented as a single iteration of a GMRESsolver for the global interface system (14). The local systems involving B−1 are solved by computing an incomplete ILUTfactorization of the matrix B ≈ LDU . Rather than applying a direct triangular solver for these systems, the Neumann (Jacobi)iterations described previously in Section 2.2 are employed. It is important to note that the residual vector r(k) = b − Ay(k) isnot required for the Schur complement GMRES solver for a fixed number of iterations. The initial guess is set to x(0) = 0 andr(k) = b, without a convergence check with the explicit r(1). This observation results in a significant computational savings.

5 NUMERICAL RESULTS

5.1 PeleLM Combustion ModelPressure linear systems are taken from the “nodal projection” component of the time integrator used in PeleLM3. PeleLM isan adaptive mesh low Mach number combustion code developed and supported under DOE’s Exascale Computing Program.PeleLM features the use of a variable-density projection scheme to ensure that the velocity field used to advect the state satisfiesan elliptic divergence constraint. Physically, this constraint enforces that the resulting flow evolves consistently with a spatiallyuniform thermodynamic pressure across the domain. A key feature of the model is that the fluid density may vary considerablyacross the computational domain. Extremely ill-conditioned problems arise for incompressible and reacting flows in the lowMach flow regime, particularly when using cut-cell approaches to complex geometries, where non-covered cells that are cut bythe domain boundary can have arbitrarily small volumes and areas. The standard Jacobi and Gauss-Seidel smoothers are lesseffective in these cases at reducing the residual error at each level of the C-AMG V -cycle and this can lead to very large iterationcounts for the GMRES+AMG solver.A sequence of two different size problems was examined, based on matrices extracted from the PeleLM pressure continuity

solver3. The BiCGStab+AMG solver using the AMGToolBox30 was run for the dimension N = 14186 matrix with ILUTsmoothing only on the finest level l = 1, then ILUT on all levels and polynomial Gauss-Seidel smoothers. Iterative Jacobitriangular solvers are employed. One pre- and post-smoothing sweep was applied on all V -cycle levels, except for the coarselevel direct solve. The AMG strength of connection threshold was set to � = 0.25.Table 4 examines the relationship between the drop tolerance imposed for the ILU factorization and the required number of

Jacobi iterations. For the results shown in the first four columns, Matlab’s built-in ilu function is applied with type ‘ilutp’, with


pivoting turned off (i.e., threshold is set to 0), and drop tolerances varied between 1e−2 and 1e−5 to matrices with dimensionN = 14186. For larger drop tolerances 1e−2 and 1e−3, the average number of nnz per row in L is limited to approximately 13and 50, respectively. For 1e−4 and 1e−5, this increases to approximately 160 and 360, respectively. The average nnz per rowin the corresponding U factors is more or less the same, except for drop tolerance 1e−5, where there are 385 nnz per row in U .For drop tolerance 1e−5, the number of Jacobi iterations remains constant for both L and U in order to achieve a decrease inthe BiCGStab iterations. It was found that regardless of drop tolerance, the number of Jacobi iterations remains constant at 3 forU and 2 for L, and the resulting BiCGStab iterations count is constant at 6.However, because the cost of sparse matrix vector products with the L and U factors obviously depends on the number of

nonzeros in the L and U matrices, limiting the fill level may be appropriate, especially for the drop tolerances 1e−4 and 1e−5.However, because Matlab’s ilu does not allow for control of the fill level explicitly, the ILUT implementation in27 is employed.In the final two columns of Table 4, the fill is limited to 10 for these drop tolerances. Even with the low number of Jacobiiterations for the triangular solves, the BiCGStab iteration count remains constant at six (6). Thus, to achieve an accurate solutionfrom the iterative solver, we conclude that a larger drop tolerance with low fill and just a few Jacobi iterations for the triangularsolve are sufficient.

No Fill Specified Fill = 10

TotalsILU droptol 1e−2 1e−3 1e−4 1e−5 1e−4 1e−5

nnz(L) 197162 702482 2268634 5139896 133552 134543nnz(U ) 190983 709452 2492995 5473513 141276 141490L Iterations. 3 3 3 3 3 3U Iterations 2 2 2 2 2 2BiCGStab Iterations 6 6 6 6 6 6

TABLE 4 Jacobi iterations for the upper and lower triangular solves versus BiCGStab iterations to reach tolerance 1e−6 fordifferent drop tolerances with row scaling LDU and row/column Ruiz scaling using AMGToolbox. Both scaling strategiesrequire the same number of Jacobi iterations for L and U , and result in the same number of BiCGStab iterations. The first fourcolumns use Matlab’s ilu and fill levels are not specified. The last two columns show results with fill limited to 10 for droptolerances 1e−4 and 1e−5 using the ILUT described in Carr et al.27. Matrix sizeN = 14186.

Results using Hypre-BoomerAMG for the N = 1.4 million linear system are plotted in 5. These results were obtained ontheNREL Eagle supercomputer with Intel Skylake CPUs and NVIDIA V100 GPUs. PMIS with aggressive coarsening and“MM-ext+i” interpolation are employed, with a strength of connection threshold � = 0.25. Because the problem is very ill-conditioned, flexible FGMRES achieves the best convergence rates and the lowest NRBE. Iterative triangular solvers wereemployed in these tests with three (3) jacobi iterations. The convergence histories are plotted for hybrid-ILUT, and polynomialGauss-Seidel smoothers. The ILUT parameters were droptol = 1e−2 and lf il = 10. The lowest time for a single-GPU, was theILUT smoother with Jacobi iterations which achieved a solve time of 0.11 seconds.

5.2 Exa-Wind Fluid Mechanics ModelsThe ExaWind ECP project aims to simulate the atmospheric boundary layer air flow through an entire wind farm on next-generation exascale-class computers. The primary physics codes in the ExaWind simulation environment are Nalu-Wind andAMR-Wind. Nalu-Wind and AMR-Wind are finite-volume-based CFD codes for the incompressible-flow Navier-Stokes gov-erning equations. Nalu-Wind is an unstructured-grid solver that resolves the complex geometry of wind turbine blades and thinblade boundary layers. AMR-Wind is a block-structured-grid solver with adaptive mesh refinement (AMR) capabilities that cap-tures the background turbulent atmospheric flow and turbine wakes. Nalu-Wind and AMR-Wind models are coupled throughoverset meshes. The equations consist of the mass-continuity equation for pressure and Helmholtz-type equations for transportof momentum and other scalars (e.g. those for turbulence models). For Nalu-Wind, simulation times are dominated by linear-system setup and solution of the continuity and momentum equations. Both PeleLM and AMR-Wind are built on the AMReX


FIGURE 5 Hypre-BoomerAMG GPU results. Convergence history of (F)GMRES+AMG with polynomial, and hybrid ILUsmoothers with iterative triangular solves. Matrix sizeN = 1.4 million

FIGURE 6 Hypre-BoomerAMG GPU results. Nalu-Wind. Convergence history of (F)GMRES+AMG with polynomial, andILUT Schur Complement smoothers with iterative triangular solves. Matrix sizeN = 23 million

software stack39 and employs geometric multigrid is the primary solver, although it has the option of using the hypre library5

as a solver at the coarsest AMR level.The NREL 5-MW turbine40 is a notional reference turbine with a 126 meter rotor that is appropriate for offshore wind studies.

The simulations performed here use the model described in Thomas et al.16 and Sprague et al.41, but with rigid blades, andthey include low- and high-resolution models of a single-turbine. These models use inflow and outflow boundary conditions inthe directions normal to the blade rotation and symmetry boundary conditions in other directions. For each simulation, 50 timesteps are taken from a cold start with four Picard iterations per time step. The cold start implies that the simulation will undergoan initial transient phase from a non-physical initial solution guess before settling into a quasi-steady solution state. This initialtransient phase is more challenging for the linear-system solvers and will require more GMRES iterations per equation system.Convergence histories for one such pressure linear system (after reaching steady-state) are displayed in Figure 9 for the ILUT-Schur complement and polynomial Gauss-Seidel smoothers. The former requires half as many iterations to reach the 1e−5convergence tolerance. Furthermore, a coarse-fine (C-F) ordering of the degrees of freedom results in fewer iterations. Thestrength of connection parameter was set to � = 0.52, which contributes to a reduction in the AMG set-up time. In addition, the


ILU drop tolerance was droptol = 0.21, with a fill level of lf ill = 2. Sufficient smoothing was achieved with 18 iterations forthe lower triangularL solve and 31 for theU solve. Aggressive coarsening was not specified and a single level of ILU smoothingwas applied.

6 PARALLEL PERFORMANCE

A hybrid C-AMG algorithm is obtained by employing the ILU smoother on the finest levels of the AMG V -cycle hierarchy (e.g.level 1), followed by the polynomial Gauss-Seidel smoother applied on the remaining levels. In the numerical results reportedearlier, the ILU smoothers are applied on all levels and also in the hybrid configuration for comparison. The latter requires fewersparse matrix-vector multiplies and thus is more efficient computationally.The computational cost of the V -cycle, besides the high set-up cost, is determined by the number of non-zeros in the triangular

factors. The ILU smoother requires 3 – 10 Jacobi iterations and one outer sweep. These are applied during pre- and post-smoothing. Therefore, the total number of flops required for the V -cycle with ILU(0) smoothing on every level is given by thesum

flops =Nl−1∑

l=1nnz(Al) × 6 (15)

whereas the factor 6 is replaced by 4 for the polynomial smoother of degree two. This makes a compelling case for the hybrid C-AMG approach with a combination of ILU and polynomial G–S smoothers when the convergence rate is not adversely affected.The cost of the coarse grid direct solve is (N3

c ), where NC is the dimension of the coarsest level matrix AC , and is smallin comparison. The cost of the Krylov iteration is dominated by the sparse matrix-vector product (SpMV) with the matrix A,whose cost is determined by 2 × nnz(A).The cost of a sparse direct triangular solver on a many-core GPU architecture such as from the NVIDIA cuSparse library can

be 10 to 25× slower than the SpMV2. For the NVIDIA V100 GPU architecture, the SpMV can now achieve on the order of50 – 100 GigaFlops/sec in double precision floating point arithmetic. The cost of a V -cycle for our third problem with matrixdimension N = 14186 would be 1.8 × 106 flops, whereas the SpMV in the Krylov iteration costs 2.8 × 106 flops. The formerwould require 0.00004 seconds to execute on the V100 GPU (assuming 50 GigaFlops/sec sustained performance), and the latterwould take 0.0003 seconds. The compute time for the BiCStab solver is 0.00034 seconds per iteration and six (6) iterationswould execute in 0.002 seconds. When the number of BiCGStab+AMG iterations to achieve the same NRBE remains less thantwo times larger, then the case for employing the hybrid scheme on GPUs becomes rather compelling.

Level nl nnz(A)l =1 14186 291068l =2 228 7398l =3 32 694l =4 5 25

TABLE 5 Size and number of non-zeros for A, at each level, where nl denotes the matrix size at level l. Hypre-BoomerAMG.Matrix sizeN = 14186.

Finally, the compute times of the GMRES+AMG solver in hypre with an incomplete LDU smoother, using either direct oriterative triangular solvers in the ILU smoother, are compared. The compute times for a single pressure solve are given in Table6 for theN = 14186 dimension matrix. The ILU(0) and ILUT smoothers are included for comparison. Both the CPU and GPUtimes are reported. In all cases, one Gauss-Seidel and one ILU sweep are employed. The solver time reported again correspondsto when the relative residual has been reduced below 1e−5.First, let us consider the CPU compute times for a single solve. The results indicate that the GMRES+AMG solver time

using the hybrid V -cycle with an ILU(0) smoother on the first level, with a direct solver for the L and U factors, costs less thanwhen a Gauss-Seidel smoother is applied on all levels. The PMIS algorithm is employed along with aggressive coarsening onthe first V -cycle level. One sweep of the G-S smoother is employed in both configurations. The longer time is primarily due tothe higher number of Krylov iterations required in this case. The ILUT smoother with iterative triangular solvers is the more


Gauss-Seidel Poly G-S ILUT direct ILUT Jacobi ILU(0) Jacobiiterations 7 9 7 7 5CPU (sec) 0.037 0.025 0.025 0.035 0.038GPU (sec) 0.021 0.0067 0.032 0.0065 0.0048

TABLE 6 GMRES+AMG compute time. Gauss-Seidel, poly Gauss-Seidel, and ILU Smoothers.N = 14186

efficient approach on the GPU. Despite only three (3) sparse matrix-vector (SpMV) products for the Jacobi iterations to solve theL and 2 Jacobi iterations to solve the U triangular systems, the computational speed of the GPU for the SpMV kernel is morethan sufficient to overcome the cost of a direct sparse triangular solve. Our compute time estimate of 0.002 sec for the solverwhenN = 14186 is fairly accurate given six iterations to converge and the use of aggressive coarsening with hypre reduces thennz(A) per level42.The compute times for a larger dimension problem where N = 1.4 million are reported in Table 7. Here it was observed

that the ILU(0) compute time on the GPU is lower than with the polynomial Gauss-Seidel smoother. Most notably, the GPUcompute time for ILU(0) solves using four (4) Jacobi iterations for U are two times faster than ILUT with direct triangularsolves. To further explore the parallel strong-scaling behaviour of the iterative and direct solvers within the ILU smoothers, theGMRES+AMG solver was employed to solve a PeleLM linear system of dimensions N = 11 million. The LDU form of theincomplete factorization with row scaling was again employed and ten (10) Jacobi iterations provide sufficient smoothing forthis much larger problem. The linear system solver was tested on the NREL Eagle Supercomputer using two NVIDIA VoltaV100 GPUs per node. Most notably, the solver with iterative Jacobi scheme achieves a faster solve time compared to the directtriangular solver as displayed in Figure 7. The convergence histories of the GMRES+AMG solver with a polynomial two-stageGauss-Seidel and the ILU(0) direct and iterative Jacobi smoothers are plotted in Figure 8The strong-scaling performance of the low resolution NREL 5MegaWatt single-turbinemesh is shown in Figure 9. Thematrix

dimension for this problem is N = 23 million. The total setup plus solve time is displayed for (F)GMRES+AMG executingon the NREL Eagle supercomputer using two NVIDIA V100 GPUs per node on up to 20 nodes or 40 GPUs. The solve timeis plotted for the polynomial Gauss-Seidel and ILUT Schur complement smoothers. In the latter case, a single iteration of theiterative GMRES Schur solver, without residual computations, results in a steeper decrease in the execution time and improvedstrong-scaling performance. In addition, the number of GMRES+AMG solver iterations to reach a relative residual tolerance of1e−5 remains constant at eleven (11) as the number of compute nodes is increased.

Gauss-Seidel Poly G-S ILUT direct ILUT Jacobi ILU(0) Jacobiiterations 7 9 8 8 4CPU (sec) 9.2 9.6 4.3 6.9 6.8GPU (sec) 0.29 0.055 0.098 0.058 0.042

TABLE 7 GMRES+AMG compute time. Gauss-Seidel, poly Gauss-Seidel, and ILU Smoothers.N = 1.4 million

7 CONCLUSIONS

A novel approach was developed for the solution of sparse triangular systems for the L and U factors of an ILU smoother forC-AMG. Previous work by H. Anzt, and E. Chow demonstrated that these factors can be highly non-normal matrices, even afterappropriate re-ordering and scaling of the linear system Ax = b. When Jacobi stationary relaxation is applied to solve suchnon-normal triangular systems, the iterations may diverge. In order to mitigate these effects, either a row or row/column Ruizscaling is applied to the U factor at each level of the V -cycle. Our results demonstrated that a several orders of magnitude ormore reduction in the departure from normality dep(U ) is possible, thus leading to robust convergence.In order to further improve the efficiency of the PeleLM (F)GMRES+AMG pressure solver on many-core GPU architectures,

we implemented C-AMG V -cycles where the ILU smoother is applied on the fine levels together with a polynomial smoother


FIGURE 7 Hypre-BoomerAMG GPU results. Strong-scaling of (F)GMRES+AMG with ILU(0) smoothers using direct anditerative triangular solves. Matrix sizeN = 11 million

FIGURE 8 Hypre-BoomerAMG GPU results. Convergence histories of (F)GMRES+AMGwith polynomial two-stage Gauss-Seidel and ILU(0) smoothers using direct and iterative triangular solves. Matrix sizeN = 11 million

on the remaining coarse levels. It was found that the convergence rates for hybrid AMG are almost identical to using ILU on alllevels, thus leading to significant cost reductions. For a large problemN = 11million solved on the NREL Eagle supercomputer,the iterative Jacobi triangular solve forLDU with row scaling, led to a five times speed-up over the direct triangular solve withinthe GMRES+AMG V -cycles. Furthermore, the strong scaling curve for the solver run time was close to linear.For the Nalu-Wind CFD pressure continuity equation, an ILUT Schur complement smoother with iterative triangular solves

on the local block diagonal systems was applied. Pressure linear systems from NREL 5 MegaWatt reference turbine simulationswere employed to assess numerical accuracy and performance. The linear solver exhibits improved parallel strong-scaling char-acteristics with this new smoother and improves upon the solver compute times when using a higher number of GPU computenodes. It was important to remove residual computations from the single iteration of the GMRES-Schur solver in order to reducethe overall execution time. Our future plans include implementing the fixed-point iteration algorithms of Chow and Anzt1,43 tocompute the ILU factorization on GPUs.


FIGURE 9 Hypre-BoomerAMG GPU results. Nalu-Wind NREL 5MW wind turbine mesh. Strong-scaling of(F)GMRES+AMG with ILUT Schur Complement using iterative triangular solves versus polynomial Gauss-Seidel smootherMatrix sizeN = 23 million

ACKNOWLEDGMENT

This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy,LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308.Funding was provided by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of

Energy organizations (ASCR and the NNSA).

References

1. Chow E, and Patel A. Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing.2015;37(2):C169–C193.

2. Anzt H, Chow E, and Dongarra J. Iterative sparse triangular solves for preconditioning. In: European conference on parallelprocessing. Springer; 2015. p. 650–661.

3. Nonaka A, Bell JB, and Day MS. A conservative, thermodynamically consistent numerical approach for low Mach numbercombustion. I. Single-level integration. Combust Theor Model. 2018;22(1):156–184.

4. Stüben K. Algebraic multigrid (AMG): an introduction with applications. In: Trottenberg U, and Schuller A, editors.Multigrid. USA: Academic Press, Inc.; 2000. .

5. Falgout RD, and Meier-Yang U. hypre: A Library of High Performance Preconditioners. In: International Conference oncomputational science; 2002. p. 632–641.

6. Stüben K. A review of algebraic multigrid. Numerical Analysis: Historical Developments in the 20th Century. 2001;p.331–359.

7. Chow E, Anzt H, Scott J, and Dongarra J. Using Jacobi iterations and blocking for solving sparse triangular systems inincomplete factorization preconditioning. Journal of Parallel and Distributed Computing. 2018;119:219–230.

8. Van der Sluis A. Condition numbers and equilibration of matrices. Numerische Mathematik. 1969;14(1):14–23.

9. Bauer FL. Optimally scaled matrices. Numerische Mathematik. 1963;5(1):73–87.


10. Stewart GW. Matrix Algorithms: Volume 1: Basic Decompositions. SIAM; 1998.

11. Chow E, and Saad Y. Experimental study of ILU preconditioners for indefinite matrices. Journal of Computational andApplied Mathematics. 1997;86:387–414.

12. Knight PA, Ruiz D, and Uçar B. A symmetry preserving algorithm for matrix scaling. SIAM Journal on matrix analysisand applications. 2014;35(3):931–955.

13. Henrici P. Bounds for iterates, inverses, spectral variation and fields of values of non-normal matrices. NumerischeMathematik. 1962;4(1):24–40.

14. Falgout RD, Jones JE, and Yang UM. The Design and Implementation of hypre, a Library of Parallel High PerformancePreconditioners. In: Bruaset AM, and Tveito A, editors. Numerical Solution of Partial Differential Equations on ParallelComputers. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 267–294.

15. Chow E, and Vassilevski PS. Multilevel block factorizations in generalized hierarchical bases. Numerical Linear Algebrawith Applications. 2003;10(1-2):105–127.

16. Thomas SJ, Ananthan S, Yellapantula S, Hu JJ, Lawson M, and Sprague MA. A Comparison of Classical and Aggregation-Based Algebraic Multigrid Preconditioners for High-Fidelity Simulation of Wind Turbine Incompressible Flows. SIAMJournal on Scientific Computing. 2019;41(5):S196–S219.

17. Anzt H, Tomov S, Gates M, Dongarra J, and Heuveline V. Block-asynchronous Multigrid Smoothers for GPU-acceleratedSystems. Procedia Computer Science. 2012;9:7–16. Proceedings of the International Conference onComputational Science,ICCS 2012.

18. van der Vorst HA. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of NonsymmetricLinear Systems. SIAM Journal on Scientific and Statistical Computing. 1992;13(2):631–644.

19. Ruge JW, and Stüben K. Algebraic multigrid. In: Multigrid methods. SIAM; 1987. p. 73–130.

20. Trefethen L, and Embree M. The behavior of nonnormal matrices and operators. Spectra and Pseudospectra. 2005;.

21. Ipsen IC. 1998. A note on the field of values of non-normal matrices. . North Carolina State University. Center for Researchin Scientific Computation.

22. Elsner L, and Paardekooper M. On measures of nonnormality of matrices. Linear Algebra and its Applications.1987;92:107–123.

23. Saad Y. Iterative Methods for Sparse Linear Systems, 2nd Ed. SIAM; 2003.

24. Eiermann M. Fields of values and iterative methods. Linear Algebra and its Applications. 1993;180:167–197. Availablefrom: https://www.sciencedirect.com/science/article/pii/0024379593905302.

25. Li R, and Zhang C. Efficient Parallel Implementations of Sparse Triangular Solves for GPU Architecture. In: Proceedingsof SIAM Conference on Parallel Proc. for Sci. Comput.; 2020. p. 118–128.

26. NaumovM. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU.Tech. Rep. NVR-2011.

27. Carr A, de Sturler E, andGugercin S. Preconditioning Parametrized Linear Systems. SIAM Journal on Scientific Computing.2021;43(3):A2242–A2267.

28. Davis TA, and Hu Y. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software(TOMS). 2011;38(1):1–25.

29. Saad Y, and Schultz MH. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.SIAM Journal on scientific and statistical computing. 1986;7(3):856–869.

30. Verbeek M, Cullum J, and Joubert W. 2002. AMGToolBox. . Los Alamos National Laboratory.

https://www.sciencedirect.com/science/article/pii/0024379593905302


31. JoubertW, and Cullum J. Scalable AlgebraicMultigrid on 3500 processors. Electronic Transactions on Numerical Analysis.2006;23:105–128.

32. Baker AH, Falgout RD, Kolev TV, and Yang UM. Multigrid Smoothers for Ultraparallel Computing. SIAM J Sci Comput.2011;33:2864–2887.

33. Baker AH, Falgout RD, Kolev TV, and Meier-Yang U. In: Scaling Hypre’s Multigrid Solvers to 100,000 Cores. London:Springer London; 2012. p. 261–279.

34. Falgout RD, Li R, Sjögreen B, Wang L, and Yang UM. Porting hypre to heterogeneous computer architectures: Strategiesand experiences. Parallel Computing. 2021;108:102840. Available from: https://www.sciencedirect.com/science/article/pii/S0167819121000867.

35. De Sterck H, Falgout RD, Nolting JW, and Meier-Yang U. Distance-two interpolation for parallel algebraic multigrid.Numerical Linear Algebra with Applications. 2008;15(2-3):115–139.

36. Li R, Sjogreen B, and Meier-Yang U. A new class of AMG interpolation operators based on matrix matrix multiplications.To appear SIAM Journal on Scientific Computing. 2020;.

37. Mullowney P, Li R, Thomas S, Ananthan S, Sharma A, Rood J, et al. Preparing an Incompressible-Flow Fluid DynamicsCode for Exascale-Class Wind Energy Simulations. In: Proceedings of Supercomputing 2021. IEEE/ACM; 2021. p. 1–12.

38. Xu T, Li R, and Osei-Kuffuor D. A two-level GPU-accelerated incomplete LU preconditioner for general sparse linearsystems. Numer Linear Algebra Appl. 2020;Submitted for publication.

39. Zhang W, and et al. AMReX: A Framework for Block-Structured Adaptive Mesh Refinement. Journal of Open SourceSoftware. 2019;4(37):1370.

40. Jonkman J, Butterfield S, Musial W, and Scott G. 2009. Definition of a 5-MW reference wind turbine for offshore systemdevelopment. NREL/TP-500-38060. National Renewable Energy Laboratory.

41. Sprague MA, Ananthan S, Vijayakumar G, and Robinson M. ExaWind: A multifidelity modeling and simulationenvironment for wind energy. Journal of Physics: Conference Series. 2020;1452.

42. Yang UM. On long range interpolation operators for aggressive coarsening. Numerical Linear Algebra with Applications.2010;17:453–472.

43. Anzt H, Ribizel T, Flegar G, Chow E, and Dongarra J. ParILUT - A Parallel Threshold ILU for GPUs. In: InternationalConference on Parallel Processing and Applied Mathematics. Springer; 2011. p. 133–142.

AUTHOR BIOGRAPHY

How to cite this article: Thomas S. J., Carr, A. K., Mullowney, P. Świrydowicz K., and Day, M. (2022), ILU Smoothers forC-AMG with Scaled Triangular Factors, NAME, VOL.

https://www.sciencedirect.com/science/article/pii/S0167819121000867

https://www.sciencedirect.com/science/article/pii/S0167819121000867

Documents

ILUT Smoothers for Hybrid C-AMG with Scaled Triangular Factors