ANALYSIS OF A SPLITTING APPROACH FOR THE PARALLELhomepages.cae.wisc.edu/~negrut/PDFpapers/siamLinSysGPU.pdf · Developing an approach and implementing parallel code for solving sparse

SIAM J. SCI. COMPUT. c© 2017 Society for Industrial and Applied MathematicsVol. 39, No. 3, pp. C215–C237

ANALYSIS OF A SPLITTING APPROACH FOR THE PARALLELSOLUTION OF LINEAR SYSTEMS ON GPU CARDS∗

ANG LI† , RADU SERBAN‡ , AND DAN NEGRUT†‡

Abstract. We discuss an approach for solving sparse or dense banded linear systems Ax = bon a graphics processing unit (GPU) card. The matrix A ∈ RN×N is possibly nonsymmetric andmoderately large, i.e., 10, 000 ≤ N ≤ 500, 000. The split and parallelize (SaP) approach seeks topartition the matrix A into diagonal subblocks Ai, i = 1, . . . , P , which are independently factoredin parallel. The solution may choose to consider or to ignore the matrices that couple the diagonalsubblocks Ai. This approach, along with the Krylov-subspace-based iterative method that it pre-conditions, are implemented in a solver called SaP::GPU, which is compared in terms of efficiencywith three commonly used sparse direct solvers: PARDISO, SuperLU, and MUMPS. SaP::GPU, whichruns entirely on the GPU except for several stages involved in preliminary row and column permuta-tions, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In acomparison against Intel’s MKL, SaP::GPU also fared well when used to solve dense banded systemsthat are close to being diagonally dominant. SaP::GPU is publicly available and distributed as opensource under a permissive BSD-3 license.

Key words. sparse linear system solution, parallel computing, GPU computing, Krylov-subspace method, preconditioning, work splitting, matrix reordering

AMS subject classifications. 65F10, 65F50, 65Y05, 65Y10

DOI. 10.1137/15M1039523

1. Introduction. Previously used in niche applications and by a small group ofenthusiasts, general purpose computing on graphics processing unit (GPU) cards hasgained widespread popularity after the release in 2007 of the CUDA programmingenvironment [22]. Owing also to the release of the OpenCL specification [26] in2008, GPU computing has been rapidly adopted by numerous groups with computingneeds originating in a broad spectrum of application areas. In several of these areasthough, when compared to the library ecosystem enabling sequential and/or parallelcomputing on x86 chips, GPU computing library support continues to be spotty. Thisobservation motivated the effort whose outcomes are reported in this paper, which isconcerned with solving sparse linear systems of equations on the GPU.

Developing an approach and implementing parallel code for solving sparse linearsystems is not trivial. This and the relative novelty of GPU computing explain thescarcity of solutions for solving Ax = b on the GPU, when A ∈ RN×N is possiblynonsymmetric, sparse, and moderately large, i.e., 10,000 ≤ N ≤ 500,000. An inven-tory of software solutions as of 2015 produced a short list of codes for solving linearsystems on the GPU: cuSOLVER [23], Paralution [18], and SuperLU [7], the latterfocused on distributed memory architectures and leveraging GPU computing at thenode level only. Several CPU multicore approaches exist and are well established; see,for instance, [12, 32, 3, 7]. For a domain-specific application implemented on the GPUthat calls for solving Ax = b, one alternative would be to fall back on one of these

∗Submitted to the journal’s Software and High-Performance Computing section September 14,2015; accepted for publication (in revised form) January 5, 2017; published electronically May 25,2017.

http://www.siam.org/journals/sisc/39-3/M103952.htmlFunding: This work was supported by National Science Foundation grant SI2-SSE–1147337.†Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, WI 53706

([email protected], [email protected]).‡Mechanical Engineering, University of Wisconsin–Madison, Madison, WI 53706 (serban@

wisc.edu).C215

http://www.siam.org/journals/sisc/39-3/M103952.html

mailto:[email protected]




C216 ANG LI, RADU SERBAN, AND DAN NEGRUT

CPU-based solutions. This strategy usually impacts the overall performance of thealgorithm due to the back-and-forth data movement across the PCI host-device in-terconnect, which in practice presently supports bandwidths of the order of 10 GB/s.Herein, the focus is not on this strategy. Instead, we are interested in carrying outthe solution, including factorization, on the GPU when the possibly nonsymmetricmatrix A is sparse or dense banded with narrow bandwidth.

There are pros and cons to having a linear solver on the GPU. On the upside,since a parallel implementation of an LU factorization is memory bound, particularlyfor sparse systems, the GPU is attractive due to its high bandwidths and relativelylow latencies. At roughly 300 GB/s, the GPU has four to five times higher main-memory bandwidth when compared to a modern multicore CPU. On the downside, theirregular memory access patterns associated with sparse matrix factorization ablatethis GPU-over-CPU advantage, which is further eroded by the intense logic and integerarithmetic requirements associated with existing algorithms. The approach discussedherein alleviates these two pitfalls by embracing a splitting strategy described forCPU-centric multicore and/or multinode computing in [24]. Two successive row-column permutations attempt to increase the diagonal dominance of the matrix andreduce its bandwidth, respectively. Ideally, the reordered matrix would be (i) diagonaldominant and (ii) dense banded. If (i) is accomplished, no pivoting is necessary in theLU factorization, thus avoiding logic and branching tasks at which the GPU does notshine. Additionally, if (ii) holds, coalesced memory access patterns associated withdense matrix operations can capitalize on the GPU’s high bandwidth.

The overall solution strategy adopted herein solves Ax = b using a Krylov-subspace method and employs LU preconditioning with work splitting and drop-off.Specifically, each outer Krylov-subspace iteration takes at least one preconditionersolve step that involves solving Ay = b on the GPU, where A ∈ RN×N is a densebanded matrix obtained from A after a sequence of possibly two reordering stages thatcan include element drop-off. Regardless of whether A is sparse, the salient attributeof the approach is the casting of the preconditioning step as a dense linear algebraproblem. Thus, a reordering process is employed to obtain a narrow-band, dense A,which is subsequently LU-factored. For the reordering, a strategy that combines twostages, namely, diagonal dominance boosting and bandwidth reduction, has yieldedwell-balanced coefficient matrices that can be factored fast on the GPU leveraginga single instruction multiple data (SIMD) friendly underlying data structure. The

LU factorization relies on a splitting of the matrix A in several diagonal blocks thatare factored independently and a correction process to account for the interdiagonalblock coupling. The implementation takes advantage of the GPU’s deep memoryhierarchy, its multi–stream multiprocessor (SM) layout, and its predilection for SIMDcomputation.

This paper is organized as follows. Section 2 summarizes the solution algorithm.The discussion covers first the dense banded matrix case. Subsequently, the sparsecase brings into focus strategies for matrix reordering. Section 3 summarizes aspectsrelated to the GPU implementation of the proposed solution methods. Results fromnumerical experiments for both dense banded and sparse linear systems are reportedin section 4. The paper concludes with a summary of lessons learned and directionsof future work.

2. Description of the methodology.

2.1. The dense banded linear system case. Assume that the banded densematrix A ∈ RN×N has half-bandwidth K � N . Following an approach discussed

ANALYSIS OF SaP SOLUTION OF LINEAR SYSTEMS ON GPU C217

Fig. 1. Factorization of the matrix A with P = 3.

in [28, 24, 25], we partition the banded matrix A into a block-tridiagonal form with

P diagonal blocks Ai ∈ RNi×Ni , where∑Pi Ni = N . For each partition i, let Bi,

i = 1, . . . , P −1, and Ci, i = 2, . . . , P , be the super- and subdiagonal coupling blocks,respectively—see Figure 1. Each coupling block has dimension K × K for bandedmatrices with half-bandwidth K = maxi,j,aij 6=0 |i− j|.

As illustrated in Figure 1, the banded matrix A is expressed as the product of ablock-diagonal matrix D and a so-called spike matrix S [28]. The latter is made upof identity diagonal blocks of dimension Ni and off-diagonal spike blocks, each havingK columns. Specifically,

(2.1) A = DS ,

where D = diag(A1, . . . ,AP ). Assuming that Ai are nonsingular, the so-called leftand right spikes Wi and Vi associated with partition j, each of dimension Ni ×K,are given by

A1V1 =

00

B1

,(2.2a)

Ai [Wi | Vi] =

Ci 00 00 Bi

, i = 2, . . . , P − 1,(2.2b)

APWP =

CP

00

.(2.2c)

Solving the linear system Ax = b is thus reduced to solving

Dg = b,(2.3)

Sx = g .(2.4)

Since D is block-diagonal, solving for the modified right-hand-side g from (2.3) istrivially parallelizable, as the work is split across P processes, each charted to solveAigi = bi, i = 1, . . . , P . Note that the same decoupling is manifest in (2.2) and thework is also spread over P processes.

The remaining question is how to efficiently solve the linear system in (2.4). This

problem can be reduced to one of smaller size, Sx = g. To that end, the spikes Vi


and Wi, as well as the modified right-hand-side gi and the unknown vectors xi in(2.4) are partitioned into their top K rows, the middle Ni−2K rows, and the bottomK rows as

Vi =

V(t)i

V′iV

(b)i

, Wi =

W(t)i

W′i

W(b)i

,(2.5a)

gi =

g(t)i

g′ig(b)i

, xi =

x(t)i

x′ix(b)i

.(2.5b)

A block-tridiagonal reduced system is obtained by eliminating the middle partitionsof the spike matrices, resulting in the linear system

(2.6)

R1 M1

. . .

Ni Ri Mi

. . .

NP−1 RP−1

x1

...xi...

xP−1

=

g1

...gi...

gP−1

,

denoted Sx = g, of dimension 2K(P − 1)� N , where

Ni =

[W

(b)i 0

0 0

], i = 2, . . . , P − 1,(2.7a)

Ri =

[IM V

(b)i

W(t)i+1 IM

], i = 1, . . . , P − 1,(2.7b)

Mi =

[0 0

0 V(t)k+1

], i = 1, . . . , P − 2,(2.7c)

and

(2.8) xi =

[x(b)i

x(t)i+1

], gi =

[g(b)i

g(t)i+1

], i = 1, . . . , P − 1 .

Two strategies are proposed in [24] to solve (2.6): (i) an exact reduction and (ii) anapproximate reduction, which sets Ni ≡ 0 and Mi ≡ 0 and results in a block-diagonalmatrix S. The solution approach adopted herein is based on (ii) and therefore eachsubsystem Rixi = gi is solved independently using the following steps:

Form Ri = IM −W(t)i+1V

(b)i ,(2.9a)

Solve Rix(t)i+1 = g

(t)i+1 −W

(t)i+1g

(b)i ,(2.9b)

Calculate x(b)i = g

(b)i −V

(b)i x

(t)i+1 .(2.9c)

Note that a tilde was used to denote the approximate values x(t)i and x

(b)i obtained

upon dropping the Ni and Mi terms. An approximation of the solution of the originalproblem is finally obtained by solving P systems independently and in parallel usingthe available LU factorizations of the Ai matrices:


A1x1 = b1 −

00

B1x(t)2

,(2.10a)

Aixi = bi −

Cix(b)i−1

00

−

00

Bix(t)i+1

, i = 2, . . . , P − 1,(2.10b)

APxP = bP−

CP x(b)P−1

00

.(2.10c)

Three remarks are in order. First, note that if an LU factorization of the diagonal

blocks Ai is available, the bottom block of the right spike, i.e., V(b)i , can be obtained

from (2.2a) using only the bottom K × K blocks of L and U. However, obtainingthe top block of the left spike requires calculating the entire spike Wi. An effective

alternative is to perform an additional UL factorization of Ai, in which case W(t)i can

be obtained using only the top K ×K blocks of the new U and L.Next, note that the decision to set Ni ≡ 0 and Mi ≡ 0, i.e., using the approximate

reduction rather than the exact reduction, relegates the algorithm to preconditionerstatus. This decision is justified as follows: although the dimension of the reducedlinear system in (2.6) is smaller than that of the original problem, its half-bandwidthis at least three times larger. The memory footprint of exactly solving (2.6), i.e.,employing the exact reduction, is large, thus limiting the size of problems that canbe tackled on the GPU due to the relatively small amount of memory available onthe device. Specifically, at each recursive step, additional memory that is required tostore the new reduced matrix cannot be deallocated until the global solution is fullyrecovered.

Finally, it is manifest that the quality of the preconditioner is dictated by theentries in Ni and Mi. For the sake of this discussion, assume that the matrix A isdiagonally dominant with a degree of diagonal dominance d ≥ 1, i.e.,

(2.11) |aii| ≥ d∑j 6=i

|aij | ∀i = 1, . . . , N .

When d > 1, the elements of the left spikes Wi decay in magnitude from top tobottom; those of the right spikes Vi decay from bottom to top [21]. This decay,which is more pronounced the larger the degree of diagonal dominance, justifies theapproximation Ni ≡ 0 and Mi ≡ 0. However, having A be diagonal dominant,although desirable, is not mandatory, as demonstrated by numerical experiments re-ported herein. Truncating when d < 1 will lead to a preconditioner of lesser quality.

Targeted for execution on the GPU, the methodology outlined above becomes thefoundation of a parallel implementation called herein “split and parallelize” (SaP).The matrix A is split into block-diagonal matrices Ai, which are processed in parallel.The code implementing this strategy is called SaP::GPU. Several flavors of SaP::GPUcan be envisioned. At one end of the spectrum, the solution path would implement theexact reduction; as justified above, this strategy is not pursued herein. At the otherend of the spectrum, SaP::GPU solves the block-diagonal linear system in 2.3 and forpreconditioning purposes uses the approximation x ≈ g. In what follows, this will becalled the decoupled approach, SaP::GPU-D. The middle ground is the approximatereduction, which sets Ni ≡ 0 and Mi ≡ 0. This will be called the coupled approach,


SaP::GPU-C, owing to the coupling that occurs through the truncated spikes, i.e.,

V(b)i and W

(t)i+1.

Neither the coupled nor the decoupled approaches qualify as direct solvers. Con-sequently, SaP::GPU employs an outer Krylov subspace scheme to solve Ax = b.The solver uses BiCGStab(`) [34] and left-preconditioning, unless the matrix A issymmetric and positive definite, in which case the outer loop implements a conjugategradient method [27]. SaP::GPU is open source and available at [30, 31].

2.2. The sparse linear system case. The discussion focuses next on solvingAsx = b, where As ∈ RN×N is assumed to be sparse. The salient attribute ofthe solution strategy is its fallback on the dense banded approach of section 2.1.Specifically, several permutation steps are employed to transform As into a matrixA that has a large d and small K. Although the reordered matrix will remain sparsewithin the band, it will be regarded as dense banded and LU- and/or UL-factoredaccordingly.

When As is nonsymmetric or has low d, a first reordering, called diagonal boosting(DB), is applied as QAsx = Qb to maximize the product of the absolute values ofthe diagonal entries [9, 10]. This is done via a depth first search with a look-aheadtechnique similar to the one in the Harwell Software Library (HSL) [12]. A detaileddescription of the DB implementation in SaP is available in [16]. Therein, we providea performance comparison against HSL on a set of more than 100 matrices.

While the purpose of the first reordering QAs is to yield a diagonally “heavy”matrix, a second reordering seeks to reduce K by using the traditional Cuthill–McKee(CM) algorithm [6]. Since the diagonal entries should not be relocated, the second per-mutation is applied to the symmetric matrix QAs+AT

s QT . A detailed description ofthe CM implementation in SaP is available in [15]. Therein, we provide a performancecomparison against HSL on a set of more than 130 matrices.

Following these two reorderings, the resulting matrix A is split to obtain A1

through AP . A third CM reordering is then applied to each Ai for further bandwidthreduction. Straightforward to implement in SaP::GPU-D, this third-stage reorder-ing in SaP::GPU-C mandates computation of the entire spikes, an operation thatcan significantly increase the memory footprint and flop count of the numerical solu-tion. Note that third-stage reordering in SaP::GPU-C renders the UL factorizationsuperfluous since computing only the top of a spike is insufficient.

If Ai is diagonally dominant, the LU and/or UL factorization can be safely carriedout without pivoting [11]. We always perform factorizations of the diagonal blocksAi without pivoting but with pivot boosting. Specifically, if a pivot becomes smallerthan a threshold value, it is boosted to a small, user controlled value ε. This yields afactorization of a slightly perturbed diagonal block, LiUi = Ai+δAi, where ‖δAi‖ =O(u‖A‖) and u is the unit roundoff [19]. The semantics of diagonal boosting and pivotboosting are different. Herein, the former refers to a matrix reordering; the latterconcerns a modification of a pivot during the LU process.

3. Brief implementation details. This section discusses SaP::GPU imple-mentation details for solving both dense banded and sparse linear systems. A dis-cussion of the DB and CM reordering algorithms and their GPU implementation fallsoutside the scope of this contribution. The interested reader is referred to [17].

3.1. Dense banded matrix factorization details. This subsection providesimplementation details regarding how the P partitions Ai are determined, how thebanded matrix A is stored, and how the LU/UL steps are implemented on the GPU.


Number of partitions and partition size. The selection of P must strikea balance between two conflicting requirements. On the one hand, having a largeP is attractive given that the LU/UL factorization of Ai for i = 1, . . . , P can bedone independently and simultaneously. On the other hand, a large P negativelyimpacts the quality of the resulting preconditioner due to the increase in the numberof instances in which the coupling of the diagonal blocks Ai and Ai+1 is approximated.In the current implementation, no attempt is made to optimally select P and someexperimentation is required. Given a P value, the size of the diagonal blocks Ai isselected to achieve load balancing. The first Pr partitions are of size bN/P c+1, whilethe remaining are of size bN/P c, where N = P bN/P c+ Pr.

Matrix storage. Dense banded matrices Ai are stored in a “tall and thin”column-major order. All diagonal elements are stored in the Kth column. The rest ofthe elements are correspondingly distributed columnwise. This strategy, shown belowfor a matrix with N = 8 and K = 2, groups the operands of the LU/UL factorizationsand allows coalesced memory accesses that can fully leverage the GPU’s bandwidth.

∗ ∗ a11 a21 a31∗ a12 a22 a32 a42a13 a23 a33 a43 a53a24 a34 a44 a54 a64a35 a45 a55 a65 a75a46 a56 a66 a76 a86a57 a67 a77 a87 ∗a68 a78 a88 ∗ ∗

LU/UL factorizations. The solution strategy pursued calls for an LU and an

optional UL factorization of each dense banded diagonal block Ai. As i = 1, . . . , P ,these P LU/UL factorizations can proceed independently and in parallel. One impor-tant question is whether Ai can be factored using one block of threads. The answeris relevant given the GPU’s lack of native, low overhead, support for synchroniza-tion between threads running in different blocks. The established GPU strategy forinterblock synchronization is “exit and launch a new kernel.” This guarantees syn-chronization at the GPU-grid level at the cost of nonnegligible overhead. In a trade-offbetween minimizing the overhead of kernel launches and maximizing the occupancyof the GPU, we adopted two, mutually exclusive, execution paths: one for K < 64and one for larger bandwidths. As a side note, the threshold value of 64 was selectedthrough numerical experimentation over a variety of problems and was dictated bythe number of threads that can be organized in a block in CUDA [22].

For K < 64, the code was designed to reduce the kernel launch count. In-stead of having Ni − 1 kernel launches, each completing a step of the factorization ofAi = LiUi by updating entries in a (K + 1)× (K + 1) window of elements, a singlekernel is launched to factor Ai. It uses min(K2, 1024) threads per block and relieson low-overhead SM synchronization support within the block, without any need forglobal synchronization. In a so-called window-sliding method, at each step of the fac-torization, i.e., during the process of computing column entries in L and row entriesof U, each thread updates a fixed number of Ai entries. On current GPU hardware,this fixed number is between 1 and 4. Once all threads in the block complete theirwork, they are synchronized and the (K + 1) × (K + 1) window slides down by onerow and to the right by one column. The value 4 is explained as follows. Assumethat K = 63. Then, the sliding window has size 64 × 64. Since the two-dimensional


GPU thread block size is 1024 = 32× 32, each thread will handle four entries of thewindow of focus.

For K ≥ 64, SaP uses multiple blocks of threads to update L and U entries. Onthe upside, there are more threads working on the window of focus. On the downside,there is overhead associated with leaving and reentering the kernel, a process thathas the side effect of flushing the shared memory and registers. The window is largerthan K ×K, and it slides at a stride of eight, i.e., moves down by eight rows and tothe right by eight columns upon exiting and reentering the LU factorization kernel.

Use of registers and shared memory. If the user decides to employ a third-stage reordering, the coupling subblocks Bi and Ci are used to compute the entirespikes in a scheme that renders a UL factorization superfluous. In this case, Bi andCi are each first partitioned into subblocks of dimension L ×K, where L is at most20. Each forward/backward sweep to get the spikes is unrolled, and in each iterationof the new loop, one entire subblock, rather than a vector of length K, is calculated.To this end, the corresponding elements in the matrix Ai are prefetched into sharedmemory and the entries of the subblock are preloaded into registers. This strategy,in which all operations to calculate the spikes draw on registers and shared memory,leads to 50% to 70% improvement in performance when compared with an alternativethat calculates the spike elements in a loop without leveraging the low latency/highbandwidth of the GPU register file and shared memory.

Mixed precision strategy. SaP::GPU uses a mixed-precision implementation.Single precision is used in obtaining and applying the preconditioner. Double precisionis used in the sparse matrix vector multiplications embedded in the Krylov iterativecomponent of the solution. A battery of tests indicates that this strategy results ina 50% average reduction in time to solution when compared with an approach whereall calculations are performed in double precision.

3.2. SaP::GPU components and computational flow. The SaP::GPU densebanded linear system solver is relatively straightforward to implement. Upon parti-tioning A into diagonal blocks Ai, each Ai is subject to an LU factorization thatrequires an amount of time TLU . Next, in TBC time, the coupling block matrices Bi

and Ci are extracted on the GPU. The Vi and Wi spikes are subsequently computedin an operation that requires TSPK time. Afterward, in TLUrdcd time, the spikesare truncated and the steps outlined in (2.9) are taken to produce the intermediary

values x(t)i and x

(b)i . At this point, the preprocessing step is over and two sets of fac-

torizations, for Ai and Ri, are available for preconditioning during the iterative phaseof the solution. The amount of time spent iterating is TKry, the iterative methodsconsidered being BiCGStab(2) and conjugate gradient.

The SaP::GPU sparse linear system solution is slightly more convoluted at thefront end. A sequence of two permutations, DB requiring TDB and CM requiring TCMtime, is carried out to increase the size of the diagonal elements and reduce bandwidth.An additional amount of time TDrop might be spent to drop off-diagonal elements inorder to decrease the bandwidth of the reordered A matrix. Since the DB and CMreorderings are hybrid, TDtransf is used to keep track of the overhead associated withmoving data back and forth between the CPU and GPU during the reordering process.An amount of time TAsmbl is spent on the GPU in bookkeeping required to turn thereordered sparse matrix into a dense banded matrix.

The definition of the times to solution is tied to the computational flow in Figure 2,where subscripts d and s are used to differentiate between the dense and sparse paths,respectively. For a sparse linear system solve that uses the coupled approach, i.e.,


Fig. 2. Computational flow for SaP::GPU. SaP::GPU-Dd is the decoupled dense solver;SaP::GPU-Cd is the coupled dense solver. The analogue sparse solvers have an s subscript. Thethird-stage reordering in square brackets is optional but typically leads to substantial reductions intime to solution.

SaP::GPU-Cs, the total time is TTotSparse = TPrepSp + TTotDense, where TPrepSp =TDB + TCM + TDtransf + TDrop + TAsmbl and TTotDense = TLU + TBC + TSPK +TLUrdcd + TKry. For SaP::GPU-Dd, owing to the decoupled nature of the solution,TTotDense = TLU+TKry, where TLU includes a CM process that reduces the bandwidthof each Ai. The names introduced, i.e., TDB , TCM , TLUrdcd, etc., are referenced inthe profiling study discussed in section 4.2.1 and used ad verbum on the SaP::GPUweb page [31] to report profiling results for approximately 120 linear systems.

3.3. Solver parameter selection aspects. No analytically backed strategy isavailable in SaP::GPU to select (i) the bandwidth K that a sparse linear systemshould be solved with, (ii) the number of partitions P , or (iii) the SaP::GPU-C orSaP::GPU-D flavor. These choices are both problem and GPU architecture specific.Insofar as the architecture is concerned, GPUs with a large number of SM will workbest with large values of P , which will ensure that each SM will be assigned a task andavoid hardware underutilization. For sparse linear systems, the user has to specifythe preferred K value; if none is specified, the default value implies no drop-off, whichcan lead to large bandwidths. Note that K selection is relevant only in the contextof solving sparse linear systems; for dense banded, K is a given. We rely on a SaPutility that for a given linear system, via a parametric study, helps the user identifythe best P , K (for sparse problems) and SaP::GPU-C versus SaP::GPU-D choice.Obviously, if one is interested in solving a linear system once, this utility is of nouse—the solver will fall back on default choices, which can be controlled throughthe SaP interface. However, when SaP is used in the context of GPU applicationswhere numerous and qualitatively similar matrices are handled, the upfront effortof identifying a suitable parameter choice pays off. For instance, SaP::GPU is wellpositioned to tackle nonlinear computational dynamics applications where one wouldrepeatedly solve a nonlinear algebraic system using a Newton method [33]. The lattercalls for the solution of a sparse linear system several times per time step. Then,presuming that the dimension of the matrix does not change over time and that thesparsity pattern stays roughly the same, a set of P , K and solver flavor determinedat the beginning of the analysis is used throughout the simulation.


4. Numerical experiments. The next two subsections summarize results fromtwo numerical experiments concerned with the solution of dense banded and sparselinear systems, respectively. The hardware/software setup for these numerical exper-iments is as follows. The GPU used was Tesla K20X [2, 1]. SaP::GPU uses CUDA7.0 [22], cusp [4], and Thrust [5]. The CPU used was the 3 GHz, 25 MB lastlevel cache, Intel Xeon E5-2690v2. The node used hosted two such CPUs, which isthe maximum possible for this type of chip, for a total of 20 cores executing up to40 HTT threads; the amount of memory per node was 64 GB. When invoking themultithreaded libraries Intel MKL v-13.0.1, PARDISO [32], MUMPS [3], or SuperLU[7], they used both CPUs on the node. SaP always ran on one GPU card.

When reporting below the results of several numerical experiments, one legiti-mate question is whether it is justifiable to compare performance results obtained onone GPU with results obtained on two multicore CPUs. The multicore CPU is notthe fastest, as Intel chips with more cores are available. Additionally, the Intel chip’smicroarchitecture is not Haswell, which is more recent than the Ivy Bridge microar-chitecture of the Xeon E5-2690v2. Likewise, on the GPU side, one could have useda Tesla K80 card, which has roughly four times more memory than K20x and twiceits memory bandwidth. Moreover, pricewise, the K80 would have been closer to thecost of two CPUs than K20x was. Finally, Kepler is not the latest microarchitectureeither, as Nvidia released the Maxwell architecture in early 2014 and Pascal in 2016.We do not attempt to answer these questions and hope that the interested reader willmodulate this study’s conclusions by factoring in unavoidable CPU–GPU hardwaredifferences. No claim is made herein of one architecture being superior since such aclaim could be easily proved wrong by moving from algorithm to algorithm or fromdiscipline to discipline. The sole purpose of this section is to gauge the performanceof SaP::GPU relative to that of existing solvers, in other words, to understand wherethis GPU solver fits in the big picture and what levels of expectations one should havein relation to its use. In the interest of full disclosure, we point out that being aniterative solver, some parameters need to be set for SaP, i.e., P , K, and strategy, i.e.,SaP::GPU-C versus SaP::GPU-D—see the discussion in section 3.3. Herein, we relyon a SaP utility to select these parameters, which are both problem and hardwaredependent. This might give SaP an unfair advantage comparing against PARDISO,MUMPS, and SuperLU as no attempt has been made to optimize the behavior of thesedirect solvers. A GPU-to-GPU comparison was also carried out and the results arereported in section 4.2.3. We found the performance of the GPU alternative, i.e.,cuSOLVER, lacking to the point where it could not provide that “placing SaP in thebig picture” aspect that was met by using the CPU and direct solvers mentionedabove.

Unless otherwise stated, all times reported below are in seconds and were ob-tained on a dedicated machine. In an attempt to avoid warm-up overhead, the resultsreported represent averages that drew on multiple successive identical runs. SaP stopsits iterative solution as soon as the condition

(4.1) ||Ax− b||2 ≤ atol + rtol · ||b||2

is satisfied; here x is a solution candidate produced during the iterative process.Unless specified by the user, SaP uses the default values atol = 0 and rtol = 10−10.These default values were used for all numerical experiments reported herein.

4.1. Numerical experiments related to dense banded linear systems.SaP::GPU-Dd and SaP::GPU-Cd are entirely implemented on the GPU. The


0 10 20 30 40 50 60 70 80 90 100

1

2

3

4

P

Exec

tim

e(s

)

SaP::GPU-CSaP::GPU-D

Fig. 3. Time to solution as a function of the number of partitions P . Study carried out for adense banded linear system with N = 200,000, K = 200, and d = 1.

subscript d will be dropped for convenience. This discussion draws on ample nu-merical experiments reported in [13].

4.1.1. Sensitivity with respect to P . The results in Figure 3 summarize thesensitivity of the time to solution with respect to the number of partitions P . Thisbehavior, namely, relatively small gains after a threshold value of P , is typical. It isinstructive to see how the solution time is spent by SaP::GPU-C and SaP::GPU-Dand understand how changing P influences this distribution of the time to solutionbetween the major implementation components. The results in Table 1 provide thisinformation as they compare the coupled and decoupled strategies in regards to thefactorization times, Dpre versus Cpre; number of iterations in the Krylov solver, Dit

versus Cit; amount of time spent iterating to find the solution, DKry versus CKry;and the total times, DTot versus CTot. These times are defined as Dpre = TLU ,Cpre = TLU + TBC + TSPK + TLUrdcd, DTot = Dpre + DKry, and CTot = Cpre +CKry. Note that for SaP::GPU, the number of iterations is reported in incrementsof 0.25, indicating convergence at one of four possible points during a BiCGStab(2)iteration.

The number of iterations to convergence suggests that the quality of the coupledversion of the preconditioner is superior. Yet the cost for getting this better precon-ditioner is higher and SaP::GPU-D ends up winning by taking as little as half thetime required by SaP::GPU-C. When the same factorization is used multiple times,this conclusion could change since the metric that controls the performance would beDKry and CKry, or its number of iterations for convergence proxy. Also note thatthe return on increasing the number of partitions gradually fades away and for thecoupled strategy there is no reason to go beyond P = 50.

4.1.2. Sensitivity with respect to d. Next, we report on the performance ofSaP::GPU for a dense banded linear system with N = 200,000 and K = 200, fordegrees of diagonal dominance in the range 0.06 ≤ d ≤ 1.2; see (2.11). The entriesin the matrix are randomly generated and P = 50. The findings are summarized inFigure 4, where SaP::GPU-C and SaP::GPU-D are compared against the banded


Table 1Performance comparison over a spectrum of number of partitions P for coupled (C) ver-

sus decoupled (D) strategies in SaP::GPU. All timings are in milliseconds. Problem parameters:N = 200,000, d = 1, K = 200. The symbols used are as follows: Dpre, amount of time spent in pre-processing by the decoupled strategy; Dit–number of Krylov iterations for convergence; DTot, amountof time to converge. Similar values are reported for the coupled scenario. SpdUp = DTot/CTot.

P Dpre Cpre Dit Cit DKry CKry DTot CTot SpdUp

2 1,016.8 1,987.6 1.75 0.75 2,127 1,742.4 3,143.8 3,730 0.843 803.7 1,672.5 1.75 0.75 1,446.4 1,179.2 2,250.1 2,851.7 0.794 694.7 1,480.7 1.75 0.75 1,105.9 896.3 1,800.6 2,377 0.765 630.1 1,371.5 1.75 0.75 900.1 722.7 1,530.2 2,094.2 0.736 595.1 1,304.4 1.75 0.75 766.1 611.3 1,361.2 1,915.7 0.718 535 1,210.5 1.75 0.75 593.2 471 1,128.3 1,681.5 0.6710 500 1,166.7 1.75 0.75 491 385.6 991.1 1,552.4 0.6420 442 1,099.9 1.75 0.75 290.2 220.4 732.1 1,320.3 0.5530 432.7 1,098.5 1.75 0.75 225 167.7 657.8 1,266.2 0.5240 410.2 1,087.2 1.75 0.75 186.9 141 597.1 1,228.2 0.4950 403.5 1,094.8 1.75 0.75 166.6 125.1 570.2 1,219.9 0.4760 408.4 1,115.9 1.75 0.75 152.7 113.7 561.1 1,229.6 0.4670 405 1,126.7 1.75 0.75 148.8 105.7 553.8 1,232.4 0.4580 397.3 1,132.9 1.75 0.75 137.7 101.7 535 1,234.6 0.4390 397 1,151.4 1.75 0.75 133.5 101.9 530.5 1,253.3 0.42100 387.8 1,155.9 1.75 0.75 131.6 101.8 519.4 1,257.6 0.41

linear solver in MKL. When d > 1 the impact of the truncation becomes increasinglyirrelevant—a situation that places SaP::GPU at an advantage. As such, there is noreason to go beyond d = 1.2 since, if anything, the results will get better. The moreinteresting range is d < 1, when the diagonal dominance requirement is violated. TheSaP::GPU solver demonstrates uniform performance over a wide range of degreesof diagonal dominance. For instance, SaP::GPU-C typically required less than oneKrylov iteration for all d > 0.08. As the degree of diagonal dominance decreasesfurther, the number of iterations and hence the time to solution increase significantlyas a consequence of truncating the spikes that now contain nonnegligible values.

It is instructive to see how the solution time is spent by SaP::GPU-C andSaP::GPU-D and understand how changing d influences the split of the time tosolution between the major components of the implementation. The results reportedin Table 2 provide this information as they help answer the following question: canone still use a decoupled approach for matrices that are far from being diagonal dom-inant? The answer is yes, except in the most extreme case, when d = 0.06. Note thatthe number of iterations to convergence for the decoupled approach quickly recoversaway from small values of d. In the end, the same SaP::GPU-D over SaP::GPU-C2× speedup factor is obtained virtually over the entire spectrum of d values.

4.1.3. Comparison with Intel’s MKL over a spectrum of N and K.This section summarizes results of a two-dimensional sweep over N and K. In thisexercise, prompted by the results reported in Figures 3 and 4, we fixed P = 50 andchose matrices for which d = 1. Each row in Table 3 lists the value of N , which runsfrom 1000 to 1,000,000. Each column lists the dimension of half-bandwidth K, whichruns from 10 to 500. Each table row is split into three subrows: SaP::GPU-D resultsare reported in the first subrow, SaP::GPU-C in the second subrow, and MKL in thethird subrow, All timings are in milliseconds. “OOM” stands for “out-of-memory”—a situation that arises when SaP::GPU exhausts during the solution of the linearsystem the GPU’s 6 GB of global memory.


Table 2Influence of d for coupled (C) versus decoupled (D) strategies in SaP::GPU (N = 200,000,

P = 50, K = 200). All timings are in milliseconds. Symbols used are as specified for Table 1.

d Dpre Cpre Dit Cit DKry CKry DTot CTot SpdUp

6·10−2 402.5 1,098.1 353.25 4.25 25,344.3 525.5 25,746.8 1,623.6 15.868·10−2 403.6 1,097.3 8.75 0.75 675.3 128 1,079 1,225.3 0.880.1 403.5 1,096.9 6.25 0.75 492.6 128.4 896.1 1,225.2 0.730.2 403.4 1,097.5 3.75 0.75 312.1 127.3 715.6 1,224.8 0.580.3 404.7 1,096.7 2.75 0.75 248.9 127.2 653.6 1,223.9 0.530.4 404 1,096.8 2.75 0.75 240.6 127.4 644.6 1,224.2 0.530.5 404.4 1,094.9 2.25 0.75 236.7 125.3 641 1,220.2 0.530.6 404 1,096.9 2.25 0.75 202.1 127.5 606.1 1,224.4 0.50.7 403.4 1,097.6 2.25 0.75 200.1 128.3 603.5 1,225.9 0.490.8 402.4 1,097.1 2.25 0.75 197.5 128.3 599.9 1,225.5 0.490.9 403.5 1,096.7 1.75 0.75 162.3 127.3 565.8 1,224 0.461 402.6 1,097.6 1.75 0.75 162.5 127.4 565.2 1,225 0.461.1 402.5 1,097.1 1.75 0.75 162.4 128.3 564.9 1,225.4 0.461.2 403.1 1,097.2 1.75 0.75 172 128 575.1 1,225.2 0.47

Table 3Performance comparison, two-dimensional sweep over N and K for P = 50 and d = 1. For each

value N , the three rows correspond to the SaP::GPU-D, SaP::GPU-C, and MKL solvers, respectively.

NK

10 20 50 100 200 500

10002.433E1 1.755E1 1.816E1 2.067E1 2.755E1 2.952E1

6.637 7.354 1.106E1 1.866E1 2.937E1 2.955E11.145E1 1.080E1 1.281E1 2.208E1 2.145E2 2.208E2

20002.224E1 1.873E1 1.911E1 2.149E1 2.725E1 5.638E1

6.158 8.514 1.328E1 2.464E1 3.569E1 9.514E11.255E1 1.100E1 1.324E1 2.214E1 2.214E2 2.357E2

50002.517E1 2.062E1 2.101E1 2.327E1 3.259E1 8.002E1

7.597 9.266 1.622E1 3.049E1 5.866E1 2.372E21.307E1 1.233E1 2.145E1 3.827E1 2.531E2 2.944E2

10,0002.823E1 2.758E1 2.385E1 2.686E1 4.509E1 1.183E21.019E1 1.168E1 1.887E1 4.561E1 1.060E2 4.737E21.560E1 1.509E1 2.959E1 5.881E1 3.009E2 3.928E2





500,0003.720E2 3.651E2 4.425E2 7.240E2 1.411E3 OOM2.037E2 2.380E2 4.424E2 1.229E3 2.928E3 OOM2.135E2 2.924E2 8.969E2 2.539E3 6.231E3 1.071E4

1,000,0007.242E2 7.092E2 9.788E2 1.442E3 OOM OOM3.970E2 4.633E2 8.640E2 2.443E3 OOM OOM3.486E2 5.692E2 1.778E3 4.712E3 1.137E4 2.159E4


0.1 0.3 0.5 0.7 0.9 1.10

2

4

6

Sp

eed

up

SaP::GPU-C over MKLSaP::GPU-D over MKL

0.1 0.3 0.5 0.7 0.9 1.1

1

2

3

4

d

Exec

tim

e(s

)

SaP::GPU-CSaP::GPU-DMKL

Fig. 4. Influence of the diagonal dominance d, with 0.06 ≤ d ≤ 1.2, for fixed values N =200,000, K = 200, and P = 50.

The results reported in Table 3 are statistically summarized in Figure 5, whichprovides SaP over MKL speedup information. Assume that a test “α” successfullyran to completion in SaP::GPU-D, requiring T SaP::GPU−D

α , and/or in SaP::GPU-C,requiring T SaP::GPU−C

α . By convention, in case of failing to solve, a negative value(−1) is assigned to T SaP::GPU−D

α or T SaP::GPU−Cα . If a test “α” runs to completion in both

SaP and MKL, the speedup value used to generate Figure 5 is sBD ≡ T MKLα /T SaP

α , whereT MKLα is MKL’s time to solution and T SaP

α ≡ min(max(T SaP::GPU−Dα , 0),max(T SaP::GPU−C

α , 0)).Given that N assumes 10 values and K takes 6 values, “α” can be one of 60 tests.Since three (N,K) tests, namely, (1,000,000,200), (1,000,000,500), and (500,000,500),failed to solve in SaP, the sample population for the statistical study in Figure 5 is57. Of 57 tests, sBD > 1 in all but two cases: for (1,000,000,10) when sBD = 0.87825,and for (2000,50) when sBD = 0.99706. The highest speedup was sBD = 8.1255,for (2000,200). The median is slightly higher than 2.0, which indicates that of the57 tests, half were completed by SaP two times faster than by MKL. The figure alsoshows that about 25% of the tests run approximately three to six times faster in SaP.The crosses in the figure represent outliers.


Fig. 5. SaP speedup over Intel’s MKL—statistical analysis based on values in Table 3.

4.2. Numerical experiments related to sparse linear systems.

4.2.1. Profiling results. Figure 6 plots statistical results that summarize howthe time to solution, i.e., finding x in Ax = b, is spent in SaP::GPU. The raw dataused in this analysis is available online [31]. A discussion of exactly what “finding thesolution of the linear system” means is postponed to section 4.2.3. The labels used inFigure 6 are inspired by the notation used in section 3.2 and Figure 2. Consider, forinstance, the diagonal boosting reordering DB employed by SaP. The percent of timeto solution spent in DB is represented using a median-quartile method to measurestatistical spread. The raw data used to generate the DB box was obtained as follows.If a test “α” that runs to completion requires TDBα > 0 for DB completion, then thistest will generate one data entry in an array of data subsequently used to produce thestatistical result. The actual entry that is used is 100×TDBα /TTotα , where TTotα is thetotal amount of time that test “α” takes for completion. In other words, the entry isthe percent of time spent when solving this particular linear system for performingthe diagonal boosting reordering. The bars for the K-reducing reordering (CM), formultiple data transfers between CPU and GPU (Dtrsf), etc., are similarly obtained.Not all bars in Figure 6 were generated using the same number of data entries; i.e.,some tests contributed to some, but not all bars. For instance, a symmetric positivedefinite linear system requires no DB step and so this test won’t contribute an entryto the array of data used to determine the DB box. Of a batch of 85 tests thatran to completion with SaP, the sample population used to generate the bars isas follows: 85 data points for CM, Dtrsf, and Kry; 63 data points for DB; 60 forLU; 32 data points for Drop; and 9 data points for BC, SPK, and LUrdcd. Thesecounts provide insight into the solution path adopted by SaP in solving the 85 linearsystems. For instance, the coupled approach; i.e., the SPIKE method of [24], hasbeen employed in the solution of 9 of the 85 linear systems. The rest were solved viaSaP::GPU-D. Of 85 linear systems, 25 were most effectively solved by SaP resortingto diagonal preconditioning, i.e., after DB all the entries were dropped off except theheavy diagonal ones. Also, note that several of the linear systems considered weresymmetric positive definite, from where the 60 points count for DB.

A statistical analysis of the time spent in the Krylov-subspace component of thesolution reveals that the median time was 55.84%. The median times for the othercomponents of the solution are listed in the first row of data in Table 4. The secondrow of data provides the median values when the Krylov-subspace component, whichdwarfs most of the solution components, is eliminated. In this case, the entry forDB, for instance, was obtained based on data points 100 × TDBα /TTotα , where thistime around TTotα included everything except the time spent in the Krylov-subspacecomponent of the solution. In other words, TTotα is the time required to compute fromscratch the preconditioner. The median values should be used in conjunction withthe median-quartile boxplot of Figure 6 for the first row of data and Figure 7 for thesecond row of data. Consider, for instance, the results associated with the drop-off


Table 4Median information for the SaP solution components as % of the time for solution. Two

scenarios are considered: the first data row provides values when the total time, i.e., 100%, includedthe time spent by SaP in the Krylov-subspace component. The second row of data is obtained byconsidering 100% to be the time required to compute a factorization of the preconditioner. Note thatvalues in each row of data do not add up to 100% for several reasons. First, these are statisticalmedian values. Second, there are very many tests that do not include all the components of thesolution. For instance, SPK is computed based on a set of 9 points, while Drop is computed using32 data points.

DB CM Dtransf Drop Asmbl BC LU SPK LUrdcd

3.4 1.4 1.9 4.1 0.7 1.4 24.8 23 4.111.4 3.7 4.1 25.5 2.7 2.3 73.4 41.8 6.4

Fig. 6. Profiling results obtained for a set of 85 linear systems that, of a collection of 114,could be solved by SaP::GPU.

operation. In the Krylov-inclusive measurement, Drop has a median of 4.1%; i.e., halfof the 32 tests which employed drop-off spent more than that amount in performingthe drop-off, while half were quicker. The spread is rather large and there are severaloutliers that suggest that a handful of tests require a very large amount of time bespent in the drop-off part of the solution.

The results in Figure 6 and Table 4 suggest where the optimization efforts shouldconcentrate in the future. For instance, the time required for the CPU↔GPU datatransfer is, in the overall picture, rather insignificant. Somewhat unexpected, theamount of time spent in drop-off came out higher than anticipated, at least in relativeterms. One caveat is that no effort was made to optimize this component of thesolution. Instead, the effort went into optimizing the DB and CM solution components.This paid off, as matrix reordering in SaP, particularly for large matrices, is fastwhen compared to Harwell and it reached the point where the drop-off became amore significant bottleneck. Another unexpected observation was the relatively smallnumber of cases in which SaP::GPU-C was preferred over SaP::GPU-D, i.e., in whichthe SPIKE strategy [24] was employed. This observation, however, should not begeneralized as it might be specific to the SaP implementation. Indeed, it simply states


Fig. 7. Profiling results obtained for a set of 85 linear systems that, of a collection of 114,could be solved by SaP::GPU.

that in the current implementation, a large number of iterations associated with aless sophisticated preconditioner is preferred to a smaller count of expensive iterationsassociated with SaP::GPU-C. Of a sample population of 85 tests, when SaP::GPU-Cwas used, the median number of iterations to solution was 6.75. Conversely, whenSaP::GPU-D was preferred, the median count was 29.375 [31].

4.2.2. Comparison against state of the art. A set of 114 matrices [29],of which 105 are from the Florida matrix collection, is used herein to compare thetime to solution and robustness of SaP, PARDISO, SuperLU, and MUMPS. In thiscontext, robustness is regarded as a qualitative measure of the ability of a solver tofind the solution of linear systems whose coefficient matrices (i) have widely differentdimensions N and (ii) are associated with practical problems that originate in vastlydifferent application areas. The 114 matrices were selected on the following basis: atleast one of the four solvers can retrieve the solution x within 1% relative accuracy.For a sparse linear system Ax = b, this relative accuracy was measured using themethod of manufactured solutions: an exact solution x? was first chosen and thenthe right-hand side was set to b = Ax?. Each sparse linear solver s attemptedto produce an approximation xs of the solution x?. If this approximation satisfied||xs − x?||2/||x?||2 ≤ 0.01, then the solve was considered to have been successful. Forall tests, x? had its entries roughly distributed on a parabola starting from 1.0 as thefirst entry, approaching the value 400 at N/2, and decreasing to 1.0 for the Nth andlast entry of x?. Figure 8 employs a median-quartile method to measure the statisticalspread of the 114 matrices used in this sparse solver comparison. In terms of size,N is between 8192 and 4,690,002. In terms of nonzeros, nnz is between 41,746 and46,522,475. The median for N is 71,328. The median for nnz is 1,167,967.

All solvers have been used in all tests “out of the box,” i.e., given a linear system,no attempt was made to change parameters that might optimize the process of findinga solution—the solvers were used with default values. For SaP, this meant atol = 0and rtol = 10−10—see (4.1). Given that SaP::GPU is an iterative solver, in many


Fig. 8. Statistical information—the dimension N and number of nonzeros nnz for the 114coefficient matrices used to compare SaP, PARDISO, SuperLU, and MUMPS.

instances the initial guess can be selected to be relatively close to the actual solution.This is avoided here by choosing as initial guess x(0) = 0N , i.e., far from x?. Againstthis backdrop, is the quality of the solution produced by solver s1 with its defaultsettings comparable with what s2 produces with its very own default settings? Thisquestion is answered in Figure 9, which can be interpreted as a cumulative probabilitydistribution. For a specified value ε, the plot shows for SaP, MUMPS, and SuperLUwhat fraction of the 114 tests were solved by a solver s at a value of relative errortighter than ε, i.e., so that ||xs−x?||2/||x?||2 < ε. For instance, using default settings,about 45%, 55%, and 70% of the tests were solved with a relative error smaller thanor equal to ε = 10−8 by SaP, MUMPS, and SuperLU, respectively. One thing thatis not apparent from this plot is the fact that when one solver produced a solutionwith a larger relative error, i.e., lower quality approximation, more often than not theother solvers produced low quality approximations as well.

In terms of robustness, SaP::GPU failed on 28 linear systems. In 23 cases, SaPran out of GPU memory. In the remaining five cases, SaP::GPU failed to converge.The rest of the solvers failed as follows: PARDISO, 40 times; SuperLU, 22 times; andMUMPS, 35 times. These results should be qualified as follows. The GPU card had 6GB of GDDR5-type memory. Given that in its current implementation SaP::GPU isan in-core solver, it does not swap data in and out of the GPU. Consequently, in 23instances it ran against this memory-size hard constraint. This issue can be partiallyalleviated by considering a better GPU card. Indeed, there are cards that have asmuch as 24 GB of memory, which yet comes short of the 64 GB that PARDISO,SuperLU, and MUMPS could tap into. There were three main failures modes forPARDISO, SuperLU, and MUMPS, i.e., the direct solvers. They produced a solutionthat was not within 1% accuracy, they seg faulted, or they started solving the problembut hung.


Fig. 9. Cumulative distribution of relative errors reached by SaP, MUMPS, and SuperLU whenused with defult settings to solve 114 linear systems [29].

In terms of speed, PARDISO was the fastest, followed by MUMPS, then SaP, andfinally SuperLU. Of the 57 linear systems solved by both SaP and PARDISO, SaPwas faster 20 times. Of the 71 linear systems solved by both SaP and SuperLU, SaPwas faster 38 times. Of the 60 linear systems solved by both SaP and MUMPS, SaPwas faster 27 times. Of the 60 linear systems solved by both PARDISO and SuperLU,PARDISO was faster 60 times. Of the 57 linear systems solved by both SaP andMUMPS, PARDISO was faster 57 times. And finally, of the 64 linear systems solvedboth by SuperLU and MUMPS, SuperLU was faster 24 times. This relative speedissue is revisited in section 4.2.4.

We compare next the four solvers using a median-quartile method to measurestatistical spread. Assume that T SaP

α and T PARDISOα represent the times required by

SaP::GPU and PARDISO, respectively, to finish test α. A relative speedup is com-puted as

(4.2) SSaP−PARDISOα = log2

T PARDISOα

T SaPα

with SSaP−MUMPSα and SSaP−SuperLUα similarly computed. These SSaP−PARDISOα values, whichcan be either positive or negative, are collected in a set SSaP−PARDISO which is used togenerate a box plot in Figure 10. The figure also reports results on SSaP−SuperLU andSSaP−MUMPS. Note that the number of tests used to produce these statistical measuresis different for each comparison: 57 linear systems for SSaP−PARDISO, 71 for SSaP−SuperLU,and 60 for SSaP−MUMPS. The median values for SSaP−PARDISO, SSaP−SuperLU, and SSaP−MUMPSare −1.4036, 0.0934, and −0.3242, respectively. These results suggest that when itfinishes, PARDISO can be expected to be about two times faster than SaP. MUMPSis marginally faster than SaP, which on average can be expected to be only slightlyfaster than SuperLU.

Crosses are used in Figure 10 to show statistical outliers. Favorably, most of theSaP’s outliers are large and positive. For instance, there are three linear systems


Fig. 10. Statistical spread for SaP::GPU’s performance relative to that of PARDISO, SuperLU,and MUMPS. Referring to (4.2), the results were obtained using the data sets SSaP−PARDISO (with 57values), SSaP−SuperLU (71 values), and SSaP−MUMPS (60 values).

for which, when compared to PARDISO, SaP finishes significantly faster, four linearsystems for which it is significantly faster than SuperLU, and four linear systemsfor which it is significantly faster than MUMPS. On the flip side, there are two testswhere SaP runs slower than MUMPS and one test where it runs significantly slowerthen SuperLU. The results also suggest that about 50% of the linear systems run inSaP in the range between “as fast as PARDISO or two to three times slower,” and 50%of the linear systems run in SaP in the range “between four times faster to four timesslower then SuperLU.” Relative to MUMPS, the situation is just like for SuperLU ifonly slightly shifted toward negative territory: the second and third quartiles suggestthat 50% of the linear systems run in SaP in the range “between three times fasterto three times slower then MUMPS.” Again, favorably for SaP, the last quartile is longand reaches well into high positive values. In other words, when it beats the othersolvers, it does so by a large margin.

4.2.3. Comparison against another GPU solver. The same set of 114 ma-trices used in the comparison against PARDISO, SuperLU, and MUMPS was consideredto compare SaP::GPU with the sparse direct QR solver in the cuSOLVER library [23].For cuSOLVER, the QR solver was run in two configurations: with or without theapplication of a reversed Cuthill–McKee (RCM) reordering before solving the system.RCM was optionally applied given that it can potentially reduce the QR factoriza-tion fill-in. cuSOLVER successfully solved 45 of 114 systems. There were only threelinear systems: ABACUS shell ud, ex11, and jan99jac120, which were success-fully solved by cuSOLVER but not by SaP::GPU. Of the 42 systems solved by bothSaP::GPU and cuSOLVER, cuSOLVER was faster than SaP::GPU in five cases. Inall 69 systems cuSOLVER failed to solve, the implementation ran out of memory.

4.2.4. Performance evaluation. We provide a final sparse linear solver per-formance comparison using performance profiles [8]. Consider the collection P ofnP = 114 test problems used herein [29], along with the set S of nS = 5 sparse linear


Fig. 11. Performance profiles in two different ranges: τ ∈ [1, 10] on the left and τ ∈ [1, 100] onthe right.

solvers: PARDISO, SuperLU, MUMPS, cuSOLVER, and SaP::GPU. Following the no-tation in [8], let tp,s be the computing time required to solve problem p using solvers. The performance ratio

rp,s =tp,s

min{tp,s : s ∈ S}

compares the performance of solver s on problem p with the best performance of anysolver on that problem. If solver s does not solve problem p, we set the correspondingperformance ratio to a value rM > rp,s ∀p, s. It can be shown that such an rM doesnot affect the results of the analysis [8]. Here, we use rM = 105.

A global estimate of the performance of a solver s can then be obtained by definingthe probability that the performance ratio rp,s is within a factor τ ≥ 1 of the bestpossible performance ratio:

ρs(τ) =1

nPsize {p ∈ P : rp,s ≤ τ} .

As the cumulative distribution function for the performance ratio, ρs is monotonicallyincreasing and ρs(τ) ≤ 1. Termed here performance profiles, distribution functionsof a performance metric have been shown to be robust benchmarking indicators. Aplot of the performance profile unveils several main performance characteristics. Theinterested reader is directed to [8] for more details. Here we only mention that,assuming a set of problems P that is sufficiently large and representative of problemsof interest, solvers with a large probability ρs(τ) are to be preferred. As a particularcase, ρs(1) represents the probability that solver s will have the best performance overall other solvers.

Figure 11 shows the performance profiles for the five solvers considered herein onthe set of 114 matrices used in the benchmark [29]. The plot on the left, for smallvalues of τ , reveals the solvers with the most wins (or close to the winner). PARDISOhas the most wins, i.e., has the highest probability of being the best solver—thisprobability being almost 0.5. SaP::GPU comes in second, with a value ρs(1) ≈ 0.3.cuSOLVER, i.e., the other GPU solver considered in this benchmarking, has effectively0 probability of ever being the optimal solver on a given problem. The plot on theright, showing the trends of the performance profiles for larger values of τ , allows forselection of the solver able to solve most problems. Indeed, since 1− ρs(τ) represents


the fraction of problems in the set P that solver s cannot solve within a factor τ ofthe best solver, including the cases for which that solver fails, for large values of τthis quantity is an indication of the solver robustness (defined here as the probabilityto solve more problems, regardless of the time to solution). The plot on the right inFigure 11 shows SaP::GPU as being able to solve nearly 80% of the problem set.

5. Conclusions and future work. This contribution discusses parallel strate-gies to solve dense banded and sparse linear systems on GPU cards. BSD-3 opensource implementations of all these strategies are available at [30, 31] as part of asoftware package called SaP. At the time of this study and over a broad range ofdense matrix sizes and bandwidths, SaP was likely to run two times faster than In-tel’s MKL. This conclusion should be modulated by hardware considerations and alsoby the observation that the diagonal dominance of the dense banded matrix is a per-formance factor. On the sparse linear system side, the most surprising result wasthe robustness of SaP. Of a set of 114 tests, most of them using matrices from theUniversity of Florida sparse matrix collection, SaP failed only 28 times, of which 23were “out-of-memory” failures owing to a 6 GB limit on the size of the GPU mem-ory. In terms of performance, SaP was compared against PARDISO, MUMPS, andSuperLU. The straight split-and-parallelize strategy, without the coupling involvedin the SPIKE-type strategy, emerged as the more often solution approach adopted bySaP.

SaP has been successfully used for the implicit numerical integration of flexiblemultibody dynamics, in a Newton–Krylov context [33], and current efforts targetits use in conjunction with interior point methods for large-scale granular dynamicssimulations [20].

Several issues remain to be investigated. First, since more than 50% of the time tosolution is spent in the iterative solver, it is worth considering the techniques analyzedin [14], which sometimes double the flop rate in sparse matrix-vector multiplicationoperations upon changing the matrix storage scheme, i.e., moving from CSR to ELLor hybrid. Second, an out-of-core and/or multi-GPU implementation might enableSaP to handle larger problems while possibly reducing time to solution. Third, theCM bandwidth reduction strategy implemented is dated; spectral and/or hypergraphpartitioning for load balancing could lead to superior matrix splitting. Finally, as itstands, with the exception of parts of the matrix reordering, SaP is entirely a GPUsolution. It would be worth investigating how the CPU can be involved in other phasesof the solution. Such an investigation would be well justified given the imminent tightintegration of the CPU and GPU memories.

Acknowledgments. This work benefited from discussions with Matt Knepleyand Ahmed Sameh.

REFERENCES

[1] NVIDIA Tesla Kepler GPU Accelerators, http://www.nvidia.com/content/tesla/pdf/Tesla-KSeries-Overview-LR.pdf (2012).

[2] Tesla K20 GPU Accelerator, http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf (2012).

[3] P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent, Multifrontal parallel distributed sym-metric and unsymmetric solvers, Comput. Methods Appl. Mech. Engrg., 184 (2000),pp. 501–520.

[4] N. Bell and M. Garland, Cusp: Generic parallel algorithms for sparse matrix and graphcomputations, Version 0.3.0, http://cusplibrary.github.io, (2012).

http://www.nvidia.com/content/tesla/pdf/Tesla-KSeries-Overview-LR.pdf

http://www.nvidia.com/content/tesla/pdf/Tesla-KSeries-Overview-LR.pdf

http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf

http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf

http://cusplibrary.github.io


[5] N. Bell and J. Hoberock, Thrust: A productivity-oriented library for CUDA, GPU Comput.Gems Jade Edition, 2 (2011), pp. 359–371.

[6] E. Cuthill and J. McKee, Reducing the bandwidth of sparse symmetric matrices, in Proceed-ings of the 24th ACM Conference, New York, 1969, pp. 157–172.

[7] J. W. Demmel, SuperLU Users’ Guide, Lawrence Berkeley National Laboratory, Berkeley, CA,2011.

[8] E. D. Dolan and J. J. More, Benchmarking optimization software with performance profiles,Math. Program., 91 (2002), pp. 201–213.

[9] I. Duff and J. Koster, The design and use of algorithms for permuting large entries to thediagonal of sparse matrices, SIAM J. Matrix Anal. Appl., 20 (1999), pp. 889–901.

[10] I. Duff and J. Koster, On algorithms for permuting large entries to the diagonal of a sparsematrix, SIAM J. Matrix Anal. Appl., 22 (2001), pp. 973–996.

[11] G. H. Golub and C. F. V. Loan, Matrix Computations, Johns Hopkins University Press,Baltimore, MD, 1980.

[12] M. Hopper, Harwell Subroutine Library. A Catalogue of Subroutines, Tech. report, DTIC, Ft,Belvoir, VA, 1973.

[13] A. Li, O. Deshmukh, R. Serban, and D. Negrut, A Comparison of the Performance ofSaP::GPU and Intel’s Math Kernel Library for Solving Dense Banded Linear Systems,Tech. Report TR-2012-07, SBEL, University of Wisconsin–Madison, 2014, http://sbel.wisc.edu/documents/TR-2014-07.pdf.

[14] A. Li, H. Mazhar, R. Serban, and D. Negrut, Comparison of SPMV Performance onMatrices with Different Matrix Format Using CUSP, cuSPARSE and ViennaCL, Tech.Report TR-2015-02, SBEL, University of Wisconsin–Madison, 2015, http://sbel.wisc.edu/documents/TR-2015-02.pdf.

[15] A. Li, R. Serban, and D. Negrut, A Hybrid GPU-CPU Parallel CM Reordering Algorithmfor Bandwidth Reduction of Large Sparse Matrices, Tech. Report TR-2014-12, SBEL, Uni-versity of Wisconsin–Madison, 2014, http://sbel.wisc.edu/documents/TR-2014-12.pdf.

[16] A. Li, R. Serban, and D. Negrut, An Implementation of a Reordering Approach for Increas-ing the Product of Diagonal Entries in a Sparse Matrix, Tech. Report TR-2014-01, SBEL,University of Wisconsin–Madison, 2014, http://sbel.wisc.edu/documents/TR-2014-01.pdf.

[17] A. Li, R. Serban, and D. Negrut, Analysis of A Splitting Approach for the Parallel Solu-tion of Linear Systems on GPU Cards, Tech. Report TR-2015-12, SBEL, University ofWisconsin–Madison, 2015, http://sbel.wisc.edu/documents/TR-2015-12.pdf.

[18] D. Lukarski and N. Trost, Paralution Project, http://www.paralution.com.[19] M. Manguoglu, A. Sameh, and O. Schenk, PSPIKE: A parallel hybrid sparse linear system

solver, in Proceedings of the 15th International Euro-Par Conference on Parallel Processing,Delft, The Netherlands, Springer-Verlag, Berlin, 2009, pp. 797–808.

[20] D. Melanz, L. Fang, J. Jayakumar, and D. Negrut, A comparison of numerical methodsfor solving multibody dynamics problems with frictional contact modeled via differentialvariational inequalities, Comput. Methods Appl. Mech. Engrg., 320 (2017), pp. 668–693.

[21] C. Mikkelsen and M. Manguoglu, Analysis of the truncated SPIKE algorithm, SIAM J.Matrix Anal. Appl., 30 (2008), pp. 1500–1519.

[22] NVIDIA, CUDA Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2015).

[23] NVIDIA, cuSOLVER, https://developer.nvidia.com/cusolver (2015).[24] E. Polizzi and A. Sameh, A parallel hybrid banded system solver: The SPIKE algorithm,

Parallel Comput., 32 (2006), pp. 177–194.[25] E. Polizzi and A. Sameh, SPIKE: A parallel environment for solving banded linear systems,

Comput. & Fluids, 36 (2007), pp. 113–120.[26] The OpenCL Specification, Khronos OpenCL Working Group, 2008.[27] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2003.[28] A. Sameh and D. Kuck, On stable parallel linear system solvers, J. ACM, 25 (1978), pp. 81–91.[29] Sparse Matrices Used for Benchmarking SaP::GPU, http://sbel.wisc.edu/documents/

testSparseMatricesSaP.zip (2015).[30] SaP::GPU Github, https://github.com/spikegpu/SaPLibrary.[31] SaP::GPU Website, URL http://sapgpu.sbel.org.[32] O. Schenk and K. Gartner, Solving unsymmetric sparse systems of linear equations with

Pardiso, Future Generation Comput. Syst., 20 (2004), pp. 475–487.[33] R. Serban, D. Melanz, A. Li, I. Stanciulescu, P. Jayakumar, and D. Negrut, A GPU-

based preconditioned Newton-Krylov solver for flexible multibody dynamics, Internat. J.Numer. Methods Engrg., 102 (2015), pp. 1585–1604, https://doi.org/10.1002/nme.4876.

[34] G. Sleijpen and D. Fokkema, BiCGStab(l) for linear equations involving unsymmetric ma-trices with complex spectrum, Electron. Trans. Number. Anal., 1 (1993), pp. 11–32.

http://sbel.wisc.edu/documents/TR-2014-07.pdf







http://www.paralution.com

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://developer.nvidia.com/cusolver

http://sbel.wisc.edu/documents/testSparseMatricesSaP.zip

http://sbel.wisc.edu/documents/testSparseMatricesSaP.zip

https://github.com/spikegpu/SaPLibrary

http://sapgpu.sbel.org

https://doi.org/10.1002/nme.4876

Documents

ANALYSIS OF A SPLITTING APPROACH FOR THE PARALLELhomepages.cae.wisc.edu/~negrut/PDFpapers/siamLinSysGPU.pdf · Developing an approach and implementing parallel code for solving sparse