31
Sparse factorizations: Towards optimal complexity and resilience at exascale Xiaoye Sherry Li Lawrence Berkeley National Laboratory Challenges in 21st Century Experimental Mathematical Computation Workshop, ICERM, Brown Univ., July 21-25, 2014.

Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Sparse factorizations: Towards optimal complexity and resilience at exascale

Xiaoye Sherry Li Lawrence Berkeley National Laboratory

Challenges in 21st Century Experimental Mathematical

Computation Workshop, ICERM, Brown Univ., July 21-25, 2014.

Page 2: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Introduction

!   DOE SciDAC programs (Scientific Discovery through Advanced Computing) !   FASTMath Institute (2011-2016, Frameworks, Algorithms, and

Scalable Technologies for Mathematics) •  Software: SuperLU, PETSc, Trilinos, Chombo, mesh, …

(3 other Institutes) !   Science Applications (many … mostly Partial Differential

Equations) •  CEMM (2009-2015, Center for Extended MHD Modeling, fusion

energy) •  ComPASS (2008-2015, Community Petascale Project for

Accelerator Science and Simulation) !   LBNL focuses

!   Direct solvers (SuperLU): scaling to 1000s cores !   Hybrid solvers (direct & iterative): scaling to 10,000 cores !   Low-rank HSS preconditioner: nearly linear complexity for certain

PDEs

2

Page 3: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Application 1: Burning plasma for fusion energy !   ITER – a new fusion reactor being constructed in Cadarache, France

•  International collaboration: China, the European Union, India, Japan, Korea, Russia, and the United States

•  Study how to harness fusion, creating clean energy using nearly inexhaustible hydrogen as the fuel. ITER promises to produce 10 times as much energy than it uses — but that success hinges on accurately designing the device.

!   One major simulation goal is to predict microscopic MHD instabilities of burning plasma in ITER. This involves solving extended and nonlinear Magnetohydrodynamics equations.

3

Page 4: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Application 1: ITER modeling

!   Center for Extended Magnetohydrodynamic Modeling (CEMM), PI: S. Jardin, PPPL.

!   Develop simulation codes to predict microscopic MHD instabilities of burning magnetized plasma in a confinement device (e.g., tokamak used in ITER experiments). •  Efficiency of the fusion configuration increases with the ratio of thermal

and magnetic pressures, but the MHD instabilities are more likely with higher ratio.

!   Code suite includes M3D-C1, NIMROD

4

ϕ R

Z•  At each ϕ = constant plane, scalar 2D data is represented using 18 degree of freedom quintic triangular finite elements Q18 •  Coupling along toroidal direction

(S. Jardin)

Page 5: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

ITER modeling: 2-Fluid 3D MHD Equations

∂n∂t+∇•(nV ) = 0 continuity

∂B∂t

= −∇×E, ∇•B = 0, µ0J = ∂×B Maxwell

nMt∂V∂t

+V •∇V%

&'

(

)*+∇p = J ×B−∇•ΠGV −∇•Πµ Momentum

E +V ×B =ηJ + 1ne

(J ×B−∇pe −∇•Πe ) Ohm's law

32∂pe∂t

+∇•32peV

%

&'

(

)*= −pe∇•∇+ηJ

2 −∇•qe +QΔ electron energy

32∂pi∂t

+∇•32piV

%

&'

(

)*= −pi∇•∇−Πµ •∇V −∇•qi −QΔ ion energy

5

The objective of the M3D-C1 code is to solve these equations as accurately as possible in 3D toroidal geometry with realistic B.C. and optimized for a low-β torus with a strong toroidal field.

Page 6: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Application 2: particle accelerator cavity design

6

•  Community Petascale Project for Accelerator Science and Simulation (ComPASS), PI: P. Spentzouris, Fermilab •  Development of a comprehensive computational infrastructure for accelerator modeling and optimization •  RF cavity: Maxwell equations in electromagnetic field •  FEM in frequency domain leads to large sparse eigenvalue problem; needs to solve shifted linear systems

bMx MK 002

0 )(problem eigenvaluelinear =−σ

ΓE Closed Cavity

ΓM

Open Cavity

Waveguide BC

Waveguide BC

Waveguide BC

(L.-Q. Lee)

bx M W - i K =+ )(problem eigenvaluecomplex nonlinear

02

0 σσ

RF unit in ILC

Page 7: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Sparse: lots of zeros in matrix

!   fluid dynamics, structural mechanics, chemical process simulation, circuit simulation, electromagnetic fields, magneto-hydrodynamics, seismic-imaging, economic modeling, optimization, data analysis, statistics, . . .

!   Example: A of dimension 106, 10~100 nonzeros per row ! Matlab: > spy(A)

7

Mallya/lhr01 (chemical eng.) Boeing/msc00726 (structural eng.)

Page 8: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

§  Solving a system of linear equations Ax = b •  Sparse: many zeros in A; worth special treatment

§  Iterative methods (CG, GMRES, …) §  A is not changed (read-only) §  Key kernel: sparse matrix-vector multiply §  Easier to optimize and parallelize §  Low algorithmic complexity, but may not converge

§  Direct methods §  A is modified (factorized) §  Harder to optimize and parallelize §  Numerically robust, but higher algorithmic complexity

§  Often use direct method (factorization) to precondition iterative method §  Solve an easy system: M-1Ax = M-1b

Strategies of sparse linear solvers

8

Page 9: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Gaussian Elimination (GE)

!   Solving a system of linear equations Ax = b

!   First step of GE

!   Repeat GE on C !   Result in LU factorization (A = LU)

–  L lower triangular with unit diagonal, U upper triangular

!   Then, x is obtained by solving two triangular systems with L and U

⎥⎦

⎤⎢⎣

⎡⋅⎥⎦

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡=

Cw

IvBvw

ATT

0/01 α

αα

9

α

TwvBC ⋅−=

Page 10: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Sparse factorization !   Store A explicitly … many sparse compressed formats !   “Fill-in” . . . new nonzeros in L & U !   Graph algorithms: directed/undirected graphs, bipartite graphs,

paths, elimination trees, depth-first search, heuristics for NP-hard problems, cliques, graph partitioning, . . .

!   Unfriendly to high performance, parallel computing !   Irregular memory access, indirect addressing, strong task/data

dependency

10

1 2

3 4

6 7

5 L

U1

6

9

3

7 8

4 5 2 1

9

3 2

4 5

6 7 8

Page 11: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

11

Graph tool: reachable set, fill-path

Edge (x,y) exists in filled graph G+ due to the path: x à 7 à 3 à 9 à y !   Finding fill-ins ßà finding transitive closure of G(A)

+

+

+

y

+

+

+

+

3

7

9

x

o

o o

Page 12: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Algorithmic phases in sparse GE

1.  Minimize number of fill-ins, maximize parallelism ! Sparsity structure of L & U depends on that of A, which can be changed by

row/column permutations (vertex re-labeling of the underlying graph) !   Ordering (combinatorial algorithms; “NP-complete” to find optimum

[Yannakis ’83]; use heuristics)

2.  Predict the fill-in positions in L & U !   Symbolic factorization (combinatorial algorithms)

3.  Design efficient data structure for storage and quick retrieval of the nonzeros !   Compressed storage schemes

4.  Perform factorization and triangular solutions !   Numerical algorithms (F.P. operations only on nonzeros) !   Usually dominate the total runtime

!   For sparse Cholesky and QR, the steps can be separate; for sparse LU with pivoting, steps 2 and 4 my be interleaved.

12

Page 13: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Distributed-memory parallelization

13

!   2D block-cyclic matrix distribution

!   Scalability challenges: !   High degree of data & task dependency (DAG) !   Irregular, indirect memory access !   Low Arithmetic Intensity

For j = 1, 2, 3 .. Number of Supernodes 1.  Block LU factorization L(j, j) U(j, j) ß LU(A(j, j)) 2.  L update : L(k, j) ß A(k, j) U(j, j)

-1 k>j 3.  U update : U(j, k) ß L(j, j)

-1 A(j, k) k>j 4.  Rank K Update : A(i, k) ßA(i, k) – L(I,j)U(j,k), i, k > j

Page 14: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

SuperLU_DIST 2.5 on Cray XE6 (hopper@nersc)

!   Profiling using IPM !   Synchronization dominates on a large number of cores

!   up to 96% of factorization time

14

8 32 128 512 20480

5

10

15

20

25

30

35

40

45

50

Number of cores

Fact

oriza

tion tim

e(s

)

FactorizationCommunication

32 128 512 20480

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Number of cores

Fa

cto

riza

tion

tim

e(s

)

FactorizationCommunication

Accelerator (sym), n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609

Page 15: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

SuperLU_DIST 3.0: better DAG scheduling

!   Implemented new static scheduling and flexible look-ahead algorithms that shortened the length of the critical path.

!   Idle time was significantly reduced (speedup up to 2.6x) !   To further improve performance:

!   more sophisticated scheduling schemes !   hybrid programming paradigms

15

0

3 4

0 1 2

3 4 5 3

0 2 0 1

3 4 5 3 4 5

0 1 2 0 1 2 0

1

1

2

2

5

0 1

4

0 1 2 0

3 4 5

2

5

0

0

3

3

3

look−ahead window

8 32 128 512 20480

5

10

15

20

25

30

35

40

45

50

Number of cores

Fa

cto

riza

tion

/Co

mm

un

ica

tion

tim

e (

s)

version 2.5version 3.0

32 128 512 20480

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Number of cores

Fa

cto

riza

tion

/Co

mm

un

ica

tion

tim

e (

s)

version 2.5version 3.0

Accelerator, n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609

Page 16: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Performance of larger matrices

v  Sparsity ordering: MeTis applied to structure of A’+A 16

Name Application Data type

N |A| / N Sparsity

|L\U| (10^6)

Fill-ratio

matrix211 Fusion, MHD eqns (M3D-C1)

Real 801,378 161 1276.0 9.9

cc_linear2

Fusion, MHD eqns (NIMROD)

Complex 259,203 109 199.7 7.1

matick Circuit sim. MNA method (IBM)

Complex 16,019 4005 64.3 1.0

cage13 DNA electrophoresis

Real 445,315 17 4550.9 608.5

Page 17: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Strong scaling: MPI, Cray XE6 (hopper@nersc)

v  Up to 1.4 Tflops factorization rate

17

§  2 x 12-core AMD 'MagnyCours’ per node, 2.1 GHz processor

Page 18: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Variety of node architectures

18

Cray XE6: dual-socket x 2-die x 6-core, 24 cores Cray XC30: dual-socket x 8-core, 16 cores

Cray XK7: 16-core AMD + K20X GPU Intel MIC: 16-core host + 60+ cores co-processor

Page 19: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Multicore / GPU-Aware SuperLU

!   New hybrid programming code: MPI+OpenMP+CUDA, able to use all the CPUs and GPUs on manycore computers.

!   Algorithmic changes: !   Aggregate small BLAS operations into larger ones. !   CPU multithreading Scatter/Gather operations. !   Hide long-latency operations.

!   Results: using 100 nodes GPU clusters, up to 2.7x faster, 2x-5x memory saving.

!   New SuperLU_DIST 4.0 release, August 2014.

19

Page 20: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

CPU + GPU algorithm

20

�� � ��

①  Aggregate small blocks ②  GEMM of large blocks ③  Scatter

GPU acceleration: Software pipelining to overlap GPU execution with CPU Scatter, data transfer.

Page 21: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

!   Use preprocesing to produce 4 versions {s, d, c, z} !   Creating macro-enabled “basefile” at the first time is clumsy; later

maintenance is easier. !   “template” in C++ is better.

!   Performance portability? !   Need adjust block size for each architecture

•  Larger blocks better for uniprocessor •  Smaller blocks better for parallellism and load balance

!   Open problem: automatic tuning for block size? !   Flexible interface?

!   Example: block diagonal preconditioner M-1A x = M-1 b

M = diag(A11, A22, A33) à use SuperLU_DIST for each diagonal block

!   No explicit funding for user support. (other than SciDAC apps.)

Software issues

A22

A33

A11

Page 22: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

!   Use preprocesing to produce 4 versions {s, d, c, z} !   Creating macro-enabled “basefile” at the first time is clumsy; later

maintenance is easier. !   “template” in C++ is better.

!   Performance portability? !   Need adjust block size for each architecture

•  Larger blocks better for uniprocessor •  Smaller blocks better for parallellism and load balance

!   Open problem: automatic tuning for block size? !   Flexible interface?

!   Example: block diagonal preconditioner M-1A x = M-1 b

M = diag(A11, A22, A33) à use SuperLU_DIST for each diagonal block

!   No explicit funding for user support. (other than SciDAC apps.)

Software issues

0 1 2 3

4 5 6 7

8 9 10 11

Page 23: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Towards exascale

! Exascale machines will have hierarchical organization !   Hierarchical memory, NUMA nodes: multicore, manycore

! Exascale applications will encompass multiphysics (coupled PDEs) and multiscale (time and space)

!   Hierarchical algorithms and parallelism match machines and applications features

Studying two classes of algorithms for sparse linear systems: 1.  Domain decomposition hybrid method

!   General algebraic solver 2.  Low-rank factorization employing hierarchical matrices and

randomization !   Target PDE applications

23

Page 24: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

1. Domain decomposition, Schur-complement (PDSLin : http://portal.nersc.gov/project/sparse/pdslin/)

1.  Graph-partition into subdomains, A11 is block diagonal

2.  Schur complement

S = interface (separator) variables, no need to form explicitly

3.  Hybrid solution methods:

24

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

2

1

2

1

2221

1211

bb

xx

AAAA

111111

22121

11211122121

112122

where ULAGWA)A (L)A – (U A A A – A AS -TT-T-

=

⋅−===

(1) x2 = S−1(b2 – A21 A11-1 b1) ← iterative solver

(2) x1 = A11-1(b1 – A12 x2 ) ← direct solver

A11 A12A21 A22

!

"

##

$

%

&& =

D1 E1D2 E2

Dk Ek

F1 F2 … Fk A22

!

"

#######

$

%

&&&&&&&

Page 25: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Hierarchical parallelism !   Multiple processors per subdomain

!   one subdomain with 2x3 procs (e.g. SuperLU_DIST)

!   Advantages: !   Constant #subdomains, Schur size, and convergence rate, regardless

of core count. !   Need only modest level of parallelism from direct solver.

25

P P(0 : 5)

P(6 : 11)

P(12 : 17)

P(18 : 23)

P(0 : 5)

P(6 : 11)

P(12 : 17)

P(18 : 23)

P(0 : 5) P(6 : 11) P(12 : 17) P(18 : 23)

D1

D2

D3

D4

E1

E2

E3

E4

F1 F2 F3 F4 A22

Page 26: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

PDSLin in Omega3P: Cryomodule

PIP2  cryomodule  consis1ng  of  8  cavi1es  

Computa(on  parameters  

§  2.3M  elements  §  First  order  finite  element  (p  =  1)  

-  39M  non-­‐zeroes,  2.5M  DOFs  -  Solu1on  1me  on  hopper  using  50  nodes  and  600  cores:  863  ms  (total)  

§  Second  order  finite  element  (p  =  2)  -  590M  non-­‐zeroes,  14M  DOFs  -  Solu1on  1me  on  edison  using  400  nodes,  4800  cores:  5:40  min  (wall)  -  Using  MUMPS  with  400  nodes,  800  cores,  solu1on  1me:  6:46  min  (wall)  

Page 27: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

New mathematical algorithms

!   K-way, multi-constraint graph partitioning !   Small separator, similar subdomains, similar connectivity !   Both intra- and inter-group load balance

!   Sparse triangular sol. with many sparse RHS (intra-subdomain)

!   Sparse matrix–matrix multiplication (inter-subdomain)

27

W← sparsify(W, σ1); G← sparsify(G, σ1) T ( p) ← W ( p) ⋅ G( p)

S ( p) ← A22( p) − T (q) (p)

q∑ ; S← sparsify(S, σ 2 )

S = A22 – (Ul-T Fl

T )T (Ll-1El )

l∑ = Wl ⋅Gl

l∑ , where Dl = LlUl

I. Yamazali, F.-H. Rouet, X.S. Li, B. Ucar, “On partitioning and reordering problems in a hierarchically parallel hybrid linear solver”, IPDPS / PDSEC Workshop, May 24, 2013.

Page 28: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

2. HSS-embedded sparse factorization

!   Dense, but data-sparse): hierarchically semi-separable structure !   PDEs with smooth kernels, off-diagonal blocks are rank deficient !   Recursion leads to hierarchical partitioning !   Key to low complexity: nested bases

!   Sparse: apply HSS to dense separators/supernodes

A≈

D1 U1B1V2T

U2B2V1T D2

"

#

$$

%

&

''

U1R1U2R2

"

#$$

%

&''B3 W4

TV4T W5

TV5T"

#$%&'

U4R4U5R5

"

#$$

%

&''B6 W1

TV1T W2

TV2T"

#$%&'

D4 U4B4V5T

U5B5V4T D5

"

#

$$

%

&

''

"

#

$$$$$$$

%

&

'''''''

HSS tree

Nested tree-parallelism: Outer tree: separator tree Inner tree: HSS tree

Page 29: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

3D Helmholtz

!   Helmholtz equation with PML boundary

!   N = 3003 = 27M, procs = 1024 !   Max rank = 1391 (tolerance = 1e-4)

29

Times (s) Gflops (peak %)

Comm % Mem (GB)

MF 4206 2385 (27.7%)

32.6 % 3144

MF + HSS HSS-compr

2171 1388

2511 (29.2%)

41.2 % 15.3 %

1104

−Δ−ω 2

v(x)2#

$%

&

'( u(x,ω) = s(x,ω)

Page 30: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

New compression kernel: Randomized Sampling

!   Traditional methods: SVD, rank-revealing QR !   Difficult to scale up !   Extend-add HSS structures of different shapes

!   Randomized sampling: 1.  Pick random matrix Ωnx(k+p), p small, e.g. 10 2.  Sample matrix S = A Ω, with slight oversampling p 3.  Compute Q = ON-basis(S), orthonormal basis of S Accuracy: with high probability ≥ 1 – 6 p-p

!   Benefits: kernel becomes dense matrix-matrix multiply !   Extend-add tall-skinny dense matrices of conforming shapes !   Scalable and resilient algorithms exist !   Even faster, if fast matrix-vector multiply available (e.g. FMM) !   Matrix-free solver, if only matrix-vector action available

30

A−QQ*A ≤ 1+11 k + p min(m,n)( ) ⋅σ k+1

Page 31: Sparse factorizations: Towards optimal complexity and ......Computing) ! FASTMath Institute (2011-2016, Frameworks, Algorithms, and Scalable Technologies for Mathematics) • Software:

Summary, forward looking . . .

!   Direct solvers can scale to 1000s cores

!   Domain-decomposition type of hybrid solvers can scale to 10,000s cores !   Can also maintain robustness

!   Expect to scale more with low-rank structured factorization methods !   Extend to general solver framework, examine feasibility with wider

class of problems

31