42
Towards a Scalable Parallel Sparse Linear System Solver M. Manguoglu METU, Turkey A. Sameh* Purdue University, U.S.A. O. Schenk & M. Sathe Univ. of Basel, Switzerland Sparse Days 2011 CERFACS, September 6 & 7 Support*: ARO, Intel, NSF. Acknowledge: F. Saied (CS, Purdue)

Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Towards  a  Scalable  Parallel  Sparse  Linear  System  Solver  

M.  Manguoglu          METU,  Turkey  

A.  Sameh*              Purdue  University,  U.S.A.  O.  Schenk  &  M.  Sathe        Univ.  of  Basel,  Switzerland  

Coupled  2011  Kos  Island  June  20-­‐22  

Sparse  Days  2011  CERFACS,  September    6  &  7  Support*:  ARO,  Intel,  NSF.  

Acknowledge:  F.  Saied  (CS,  Purdue)  

Page 2: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Outline  •  MoUvaUng  applicaUons  –  a  sample  •  The  parallel  scalability  of  the  SPIKE  family  of  banded  linear  system  solvers  

•  Extensions  for  solving  general  sparse  linear  systems:  

– The  Pardiso-­‐SPIKE  hybrid  solver  (PSPIKE)  •  Role  of  reordering  for  faster  MATVEC  and  extracUon  of  precondiUoners:  

– TraceMIN  –  a  parallel  eigensolver  for  compuUng  the  Fiedler  vector  (spectral  reordering)  

2  

Page 3: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Target  ComputaUonal  Loop  Integra(on  

Newton  itera(on  

Linear  system  solvers  

ηk

∈k

Δt 3  

Page 4: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Role of Reordering Schemes

•  “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or spectral reordering,

•  This yields more scalable parallel matrix-vector multiplications, and helps in extracting more effective “banded” preconditioners.

a4er  RCM  reordering  

Banded  PrecondiUoners  

Page 5: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Factorization Triangular solves

Total

MKL TA0

Time (sec) on a 4-core Intel Clovertown

A x = f n = 600,000 bw = 99

5  

Page 6: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

0

0.5

1

1.5

2

2.5

3

3.5

4

Factorization Triangular solvers

Total

MKL TA0

off-chip data accessed (in bytes)

X 109

6  

“Analyzing  memory  access  intensity  in  parallel  programs  for  mul(core  architectures”  L.  Liu,  Z.  Li,  and  A.  S.    

Page 7: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Spike-­‐based  sparse  solver:  PSPIKE      

Solve  Mz  =  r      (via  Pardiso-­‐SPIKE)        

BiCGstab  (or  any  Krylov  subspace  method)  

Step  1:  reordering  (equivalent  to  HSL  -­‐-­‐  MC64  +  MC73)  Step  2:  extract  a  “banded”  precondiUoner  M  

7  

Page 8: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Weighted  Spectral  Reordering  (WSO)  

Main  ObjecUves:  – encapsulaUng  as  many  of  the  heaviest  sparse  matrix  elements  as  possible  within  a  central  band  to  be  used  as  a  precondiUoner,  • reducing  the  rank  of  the  matrix  lying  outside  the  central  band.  

– realizing  a  faster  “matvec”        (bandedmatrix  +  few  elements  outside  the  band)    (banded  matrix  +  few  nonzero  elements  outside  the  band)  

Page 9: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

UFL:  smt  -­‐-­‐  structural  mechanics  N:  25,710    NNZ:  3,749,582  

Original  matrix   AGer  MC73   AGer  TraceMin-­‐Fiedler  

ager    TraceMIN-­‐Fiedler  ager  HSL-­‐MC73  

9  

obtaining  the  Fiedler  vector  via  the  eigensolver:  TraceMIN    (Wisniewski  and  A.S.  -­‐-­‐  SINUM,  ’82)  

Page 10: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

UFL:  f2  -­‐-­‐  structural  mechanics  N:  71,505    NNZ:  5,294,285  

Original  matrix   AGer  MC73   AGer  TraceMin-­‐Fiedler  

10  TraceMIN-­‐Fiedler:  Murat  Manguoglu  et.  al  -­‐-­‐  submiled  

Page 11: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Before  reordering   Ager  reordering  via  TraceMIN-­‐Fiedler  

UFL  –  f2    

Page 12: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

M11

M22

M33

M44

B1

C2

C3

C4

B2

B3

z1

z4

z3

z2

r1

r4

r3

r2

=

M z = r (M is “banded”)

P = M + δ(M) = D’ * S’

(i)  Solve D’ y = r

(ii) Solve S’ z = y

PSPIKE:  Pardiso-­‐SPIKE  

12  

Each  Mkk  is  a  general  sparse  

matrix    

Solving  systems  involving    The  precondiPoner  M  z  =r  

Page 13: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Reduced  system  reduced  system  

Page 14: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

GeneraUng  Ups  of  the  spikes  

Mkk  0?

=  

Obtain  the  upper  and  lower  Ups  of  the  soluUon  block  via  the  modified  direct    

sparse  system  solver  “Pardiso”.  

Page 15: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

How  expensive  is  spectral  reordering?  Integra,on  

Newton  Itera,on  

Linear  system  solvers  

ηk

∈k

Δt

Cost  of  weighted  Spectral  reordering  could  be  amor(zed  over  several  (me  steps.    

Cost  of  WSO  is  amorPzed  across  several  Pme  steps.  

Will  return  to  this  issue  later  will  return  to  this  issue  later   15  

Page 16: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

σ  =  T(MC73)  on  1  core  ÷  T(TraceMIN-­‐Fiedler)  on  4  nodes  Matrix   n   nnz   σ  

Rajat31   ~4.7  M   ~  20  M   85  

Schenk/nlpkkt20   ~  3.5  M   ~  95  M   8  

Freescale1   ~  3.4  M   ~  17  M   4  

KKT-­‐power   ~  2.1  M   ~  13  M   641  

Page 17: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Mul(level  Parallelism  of  PSPIKE    for  solving  M  z  =  r  

Node  1   Node  2   Node  3   Node  4  

Pardiso   Pardiso   Pardiso   Pardiso  

SPIKE  A1

A2

A3

A4

C2

C3

C4

B1

B2

B3

Precondi(oner:  M  

17  

PrecondiPoner  M  is  “generalized-­‐banded”  Other  opPons:  •   narrow-­‐banded  •   wide-­‐banded  

PrecondiUoner  M  is  generalized-­‐banded  (MBP)  Other  cases:  •   narrow-­‐banded  (NBP)  •   wide-­‐banded  (WBP)  

Page 18: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

18  

ager  TraceMIN-­‐Fiedler  Circuit  SimulaUon  

Wide-­‐Banded  PrecondiUoner  

Page 19: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

WBP:  wide-­‐banded  precondiUoners  precondiUoner  consists  of  overlapped  blocks  

Naumov,  Manguoglu,  A.S.  :  A  Tearing-­‐based  Hybrid  Parallel  Sparse                                                                                                    Linear  System  Solver:  JACM,  2010.    

19  

Page 20: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Solve:  Ax = f    ;    A  :=  nonsingular  

m << n

A1 A2

20  

Page 21: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Determine  that              which  assures  that      

is  the  soluPon  of  the  balance  system  (order  =  size  of  overlap)  

M  y  =  g  M  and  g  are  not  available  explicitly  

A1  

A2  

21  

Page 22: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

The  Balance  System  

•  Solve  the  system  My=g  using  a  Krylov  subspace  scheme  (CG  or  Bicgstab).  

•  Two  Ume-­‐consuming  kernels:  

–   q  =  M*p  

–   r(p)  =  g  –  M*p  

22  

Page 23: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

ObservaUons  •  r(p)  =  g  –  M*p  

                     =  x2̂(p)  –  x2(p)  

•  r(0)  =  g                        =  x2̂(0)  –  x2(0)  

•  q  =  M*p                =  r(0)  –  r(p)    

23  

Page 24: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

PSPIKE  Stage  1  –  

•   ExtracUon  of  an  effecUve  “banded”  precondiUoner        M  –  with  or  without  reordering.  •   norm(M,’fro’)  =  (1-­‐  ε)  norm(A,’fro’);    • “bandwidth”  ≤  β(n).  

Stage  2  –  •   Use  an  outer  Krylov  subspace  method  (BiCGstab)  to  solve  A  x  =  f  with  precondiUoner  M.    •   A  modified  Pardiso  (by  O.  Schenk)  is  a  major  kernel  for  solving  M  z  =  r  .  •   Classes  of  banded  precondiUoners:  

narrow-­‐banded,  generalized-­‐banded,  wide-­‐banded  

Page 25: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

CompuUng  Playorm  • Intel  Cluster  – Westmere  X5670,  2.93  GHz  – Infiniband  interconnecUon  – 12-­‐core  nodes  

Page 26: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

a.  PSPIKE-­‐NBP  

Page 27: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

UFL  –  Rajat31  (circuit  simulaUon)  

N    ~      4.7  M  nnz  ~      20  M  nonsymmetric  

Page 28: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Scalability  of  PSPIKE  vs.  direct  solvers  (for  Rajat31)  

•  Direct  solvers:  – WSMP    

•  matrix  input  centralized  on  Node  0  

– MUMPS  •  matrix  input  distributed  by  block  rows  •  soluUon  vector  distributed  •  ParMETIS  

•  PSPIKE:  (bandwidth  of  precondiUoner  =  5)  – TraceMIN-­‐Fiedler  

– matrix  input  distributed  by  block  rows  

Page 29: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

1  

10  

100  

4   8   16   32   64   128   256   512   1024  

PSPIKE  –  WSMP  -­‐-­‐  MUMPS  

Time  in  

seconds  

factorizaUon  

Total  Ume  for    WSMP  &  MUMPS  

#  of  cores  (8  cores  per  node)  

~37  

~25  

~17  

~2.8  

PSPIKE:  1  MPI  proc./node                  8  OMP  threads/MPI  proc.  

8  MPI  proc./node  

rel.  res.  =  O(10-­‐5)  rel.  res.  =  O(10-­‐5)  

Page 30: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

System  size:  

N  =  11,333,520  

#  of  nonzeros:  61,026,416  

bandwidth:  334,613  

MEMS  simulaUon  benchmark  1    

stopping  criterion:  rel.  res.  =  O(  10-­‐2)  

30  

Page 31: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Scalability  of  TraceMin-­‐Fiedler  

31  

0  5  10  15  20  25  30  35  40  

0   5   10   15   20   25   30   35  

~  Time  in  seconds  

Nodes:  1  to  32  

37  

21  

12  

6  3   1.7  

Page 32: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

MEMS  SimulaUon  –  benchmark-­‐1  

WSMP  total  

WSMP  factor  

1  

10  

100  

1000  

4   8   16   32   64   128   256   512   1024  

PSPIKE-­‐NBP,  rel.  res.  =  5(10-­‐5)    

~345  

~31  

~86  

~4  

~  2.4  

β  =  5  β  =  5  Time  in  sec.  

cores  

Variable  #    of  OMP  threads/MPI  proc.  PSPIKE:  variable  #  of  OMP  threads/MPI  proc.  

Page 33: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

•  Strong  scalability  of  PSPIKE      Fixed  problem  size  –  1  to  64  nodes  (or  8  to  512  cores)      

•  Comparison  with  AMG-­‐precondiUoned  Krylov  subspace  solvers  in:  •   Hypre                  (LLNL)  •  Trilinos-­‐ML  (Sandia)  

•  Smoother  –  •  Chebyshev                          fastest  •  Jacobi  •  Gauss-­‐Seidel  

Scalability  of  PSPIKE  vs.  Trilinos  Intel  Harpertown  

33  

Page 34: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

0.1

1

10

100 1 2 4 8 16 32 64

Speed  Improvement  over  Trilinos-­‐ML  (nodes)  

Time  (Trilinos-­‐ML)  ÷  Time  (PSPIKE)  

PSPIKE:  k  threads  per  MPI  process    

break-­‐even  @  4  nodes    

#  of  nodes  k    1  to  4      1  8  to  16    4  >  16      8  Intel  Harpertown  

34  MEMS  benchmark  1  

Page 35: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

0

1

2

3

4

5

6

64 128 256 512 1024

time

(sec

)

number of cores

5.4  

2.9  

1.4  0.9  

0.7  

Strong  Scalability  on  Intel  Nehalem  for  a  larger  MEMS  system  of  order  ~  23M  (benchmark  2)  

35  

~  7.7  speed  improv.  :  64  to  1024  cores  

Page 36: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

b.  PSPIKE-­‐WBP  

Page 37: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

PSPIKE  for  ASIC680K  system    

Time  on  1  node:    Pardiso              –  23  (sec).                                                              ILU-­‐BiCGstab    –  34  (sec).    Time  on  2  nodes:      PSPIKE        –      6  (sec).  

relaUve  residual    Pardiso:            O(10-­‐9)                  ILU+BiCGstab:      O(10-­‐7)  PSPIKE-­‐WBP:        O(10-­‐7)  

PrecondiPoner    consisPng  of  two  overlapped  blocks  

Precondi(oner    consis(ng  of  two  overlapped  blocks  (overlap  =  11)  

37  

Page 38: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Precondi(oner    

A    Overlapped  diagonal  blocks  

Overlapped  diagonal  blocks  

M

PSPIKE-WBP 38  

Page 39: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

ASIC_680k_8  nonsymmetric  size:        ~    5.5  M  nnz:        ~31.0  M  

PrecondiUoner:  16  overlapped    diagonal  blocks    overlap  size  =  100  

PSPIKE-­‐WBP  

N  =  16  m  –  15  β  N  =  16  m  –  15  β  

Page 40: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

PSPIKE-­‐WBP  vs.    ILU-­‐precondiUoned  BiCGstab  

•  ILUT-­‐BiCGstab    (one  core):    – drop  tolerance  =  10-­‐3  –   fill_in  per  row  =  10%  – 16  iteraUons,    rel.  res.  =  O(10-­‐6)  – Time  ~  276  sec.  

•  PSPIKE-­‐WBP  :  – 1  iteraUon,  rel.  res  =  O(10-­‐9)  

Page 41: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

PSPIKE-­‐WBP  (mulUple  nodes)  vs.    

ILU-­‐precondiUoned  BiCGstab  (one  core)  

#  of  cores   32  (4  nodes)  

64  (8  nodes)  

128  (16  nodes)  

Time  (sec.)   113   31   9  

Speed  improv.  over  ILU-­‐BiCGstab    (one  core)  

2.4   8.9   30.7  

Page 42: Towards(a(Scalable(Parallel(Sparse( Linear(SystemSolver · Role of Reordering Schemes • “Banded”, or low-rank perturbations of banded, systems are often obtained after RCM or

Thank  You!