1
" " Characterizing the Relationship between ILU-type Preconditioners and the Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" Storage Hierarchy" Diego Rivera 1 , David Kaeli 1 and Misha Kilmer 2 1 Department of Electrical and Computer Engineering Northeastern University, Boston, MA {drivera, kaeli}@ece.neu.edu 2 Department of Mathematics Tufts University, Medford, MA [email protected] www.ece.neu.edu/students/drivera/tlg/tunlib.html www.ece.neu.edu/students/drivera/tlg/tunlib.html ICSS Institute for Complex Scientific Software • Approximate inverse preconditioner: SPAI, MR, etc. • The PIN tool was used to capture cache events. LRU and random replacement policies were modeled • Several matrices were evaluated. Results from four representative matrices are shown below: Plans and future work • Developing a benchmark suite for evaluating how best fill-in can be used for a given memory hierarchy and application code • Arriving at an algorithmic approach to select the best values of the preconditioner parameters for a given memory hierarchy • Proposing a new portable ILU-type preconditioner that does dynamic matrix fill-in: Reordering technique for improving temporal locality Adapting the number of non-zero elements to the block’s size of the highest cache level for improving spatial locality Objective • To improve the performance of preconditioners targeting sparse matrices • To accelerate the memory accesses associated with these codes Motivation • Prior work targeted Krylov subspace methods • However, little has been done in the case of preconditioners “Nothing will be more central to computational science in the next century than the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly. For Krylov subspace matrix iterations, this is preconditioning” from Numerical Linear Algebra by Trefethen and Bau (1997). Common target applications • Computational time is a barrier in these applications • Parallel processing can be used to lower this barrier • The sparsity of the data reduces the effectiveness of direct parallel computation • Preconditioners can be used to accelerate the convergence of Krylov subspace methods • A drawback of these approaches is that it is difficult to choose good values for their tuning- parameters • Choosing good values depends heavily on the structure of non-zero elements of the coefficient matrix • In our work we have found that it depends also on the memory hierarchy machine used to compute Acknowledgement This project is supported by the National Science Foundation’s Computing and Communication Foundations Division, grant number CCF-0342555 and the Institute of Complex Scientific Software. Precondition er Ax=b Solution to the linear system M -1 Ax=M -1 b Iterative Method Weather Simulations Turbulence problems in airplanes DNA models A (m,m) x (m) = b (m) 0 20 40 60 80 100 Zd_jac3 C age13 O hne2 Torso3 Ldoor Venkat01 % good duple Results for ILUD preconditioner and method GMRES, 14 possible values for each parameter (drop tolerance, diagonal compensation parameter). There are 378 possible combinations. drop tolerance, diagonal compensation parameter and tolerance ratio , , permtol ILUDP ... drop tolerance, diagonal compensation parameter , ILUD level-of-fill, drop tolerance and tolerance ratio ,, permtol ILUTP level-of-fill, drop tolerance , ILUT level-of-fill ILU() Description parameters Parameters Preconditioner Target preconditioners 1 GB RAM 2 GB RAM RAM All the cache levels use a pseudo-random All cache levels use a pseudo-LRU Replacement algorithm N/A 1 MB 8-way Level 3 8MB 2-way 512 KB 8-way Level 2 64KB 4-way for data 8KB 4-way for data Level 1 Ultra Sparc-III 750 MHz Intel XEON 3.06 GHz Evaluation environment 0% 21% 100% 48% Numerical symmetry (NS) Torso3 Cage14 Ldoor Raefsky3 Name 0 259,156 4,429,042 0.48 1,505,785 27,130,349 1.06 952,203 42,493,817 4.3 21,200 1,488,768 NS/B Rows Non-zero elements Matrices Raefsky3 Ldoor Cage14 Torso3 Relation numerical-symmetry/matrix-bandwidth decreases in this direction Error norm vs. 13 first duple sorted in increasing order for execution time of ILUT and GMRES Raefsky3 104.5 105 105.5 106 106.5 107 duple(l-fill,tol) error norm final Ldoor 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 duple(l-fill,tol) error norm final X eon U ltra fill-in drop tolerance fill-in drop tolerance 1.00E+00 5.00E-01 1.00E+00 5.00E-01 1.00E+00 2.50E-01 1.00E+00 2.50E-01 1.00E+00 1.00E-01 1.00E+00 1.00E-01 1.00E+00 5.00E-02 1.00E+00 5.00E-02 1.00E+00 2.50E-02 1.00E+00 2.50E-02 1.00E+00 1.00E-02 1.00E+00 1.00E-02 1.00E+00 1.00E-03 1.00E+00 1.00E-03 1.00E+00 1.00E-04 1.00E+00 1.00E-04 1.00E+00 1.00E-05 1.00E+00 1.00E-05 1.00E+00 1.00E-10 1.00E+00 1.00E-06 1.00E+00 1.00E-20 1.00E+00 1.00E-07 1.00E+00 1.00E-30 1.00E+00 1.00E-20 1.00E+00 1.00E-06 1.00E+00 1.00E-30 X eon U ltra fill-in drop tolerance fill-in drop tolerance 5.00E+01 1.00E-02 5.00E+01 1.00E-02 5.00E+01 1.00E-03 5.00E+01 1.00E-07 5.00E+01 1.00E-04 5.00E+01 1.00E-06 5.00E+01 1.00E-07 5.00E+01 1.00E-03 5.00E+01 1.00E-06 5.00E+01 1.00E-05 5.00E+01 1.00E-10 5.00E+01 1.00E-10 5.00E+01 1.00E-05 5.00E+01 1.00E-04 5.00E+01 1.00E-01 5.00E+01 1.00E-01 4.00E+01 1.00E-01 5.00E+01 5.00E-02 5.00E+01 5.00E-02 5.00E+01 2.50E-02 5.00E+01 2.50E-02 4.00E+01 1.00E-01 5.00E+01 2.50E-01 4.00E+01 5.00E-02 4.00E+01 5.00E-02 4.00E+01 2.50E-02 Torso3 0 0.001 0.002 0.003 0.004 0.005 0.006 duple(l-fill,tol) error norm final Xeon U ltra fill-in drop tolerance fill-in drop tolerance 1.00E+00 5.00E-01 1.50E+01 2.50E-01 4.00E+01 5.00E-01 1.30E+01 2.50E-01 2.00E+00 5.00E-01 4.00E+01 1.00E-01 2.00E+01 5.00E-01 3.00E+01 1.00E-01 1.70E+01 5.00E-01 5.00E+01 1.00E-01 3.00E+00 5.00E-01 2.00E+01 1.00E-01 1.50E+01 5.00E-01 7.00E+00 5.00E-01 5.00E+00 5.00E-01 9.00E+00 5.00E-01 3.00E+01 5.00E-01 1.30E+01 5.00E-01 9.00E+00 5.00E-01 1.10E+01 5.00E-01 5.00E+01 5.00E-01 2.00E+00 5.00E-01 1.10E+01 5.00E-01 1.50E+01 5.00E-01 1.30E+01 5.00E-01 5.00E+00 5.00E-01 C age14 0 0.005 0.01 0.015 0.02 0.025 0.03 duple(l-fill,tol) error norm final Xeon U ltra fill-in drop tolerance fill-in drop tolerance 1.70E+01 5.00E-02 4.00E+01 2.50E-02 1.50E+01 5.00E-02 5.00E+01 2.50E-02 1.10E+01 5.00E-02 2.00E+01 2.50E-02 2.00E+01 5.00E-02 3.00E+01 2.50E-02 3.00E+01 5.00E-02 1.50E+01 2.50E-02 1.30E+01 5.00E-02 1.70E+01 2.50E-02 4.00E+01 5.00E-02 1.70E+01 5.00E-02 5.00E+01 5.00E-02 4.00E+01 5.00E-02 1.70E+01 2.50E-02 5.00E+01 5.00E-02 3.00E+01 2.50E-02 1.50E+01 5.00E-02 1.50E+01 2.50E-02 3.00E+01 5.00E-02 5.00E+01 2.50E-02 1.30E+01 5.00E-02 2.00E+01 2.50E-02 1.10E+01 5.00E-02 DTLB DL1 L2 Ultra Sparc- III 0 0.2 0.4 0.6 0.8 1 Value ofcorrelation coefficient DTLB DL1 L2 L3 Intel XEON 0 0.2 0.4 0.6 0.8 1 Value ofcorrelation coefficient Correlation of load accesses and execution time Cost of preconditioner Vs. Cost of Krylov method = Not interesting, usually easy problems Case for Xeon: cost of preconditioner domains execution time. It is desired to pay less for the preconditioner >> << Case for Ultra: cost of Krylov method domains execution time. It is desired to pay more for the preconditioner In the i th iteration of the outer loop: Data accessed but not modified The i th row Data accessed and modified Data not accessed

" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: " Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners

""Characterizing the Relationship between ILU-type Preconditioners and the Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy"Storage Hierarchy"

Diego Rivera1 , David Kaeli1 and Misha Kilmer2 1 Department of Electrical and Computer Engineering

Northeastern University, Boston, MA{drivera, kaeli}@ece.neu.edu

2 Department of Mathematics Tufts University, Medford, MA

[email protected]

www.ece.neu.edu/students/drivera/tlg/tunlib.htmlwww.ece.neu.edu/students/drivera/tlg/tunlib.html

ICSSInstitute for Complex Scientific Software

• Approximate inverse preconditioner: SPAI, MR, etc.

• The PIN tool was used to capture cache events. LRU and random replacement policies were modeled

• Several matrices were evaluated. Results from four representative matrices are shown below:

Plans and future work• Developing a benchmark suite for evaluating how best fill-in can be used for a given memory hierarchy and application code

• Arriving at an algorithmic approach to select the best values of the preconditioner parameters for a given memory hierarchy

• Proposing a new portable ILU-type preconditioner that does dynamic matrix fill-in:

Reordering technique for improving temporal locality

Adapting the number of non-zero elements to the block’s size of the highest cache level for improving spatial locality

Objective • To improve the performance of preconditioners targeting sparsematrices • To accelerate the memory accesses associated with these codes

Motivation• Prior work targeted Krylov subspace methods

• However, little has been done in the case of preconditioners

“Nothing will be more central to computational science in the next century than the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly. For Krylov subspace matrix

iterations, this is preconditioning” from Numerical Linear Algebra by Trefethen and Bau (1997).

Common target applications

• Computational time is a barrier in these applications

• Parallel processing can be used to lower this barrier

• The sparsity of the data reduces the effectiveness of direct parallel computation

• Preconditioners can be used to accelerate the convergence of Krylov subspace methods

• A drawback of these approaches is that it is difficult to choose good values for their tuning-parameters

• Choosing good values depends heavily on the structure of non-zero elements of the coefficient matrix

• In our work we have found that it depends also on the memory hierarchy machine used to compute the solution

• What about tuning memory access patterns of preconditioner techniques?

AcknowledgementThis project is supported by the National Science Foundation’s Computing and Communication Foundations Division, grant number CCF-0342555 and the Institute of Complex Scientific Software.

Preconditioner

Ax=bSolution to the linear system

M-1Ax=M-1b

Iterative Method

Weather Simulations

Turbulence problems in airplanes

DNA modelsA(m,m)x(m) = b(m)

0

20

40

60

80

100

Zd_jac3 Cage13 Ohne2 Torso3 Ldoor Venkat01

% g

oo

d d

up

le

Results for ILUD preconditioner and method GMRES, 14 possible values for each parameter (drop tolerance, diagonal compensation parameter). There are 378 possible combinations.

drop tolerance, diagonal compensation parameter and tolerance ratio

, , permtol

ILUDP

……...

drop tolerance, diagonal compensation parameter

,ILUD

level-of-fill, drop tolerance and tolerance ratio

,, permtolILUTPlevel-of-fill, drop tolerance,ILUTlevel-of-fillILU()

Description parametersParametersPreconditioner

Target preconditioners

1 GB RAM2 GB RAMRAMAll the cache levels use a pseudo-random

All cache levels use a pseudo-LRU

Replacement algorithm

N/A1 MB 8-wayLevel 38MB 2-way512 KB 8-wayLevel 264KB 4-way for data8KB 4-way for dataLevel 1

Ultra Sparc-III 750 MHzIntel XEON 3.06 GHz

Evaluation environment

0%21%

100%48%

Numerical symmetry (NS)

Torso3Cage14LdoorRaefsky3

Name

0259,1564,429,0420.481,505,78527,130,3491.06952,20342,493,817

4.321,2001,488,768

NS/BRowsNon-zero elements

Matrices

Raefsky3 Ldoor Cage14 Torso3

Relation numerical-symmetry/matrix-bandwidth decreases in this direction

Error norm vs. 13 first duple sorted in increasing order for execution time of ILUT and GMRES

Raefsky3

104.5

105

105.5

106

106.5

107

duple(l-fill,tol)

erro

r nor

m fi

nal

Ldoor

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

duple(l-fill,tol)

erro

r nor

m fi

nal

Xeon Ultrafill-in drop tolerance fill-in drop tolerance1.00E+00 5.00E-01 1.00E+00 5.00E-011.00E+00 2.50E-01 1.00E+00 2.50E-011.00E+00 1.00E-01 1.00E+00 1.00E-011.00E+00 5.00E-02 1.00E+00 5.00E-021.00E+00 2.50E-02 1.00E+00 2.50E-021.00E+00 1.00E-02 1.00E+00 1.00E-021.00E+00 1.00E-03 1.00E+00 1.00E-031.00E+00 1.00E-04 1.00E+00 1.00E-041.00E+00 1.00E-05 1.00E+00 1.00E-051.00E+00 1.00E-10 1.00E+00 1.00E-061.00E+00 1.00E-20 1.00E+00 1.00E-071.00E+00 1.00E-30 1.00E+00 1.00E-201.00E+00 1.00E-06 1.00E+00 1.00E-30

Xeon Ultrafill-in drop tolerance fill-in drop tolerance5.00E+01 1.00E-02 5.00E+01 1.00E-025.00E+01 1.00E-03 5.00E+01 1.00E-075.00E+01 1.00E-04 5.00E+01 1.00E-065.00E+01 1.00E-07 5.00E+01 1.00E-035.00E+01 1.00E-06 5.00E+01 1.00E-055.00E+01 1.00E-10 5.00E+01 1.00E-105.00E+01 1.00E-05 5.00E+01 1.00E-045.00E+01 1.00E-01 5.00E+01 1.00E-014.00E+01 1.00E-01 5.00E+01 5.00E-025.00E+01 5.00E-02 5.00E+01 2.50E-025.00E+01 2.50E-02 4.00E+01 1.00E-015.00E+01 2.50E-01 4.00E+01 5.00E-024.00E+01 5.00E-02 4.00E+01 2.50E-02

Torso3

0

0.001

0.002

0.003

0.004

0.005

0.006

duple(l-fill,tol)

erro

r nor

m fi

nal

Xeon Ultrafill-in drop tolerance fill-in drop tolerance1.00E+00 5.00E-01 1.50E+01 2.50E-014.00E+01 5.00E-01 1.30E+01 2.50E-012.00E+00 5.00E-01 4.00E+01 1.00E-012.00E+01 5.00E-01 3.00E+01 1.00E-011.70E+01 5.00E-01 5.00E+01 1.00E-013.00E+00 5.00E-01 2.00E+01 1.00E-011.50E+01 5.00E-01 7.00E+00 5.00E-015.00E+00 5.00E-01 9.00E+00 5.00E-013.00E+01 5.00E-01 1.30E+01 5.00E-019.00E+00 5.00E-01 1.10E+01 5.00E-015.00E+01 5.00E-01 2.00E+00 5.00E-011.10E+01 5.00E-01 1.50E+01 5.00E-011.30E+01 5.00E-01 5.00E+00 5.00E-01

Cage14

0

0.005

0.01

0.015

0.02

0.025

0.03

duple(l-fill,tol)

erro

r nor

m fi

nal

Xeon Ultrafill-in drop tolerance fill-in drop tolerance1.70E+01 5.00E-02 4.00E+01 2.50E-021.50E+01 5.00E-02 5.00E+01 2.50E-021.10E+01 5.00E-02 2.00E+01 2.50E-022.00E+01 5.00E-02 3.00E+01 2.50E-023.00E+01 5.00E-02 1.50E+01 2.50E-021.30E+01 5.00E-02 1.70E+01 2.50E-024.00E+01 5.00E-02 1.70E+01 5.00E-025.00E+01 5.00E-02 4.00E+01 5.00E-021.70E+01 2.50E-02 5.00E+01 5.00E-023.00E+01 2.50E-02 1.50E+01 5.00E-021.50E+01 2.50E-02 3.00E+01 5.00E-025.00E+01 2.50E-02 1.30E+01 5.00E-022.00E+01 2.50E-02 1.10E+01 5.00E-02

DTLB DL1 L2

Ultra Sparc-III

0

0.2

0.4

0.6

0.8

1

Val

ue o

f cor

rela

tion

coef

ficie

nt

DTLB DL1 L2 L3

Intel XEON

0

0.2

0.4

0.6

0.8

1

Val

ue o

f cor

rela

tion

coef

ficie

nt

Correlation of load accesses and execution time

Cost of preconditioner Vs. Cost of Krylov method= Not interesting, usually

easy problems

Case for Xeon: cost of preconditioner domains execution

time. It is desired to pay less for the preconditioner

>>

<<

Case for Ultra: cost of Krylov method domains execution time. It is desired to pay more for the preconditioner

In the ith iteration of the outer loop:

Data accessed but not modified

The ith row Data accessed and modified

Data not accessed