21
BLAS Specification Revisited Linda Kaufman William Paterson University

BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Embed Size (px)

Citation preview

Page 1: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

BLAS Specification Revisited

Linda Kaufman

William Paterson University

Page 2: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Wish list

• Simultaneous Orthogonal Transformations

– Generation

– Application

• Simultaneous elementary transformations

• Simultaneous gemv with different matrices

• Simultaneous Householders

• Symmetric rank k update that manufacturers might want not balk at implementing

Page 3: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Simultaneous Orthogonal transformations

A. QZhes- reduction of A to Hessenberg form and B to triangular form B. QR iteration for symmetric tridiagonal eigenvalues C. Reduction of narrow banded matrix to tridiagonal form in order to solve an

eigenvalue problem Ax=λx

A. Approximation of 1 dimensional pde with a Rayleigh Ritz Galerkin approach using cubic or quintic B splines

B. Periodic boundary conditions of 1 dimensional pde with finite element C. Coupling several problems with 1 dimensional pde as in designing

optical fibers D. Ax = λB x, A and B symmetric, B positive definite, A and B banded

A. B tridiagonal mass matrix in finite element approximation of 1 D

problem,

E. Banded singular value decomposition-S. Rajamanickam thesis under Tim Davis

F. To prevent fill in when pivoting in Symmetric indefinite banded

factorization

A. Optimization problem with negative curvature

Page 4: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

QZ phase 1 reduction of A to upper Hessenberg B to triangular for solving nonsymmetric Ax = λB x

Assume we have used orthogonal transformations to reduce B to triangular and have applied them to A. We now have

A A B B A B

In LAPACK get rid of elements of A in the order of

But there are independent operations that can be done simultaneously using the ordering

In general 2n simultaneous operations. B. Kagstrom & Dackland, 1999 One can look at these as blocks or individual elements

→ →

Page 5: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

IMTQL1- finding the eigenvalues of Ax=λx for A symmetric tridiagonal

1 Compute shift μ, form B =A- μI

2. Find Q1 that annihilates B21 and form Q1B Q1T

3. Chase unwanted element down matrix

Parallel QR- keep on determining shift and do simultaneous chases. Van-de Gijn(1993)- 3 times as many chases but Kaufman showed can get factor of 2 reduction (1994)

→ →

Page 6: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Diagram of annihilation using Given’s rotations for banded eigenvalue

Eventually every kth row have element that could be annihilated for r 2k-1 diagonals (Kaufman-1984) implemented in Lapack- (Christian Bischof, Bruno Lang, and XiaobaiSun -SBR toolbox 2000) Saw reduction by factor of 5 for narrow bands on Cray Would like to have been able to generate Givens rotations simultaneously- killed by manufacturers

Page 7: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Diagrams of parallel Crawford for reducing symmetric tridiagonal Ax = λB x, to standard eigenvalue problem

A A

B B

Page 8: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Simultaneous Stabilized elementary transformations • Simultaneously factoring banded linear systems

• Tinvit- Given several eigenvalues (λ1, λ2, … λk) of a

tridiagonal system determine their eigenvectors. Solve A- λ1

I, A- λ2 I, … A- λk I simultaneously

• Shot gun bisection

• Two dimensional separable elliptic PDE’s solved using

marching(Bank), Rayleigh Ritz Galerkin(Kaufman and

Warner), or Collocation(Fairweather).

• Matrix has form A = (Tensor Product where

S’s and M’s are banded. )

• Queueing Problems leading to separable matrices(Kaufman,

1983)

• Symmetric Indefinite using stabilized elementary to prevent fill-in

Page 9: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Separable matrix in solving 2 dimensional

Matrix hasthe form A = (Tensor Product )

Where Sx and Mx are m x m banded and Sy and My are n xn

banded, Mx , My, and Sy are symmetric and My is positive

definite

Need to Solve Av = f where A has mn rows and columns

Algorithm:

(1) Find D and Z such that ZT My Z = I and ZT Sy Z = D

Generalize eigenvalue problem

(2) Compute g = f

(3) Solve the n banded systems given by h = g

(4)Compute v = h

Steps 2 and 4 just use Matrix-Matrix multiply and are fast.

Sometimes Z and D are known apriori like for Poisson’s

equation with uniform grid.

One can reduce Step(3) by factor of 4 by using simultaneous

axpy’s.

Page 10: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Queueing Problems

Often working with singular matrices A with the form Where the B’s are not symmetric but might have symmetric zero structure, there exists matrices Q and Z such QBjZ is diagonal. Usually all the B’s are the identity matrix except for one. The variable q denotes the number of queues. The problems get large quickly. If one has 10 waiting spaces in q queues the number of variables is 0(10q) As in the pde case, one can reduce the problem using generalized eigendecompositions to diagonal blocks containing tridiagonal matrices, which here could be unsymmetric-

Page 11: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Symmetric rank k updates

Originally updates could have the form A= A + α XXT

Could not accommodate the quasi- Newton BFGS update of the approximate Hessian in optimization LAPACK did not use it and instead treated the lower

triangular part as shown where the rectangles used GEMM and small lines that used DGEMV

Could use simultaneous DGEMV here

Page 12: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Workarounds for symmetric updates

At 2011 Householder conference Jennifer Scott suggested adding extra space so that one can use GEMM’s throughout

For symmetric indefinite linear systems, the reduction uses either A=A+ YDYT where D is either 1 x 1 or 2 x 2. For block D would be a sequence of 1 x 1 or 2 x 2. 2002 Blas suggest A=A + YJYT where J was tridiagonal- I don’t know of any implementations. Perhaps better to have A=A+XYT but only update triangular part of A

Page 13: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Symmetric banded factorization

For symmetric banded matrices, Kaufman’s retraction algorithm requires (2m+1)n space even though the original matrix can be specified using (m+1)n. The extra space is to store complications for 2 x 2. Thus one can imagine the image below and the stuff above diagonal is just scratch space with 1 x 1’s.

Page 14: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Bunch Kaufman for symmetric indefinite non banded

Partition A as

Bandwidth spread with Bunch-Kaufman on banded matrix because of pivoting for stability

Where D is either 1 x 1 or 2 x 2 Reset B to B’ = B – Y D-1 YT

when deleting Y’s

Choice of dimension of D depends on magnitudes of a11 versus other elements Continue with B’ and partition it as above

Page 15: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Banded algorithm based on B-K

1) Let c = |ar 1 | = max in abs. in col. 1 2) If |a11 | >= w c, use a 1 x 1 pivot. Here w is a scalar to balance

element growth, like 1/3 Else

3)Let f= max element in abs. in column r 4) If w c*c <= |a11 | f, use a 1 x 1 pivot Else 5)interchange the rth and second rows and

columns of A 6) Do a sequence of orthogonal or elementary

transformation to prevent fill-in while performing a 2 by 2 pivot 7) Perform a 2 x 2 pivot Never pivot with 1 x 1

Page 16: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Pivoting for stability can ruin bandwidth

Worst case r =m, what happens in pivoting x x x x x x x x x x x x x a b c d x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x a x x x x x x x x x b x x x x x x x x c x x x x x x x d x x x x x x x x x x x x

Page 17: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Reset B’ = B – Y D-1 YT= YZ

Let Z = D-1 YT x x x x x x x x x p q r s x x x x x . Then B’ looks like

x x x x x x bp cp dp which by the x x x x x x 0 0 0 x x x x x x x cq dq x x x x x x x cq dq x x x x x x x x dr elimination of 1 x x x x x x x x dr x x x x x as bs cs ds x x x x x x bt ct dt x x x x x x x x x x element x x x x x x x x x x x x x as x x x x x x x x x x x x x x x x bp x x bs x x x x x x becomes 0 x x bt x x x x x x cp cq x cs x x x x x x 0 cq x ct x x x x x x dp dq dr ds x x x x x x 0 dq dr dt x x x x x x x x x x x x x x x x x x Continuing in this way gets us back to band form

Partition A as

Page 18: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

In practice- Pretreat Z to make zeroes so that rank 2 change does not produce zeroes outside the band

Partition A as where D is 2 X 2

Let Z = D-1 YT Reset B’ = QT(B – Q Z)Q= QTB Q-HG, Q from fixup Where H=QTY and G =Z Q Construct Q so that G= Z Q looks like

Use a sequence of Givens transformations or stabilized planar elementary transformations to form Q so banded structure of QTBQ is not upset. Because H= QY has form HG will not extend beyond band

x x x x x x

x x x0 0 0

x x

x x

x x

x

x

x

0

0

0

Page 19: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Comparison with Lapack on positive definite n=2000 posdef

mine-no

block dgbtrf dgbtf2 dpbtrf dpbtf2 mine,nb=16

100 0.223 0.266 0.389 0.17 0.218 0.13

200 0.782 0.873 1.46 0.67 0.773 0.382

300 1.64 1.65 3.12 1.49 1.62 0.834

400 2.78 2.41 5.28 2.59 2.76 1.25

500 4.2 3.78 8.14 3.9 4.13 1.76

600 5.61 4.81 11.31 5.62 5.63 2.343

700 7.35 6.3 15.36 7.88 7.33 2.977

0

2

4

6

8

10

12

14

16

18

0 200 400 600 800

mine-dsyr

dgbtrf

dgbtf2

dpbtrf

dpbtf2

mine,nb=16

Page 20: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Block version on random matrices-n=2000

m

nonblock-

retraction

block

retraction dgbtf2 dgbtrf 2x2 maxr ave

100 0.327 0.315 0.682 0.451 444 49

200 0.986 0.81 2.58 1.3 315 98

300 2.08 1.79 5.22 2.58 365 141

400 3.37 2.6 8.92 3.93 327 201

500 5.45 4.38 13.97 5.85 370 231

600 7.19 5.61 24.84 7.64 344 299

700 10.23 8.46 37.03 9.85 421 293

Only blocking for 1 x 1, stop accumulating when a 2 x 2 is reached. Elementary transformations for “pretreating” Z. with 2 x 2

0

1

2

3

4

5

6

0 50000 100000 150000 200000 250000

retraction

dgbtrf

Time as a function of number of planar transformations m=400, n=2000.

Page 21: BLAS Specification Revisited - William Paterson Universitycs.wpunj.edu/~kaufmanl/BLAS.pdf · BLAS Specification Revisited Linda Kaufman William Paterson University . Wish list •

Possible ways to speedup retraction for consecutive 2 x 2s: Each column involves 2 full daxpys plus orthogonal transformations or cut up daxpys to the same column (1) marching-

(1) work on column i+j when elimination starts at i (2) Work on column i+j-1 with elimination starting at i+2 (3) Work on column i+j-2 with elimination starting at i+4 Requires simultaneous dgemvs or daxpys

(2) 2 sets of transformations (1) Cut up daxpy or orthogonal transformation applied to i+j

stemming from i (2) Dgemv involving 4 columns (i,i+1,i+2,i+3) applied to i+j (3) Cut up daxpy applied to i+j stemming from i+2

Back to requesting simultaneous orthogonal or elementary transformations

Van de Geign to the rescue