Efficient Parallel Implementation of Classical Gram …pmaa06.irisa.fr/pres/14-Yokozawa-PMAA06.pdfSep. 7, 06 PMAA06 1 Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization

Sep. 7, 06 PMAA06 1

Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization Using Matrix Multiplication

Graduate School of Systems and Information Engineering,University of Tsukuba, Japan

Takuya Yokozawa Daisuke Takahashi

Taisuke Boku Mitsuhisa Sato

Sep. 7, 06 PMAA06 2

Overview

Background and objectiveClassical Gram-Schmidt orthogonalization

Using matrix multiplicationProposed algorithm

Recursive CGSPerformance resultsConclusion and future works

Sep. 7, 06 PMAA06 3

Background

Gram-Schmidt orthogonalization process is one of the fundamental algorithms in linear algebraTwo basic computational variants of the Gram-Schmidt process exist:

Classical Gram-Schmidt algorithm (CGS) Modified Gram-Schmidt algorithm (MGS)

Often selected for practical applicationMore stable than the CGS algorithm

Sep. 7, 06 PMAA06 4

Background (cont’d)

MGS is stable but, Cannot be expressed by Level-2 BLASParallel implementation requires additional communication

CGS is unstable but,Can be expressed by Level-2 BLASCGS with DGKS correction [Daniel et al. ‘76] is one of the most efficient way to perform the orthogonalization process

In this work, we use the CGS

Sep. 7, 06 PMAA06 5

In this work

We study an efficient implementation of the CGS orthogonalization using matrix multiplicationWe propose a parallel recursive CGS algorithm to perform QR decompositionWe report some performance results on PC-cluster

Sep. 7, 06 PMAA06 6

Related works

Recursion leads to automatic variable blocking for dense linear-algebra algorithms [Gustavson 1997]

Parallel QR factorization[Elmroth and Gustavson 2000]

Recursive QR factorization of by matrixOn 4-way SMP computer

m n

Sep. 7, 06 PMAA06 7

Classical Gram-Schmidt algorithmfor matrix

The CGS orthogonalization can be performed by using Level-2 BLASSuitable for parallelizationThe CGS algorithm is not stable

jjj qqq =

ijijj qaqaq ),( −=

do end

1,n j =dojj aq =

11 do ,j-i =

do end),( ji aq

ja Raw data vector

jq

Inner product

Orthogonalized vector

jq Euclid norm

Sep. 7, 06 PMAA06 8

Naïve implementationCGS for by matrix is computed by

Matrix-vector multiplications (GEMV; Level-2 BLAS) normalizations (NRM2, SCAL; Level-1 BLAS)

66656146136126116166 qqq)qa(q)qa(q)qa(q)qa(q)qa(qaq =−−−−−= ,,,,,,555555555 ,,,,, qqq)qa(q)qa(q)qa(q)qa(qaq 41312111 =−−−−=

44444444 ,,,, qqq)qa(q)qa(q)qa(qaq 312111 =−−−=33332333 ,,, qqq)qa(q)qa(qaq 211 =−−=

2222122 qqq)qa(qaq =−= ,, 1

11111 , qqqaq ==

m n

mm2

GEMV

NRM2, SCAL

)( n21 aaaA Λ= is vector ia

Sep. 7, 06 PMAA06 9

CGS using matrix multiplicationsThe CGS orthogonalization of a matrix can be changed into a matrix multiplication. [Samukawa ’95]The orthogonalization processes of the can be separated

Depend on and Depend onNormalization

312111 )qa(q)qa(q)qa(qaq 44444 ⋅−⋅−⋅−= 444 qqq =

35125115155 )qa(q)qa(q)qa(qaq ⋅−⋅−⋅−=

36126116166 )qa(q)qa(q)qa(qaq ⋅−⋅−⋅−=

451 )qa(q ⋅−

561461 )qa(q)qa(q ⋅−⋅−

555 qqq =

666 qqq =⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

⋅⋅⋅⋅⋅⋅⋅⋅⋅

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

)()()()()()()()()(

6

5

4

616161

515151

414141

3

2

1

6

5

4

aqaqaqaqaqaqaqaqaq

qqq

aaa

qqq

( )⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛= 654

3

2

1

3

2

1

6

5

4

aaaqqq

qqq

aaa

qqq

t

t

t

6

5

4

1q 2q 3q 4a 5a 6a4q 5q

4q 5q 6q

Sep. 7, 06 PMAA06 10

Concept of proposed algorithm

The CGS orthogonalization can also be extended with matrix multiplication into a recursive formulation

The parts of A and C are same structure as original CGS calculation spaceThe parts of Ab and Cb are computed by matrix multiplication

A

CB

Ab

Cb

Matrix-Vector Multiplication （GEMV)Matrix-Matrix Multiplication （GEMM)

Sep. 7, 06 PMAA06 11

ParallelizationWe use row-wise distribution

The proposed recursive algorithm use matrix multiplication and matrix vector multiplication to compute inner products

Column-wise distribution Whole elements of a vector are in one processorAdditional communication is required to compute inner products.

Row-wise distributionPart of elements of a vector are in each processorCorrective communication is required to sums up and distributes inner products

Sep. 7, 06 PMAA06 12

Recursive CGS

else

then) ( if NBm ≤)(GSrecursiveCbegin A,Q,k,m

)GEMM( 2222 ,S,AQ )),(m/(m/kT

)),(m/(m/k ++

)GEMM( 2222 )),(m/(m/k)),(m/(m/k ,QS,Q ++

if endend

kkk qqq =m, kkj ++= 1 do

)GEMV( wa ,,Q jT

kk,j−)GEMV( jkk,j ,,Q qw−

jjj qqq =do end

)2,GS(recursiveC mk,A,Q

)(GSrecursiveC 2m,2mA,Q,k +

: Blocking Size NB

m2m4m

NBNB

4m

2m

Matrix-Vector Multiplication （GEMV)Matrix-Matrix Multiplication （GEMM)

Sep. 7, 06 PMAA06 13

Performance Results

To evaluate the proposed recursive CGS algorithm, we compared

Proposed recursive CGS algorithmNaïve implementation of the CGS algorithm using Level-2 BLAS

The CGS orthogonalization processes were performed on double-precision real data

Sep. 7, 06 PMAA06 14

Evaluation EnvironmentA 32-node Xeon PC-cluster

Xeon 3GHz, 1GB DDR2 Memory1000Base-T Gigabit EthernetLAM/MPI 7.1.1Intel MKL 8.1 gcc 4.0.2Linux 2.6.17All programs were run in 64-bit mode

Sep. 7, 06 PMAA06 15

Performance Results on 32-node Xeon PC cluster

0

2

4

6

8

10

12

14

16

0 2000 4000 6000 8000Matrix Size

GFLO

PS

Recursive

Naïve

about up to 1.72 times

Sep. 7, 06 PMAA06 16

Discussion(1)

The recursive CGS algorithm improved the performance by up to about 1.72 times when matrix size is 8000Cache reusability is improved

Naive implementation uses Level-2 BLASThe recursive CGS uses Level-3 BLAS

Sep. 7, 06 PMAA06 17

Performance Results (Matrix size N=8000)

0

2

4

6

8

10

12

14

16

18

20

0 4 8 12 16 20 24 28 32 36Number of Processors

GF

LO

PS

Recursive Naïve

Up to about 3.6 times

Sep. 7, 06 PMAA06 18

Discussion(2)

We could improve the performance by up to about 3.6 times when matrix size is 8000

Scalability is limitedMPI_ALLREDUCE communication time dominates in total execution time For larger problem size, matrix multiplications should be distributed

Sep. 7, 06 PMAA06 19

Breakdown of performance resultsMatrix Size N=8000

0

400

800

1200

1600

1 2 4 8 16 32

Number of Porcessors

Tim

e [

Sec]

naive communication

recursive communication

naive computation

recursive computation

Sep. 7, 06 PMAA06 20

Conclusion

We proposed a recursive CGS algorithm using matrix multiplicationThe proposed algorithm improved the performance by naïve implementation

Cache reusability was improved since using matrix multiplication

The performance results demonstrate that the recursive CGS algorithm is efficient for reduce execution time of QR decomposition.

Sep. 7, 06 PMAA06 21

Future works

Improve scalability Using other data distribution algorithm

Development algorithm to compute optimal blocking size “NB”

Documents

Efficient Parallel Implementation of Classical Gram …pmaa06.irisa.fr/pres/14-Yokozawa-PMAA06.pdfSep. 7, 06 PMAA06 1 Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization