7
Fast Vector-Scalar-Multiply-and-Add Subroutines for VAX Computers Gerard0 Cisneros Seccion de Graduados, Escuela Superior de Ingenieria Mecanica y Electrica, Instituto Politecnico Nacional, 07738 Mexico, D .F., Mexico Carlos F. Bunge Znstituto de Fisica, Universidad Nacional Autonoma de Mexico, Apartado Postal 20364, 01000 Mexico, D. F., Mexico C. C. J. Roothaan Departments of Physics and Chemistry, University of Chicago, Chicago, Illinois 60637 Received 19 March, 1986; accepted 7 August, 1986 We describe a set of simple VAX assembly language, Fortran-callable subroutines for performing vector- scalar-multiply-and-add operations which can increase processing speed by more than 10%. The routines are simple enough that they may be translated readily for use on other machines. An operation very commonly performed in scientific calculations is the linear combina- tion of vectors. In large scale applications, execution of this operation may represent a sizable fraction of the overall computation; this is the case, for instance, in the four-index transformation of two-electron integrals over atomic orbitals into integrals over molecular orbitals required in quantum chemical cal- culations going beyond the independent par- ticle model.' A linear combination of N vectors may be performed by a succession of vector-scalar- multiply-and-add (VSMA) operations: A(I) = Sl*Vl(Kl + I) + S2*V2(K2 + I) + . .. + SN*Vn(Kn + I) + A(I) where the equal sign denotes assignment (as in Fortran) and n is chosen to optimize the operation on a given machine. The Ki in the above expression are offsets included for generality; in fact, the Vi could represent different portions of the same array, as in a matrix-vector product. Clearly, N/n VSMA operations involving n product terms, plus one involving N.mod(n) product terms are required. On a VAX-11/780 computer, the above con- struct written in Fortran executes at 0.230 double precision (DP) Mflop/s when n = 4, which is close to the VAX maximum speed of 0.264 DP Mflop/s (obtained by executing 500000 DP additions and 500000 DP multi- plications wholly within machine registers in a three instruction loop). Efficient as this may seem, up to a 13% improvement is achieved by recasting the algorithm in assembly language, as illustrated in Table I, where execution times are given for both imple- mentations. In fact, assembly language coded VSMA operations for n = 6 run at the maxi- mum possible speed. Table I. CPU execution times (best of 10 runs, in s) on VAX VMS (version 3.3) for VSMA operations on vectors of dimension 100, requiring 500000 DP multiplications and 500000 DP additions. n Fortran Assembly code 1 5.78 5.38 2 4.58 4.42 3 4.39 4.05 4 4.34 3.85 5 4.35 3.83 6 4.35 3.78 Fastest 1 DP Mflop" 3.78 "Program executing 500000 DP additions and 500000 DP multiplications wholly within machine registers in a three-instruction loop. Journal of Computational Chemistry, Vol. 8, No. 5, 618-624 (1987) 0 1987 by John Wiley & Sons, Inc. CCC 0192-8651/87/050618-07$04.00

Fast vector-scalar-multiply-and-add subroutines for VAX computers

Embed Size (px)

Citation preview

Fast Vector-Scalar-Multiply-and-Add Subroutines for VAX Computers

Gerard0 Cisneros Seccion de Graduados, Escuela Superior de Ingenieria Mecanica y Electrica, Instituto Politecnico Nacional, 07738 Mexico, D. F., Mexico

Carlos F. Bunge Znstituto de Fisica, Universidad Nacional Autonoma de Mexico, Apartado Postal 20364, 01000 Mexico, D. F., Mexico

C. C. J. Roothaan Departments of Physics and Chemistry, University of Chicago, Chicago, Illinois 60637

Received 19 March, 1986; accepted 7 August, 1986

We describe a set of simple VAX assembly language, Fortran-callable subroutines for performing vector- scalar-multiply-and-add operations which can increase processing speed by more than 10%. The routines are simple enough that they may be translated readily for use on other machines.

An operation very commonly performed in scientific calculations is the linear combina- tion of vectors. In large scale applications, execution of this operation may represent a sizable fraction of the overall computation; this is the case, for instance, in the four-index transformation of two-electron integrals over atomic orbitals into integrals over molecular orbitals required in quantum chemical cal- culations going beyond the independent par- ticle model.'

A linear combination of N vectors may be performed by a succession of vector-scalar- multiply-and-add (VSMA) operations: A(I) = Sl*Vl (Kl + I) + S2*V2(K2 + I)

+ . .. + SN*Vn(Kn + I) + A(I) where the equal sign denotes assignment (as in Fortran) and n is chosen to optimize the operation on a given machine. The Ki in the above expression are offsets included for generality; in fact, the Vi could represent different portions of the same array, as in a matrix-vector product. Clearly, N/n VSMA operations involving n product terms, plus one involving N.mod(n) product terms are required.

On a VAX-11/780 computer, the above con- struct written in Fortran executes at 0.230

double precision (DP) Mflop/s when n = 4, which is close to the VAX maximum speed of 0.264 DP Mflop/s (obtained by executing 500000 DP additions and 500000 DP multi- plications wholly within machine registers in a three instruction loop). Efficient as this may seem, up to a 13% improvement is achieved by recasting the algorithm in assembly language, as illustrated in Table I, where execution times are given for both imple- mentations. In fact, assembly language coded VSMA operations for n = 6 run at the maxi- mum possible speed.

Table I. CPU execution times (best of 10 runs, in s) on VAX VMS (version 3.3) for VSMA operations on vectors of dimension 100, requiring 500000 DP multiplications and 500000 DP additions.

n Fortran Assembly code

1 5.78 5.38 2 4.58 4.42 3 4.39 4.05 4 4.34 3.85 5 4.35 3.83 6 4.35 3.78

Fastest 1 DP Mflop" 3.78

"Program executing 500000 DP additions and 500000 DP multiplications wholly within machine registers in a three-instruction loop.

Journal of Computational Chemistry, Vol. 8, No. 5, 618-624 (1987) 0 1987 by John Wiley & Sons, Inc. CCC 0192-8651/87/050618-07$04.00

Vector-Scalar Subroutines for Vax Computers 619

Figure 1 shows the best six Fortran sub- routines for n = 1 through n = 6. More straightforward Fortran programs are pos-

sible than those shown, namely, programs in which the scalar arguments are not copied to local variables (e.g., the statements NL = N,

C Module file VPLIB.FOR: vector processing utility library. c----------------------------------

SUBROUTINE VSMAlD (N,Sl,Vl,A) c---------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),A(N) NL = N SlL= s1 NREST = MOD(NL,6) DO 10 I=l,NREST

DO 20 I-l+NREST,NL,G 10 A(I 1 = SlL*Vl(I )+A(I )

A(I 1 = SlL*Vl(I )+A(I ) A(I+l) = SlL*Vl(I+l)+A(I+l) A(I+2) = SlL*Vl(I+2)+A(I+2) A(I+3) = SlL*Vl(I+3)+A(I+3) A(I+4) = SlL*Vl(I+4)+A(I+4)

20 A(I+5) = SlL*Vl(I+S)+A(I+S) RETURN END c---------------------------------------- SUBROUTINE VSMA2D (N,Sl,S2,Vl,V2,A) c---------------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),V2(N),A(N) NL = N SlL= s1 SZL= s2 NREST = MOD(NL,6) DO 10 I=l,NREST

DO 20 I=l+NREST,NL,G 10 A(I 1 = SlL*Vl(I )+SZL*VZ(I )+A(I )

A(I 1 = SlL*Vl(I )+S2L*V2(I )+A(I 1 A(I+l) = SlL*Vl(I+l)+S2L*V2(I+l)+A(I+1) A(I+2) = SlL*Vl(I+2)+S2L*V2(1+2)+A(I+2) A(I+3) = SlLAV1(1+3)+S2L*V2(1+3)+A(I+3) A(I+4) = SlL*Vl(I+4)+SZL*V2(1+4)+A(I+4)

20 A(I+5) = SlL*Vl(I+S)+SZL*V2(1+5)+A(I+5) RETURN END c---------------------------------------------- SUBROUTINE VSMA3D (N,Sl,S2,S3,Vl,V2,V3,A) c---------------------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),V2(N),V3(N),A(N) NL = N SlL= s1 S2L= s2 S3L= s3 NREST = MOD (NL,6) DO 10 I=l,NREST

Figure 1. Fortran subroutines for VSMA, for n up to 6.

620 Cisneros, Bunge, and Roothaan

10 A(I 1 = SlL*Vl(I )+SZL*VZ(I )+S3L*V3(1 )+A(I )

A(I 1 = SlL*Vl(I )+SZL*VZ(I )+S3L*V3(1 ) + A ( I ) A(I+l) = S1L*V1(1+1)+S2L*V2(1+1~+S3L*V3(1+1~+A~1+1~ A(I+2) = SlL*Vl(I+2)+S2L*V2(1+2)+S3L*V3(1+2)+A(I+2) A(I+3) = SlL*V1(1+3)+S2L*VZ(I+3)+S3L*V3(1+3)+A(I+3) A(I+4) = SlL*Vl(I+4)+SZL*V2(1+4)+S3L*V3(1+4)+A(I+4)

20 A(I+5) = SlL*Vl(I+5)+S2L*V2(1+5)+S3L*V3(1+5)+A(I+5)

DO 20 I=l+NREST,NL,G

RETURN END c---------------------------------------------------- SUBROUTINE VSMA4D (N,Sl,SZ,S3,S4,Vl,VZ,V3,V4,A) c---------------------------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),VZ(N),V3(N),V4(N),A(N) NL = N SlL= s1 S2L= s2 S3L= s3 S4L= s4 DO 10 I=l,NL

RETURN END c---------------------------------------------------------- SUBROUTINE VSMASD (N,Sl,SZ,S3,S4,S5,Vl,VZ,V3,V4,V5,A) c---------------------------------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),VZ(N),V3(N),V4(N),V5(N),A(N) NL = N SlL= s1 S2L= s2 S3L= s3 S4L= 54 S5L= s5 DO 10 I=l,NL

10 A ( 1 ) = SlL*Vl(I)+S2L*V2(I)+S3L*V3(1)+S4L*V4(I)+A(I)

10 A(1) = SlL*Vl(I)+S2L*V2(1)+S3L*V3(I)+S4L*V4(1) * +SSL*V5(I)+A(I) RETURN END c---------------------------------------------------------------- SUBROUTINE VSMA6D (N,S1,S2,S3,S4,S5,S6,Vl,V2,V3,V4,V5,V6,A) c---------------------------------------------------------------- IMPLICIT DOUBLE PRECISION (A-H,O-Z) DIMENSION Vl(N),VZ(N),V3(N),V4(N),V5(N),VG(N,AN) NL = N SlL= s1 S2L= s2 S3L= s3 S4L= s4 S5L= s5 S6L= S6 DO 10 I=l,NL

10 A ( 1 ) = SlL*Vl(I)+S2L*VZ(I)+S3L*V3(I)+S4L*V4(1) rt +S5L*V5(I)+S6L*V6(I)+A(I)

RETURN END

Figure 1 (continued)

Vector-Scalar Subroutines for Vax Computers 62 1

etc.). However, this copying to local variables results in a 5% improvement in execution speed. Unrolling of DO loops results in a 10% gain in efficiency for n = 1 and smaller gains for higher n values up to n = 3, whereas for n > 4 unrolling is slightly detrimental to efficiency. Maximum efficiency in Fortran is achieved for n = 4.

Figure 2 shows the corresponding routines in VAX assembly language. The key idea is to minimize the number of memory references within the loop, so that results are kept in machine registers as much as possible. All subroutines require a sequence of in- structions to retrieve the arguments whose addresses are stored in a short array pointed at by the VAX Argument Pointer (AP) regis- ter. For n > 2, there are not enough registers in which to keep all scalars Si, so that

the machine stack [pointed at by the Stack Pointer (SP) register] is used to store some or all of the scalars, depending on n. Vector ele- ments are retrieved and stored using auto- increment addressing, keeping the addresses of all vectors in machine registers. The de- scription of the code is completed by the com- ments accompanying each instruction.

In order to make use of an assembler sub- routine, say, subroutine VAX-VSMABD, file VPLIBMAC.MAFt given in Figure 2 must be compiled and linked to a calling program con- taining the following lines of code

IF (VAX) THEN CALL VAX-VSMA2D (N, S1, S2, V1, V2, A)

CALL VSMA2D (N, S1, S2, V1, V2, A) ELSE

END IF

C Module file VPLIBMAC.MAR: vector processing utility library. C The prefix VAX- is used to distinguish these subroutines from C their Fortran language counterparts.

.TITLE VSMAD

.PSECT VSMCODE,PIC,CON,REL,LCL,SHR,EXE,RD,NOWRT,LONG

; SUBROUTINE VAX-VSMAlD (N,Sl,Vl,A)

.ENTRY ADDL2 MOVL M O W MOVL MOVL

LOOP1: MULD3 ADDD2 SOBGTR RET

VAX_VSMAlD,"M<R2,R3,R4,R8> #4 ,AP ;skip over number of arguments @(AP)+,R4 ;get vector size W A P ) + ,R2 ;get S1 (AP)+,R8 ;get address of V1 (AP)+ ,R12 ;get address of A R2,(R8)+,RO ;do Sl*Vl(I) RO,(R12)+ ;add to A(1) R4 ,LOOP1 ;N=N-1, loop while N>O

I SUBROUTINE VAX-VSMA2D (N,Sl,SZ,Vl,VZ,A)

.ENTRY VAX-VSMA2D,"M<R2,R3,R4,R6,R7,R8,R9,RlO,Rll> ADDL2 #4,AP ;skip over number of arguments MOVL (3 ( AP 1 + ,R4 ;get value of N into R4 MOVD + ,R6 ;value of S1 into R6-7 M O W @(AP)+,R8 ;value of S2 into R8-9 MOVL ( AP ) + , R10 ;address of V1 into R10 MOVL ( AP) + ,R11 ;address of V2 into R11 MOVL (AF')+,R12 ;address of A into R12

LOOP2: MULD3 R6,(RlO)+,RO ;Sl*Vl(I) into RO MULD3 R8,(Rll)+,R2 ;S2*V2(I) into R2 ADDD2 R2,RO ;add them together ADDD2 RO,(R12)+ ;now add sum to A(1)

Figure 2. Assembly language subroutines for VSMA, for n up to 6.

622 Cisneros, Bunge, and Roothaan

SOBGTR R4 ,LOOP2 ;N=N-1, loop while N)O RET

c SUBROUTINE VAX-VSMA3D (N,Sl,SZ,S3,Vl,VZ,V3,A)

. ENTRY ADDL2 MOVL M O W MOVD MOVD MOVL MOVL MOVL MOVL

LOOP3: MULD3 MULD3 ADDDZ MULD3 ADDD2 ADDDZ SOBGTR RET

#4,W ;skip over number of arguments @ ( Ap) +,R8 ;value of N to R8 @(AP)+,-(SP) ;value of S1 to the stack @(AP)+,R4 ;value of S2 to R4-5 (3 ( Ap ) + .R6 ;value of S3 to R6-7 (AP)+,R9 ;address of V1 to R9 ( AP) + ,R10 ;address of V2 to R10 (AP)+,R11 ;address of V3 to R11 (AP) +,R12 ;address of A to R12 (SP),(RS)+,RO ;Sl*Vl(I) to RO R4,(RlO)+,R2 ;S2*V2(I) to R2 R2 ,RO ;add them together R6,(Rll)+,R2 ;S3*V3(1) to R2 R2 ,RO ;add to previous sum RO,(R12)+ ;add sum to A(1) R8, LOOP3 ;N=N-1, loop while N)O

# SUBROUTINE VAX-VSMA4D (N,Sl,SZ,S3,S4,Vl,VZ,V3,V4,A)

.ENTRY ADDL 2 MOVL MOVD MOVD M O W MOVD MOVL MOVL MOVL MOVL MOVL

LOOP4: MULD3 MULD3 ADDD2 MULD3 ADDDZ MULD3 ADDD2 ADDDZ SOBGTR RET

VAX-VSMA4D,"M<R2,R3,R4,R6,R7,R8,R9,RlO,Rll> #4 ,AP ;skip over number of arguments @(AP)+,R4 ;value of N to R4 @(AP)+,-(SP) ;value of S1 to the stack @(AP)+,-(SP) ;value of S2 to the stack @(AP)+,-(SP) ;value of S3 to the stack @ (AP) + ,R6 ;value of S4 to R6-7 (AP)+,R8 ;address of V1 to R8 ( AP 1 + , R9 ;address of V2 to R9 (AP)+,R10 ;address of V3 to R10 (AP)+,R11 ;address of V4 to R11 (AP)+,R12 ;address of A to R12 lG(SP),(RB)+,RO ;Sl*Vl(I) to RO 8(SP),(R9)+,R2 ;S2*V2(I) to RZ R2 ,RO ;add these two into RO (SP),(RlO)+,RZ ;S3*V3(1) to R2 R2 ,RO ;add it to previous two R6,(Rll)+,R2 ;S4*V4(1) to R2 R2 ,RO ;add it to other three RO,(R12)+ ;and add total to A(1) R4, LOOP4 ;N=N-1, loop while N>O

I SUBROUTINE VAX-VSMA5D (N,Sl,S2,S3,S4,SS,Vl,V2,V3,V4,V5,A)

.ENTRY VAX-VSMA5D,AM<R2,R3,R4,R5,R6,R7,R8,R9,R10,R11> ADDL2 #4,AP ;skip over number of arguments MOVL @(AP)+,R6 ;value of N to R6 M O W @(AP)+,RQ ;value of S1 to R4-5 MOW @(API+,-(SP) ;value of 52 to the stack MOVD @(AP)+,-(SP) ;value of S3 to the stack

Figure 2 (continued)

Vector-Scalar Subroutines for Vax Computers 623

M O W @(AP)+,-(SP) ;value of S4 to the stack MOVD @(AP)+,-(SP) ;value of S5 to the stack MOVL (AP)+,R7 ;address of V1 to R7 MOVL ( Ap 1 + ,R8 ;address of V2 to R8 MOVL ( AP 1 + , R9 ;address of V3 to R9 MOVL ( AP) + ,R10 ;address of V4 to R10 MOVL ( Ap 1 + ,R11 ;address of V5 to R11 MOVL (AP)+,R12 ;address of A to R1

LOOP5: MULD3 R4,(R7)+,RO ;Sl*Vl(I) to RO MULD3 24(SP),(R8)+,R2 ;S2*V2(I) to R2 ADDDZ R2,RO ;add these two into RO MULD3 16(SP),(R9)+,R2 ;S3*V3(1) to R2 ADDD2 R2,RO ;add it to previous two MULD3 8(SP),(RlO)+,R2 ;S4*V4(1) TO R2 ADDD2 R2,RO ;add it to other three MULD3 (SP),(Rll)+,RZ ;SS*V5(1) TO R2 ADDD2 R2,RO ;add it to other four ADDD2 RO , (R12 1 + ;and add total to A(1) SOBGTR R6,LOOPS ;N=N-1, loop while N>O RET

SUBROUTINE VAX-VSMAGD (N,Sl,S2,S3,S4,S5,S6,Vl,VZ,V3,V4,VS,V6,A;

. ENTRY ADDL 2 MOVL MOVD MOVD M O W MOVD MOVD M O W MOVL MOVL MOVL MOVL MOVL MOVL MOVL

LOOP6: MULD3 MULD3 ADDD2 MULD3 ADDD2 MULD3 ADDD2 MULD3 ADDD2 MULD3 ADDD2 ADDD2 SOBGTR REX

. END Figure 2 (continued)

VAX_VSMA6D,"M<R2,R3,R4,R6,R7,R8,R9,RlO,Rll> #4 ,AP ;skip over number of arguments @ ( AP) + ,R4 ;value of N to R4 @(Ap )+, - ( SP) ;value of S1 to the stack @(AP)+,-(SP) ;value of S2 to the stack @(AP)+,-(SP) ;value of S3 to the stack @(AP)+,-(SP) ;value of S4 to the stack @(Ap)+,-(SP) ;value of S 5 to the stack @(AP)+,-(SP) ;value of S6 to the stack (AP)+,R6 ;address of V1 to R6 ( AP 1 + , R7 ;address of V2 to R7 (AP)+,R8 ;address of V3 to R8 ( AP 1 + , R9 ;address of V4 to R9 (AP)+ ,R10 ;address of V5 to R10 ( Ap) + ,R11 ;address of V6 to R11 ( AP 1 + , R12 ;address of A to R12 40(SP),(R6)+,RO ;Sl*Vl(I) to RO 32(SP),(R7)+,R2 ;S2*V2(I) to R2 R2 ,RO ;add these two into RO 24(SP),(R8)+,R2 ;S3*V3(1) to R2 R2 ,RO ;add it to previous two 16(SP),(R9)+,R2 ;S4*V4(1) to R2 R2 ,RO ;add it to previous three 8(SP),(RlO)+,R2 ;SS*VS(I) to R2 R2 ,RO ;add it to other four (SP),(Rll)+,RZ ;S6*V6(1) to R2 R2 ,RO ;add it to other five RO,(R12)+ ;and add total to A(I) R4, LOOP6 ;N=N-1, loop while N>O

624 Cisneros, Bunge, and Roothaan

where VAX is a logical variable whose value is .TRUE. for a VAX computer and .FALSE. otherwise.

Subroutine VSMAlD is an improved ver- sion of a similar routine of the BLAS (Basic Linear Algebra Subprograms) package: on account of having some of its arguments con- verted into local variables- The other Fortran routines, together with their assembly lan-

guage counterparts, may be considered as a useful extension of the BLAS package.

References

1. G. Cisneros and C.F. Bunge, Znt. J . Quant. Chern., S19, 193 (1985).

2. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, ACM Tmns. Math. Soft., 6,308 (1979).