7
Semantic Structures for Efficient Code Generation on a Stack Machine John Couch Hewlett-Packard Terry Hamm Tektronix Introduction Since the expression is the fundamental building block of any programming language, its evaluation is an integral part of program compilation and execution. In particular, the evaluation techniques and data structures are determined by the class of expressions acceptable by the language. This paper serves as a framework for expression evaluation on a stack machine by presenting a set of optimized algorithms and internal data structures developed during the implementation of several compilers for the Hewlett-Packard computer systems. Expressions In general, an expression consists of a sequence of operators and operands where each operator is assigned a precedence determining the evaluation order of the expression. A formal definition of the class of expressions addressed in this paper is given by the following context- free grammer: (1) <expression> (2) (3) <logical factor> (4) (5) <logical secondary> (6) (7) <logical primary> (8) (9) (10) (11) (12) (13) (14) <sum> (15) = <logical factor> <expression> or <logical factor> = <logical secondary> <logical factor> and <logical secondary> <logical primary> not <logical primary> <sum> <sum> ><sum> <sum>>= <sum> <sum> <> <sum> <sum><= <sum> <sum>< <sum> <sum> =<sum> <term> <sum> + <term> (16) (17) (18) (19) <term> (20) (21) (22) <factor> (23) (24) <primary> (25) (26) (27) (28) <variable> (29) (30) <subscripted variable> (31) (32) <function reference> (33) (34) <parameter list> (35) (36) <parameter> <sum> - <term> - <term> + <term> = <factor> = <term> * <factor> = <term> / <factor> = <primary> = <factor> ** <primary> = <variable> = <constant> = <function reference> (<expression>) = <simple variable> = <subscripted variable> = <simple variable> (<expression>) = <simple variable> (<expression>, <expression>) = <function name> = <function name> (<parameter list>) = <parameter> = <parameter list>, <parameter> = <expression> <Integer>, <constant> , <simple-variable> , <subscripted- variable>, and <function-name> are tokens-i.e., those items recognized by the scanner. In addition to defining the class of syntactically valid expressions, the grammar assigns each operator a precedence according to it- position in the grammar. The higher the production number the greater the precedence of the operator. For example, the grammar produces the evaluation or parse tree in Figure 1 for the expression B*C - A + D * E. Note that * has greater precedence than + or - (B*C must be evaluated before (B*C)-A). This is no surprise since * occurs in production 20 while + and - occur in productions 15 and 16. COMPUTER 42

Semantic Structures Efficient Code Generation Stack … evaluation. Recall that theparsing process is conceptually one of constructing a derivation tree for the given program string

Embed Size (px)

Citation preview

SemanticStructures forEfficient CodeGeneration ona Stack MachineJohn CouchHewlett-PackardTerry HammTektronix

Introduction

Since the expression is the fundamental building blockof any programming language, its evaluation is anintegral part of program compilation and execution. Inparticular, the evaluation techniques and data structuresare determined by the class of expressions acceptable bythe language. This paper serves as a framework forexpression evaluation on a stack machine by presenting aset of optimized algorithms and internal data structuresdeveloped during the implementation of several compilersfor the Hewlett-Packard computer systems.

Expressions

In general, an expression consists of a sequence ofoperators and operands where each operator is assigneda precedence determining the evaluation order of theexpression. A formal definition of the class of expressionsaddressed in this paper is given by the following context-free grammer:

(1) <expression>(2)

(3) <logical factor>(4)

(5) <logical secondary>(6)(7) <logical primary>(8)(9)

(10)(11)(12)(13)(14) <sum>(15)

= <logical factor><expression> or<logical factor>

= <logical secondary><logical factor> and<logical secondary><logical primary>not <logical primary><sum><sum> ><sum><sum>>= <sum><sum> <> <sum><sum><= <sum><sum>< <sum><sum> =<sum><term><sum> + <term>

(16)(17)(18)(19) <term>(20)(21)(22) <factor>(23)(24) <primary>(25)(26)(27)(28) <variable>(29)(30) <subscripted variable>

(31)

(32) <function reference>(33)

(34) <parameter list>(35)

(36) <parameter>

<sum> - <term>- <term>+ <term>

= <factor>= <term> * <factor>= <term> / <factor>= <primary>= <factor> ** <primary>= <variable>= <constant>= <function reference>

(<expression>)= <simple variable>= <subscripted variable>= <simple variable>

(<expression>)= <simple variable>

(<expression>,<expression>)

= <function name>= <function name>

(<parameter list>)= <parameter>= <parameter list>,

<parameter>= <expression>

<Integer>, <constant> , <simple-variable> , <subscripted-variable>, and <function-name> are tokens-i.e., thoseitems recognized by the scanner. In addition to definingthe class of syntactically valid expressions, the grammarassigns each operator a precedence according to it-position in the grammar. The higher the productionnumber the greater the precedence of the operator.For example, the grammar produces the evaluationor parse tree in Figure 1 for the expression B*C - A + D * E.Note that * has greater precedence than + or -

(B*C must be evaluated before (B*C)-A). This is nosurprise since * occurs in production 20 while + and -

occur in productions 15 and 16.

COMPUTER42

Expression evaluation. Recall that the parsing processis conceptually one of constructing a derivation tree forthe given program string. For instance the derivationtree for the expression B*C-A+D*E is given in Figure 2.Note that each node in the derivation tree is nothingmore than a two-dimensional representation of a pro-duction in the grammar. No compiler retains a completederivation tree in memory for a program; however, everycompiler maintains at least one path in the tree in the

+

form of a push-down stack, procedure calls, or statetransitions.The derivation tree, or parts of the derivation tree

(parse tree), may in fact be stored internally as a datastructure consisting of a pointer in a semantics stack(the root), which is directed to other data elements. Eachof the data elements may contain pointers to one or moreother data elements; these latter contain sons of thefather data element. For example, a binary operation asa tree node would carry a pointer to the left memberof the operation and a pointer to the right member.A procedure call element might carry several pointers,one to each actual parameter structure.A tree-structure must be built in a bottom-up parser

whenever the emitted machine code sequence for agiven input fragment depends on contextual information,preceding or following it. For example, consider thefollowing simple grammar:

(1) <statement> ::= IF

A D E

B C

Figure 1. Parse tree for B*C - A + D*E.

(2)(3) <expression> ::(4)(5) <log op>(6)(7) <term>(8)

<expression> THEN<statement> ELSE<statement><identifier> := <expression><expression> <log op> <term><term>ANDOR<identifier><constant>

<exp>

<logical factor>

<logical econdary

<logical primary>

<sum>

sum> + <term>

<sum> t rm> <factor>

<term> <factor> <term>

<term> + <factor> <primary> <factor>

<factor> <primary> <variable> primary>

<primary> <vari ble> <simple variable> <variable>

<variable> <simple variable> A <simple variable>

<simple variable> C D

B

Figure 2. Derivation tree for B*C - A + D*E.

May 197 43.,R,

<actor>

<pri ary)

<variable"

<simple varial

E

ible>

43May 1977

0 15

I sI I -Ismall integer

s exp I~~~~~~~

complex

0s |expI

re

FexdoublE

Figure 3. Data formats.

Note that <expression> can appear in tlas either a branching condition (in the IF-TIor as the object of a replacement. In theprocess, the <expression> is first builtstituent constants, identifiers, and logical oplast as part of the replacement or conditiona(It is possible, but poor style, to ask the lexito note whether it has seen ":=" or "IF"of determining the form of code emitted.)The parser should therefore construct an <

tree, which will be available through a posemantics stack when the IF-THEN or theproduction is selected by the parser. At tht<expression> tree may be evaluated bytop-down procedure with one of theses goals:IF-THEN-ELSE, we evaluate the tree asbranches based on the AND and OR opertherein, or (2) given the replacement, we evalas an expression calling for bit-by-bit ANDingThe evaluation process is considerably aide

tree node information that can be propagatecduring the tree-building phase. Thus it is uselooking at a certain node, whether the concontains a logical, relational, or arithmeticthe next lower level.

Mixed-mode arithmetic also requires a structure, since0 15 typing information must be propagated up the tree if. a stack machine is the target machine.

~zj To demonstrate the requirement for an internal structureand present a viable solution, we will adopt the followingdata formats (see Figure 3). Assuming each variable in

bal the expression B*C-A+D*E is type integer, the codegenerated directly from the parser and its correspondingrun-time stack environment are shown in Figure 4.

111 iZ However, a runtime predicament will occur if the variableA is allowed to be real (see Figure 5).In order to perform the subtraction, the value of B*C

must be converted to the type of A. Unfortunately, the'integer value B*C is not on top of the stack. To avoid

ereal situations like this, the second operand must be knownat the time the first operand is evaluated (stacked) inorder to generate the correct conversion code. One methodto ensure this is to have the parser create a parsetree with type information. Thus the scanning andparsing of the expression B*C-.A+D*E, where A is real,B and C are integer, and D and E are long, produces

he language the internal structure depicted by Figure 6.HEN-ELSE) The remaining sections of this paper define the parsea bottom-up tree construction process and the optimized evaluationof its con- algorithms for arithmetic and boolean expressions.PeraLmrs, unul statement.ical analyzeras a means

expression>inter in thereplacementat time, thea recursive(1) given the,conditional

-ators founduate the treeand ORing.

-d by certaini up the treeIful to know,necting treeoperator at

Data structure construction

The construction of the internal structures can betied to the productions by maintaining a semantic stackin the bottom-up parse. Each semantic stack elementthen becomes a link to some partial parse tree.In general the semantic application of a production

rule A -+ an an-1 al ao involves the creation of anew parse tree that connects the subtrees associatedwith the symbols an . ao. For example, the production<term> ::= <term> * <factor> calls for the linkingof the <term> subtree to the <factor> subtree. Theconnection is made by creating a new tree node of the form

.ILTYPE lRETYPE

LfiLP2

TOS B B B*C B*C| B*C-Arl~~~--r1 rlONLOAD B LOAD C MPY LOAD A SUB

LOAD D LOAD E MPY ADD

Figure 4. Runtime environment for B*C - A + D*E.

44 COMPUTER

where P1 is a link to the <term> subtree and P2 alink to the <factor> subtree. LTYPE and RTYPE arethe types of the resulting values of the left subtreeand right subtree. This node is then associated with theleft part of the production

<term>::= <term> * <factor>

and replaces the <term> and <factor> subtrees onthe semantic stack.Many productions require no semantic operations-in

particular single productions such as

_ B*C

sT esp ITOS -

Figure 5. Runtime predicament.

<term> ::= <factor>

leave the semantic stack unchanged.

Each production requiring some semantic action carriesa semantic operation #N, as follows:

<expression>

<logical factor>

<logical secondary>

(7) <logical primary>(8)(9)

(10)(11)(12)(13)(14) <sum>(15)(16)(17)(18)(19) <term>(20)(21)

= <logical factor>= <expression> or

<logical factor> #1= <logical secondary>= <logical factor> and

<logical secondary> #2= <logical primary>= not <logical primary>

#3= <sum>= <sum> > <sum> #4= <sum> >= <sum> #5= <sum><> <sum> #6= <sum><= <sum> #7= <sum>< <sum> #8= <sum>=<sum> #9= <term>= <sum> + <term> #10= <sum> - <term> #11

-<term> #12+ <term><factor><term> * <factor> #13<term> /<factor> #14

(22) <factor>(23)

(24) <primary>(25)(26)(27)(28) <variable>(29)(30) <subscripted variable>

(31)

(32)(33)

(34)(35)

<function reference>

<parameter list>

(36) <parameter>

= <primary>= <factor> ** <primary>

#15= <variable><=constant> #16= <function reference>= (<expression>)= <simple variable> #17= <subscripted variable>= <simple variable>

(<expression>) #18<=simple variable>

(<expression>,<expression>) #18

= <function name> #19<function name>(<parameter list>)#19

= <parameter> #20<parameter list>,<parameter> #20

= <expression>

The semantic operations are defined as follows:

#1 OR: Create a structure of the form

OR I LTYPE I RTYPEl

P2

A D E

Internal tree representation for B*C - A + D*E.

45

integer value

real value for A

(1)(2)

(3)(4)

(5)(6)

B

Figure 6.

May 1977

I I

-1

by linking P1 to <expression> subtree and P2 to<logical factor> subtree.

#19 FUNCTION NAME: Create an entry of the form

#2 AND: Create a structure similar to that for OR;however AND is the operator. P1 links to <logicalfactor> subtree and P2 links to <logical secondary>subtree.

#3 NOT: Create a structure of the form

NOT TYPEPi

by assigning the operator NOT and the TYPE of<logical primary> to the first word and linking P1 tothe <logical primary> subtree.

#4, #5, #6, #7, #8, #9: Create a structure of the form

FUNCTREF I TYPE I T I #PARMSSYMBOL TABLE INDEX

P1P2

PN

where T determines whether the function is built in oruser defined, and P1, P2 ... PN link to each <expression>subtree. The number of subtrees is determined by thenumber of parameters (# parms).

#20 PARAMETER: The semantic action for parameteris to build an entry in the parse tree of the form

operator Ip LTYPE rYPE

P2

where operater may be >, >=, <>, <, <=, =. LTYPEand RTYPE are the types of the left subtree <sum>and the right subtree <sum>. Again P1 and P2 arelinks to the subtrees for <sum> and <sum>.

#10, #11: Create a structure similar to the relationaloperators structure except place the operators + or - inthe first word. In this case P1 and P2 are links to thesubtrees for <sum> and <term>.

#12 UNARY MINUS: A structure similar to the nodeconstructed for NOT is built. P1 links to the subtreefor <term>.

#13, #14, #15: Nodes are created for *, I or** in a fashionsimilar to the relational operator structures.

#16 CONSTANT: Create a structure of the form

TYPE INDEX l

when TYPE is the data type of the constant linked to byINDEX.

#17 SIMPLE VARIABLE: A structure of the form

INDEX

is built. INDEX links to the symbol table entry for thesimple variable.

#18 SINGLE SUBSCRIBED VARIABLE: Create astructure of the form

ARRAYREF I TYPE I #SSYMBOL TABLE INDEX;=P1P2

where #S is the number of subscripts, and P1 and P2link to the subtree or subtrees for <expression> .

I P I INDEX

P determines its type (value or reference) and index isits entry in the symbol table.The derivation tree in Figure 2 can now be built in

a bottom-up manner by the parser, and the semanticsactions are applied when building the parse tree. Theset of semantic operations 17,17,13,17,11,17,17,13,10 willcause the construction of the internal structure shown inFigure 7. Redrawing as a tree, we see that we haveconstructed the decorated parse tree from Figure 6.Assuming the existence of a set of structures of this

form, the remaining sections of the paper presentoptimized evaluation algorithms for arithmetic and booleanexpressions.

Arithmetic expression evaluation

Given an internal decorated parse tree, code for ex-pressions can now be generated by implementing a re-

<exp> _ . +

Figure 7. Internal structure.

"B""C,'

"A"

COM9IPUTER46

cursive procedure EVAL. Starting at the root of the tree,EVAL generates code for the subtree pointed to by theroot node's left branch. The subsequent action of EVALis determined by the operators and operands of that node.If both operands are subtrees (expressions) the leftsubtree is always evaluated first. In the case where theleft operand is a simple variable and the right operandan expression, EVAL is recursively called to evaluatethat subtree. In general, the algorithm for EVAL isas follows:

(1) Determine the result type (RESULTYPE) by takingthe maximum type between the left and right node types.

(2) Evaluate the left operand (node).(3) Convert the value to RESULTYPE.(4) If binary operator then evaluate right operand.(5) Convert the value to RESULTYPE.(6) Emit code for the operation according to type.

Boolean expression evaluation

By employing the same internal parse tree, it ispossible to generate optimal boolean expression code forthe operators AND, OR, and NOT. Rather than generatecode for full evaluation of all subexpressions, jumps outof the expression are generated as soon as the truth orfalsity of the expression is determined. Less desirablecode would be to evaluate each subexpression, leaving alogical result as a temporary result. These logicals wouldbe combined with AND, OR, and NOT instructionsleaving the final result for the true or false jump. Forexample, two possible sets of code generated for theexpression A< B AND (C=1 OR C=7) follow. (Theinternal parse tree allows us to generate the secondoptimal sequence.)

Full expression evaluation

The code generated for the mixed mode expressionB*C-A+D*E, where B and C are integers, A is real, andD and E are long, becomes

LOAD BLOAD CMPY, FLTLDDAFSUB, DZROLDL DLDL EEMPY, EADD

load integer Bload integer Cmultiply and convert value to realload real A on stackreal subtract and convert to longload long Dload long Emultiply long and add long

In addition to generating mixed mode expression code,the internal tree structure can also be used to generateoptimized code for the arithmetic operators and relationaloperators when the operands are all type integer byemploying memory immediate and memory to memoryinstructions. Recall that the code sequence for the allinteger expression B*C-A+D*E was

LOAD BLOAD CMPYLOAD ASUBLOAD DLOAD EMPY, ADD

load integer Bload integer Cmultiplyload integer Asubtractload integer Dload integer Emultiply, then add

However, employing memory instructions-i.e., memoryto top of stack, the following code sequence can begenerated by checking types of operators and operands(subtree or simple variable):

LOAD B load integer BMPYM C multiplyTOS by CSUBM A subtract integer ALOAD D load integer DMPYM E multiply TOS by EADD add

Other simple optimizations include constant and simplesubscript optimization. If both lower subtrees representconstants, they may be subsumed by the compiler. Also,a pointer is kept to the last subscript subtree; thenduring the evaluation, a compile time- tree comparisoncan be performed. If the two subtrees are equivalent,the value of the subscript will be in the index register.The remaining section employs the internal parse tree

to generate optimnzed branching code for boolean expressions.

May 1977

LOAD ACMPM BBLT L1LOAD FALSEBR L2

Li LOADTRUEL2 LOAD C

CMPI 1BEQ L3LOAD FALSEBR L4

L3 LOAD TRUEL4 LOAD C

CMPI 7BEQ L5LOAD FALSEBR L6

L5 LOAD TRUEL6 OR

ANDBF (FALSE)

TRUE.

Optimized evaluation

LOAD ACMPM BBGE FALSE

LOAD CCMPI 1BEQ TRUELOAD CCMPI 7BNE FALSE

TRUE ..FALSE ....

load value ofAcompare withBjump on less to Liload FALSE on TOSjump to L2load TRUE on TOS

compare with integer 1jump on equal to L3

OR top two stack valuesAND top two valuesjump to false location

load value ofAcompareA with Bjump on greater or equal toFALSE

compare C with integer 1jump on equal to TRUE

jump not equal to FALSE

Boolean expression evaluation can be implemented by arecursive tree evaluation similar to arithmetic expressionevaluation using the earlier defined EV'AL for non-booleanoperators. The algorithm evaluates the left branchpointed to by the node and then generates a jumpinstruction. Next the right branch is evaluated. Threeparameters are used on each recursive call in booleanevaluation. The parameters are a logic value, used todetermine whether a FALSE or TRUE jump is to begenerated, the false label destination, and the true labeldestination. The initial logic value is FALSE. Dependingon the operator, this value is reset for each left branchsubexpression evaluation. AND sets it FALSE and ORsets it TRUE. Subsequent NOT operators will complement

47

the logic value. The true and false labels are necessaryfor determining the destinations of jumps generated. AnAND node generates a jump to the false label, and anOR node generates a jump to the true label. Byexamining code generated for (A OR B) AND C OR D,note that intermediate labels must be generated. A, B,C andD may be any mixed mode expressions.

EVAL ABT TL1EVAL BBF FL1

TL1 EVAL CBT TLABEL

FL1 EVAL DBFFLABEL

evaluate Ajump on TRUE to label TL1evaluate Bjump on FALSE to label FLI

jump on TRUE to TLABEL

jump on FALSE to FLABEL

After the expression A is evaluated, it is necessaryto jump on true to the code for the evaluation ofexpression C. Thus, the intermediate label TL1 must bedefined. The algorithm defines a new label associatedwith the right branch of each AND and OR node. Thelabel may or may not be used as the destination labelof a jump instruction. The AND node associates a newtrue label, and the OR node associates a new falselabel with the right branch. The new labels are onlyused when evaluating the left branch subexpression.The NOT operator redefines the labels in such a way thatthe true label becomes the false label and the falselabel becomes the true label. The general algorithm isBOOLEVAL (flag, flabel, tlabel) and is as follows:AND

(1) Set logic value false (local flag=false).(2) Associate new true label with right branch.(3) Evaluate left branch, BOOLEVAL (IfMg, fibi newtlbl).(4) Emit jump for left subexpression using lflg.

The destination is false label (flbl).(5) Fix up any references to newtlbl.(6) Evaluate right branch; BOOLEVAL (flag,flbl,tlbl).

OR(1) Set logic value true.(2) Associate new false label with right branch.(3) Evaluate left branch, BOOLEVAL (lfIg,newflbltlbl).(4) Emit jump for left subexpression using lflg. The

destination is true label (tlbl).(5) Fix up any references to newflbl.(6) Evaluate right branch; BOOLEVAL (flag,flbltlbl).

NOT(1) Complement logic value (flag=NOT flag).(2) Reverse true and false labels.(3) Evaluate the branch of the node; BOOLEVAL

(flag,tlabel,flabel).

There are two key points to the algorithm. First, thelogic value is a reference parameter. It is only used whendetermining whether to generate a true or false jump atthe node where it is defined. Lower level NOT operatorswill complement the flag. Secondly, the jump instructionsare only generated for the left branch of a node. Theright branch of any node is really a left branch of somenode above it. By exiting out of any recursive calls,the right branch will become a left branch and a jumpwill be generated.The expression (NOT (A AND B) AND C OR D)

AND E shows the parameters for each node evaluationand the labels associated with the right branches of eachnode. The code generated is also given. The jumpdestinations and type of jump are determinable by theparameters.

48

PARAMETERSF, FL, TL

F, FL, TL1T, FL1, TL1 I

F, FL1, TL2 NT, TL2, FL1 AD

F A

AND\

ORD E

P,T oC,ND

LABELS CODESEVALABF TL2

TL1 EVALBFL1 BF FL1

TL2 EVALCTL2 BT TL1

FL1 EVALDBFFL

TL1 EVAL EBFFL

Translator writers should be able to extrapolate fromthese ideas and develop additional techniques to solvetheir machine-dependent problems. And, hopefully, stackmachine instruction set designers will get the flavor ofthe class of instructions required for quick and cleantype conversion. -

Bibliography

1. Aho, Alfred V., and Jeffrey D. Ullman, The Theory ofParsing, Translation, and Compiling, Vol. 1, EnglewoodCliffs, N.J.: Prentice-Hall, 1972.

2. Anderson, J. P., "A Note On Some Compile Algorithms,"CACM, Vol. 7, No. 3, 1964, pp. 149-150.

3. Gries, David, Compiler Construction for Digital Com-puters, New York: John Wiley and Sons, Inc., 1971.

4. Hopgood, F. R. A., Compiling Techniques, New York:American Elsevier Inc., 1969.

5. Nakata, I., "On Compiling Algorithms for ArithmeticExpressions," CACM, Vol. 10, No. 8, 1967, pp. 492-494.

6. Redziejowski, R. R., "On Arithmetic Expressions andTrees," CACM, Vol. 12, No. 2, 1969, pp. 81-82.

7. Sethi, R., and J. D. UUman, "The Generation of OptimalCode for Arithmetic Expressions," JACM, Vol. 17, No. 4,1970, pp. 715-728.

John D. Couch is currently an engineeringsection manager responsible for language,utility, and application development at

X | I = g Hewlett-Packard's General Systems Division.Since joining HP in 1972, he has been a memberof the HP3000 Fortran and Basic compiler

-;1s projects, a project leader for a systemsprogramming language, and a project manager

b responsible for language development andmachine architectural design.

Couch received his AB in computer science and MS inelectrical engineering-computer science from the University ofCalifornia, Berkeley, in 1969 and 1971, respectively. He haslectured courses at UCB and, since 1974, has taught thegraduate classes in compiler construction and software designat California State University, San Jose.He served as technical program vice chairman for COMPCON

76 Spring and is the co-author of Syntax Directed CompilerConstruction, a coUege textbook to be pubhshed by SRA in 1978.

Terry Hamm is a firmware engineer atTektronix doing language design and develop-ment. Prior to joining Tektronix he spentfive years at Hewlett-Packard doing languagetranslator development. His interests includelanguage design, translator design and develop-ment, and software engineering.He received the BA in mathematics and

the MS in computer science from WashingtonState University in 1968 and 1969, respectively.

He is a member of the ACM and IEEE Computer Society.

COMPUTER