Efficient Stack Allocation for Tail-Recursive Languages · Efficient Stack Allocation for Tail-Recursive Languages Chris Hanson MIT Artificial Intelligence Laboratory 1 introduction

Efficient Stack Allocation for Tail-Recursive Languages

Chris Hanson MIT Artificial Intelligence Laboratory

1 introduction

The Scheme dialect of Lisp [9] is properly tail-recuraiw- it relies entirely on procedure calls to express iteration. In Scheme, a tail-recursive procedure call (that is, a call in which the calling procedure does not do any further process- ing of the returned due) is essentially a goto that passes arguments, sa was first pointed out by Steele [13]. In a prop erly tail-recursive language, there is no need for any explicit iteration constructs such as do or vhile; these can all be defined in terms of ordinary procedure calls.

As elegant as tail-recursion may be from the perspective of the programmer or the theoretician, it poses challenges for the compiler designer. One of the crucial decisions in the design of a compiler is the formation of a strategy for memory allocation and deallocation. An important aspect of this strategy is the treatment of memory locations used to hold the bindings of local variables. Because local variables play a significant role in most computer languages, their treatment can have a noticeable impact on a program’s execution speed and run-time space requirements. Compilers for many block-structured languages use a simple strategy when allocating local variables: stack allocation. This strategy is supported by hardware on many computers, and by software mechanisms such as the aceem link and the display. However, the standard methods for implementing stack allocation assume that the language is not tail-recursive, and a straightforward application of these methods to a tail- recursive language can result in non-tail-recursive compiled code.’

This paper describes stack-allocation techniques for com- piling tail-recursive languages. We do not claim that these are the only techniquea that can be used to solve the prob lem, nor do we compare them to other techniques. Instead, we use our techniques as a concrete example to demonstrate that it is possible to implement stack allocation of local variables without sacrificing tail recursion, which to our knowl- edge has not previously been shown.

We have implemented these techniques in the MIT

‘In contrast, the implementation of a tail-recursive interpreter is relatively simple, an deecribed in [l]. The interpreter do- not attempt to optimize l tomge for locd variables, which ir the efkiency iuue of concern in thir paper.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise , OT to Tepublish, Tequires a fee and/or specific permission.

0 1990 ACM 089791-368-X/90/0006/0106 $1 SO 106

Scheme compiler [ll]. Although efficiency is a secondary issue for this paper, it is nonetheless important: we show that their performance is comparable to implementations of non-tail-recursive languages. In particular, the code sequences generated for tail-recursive procedure calls are as efficient as those that implement the special-purpose iter;l tion constructs of non-tail-recursive languages.

Sections 2 and 3 present a standard implementation of stack allocation and demonstrate why it fails for a tail- recursive language. Section 4 shows how the implementation can be modified to remedy this failure. The modified implementation is somewhat impractical, so section 5 presents a refinement that results in a simple and practical implementation; thi is the paper’s main contribution. Section 6 shows additional static analysis techniques that can be used to make the implementation more efficient, and section 7 compares this final implementation to a non-tail-recursive one, in which iteration is acomplished using explicit iteration constructs rather than ordinary procedure calls.

2 GeneralFr-k

We’ll follow an empirical approach in this paper, starting with techniques from a well-established compiler text and making successive modifications until we achieve the desired results. Our source language will be Scheme, because it is tail-recursive, widely known, and fun to program in.

Standard texts on compilation, such sa [2] inform us that stack allocation of run-time storage is a very common technique, applicable to such lexically-scoped languages as Al- gol, Pascal, and C; this is a good sign, as Scheme is closely related to Algal. The stack is allocated in segments called activation records, each containing the information associated with a single procedure invocation. Each activation record consists of a collection of fields. The particular stack discipline described in chapter 7 of [2] uses seven fields, of which we need only four:

Actual Parameters The argument values of the called procedure.

Saved Machine Status The return address indicating where to continue execution of the calling procedure.

Access Link A pointer to the activation record of the procedure that is the lexical parent of the called procedure. This link is used to implement lexical variable reference.

Temporaries

The remaining three fields won’t be used:

Local Data Because Scheme has no local declarations, it needs no local data. The role played by these decla rations in other languages is subsumed by Scheme’s lambda bindings.

Control Link A pointer to the activation record of the procedure that invoked the called procedure. This field is normally used for three purposes:

l Analogous to the access link, the control link can be used to implement “deep access” in a dynamically-bound language.

l Non-local control transfers can be implemented by searching the control link chain to find a particular activation record.

l Procedures that accept varying numbers of arguments can use the control link to find out how many arguments they were passed, or to return to their caller without actually counting the arguments.

None of these purposes is directly relevant to our dig cussion. The MIT Scheme implementation that we will discuss uses alternative techniques to solve all of these problems, so we won’t use this field.

Returned Value We’ll assume that that all procedures return at most one value, and that a register is used to hold a procedure’s returned value, so this field won’t be needed.

Although Scheme is closely related to Algol, it differs in that Scheme’s procedures are first-class objects. The major consequence of this difference is that the activation records for some Scheme procedures must be allocated on the heap, and deallocated by garbage collection. Fortunately, use of stack allocation and heap allocation may be combined in the same implementation, and there are methods for determining when stack allocation is applicable [ll]. The other rels vant difference between Scheme and Algol is that Scheme has first-class continuations, but even so, standard techniques permit the use of stack allocation [4]. This paper ignores these issues and focuses on the relationship between stack allocation and tail recursion.

Now let’s choose an example program and simulate its run-time behavior for our proposed compiler. The example we choose is deliberately simple; it shows the basic relationship between stack allocation and tail recursion. The pro cedure accuxulate, shown in Figure 1, is a simple iterative procedure that is sometimes used to implement the variable- arity equivalent of a binary procedure (e.g. +), where the set of arguments is represented as a list.

To examine the run-time behavior of our example, we’ll use a visual model showing the arrangement of activation records on the stack at particular moments during execution. By choosing appropriate moments, we’ll see only those allocation records that we’re interested in, not uninterest- ing ones such as those generated for calls to primitives. Also, we’ll “crop” the snapshots to eliminate any activation records associated with the caller of the example procedure,

Each activation record is represented by a box, and the boxes are “stack& one on top of another; the top of the stack is the topmost box. Within each box, the fields of the activation record are represented as follows:

(def ins accumulate (lambda (binary-op initial iters)

(if (null? it.9881 initial (letrec ((loop

(la&da (value item51 (if (null? it-1

value (loop (binary-op

value (car itera))

(cdr itera)))))) (loop (car itexa) (cdr itens))))))

Figure 1: The accumulate procedure.

Each actual parameter is represented by a line containing name = value.

An access link, if any, is represented by a line containing acceua link, with an arrow showing where the access link points to. The arrow is suppressed in the outermost activation record, because we’re not concerned with how “top-level” variables are accessed.

The return address, if any, is represented by a line containing return address.

The activation record’s temporary storage is not shown.

3 A Traditiond llrplementatia~

Now we choose the compiler’s procedure call and return sequences. Because our stack-allocation techniques are specif- ically designed for non-tail-recursive languages, we initially ignore tail recursion, and later we’ll modify the implementation to introduce it. So we begin with the following procedure call sequence:

s Push the return address.

l Push the access link.

l Push the called procedure’s arguments.

l Jump to the called procedure’s start address.

and the following return sequence:

l Pop the topmost activation record off the stack.

l Jump to that record’s return address.

Figure 2 shows the activation records generated by this call sequence when accurulate is called with some typical arguments: (a) shows the initial activation record for the procedure accumulate; (b) shows the additional record generated by the letrec expression that binds loop; and (c) through (e) show the records generated by successive invo- cations of the procedure loop. The figure doesn’t show the effect of the return sequence, which simply pops the records off the stack one at a time.

107

As we expect for a non-tail-recursive language, Figure 2 shows the typical pattern associated with a recursive process: the number of activation records increases in propor- tion to the number of times we invoke loop. If our compiler wsa properly tail-recursive we would instead see the pattern of an iterative process: the number of activation records would be constant as we went around the loop (see p. 32 of [l] for a discussion of recursive and iterative processes).

4 A Tail-bsive Inplernentath

Now let’s see what must to be done to make our compiler produce tail-recursive code. The basic problem is that we are storing too much information on the stack and must discard some. We’ll begin this section by refining our concepts of “activation record” and “procedure call”. Then we’ll discuss what information we should discard, when it can be discarded, and why discarding it is sufficient to make our implementation tail-recursive. We’ll finish the section by detailing the tail-recursive call and return sequences.

The traditional model’s notion of an activation record is somewhat clumsy for the discussion that follows. As we shall soon see, an activation record’s return address must sometimes be separated from the rest of the record. To facilitate this, we change our model by splitting each activation record into two parts: a control record containing a return address, and an environment record containing bindings and an access link; either kind of record may also contain temporaries. We’ll use the term actioation record to refer to either kind; what we formerly called an activation record will now be thought of as a combination of a control record and an environment record. Note that this doesn’t change the contents of the stack, but only how we interpret them.

Another useful refinement to our model is to classify each procedure call as either a subproblem or a reduction. When the calling procedure will perform more computation after the called procedure has finished executing, the procedure call is a subproblem, otherwise it is a reduction. This clas- sification is easy to implement in a compiler as it is a surface property of the program text. For example, procedure calls appearing aa the last expression of a ‘lambda’ or ‘le- tree’ body or as one of the arms of an ‘if’ expression are rc- ductione. Procedure calls appearing as arguments to other procedure calIs or as the predicate of an ‘if’ expression are eubproblems. In accuwulate, the calls to the procedures null?, car, cdr, and binary-op are subproblems and the calls to loop are reductions. (The letrec expression also expands into a reduction, regardless of how the recursion is implemented.)

With these refinements, we can say that achieving tail recursion requires (1) avoiding pushing unnecessary control records, and (2) popping environment records when they are no longer needed.

Unnecessary control records are those that correspond to reductions. It is easy to see that we need never push a control record for a reduction, because the action of its ra turn address is known at compile time: it pops the caller’s environment and control records, then jumps to the return address of the latter. Thus if we omit the control record from every reduction, the return sequence should be modified to pop environment records off the stack until it reaches a control record, then pop that control record and jumps to its return address. On the other hand, we must push control records for subproblems, because the compiler doesn’t usu-

ally know what the caller will do after the return.’ Referring to Figure 2(e), this control-record optimization eliminates ali but one of the return addresses shown, the exception being that of the bottom record.

Now let’s try to eliminate some of the environment records. As a consequence of the control-record optimization just described, the return sequence now discards any environment records that are above the topmost control record on the stack. But some environment records can and should be discarded sooner. In Figure 2(e), the second and third environment records from the stack’s top can play no useful role in the computation, as neither their variable bindings nor their access links can be reached. Each of these two records becomes useless at the procedure call to loop that occurs within loop itself, indicating that we should attempt to detect and eliminate useless environment records during the call sequence. As our example indicates, we can sometimes find useless records at a reduction, however, we’ll never find any useless environment records at a subproblem, because all of the records are needed to continue the calling procedure after the called procedure returns.

So which environment records can we pop at a reduction? At most, we can pop all environment records between the stack’s top and the nearest control record; environment records below that control record may not be popped because they are needed by the caller who pushed the control record. However, we can’t always pop all of the environment records above the control record, because some of these may be in the callee procedure’s access chain and thus will be needed by the callee. So what we need is a simple algorithm to decide which of these environment records, if any, is iu the access chain of the callee.

Our popping algorithm is a loop that examines the top most record on the stack: if that record is a control record or the environment record that the callee’s access link points to (i.e. the callee’s lexical parent), then we’re done, otherwise we pop the record and continue the loop.

The algorithm is based upon the following claim: if two environment records are adjacent on the stack, then the access link of the upper record points to the lower record. The claim implies that all of the environment records between the topmost control record and the stack’s top are part of a single access chain. If the callee shares part of this access chain, it will be the lower part, so we must pop the upper, unshared part before pushing the calhze’s new environment record.

This claim about adjacency is true because environment records are only pushed by procedure calls, and furthermore an environment record can only be adjacent to another one when the upper one has been pushed by a reduction, since a subproblem always pushes a control record before pushing its environment record. A reduction’s record-popping algorithm guarantees that the new environment record’s adjacent record is either a control record, or is the callee’s lexical parent record. In the former case the new environment record will not be adjacent to any environment record, while in the latter case it will be adjacent to its lexical parent. In both cases our claim is supported.

Let’s snmmarise how the new call and return sequences affect the stack. The subproblem call sequence doesn’t dis-

‘Sometimes the compiler can determine that a particular proa- dure will always be cdled aa a subproblem with s particular known return address, and in that ewe it need not push the control record. Thii is a useful optimiaation, but it is not needed to achieve tail- recursive behavior.

108

initial = 0 itere = (1 2 3)

return address

loop = . . .

return address

binary-op = + initial = 0 item = (1 2 3) access link return address

(4

value = 3 iters = OCCe88 ,if3:3) return address

value = 1 itere = (2 3)

return address

loop = . . .

return address

binary-op = + initial = 0 itere = (1 2 3) acce88 link return address

(4

(b)

value = 1 iters = (2 3)

return address

loop = . . .

return addresr

binary-op - + initial = 0 item = (1 2 3) access link return address

(4

value = 6 iters = 0 acce88 link return address

value = 3 iters = (3) accea8 link return address

value = 1 iters = (2 3) acce88 link return address

loop = . . . access link return address

binary-op = + initial = 0 item - (1 2 3) acce88 link return address

(e)

i zl

Figure 2: Activation records created by evaluation of the expression (accuaulate + 0 '(1 2 3)) when accurtiate is compiled using non-t&Lrecursivemethods.

109

card any records because they will all be needed after the called procedure is finished; it pushes both a control record and au environment record. The reduction call sequence discards any environment records that the callee doesn’t need, and then pushes a new environment record. The return sequence discards any environment records that the returnee doesn’t need, and then pops the returnee’8 control record and invokes it.

By comparing this use of the stack with that of a tail- recursive interpreter, such as that described in [l], we can informally show that our new call and return sequences are tail-recursive. For a subproblem, the interpreter pushes a return address and a pointer to the current environment, then allocates a new environment frame for the callee and makes that the current environment. For a reduction, it pushes nothing, discarding the current environment pointer and replacing it with that of the callee’e newly-allocated environment frame. For a return, it discards the current environment, pops a return address and an environment off the stack, and makes that environment the current one. If we identify our environment records with the interpreter’s environment frames, and our control records with the interpreter’s return addresses, it is clear that the interpreter and the compiled code are saving exactly the same information, and thus the compiled code is tail-recursive. The only difference between them is that the compiled code allocates the environment records directly on the stack, while the interpreter allocates them in the heap and saves their pointers on the stack.

Concludig this section, here are the new tail-recursive call aud return sequences iu detail. For a subproblem, we use the same calling sequence ss we did for the non-tail-recursive implementation. For a reduction, we use this sequence:

a Compute the access link and bindings of the new environment record, saving them somewhere other than the stack.

l If the stack’s topmost record is an environment record other than the callee’s lexical parent, pop it. Continue this process until reaching either a control record or the lexical parent.

l Push the access link and bindings that were computed earlier.


Our new return sequence is:

s If the stack’s topmost record is au environment record, pop it. Continue this process until the topmost record on the stack is a control record.

s Pop the control record off the stack and jump to its return address.

Figure 3 shows the effect of these new code sequences; it should be compared to Figure 2. Clearly accumulate now executes as an iterative process.

5 A Practical Implementation

The call and return sequences described in section 4 give correct tail-recursive behavior, but a straightforward implementation of them is both complicated and inefficient. Not

only must we be able to distinguish environment and control records at run time, implying that each record must somehow be marked to indicate its type, but the compiler must emit record-popping loops for each reduction and return sequence; such loops are likely to consist of several machine instructions, and their execution time will be lin- ear in the number of records popped. If we could statically determine how many records to pop, the code to pop them would merely change the stack pointer. This would execute in constant time, and on most machines would be a single instruction. Furthermore, without the need to dynamically distinguish record types, we would save the space needed to mark each record’s type.

It is clear that compiler analysis to determine the dis- cardable records is justified. However, despite its initial ap- peal, there are two serious drawbacks associated with static analysis: it’s difficult and expensive to do well, and it’s sometimes theoretically impossible to statically determine how many records to pop. So before looking at compiler analysis, we should first find a reasonably efficient mechanism that implements the record popping dynamically. Such a mechanism can be used either ss an alternative to static analysis, or as a backup method when static analysis fails.

The popper mechanism, invented by Guillermo Rosas [l l] and implemented in two dierent versions of the MIT Scheme compiler, is one possibility. This mechanism works by adding a “popper” field to each record, which contains a tiny program that dynamically decides whether the record needs to be popped; the program takes arguments that pro- vide it with contextual information. The compiled code jumps into the popper field of the topmost record. The code in that field either jumps to the popper of the next record, after modifying the arguments if necessary, or else stops and adjusts the stack pointer.

Roras was dissatisfied with the performance and com- plexity of the popper mechanism, and a subsequent discussion between him and Jonathan Roes, sparked by a paper contrasting stack allocation to heap allocation [3], led to their discovery of a simpler and more efficient mechh uism [lo], which has since been implemented by Richard Kelsey [8] for the Scheme 48 system. The author modified their mechanism to eliminate certain disadvantages (while introducing others) and implemented the modified mechanism in the MIT Scheme compiler in December 1987. The remainder of this section will describe the modified mechanism. We’ll discuss the original mechanism later, as it remains an interesting possibility for future work.

Recall that earlier we examined the uses of the control- liuk field and concluded that it was unnecessary. Rozas and Kees made the discovery that a tail-recursive implement& tion could take advantage of this field even when a non-tail- recursive implementation could not. The reason is that control links, which chain together the stack’s control records, can be used to identify the topmost control record on the stack. In our original terminology, au activation record’s control link points to the return address in the activation record immediately below. In our new terminology, the control link of an environment record points to the nearest control record that is below the environment record on the stack. The control link of the stack’s topmost environment record points to the topmost control record.3

3Note that, except during the call and return sequences them- selves, the topmost record on the stack is always M environment ItCOd.

110

initial = 0 itelo = (1 2 3)

return address

(4

initial = 0 iteas = (1 2 3)

return address

w

value = 3 iters = (3)

loop = . . .

binary-op = + initial = 0 iters = (1 2 3) acce.w link

return address 1

(4

binary-op = + initial = 0 items = (1 2 3) accea link

1 re’eturn address I

(4

value = 6 item = ()

loop = . . .

binary-op = + initial = 0 it988 = (1 2 3) access link

I return address

(4

Figure 3: Activation records created by evaluation of the expression (accumulate + 0 piled using tail-recursive methods.

a (1 2 3)) when accurulate is com-

111

We can restate the record-discarding algorithms in a slightly different form to help show how the control link can be used: the return sequence discards all environment frames above the topmost control record, while the reduction sequence discards all environment frames above either the topmost control record or callee’s lexical parent record, whichever is closer to the top of the stack. We can implement the first case by moving the control link into the stack pointer, and the second by comparing the control link and the callee’s access link, moving the topmost one into the stack pointer. Both of these are simple to implement and reasonably efficient.

Let’s design new call and return sequences that take advantage of the control link. If we store a procedure’s control link in its environment record, we’ll have to copy the control link every time a reduction is executed. Instead, we introduce a register to hold the control link; when we execute a subproblem, we’ll save the contents of this register in the control record, along with the return address, and set the register to point to the new control record. This is a tradeoff that simplifies reductions at the expense of subproblems.

Here is the new call sequence for a subproblem:

l Push the return address.

l Push the contents of the control-link register.

l Copy the contents of the stack-pointer register into the control-link register.

l Push the acceaa link and bindings of the called proca dure.


The new return sequence is:

l Copy the contents of the control-link register into the stack-pointer register.

l Pop the previous control link off the stack and put it in the control-link register.

l Pop the return address off the stack and jump to it.

The new call sequence for a reduction is:

l Compute the access link and bindings of the new environment record, saving them somewhere other than the stack.

l Compare the access link to the contents of the control-link register. Set the stack-pointer register to whichever of these is closer to the top of the stack.

a Push the access link and bindings that were computed earlier.


The efficiency of these new code sequences is significantly better than that of the previous sequences: the new ones execute in constant time, and the number of machine instructions needed to encode them is small. Their major drawbacks are the need to dedicate a machine register to hold the control link, and to push au extra word for each control record.

Figure 4 shows the activation records that result from these call and return sequences. The notation “CL” and au arrow is used to indicate the contents of the control- link register, and a saved control link is indicated by a line containing control link. Comparison to Figure 3 shows that these snapshots differ only in the handling of the control link.

6 An Eflident lnpkmentatm

The control-link mechanism of section 5 provides a correct implementation of tail recursion with acceptable performance. The mechanism dynamically determines the records to discard. If we make this determination statically we can generate even more efficient code. What follows are some useful ad hoc rules, with emphasis on what can be accom- plished by a simple compiler using these rules.

Static analysis readily admits two relevant improvements to our code sequences. If we predict the value of a control link, then we need not generate the instructions to load, save, and restore the register for the link; if we predict the value of an access link, we may eliminate the link from its record, reducing code size. Here “predict the value” means the value can be determined at run time by adding a constant offset to some other known quantity. A control liuk is predicted by determining its offset from the stack pointer, while an access link is predicted by determining its offset from the beginning of the record that it is stored in. With good predictions, the compiled code can refer to most quan- tities as offsets from the stack pointer. Many machines have instructions supporting this form of reference, and we’d expect a good C compiler, for example, to generate such code.

We can predict the value of a record’s access link in several situations. One situation occurs commonly and is relatively easy to determine: when a given procedure is always called as a reduction, the environment record of the procedure’s lexical parent is always immediately below the environment record of the procedure itself (assuming that the lexical parent record is stack-allocated; otherwise this analysis is not interesting). In accumitate, loop is such a procedure, whose parent is the letrec that binds the variable loop. The letrec also expands into such a procedure, whose parent is the procedure accunulate. In typical Scheme code most let expressions occur this way. Performing the optimization transforms such a let expression into a sequence of stack pushes.

A simple analyzer can find many of the environment records that this rule applies to. A reduction whose operator is a la&da expression is an obvious caee, and a common one: a let expression expands into such code. Another common case is a lambda expression that is bound to a variable by a let or letrec expression, and every reference to that variable appears as the operator of a procedure call. Both of these csses can be located without sophisticated analysis, and represent many accees links. Of course, if more com- plete datallow information is available the analyzer can take advantage of it to find other cases.

Elimination of control links is harder than elimination of access links, and control links that can be eliminated are less common. For these reasons, the designer of a simple compiler may wish to omit control-link analysis entirely. How- ever, one rule for eliminating control links is so simple, and applies to so many procedures, that most compilers should implement it: when a procedure’s lexical-parent environ-

112

binary-op - + initial = 0 iteli, = (1 2 3)

-4 aceem link

control link CL

return address

initial = 0 itea3 = (1 2 3)

I control link CL

return address

value = 1 item - (2 3)

loop = . . .

binary-op = + initial = 0 iters = (1 2 3) aceed link

control link return addwe

- CL

(4 W (4

value = 6 iteur = 0

loop = . . . loop = . . .

binary-op = + binary-op = + initial = 0 initial = 0 iters = (1 2 3) itens = (1 2 3) access link acceaa link

I control link t-

CL

return address

(4

Figure 4: Activation records created by evaluation of the expression (accurulata + 0 ‘(1 2 3)) when acaulate is compiled using the control-link mechanism.

113

ment is not stack-allocated, the procedure does not need a control link: the control record is always immediately below the procedure’s environment record (recall that the record below an environment record is either a control record or the lexical parent’s environment record).*

Beyond thii simple rule is another that requires dataflow analysis. For each procedure, find all of the calls to the prs cedure. If all of the calls are subprobleme then a control link is not needed: the control record will always be immediately below the procedure’s environment record. If all of the calls are reductions, then the environment record of the procedure’s lexical parent is always immediately below: a control link is needed only if the lexical-parent procedure needs one. If some of the calls are reductions, and some are subproblems, then a control link is needed.

The compiler analysis we’ve just sketched results in each procedure being marked to indicate whether or not it requires a control link or an access link. The compiler’s code generator examines these marks to generate the correct call and return sequences. In the following descriptions of the new code sequences, the decisions made by the code generator at compile time are italicized to distinguish them from the actions and decisions that occur at run time.

Subproblem sequence:

If the calling procedure har a control-link mark: push the contents of the control-link register.

Push the return address.

If the called procedure ia known and has a control-link mark: copy the contents of the stack-pointer register into the control-link register.

If the called procedure is known and has an access-link mark: compute the access link and push it.

Push the bindings of the called procedure.

Jump to the called procedure’s start address.

Return sequence:

If the returning procedure has a control-link mark: copy the contents of the control-link register into the stack-pointer register.

If the returning procedure doesn’t have a control-link mark: adjust the stack-pointer register so that it points at the topmost control record. The compiler knows exactly what adjustment is needed.

Pop the return address off the stack and jump to it. Note that the code at the return address is now respon- sible for popping the calling procedure’s saved control link, if any. The returning procedure can’t do this because it does not know whether the returnee procedure requires a control link,

Reduction sequence:

‘The methoda that determine when stack allocation in used also determine when this rule should be used; since we aren’t discussing these methods we won’t describe exactly when to use the rule. But clearly a top-level procedure (e.g. accumulate) does not have a stack- allocated parent record and need not use a control link.

If the called procedure ir known and has an accear-link mark: compute the access link and save it somewhere other than the stack.

Compute the bindings of the called procedure, saving them somewhere other than the stack.

Adjust the stack pointer. The action to be taken is one of the following coded. The compiler choose8 the first case that holds and generates code for it:

- The called procedure is unknown or its Jezical parent environment is not stack-allocated: two sub- cases:

The calling procedure has a control-link mark: copy the contents of the control-link register into the stack-pointer register. Otherwise: adjust the stack-pointer register so that it points at the topmost control record. The compiler knows exactly what adjustment is needed.

- The calling procedure is the lezical parent of the called procedure: do nothing, the stack pointer is already correct.

- The calling procedure and the called procedure are lexicd siblings (i.e. they have the same Jezicol parent): pop the calling procedure’s environment record.

- The calling procedure has a control-link mark: compare the called procedure’s access link to the contents of the control-link register and set the stack pointer to whichever of these is closer to the top of the stack.

- The calling procedure is always invoked as a reduction: set the stack pointer to the called procedure’s access link.

- Otherwise, the calling procedun iu always invoked as a subproblem: pop the calling procedure’s environment record.

Push the access link (if any) and bindings that were computed earlier.

If the calling procedure does not have a conttvl-link mark, but the called procedure does: compute the control link and store it in the control-link register.

Jump to the called procedure’s start address.

Figure 5 shows the effect of these new optimized sequences on the accumulate procedure. As we can see, accwulate is optimal in that the compiler is able to eliminate all of its access and control links, except for the special access link in the bottommost record.

7 Coqarison to Standard Techniqwr

Having described an efficient tail-recursive implementation of stack allocation, we can ask how the quality of ite generated code compares to code generated by traditional compilation techniques for non-tail-recursive languages. There are two casea in which direct comparison of tail-recursive and

114

I I binary-op = + initial = 0 itere = (1 2 3) acxe88 link

return address

initial = 0 item - (1 2 3)

I return address I

I value - 3 iters - (3) I

loop - . . .

t-l binary-op = + initial - 0 item = (1 2 3) access link

return address

(4

I value = 1 item = (2 31 I

loop = . . .

t-i

binary-op - + initial = 0 items f (1 2 3) acceaa link

I return addresr I

04 (4

I value = 6 item, = 0 I

initial = 0 items - (1 2 3)

return address I

(4

Figure 5: Activation records created by evaluation of the expression (accumulate + 0 ‘(1 2 3)) when accumilate is compiled using static analysis to eliminate access and control links.

115

object accumulate (binary,op. initial, itens)

object biuary-op; object initial; atrnct pair * iters;

< object value ; etruct pair * 1; if titers - 0)

return (initial) ; else

( value = iteus->car; 1 - itera->cdr;

loop : if (1 -- 0)

return (value) ; else

< value =

apply,? (binary,op. value, 1->car): 1 = 1->cdr; got0 loop;

3 3

3

Figure 6: The accumrlate procedure, translated to C.

non-tail-recursive stack-allocation techniques is etraightfor- ward. For a purely recursive loop, the two techniques generate essentially identical code.

The more interesting comparison is for a purely iterative loop. In Scheme, this appears as a procedure whose recursive call is a reduction, while in C this is written using an iteration construct, like ohilo or goto, combined with side-effects and sequencing. Figure 6 shows the accumulate program translated to C. The goto construct is used to express the iteration because it is closest in form to Scheme’s procedure call. The declarations of the local variables value and 1 appear in the outermost block to emphasize the fact that most C implementations allocate all local variables immediately on procedure entry.

We will compare the object code of LIAR, the MIT Scheme compiler, to that of GCC, the GNU C Compiler [12], when each is used to compile accunulate.

Figure 7 shows LIAR’s output for the procedure accunulate, and figure 8 shows GCC’s output. Both outputs are 68020 instructions [S] in the syntax of the Hewlett-Packard HP-UX assembler [5]. Both have been edited in a vari- ety of ways to eliminate inessential differences: extraneous labels and declarations have been removed; various noop instruction sequences have been deleted; and explanatory comments have been added. Additionally, LIAR’s output has had all tag manipulation, GC formatting, debugging information, and interrupt polling removed; the “procedure value” register has been altered to XdO to match GCC’s convention; and the order of the basic blocks has been changed to match that of GCC’s output. GCC was run with its op timizer turned off because the optimizer causes the object code to use registers instead of the stack, which would have defeated the comparison.

LIAR’s output differs in two ways from that of the hypothetical compiler we have been discussing. First, it does not push an environment record containing the binding of loop--it determines that the value of this variable is always known and need not be constructed. Second, while our hypothetical compiler would have computed the callee’s record, stored it somewhere (perhaps in registers), adjusted the stack, and then pushed the record, LIAR’s output constructs the environment record of the callee by incremen- tally overwriting that of the caller (see lines 25-28). This valuable optimization causes the object code to use side- effects on stack-allocated locations, even though no explicit side-effects appear in the source code. As an alternative to this optimization we could have written the Scheme program using explicit sideeffects to make it more like that of the C program. However, we cannot rewrite the C program to eliminate those explicit side-effects without changing it from an iterative process to a recursive process, because C has no iteration constructs that bind variables.

Summarizing, the differences between the compilers’ outputs are entirely due to calling and representational conven- tions. There is no fundamental difference in their efficiency:

l GCC’s output uses a frame pointer while LIAR’s out put does not (e.g. Fig. 8 lines 2 and 10).

l GCC uses the convention that the caller deallocates the arguments of a procedure call, while with LIAR the callee does. (See Fig. 8 line 24, Fig. 7 line 16.)

l GCC uses the jsr instruction to push the return address, jumping to the callee at the same time; LIAR uses the pea instruction and does other work before jumping to the callee. (See Fig. 8 line 23, Fig. 7 lines 19 and 23.) Note that this convention is a consequence of tail-recursion: the return address appears below a procedure’s arguments so that the arguments may be popped without popping the return address.

l GCC’r output allocates its local variables immediately on entry to the procedure; LIAR’s output allocates them when they are needed. (See Fig. 8 line 2, Fig. 7 lines 9 and 10.)

This shows that our techniques work se well as existing non- tail-recursive techniques for simple iterative loops of this kind.

It would be interesting to do a more thorough comparison of the two compilation techniques. However, this will require writing pairs of programs that generate equivalent stack- allocation patterns in tail-recursive and non-tail-recursive languages, which is difficult. Tail-recursive procedure calls are easily used to create processes with complex mixtures of recursive and iterative stack-allocation behavior, while most non-tail-recursive languages’ iteration mechanisms have lit- tle or no control over stack-allocation behavior.

8 Future Directions

We’ve seen a specific set of methods to implement stack allocation of local variables for a tail-recursive language. Our use of the control link is a general solution to the problem of deallocating stack-allocated environment records. The compiler optimizations admit an implementation that is camps rable in performance to an implementation for a non-tail- recursive language.

116

1 accunulate: 2 rov.1 8(Xsp),XaO 3 tst.1 %a0 4 bne label-18 5 nov.1 4(%sp),%dO 6 lea WXsp),Xsp 7 l-t8 8 label-18: 9 aov.1 4(XaO),-(Xsp) 10 rov.1 (XaO).-(Xsp) 11 la&da-4: 12 nov.1 4(Xsp),%aO 13 tat.1 %a0 14 bne label-19 15 rov.1 (%sp),%dO 16 lea 2o(%sp),xsp 17 rta 18 label,19: 19 pea continuation-2 20 rov.1 (%aO),-(Xsp) 21 rov.1 8(%sp),-(Xsp) 22 rov.1 2O(xsp),-(Xsp) 23 jlp apply-2 24 continuation,2: 25 rov.1 %dO,(%sp) 26 rov.1 4 (75sp),%aO 27 rov.1 4(%aO),Q(%sp) 28 bra lsubda-4

Figure 7: MIT Schemecompiler output for accunulate.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

accurulate:

L2:

L3:

L4:

1ink.r rov.1 tat.1 bne rov.1 unlk lTt.8

%a6.&-8 16(%a6),%aO %a0 L2 12(%aG),%dO %a6

rov.1 (%aO),-4(%a6) * VALUE := ITEHS-XXR rov.1 4(%aO).-8(%a6) 8 L := IIEHS-XDR

rov.1 tat.1 bne nov.1 unlk rts

-8(%a6).%aO %a0 L4 -4 (%a61 , XdO %a6

rov.1 (%aO),-(Xsp) rov.1 -4(%a6).-(Xsp) rov.1 8cta6) .-(%sp) jar apply-2 add.0 812.%sp nov.1 %dO,-4(%a6) rov.1 -8(%a6),%aO rov.1 4 (%a01 , -8 (%a61 bra L3

S test (NULL? ITmS>

# branch if not null # value-register := INITIAL 1: deallocate arguuents # return

8 allocate L := (CDR ITEXS) t allocate VALUE := (CAR ITRHS)

# test (NDLL? L)

# branch if not null # value-register := VALUE (I deallocate args, L, VALDR t return

8 push return address t push (CAR L) t push VALUE Ir push BINARY-OP 8 invoke BINARY-OP on two args

8 VALUE :- result S L := (CDR L)

8 continue loop

8 allocate VALUE. L t test (ITEW - 0)

# branch if not null $ value-register :- INITIAL t deallocate VALUE, L S return

t test (L == 0)

8 branch if not null I value-register :- VALUEi * deallocate VALUE, L t return

# push L-XAR It push VALUE 8 push BINARYJIP 8 invoke BINARY,OP on tvo args 8 discard call franc (t VALUE := result 8 L := L-XDR

t continue loop

Figure 8: GNU C compiler output for accuuulate.

117

What we haven’t seen is perhaps more interesting. This paper has just touched the surface of how the methods work, ignoring any serious analysis or proof. A thorough treatment of the topic would compare these methods to alternatives such as register allocation and heap allocation to determine the performance tradeoffs. It should also compare these techniques to lambda&fting [7], a popular technique that solves this problem by copying environment records rather than attempting to share them. The compiler analysis would be better presented as one part of a comprehensive analysis with deep understanding of the relationships between a program’s parts. The ad hoc rules we’ve seen here are crude approximations to this.

Another useful experiment would be to compare the performance of the original form of the control-link mechanism, as invented by R.ozas and Rees, to that of the form presented here. In the original form, records are popped in the return sequence but not the call sequence. If the stack overflows, a compaction process examines the records on the stack and removes those that are useless. This form has the advantage of eliminating some code at reductions, but the disadvantages of making access-link optimization more difficult and using more memory (thereby decreasing the effectiveness of cacheing); a hybrid form might capture the best of both.

Clearly, much interesting work remains.

Thank you to the wonderful people who helped create this paper. Some read drafts and debugged my writing (if only program debuggers were so good!), and all had interesting and useful comments: Hal Abelson, Olivier Danvy, Mike Eisenberg, Arthur Gleckler, David Kranz, Jonathan Rees, Jerry Sussman, and Frauhlyn Turbah. Special thanks to Bill Rozas, who not only read the paper, but was always there with a helpful word when the ideas were being implemented. Both the paper and the MIT Scheme compiler owe a great deal to his imagination and energy.

fbferences

[l] Harold Abeleon and Gerald Jay Sussman with Julie Sussmau. Structure and Interpretation of Computer Progmms. MIT Press, Cambridge, 1985.

[2] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Com- pilers: Principles, Techniques, and Toots. Addison- Wesley Publishing Company, Heading, 1986.

[3] Andrew W. Appel. Garbage Collection can be Faster than Stock Allocation. Princeton University Depart ment of Computer Science CS-TR-045-86, June 1986.

[4] William D. Clinger, Anne H. Hartheimer, and Eric M. Oat. Implementation Strategies for Continuotiona. In Proceedings of the 1988 ACM Conference on Lisp and Functional Programming, pages 124-131.

[5] HP-UX Assembler Reference and Supporting Docu- me&. Hewlett-Pachard Company, Fort Collins, 1988.

[S] Motorola Inc. MC68020 22-Bit Microprocessor User’s Manual. 2d ed. Prentice-Hall, Inc., Englewood Cliffs, 1985.

M

PI

PI

tw

WI

WI

[I31

Simon L. Peyton Jones. The Implementation of Func- tional Programming Longuoges. Prentice Hall, New York, 1987.

Jonathan F&es. Personal communication.

Jonathan Rees and William Clinger, editors. The Re- vised3 Report on the Algorithmic Language Scheme. In A CM SZGPLA N Notice6 21( 12), ACM, December 1986.

Jonathan Hees and GuiUermo Juan Hozas. Personal communication.

Guillermo Juan Hozas. Lior, an Algal-like Compiler for Scheme. S. B. thesis, MIT Department of Electrical En- gineering and Computer Science, January 1984.

Richard M. Stallman Using and Porting GNU CC. Free Software Foundation, Inc. February 1990.

Guy Lewis Steele Jr. Debunking the ‘Ezpensive Pro- cedure Call” Myth, or Procedure Call Implementations Considered Harmful, or Lambda, the Ultimate GOTO. In ACM Conference Proceedings, pages 153-162. ACM, 1977.

118

Documents

Efficient Stack Allocation for Tail-Recursive Languages · Efficient Stack Allocation for Tail-Recursive Languages Chris Hanson MIT Artificial Intelligence Laboratory 1 introduction