49
International Journal of Parallel Programming, VoL 17, No. 5, 1988 Multiprocessor Execution of Functional Programs 1 Benjamin Goldberg 2 Received October 1988; Revised April 1989 Functional languages have recently gained attention as vehicles for program- ming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiproeessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times. Two implementations of the functional language ALFL were built on com- mercially available multiprocessors. Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, and Buckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, called serial combinators, that can be executed in parallel. The abstract machine model supported by Alfalfa and Buckwheat is called heterogeneous graph reduction, which is a hybrid of graph reduction and conven- tional stack-oriented execution. This model supports parallelism, lazy evalua- tion, and higher order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems sup- port dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented. KEY WORDS: Funtional languages; parallelism; graph reduction; com- binators. This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation. 2 Department of Computer Science, New York University, 251 Mercer Street, New York, NY 10012. 425 0885-7458/88/1000-04-25806.00/0 1988 Plenum Publishing Corporation

Multiprocessor execution of functional programs

Embed Size (px)

Citation preview

Page 1: Multiprocessor execution of functional programs

International Journal of Parallel Programming, VoL 17, No. 5, 1988

Multiprocessor Execution of Functional Programs 1 Benjamin Goldberg 2

Received October 1988; Revised April 1989

Functional languages have recently gained attention as vehicles for program- ming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiproeessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.

Two implementations of the functional language ALFL were built on com- mercially available multiprocessors. Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, and Buckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, called serial combinators, that can be executed in parallel.

The abstract machine model supported by Alfalfa and Buckwheat is called heterogeneous graph reduction, which is a hybrid of graph reduction and conven- tional stack-oriented execution. This model supports parallelism, lazy evalua- tion, and higher order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems sup- port dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.

KEY WORDS: Funtional languages; parallelism; graph reduction; com- binators.

This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation.

2 Department of Computer Science, New York University, 251 Mercer Street, New York, NY 10012.

425

0885-7458/88/1000-04-25806.00/0 �9 1988 Plenum Publishing Corporation

Page 2: Multiprocessor execution of functional programs

426 Goldberg

1. I N T R O D U C T I O N

Functional languages have gained attention as vehicles for programming in a concise and elegant manner. (1 3) It has also been suggested that func- tional programming provides a natural methodology for programming multiprocessor computers. Although several prototype machines have been built specifically for the purpose of executing functional programs in parallel, we are interested in using general purpose parallel machines for functional programming. This paper describes the first working implemen- tation of a non-strict, higher-order functional language on two commer- cially available multiprocessors.

1.1. Objectives

This paper seeks to answer the following question:

Is it feasible to execute conventional functional programs on current multiprocessors, such that a significant reduction in the execution time is achieved?

Some of the terms used in this question need to be defined.

�9 Convent ional funt ionalprograms: We seek to create an implementa- tion for a functional language that does not contain special constructs for specifying the parallel behavior of a program. Our implementation must be able to automatically decompose functional programs to run on a multiprocessor.

�9 Currently available multiprocessors: The multiprocessors available today are generally comprised of processors designed to execute programs written in sequential imperative languages. No special hardware support is provided for executing functional programs.

�9 Reduction in execution time: We are investigating whether a func- tional program can run significantly faster on a multiprocessor than on a sequential (uniprocessor) computer. We would ultimately like to show that functional programming is the most appropriate method for programming parallel computers. However, in this paper we restrict ourselves to the investigation of the advantages of using parallel machines instead of sequential machines to execute functional programs.

Alfalfa and Buckwheat are two prototype systems implemented at Yale University that were built to answer this question. Alfalfa is an implemen- tation on the Intel iPSC hypercube multiproeessor and Buckwheat is implemented on the Encore Multimax shared memory multiprocessor.

Page 3: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 427

In this paper, we assume that the reader has some familiarity with functional languages, combinators, and graph reduction. If not, an excellent discussion of these topics can be found in Ref. 4. We give an extremely brief introduction in Section 2. In Sections 3-6, we describe the compile-time methods used to partition functional programs for efficient multiprocessor execution. We describe Alfalfa in Section 7, Buckwheat in Section 8 and present experimental results for each.

2. FUNCTIONAL LANGUAGES, GRAPH REDUCTION, AND COMBINATORS

2.1. Functional Languages

Functional languages are programming languages exhibiting the following characteristics:

�9 Mathemat ica l Notation: The programs are written in a high-level notation resembling that of mathematics.

�9 Referential Transparency: There is no side-effect operator (such as assignment). Thus the programs exhibit referential transparency, the property in which identical expressions have identical values (within the same lexical scope).

�9 Applicative structure: A program consists of a collection of function and constant definitions, and an expression whose value con- stitutes the result of the program. Each expression in the program consists only of constants, identifiers, function applications, and perhaps nested definitions.

There are many functional languages. Some of the better known ones are FP, (1) ML, ~5) Miranda, (6) ["Miranda" is a trademark of Research Software Ltd.] and LML. (7)

Many modern functional languages exhibit some additional proper- ties:

�9 Higher-Order Functions: Functions are treated a s f i r s t class objects in these languages. They can be passed as arguments to other func- tions and may be returned as the result of a function application. Functions that take functions as arguments or return functions as values are called higher-order functions. These languages generally allow function applications to be curried. That is, if a function is defined to take several arguments then it can be thought of as a function that takes one argument and returns a function that takes another argument and so on.

Page 4: Multiprocessor execution of functional programs

428 Goldberg

Non-strict Semantics: In functional programs written in languages with non-strict semantics, an argument in a function application is evaluated only if its value is required. This corresponds to normal order evaluation in the lambda calculus.

For the rest of this paper, the term functional language will be used to refer only to non-strict, higher-order functional languages. Of the languages mentioned here, Miranda and LML are higher-order and non-strict.

A strict functional language called SISAL ~8) was implemented on the Denelcor HEP multiprocessor by Allen and Oldehoeft prior to the work described here. They describe their implementation in several chapters of Ref. 9. Further discussion can be found in Ref. 10. Our goal is to execute higher-order and non-strict functional languages. A discussion of the reasons for choosing strict or non-strict languages is, however, beyond the scope of this paper.

2.2. ALFL: A Non-strict, Higher-order Functional Language

This paper describes an implementation of ALFL, a non-strict higher- order functional languageJ 1~ ALFL is similar in many ways to the other non-strict higher-order functional languages mentioned earlier. It is weakly typed and requires run-time type checking. ALFL functions are fully curried although many common infix operators, such as arithmetic operators, are provided for convenience. Like many other functional languages, ALFL provides pattern matching as an elegant way to define functions based on the structure of their arguments (we will not discuss pattern matching further in the paper, however).

An ALFL program consists of an equation group which is a set of equations and a result expression delimited by braces. Each equation defines either a function or a constant:

f l ~11 -..Xlml--= el ; . o .

f n X n l . . . X n m n --= e n ;

result e ; }

Each expression may consist of applications of functions, applications of primitive operators (such as + and - ) , and nested equation groups. ALFL uses block structure and static scoping to resolve identifiers. Func- tions defined within the same equation group may be mutually recursive. Here is a sample ALFL program that defines and uses the higher-order

Page 5: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 429

map function to form a list whose elements are the square of the elements of a given list:

{ map f 1 == i=[] -> [], f (hd I) ^ map f (tl i); square 1 == map { sq n == n,n;

result sq; }

l; result square [1,2,3,4,5]; }

The conditional operator in ALFL has the form p ~ c, a where p, c, and a are the predicate, consequent, and alternate respectively. The infix operator ^ denotes the list construction operator (similar to cons in Lisp) and [ ] denotes the null list.

The abstract syntax of the simplified version of ALFL that we have implemented is described here. Arithmetic follow the usual precedence rules and function applications associate to the left.

p rogram ::= equat ion_group equat ion_group ::=. { (equat ion ;)*

r e s u l t exp ; } equa t ion : := id ( id)* = = exp

exp : := id I c o n s t a n t I exp bin_op exp I -exp I exp -:, exp, exp I ~xp

bin_op ::= * 1 - 1 . 1 / I ^ l ' - I < l > l . . . cons t an t ::= integer I float ] [] I predefined identifiers

2.3. Graph Reduction

Graph reduction (12) is the evaluation method most often used to execute non-strict functional programs. It can be thought of as the graphi- cal equivalent of reduction in the lambda calculus, and supports higher- order functions and lazy evaluation in a natural manner. In graph reduc- tion a program, along with its data, is represented as a graph. During execution, reductions (conversions) are applied to the graph until it has been reduced to a normal form, to which no more reductions can be applied. For example, the initial graph representing the ALFL program

{ f x y == h (x y) y ; g a b = = a + b ; hcd== c* d;

result f (g 2) 3; }

Page 6: Multiprocessor execution of functional programs

430 Goldberg

,P\

g 2

Fig. 1. The initial graph.

is shown in Fig. 1. The "@" symbol represents function application. Notice that the application of f to two arguments is curried and is represented by two application nodes. Reduction proceeds via the construction of an instance of f's body with its formal parameters replaced by pointers to the corresponding arguments. This is shown in Fig. 2. According to the defini- tions of g and h, the reduction of the graph proceeds as shown in Fig. 3.

In this example, the function identifier in an application resides at a leaf in the graph to support currying. Each interior node represents the application of its left child to its right child. If a function is supplied with all the arguments it needs (as in this application of f), an uncurried application could be represented by a node containing the function and arguments. For example, the uncurried version of f (g 2)3 in the above program could be represented as shown in Fig. 4. In this case, the node serves as an activation record for the function call. In applications that cannot be uncurried, an explicit apply node is required.

Graph reduction can be used to support parallel execution of func- tional programs. Any sections of the graph that are eligible to be reduced (without violating ALFL's non-strict semantics) can be reduced in parallel. For example, the parallel reduction of f(g 1 2) (h 3 4), represented in Fig. 5, could proceed by the evaluation of (g 1 2) and (h 3 4) in parallel, as

g 2

Fig. 2. The graph after the first reduction step.

Page 7: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 431

~ 3 ~ 3 / % 3 3 2

g 2

5 2

15

Fig. 3. The reduction of the graph.

\ i l l l ~ 1 3 1 /

i g i 1 2 1

Fig. 4. The uncurried version of (f (g 2)3).

\

I V~ 1,1~1 \

Fig. 5. The applications of g and h can be evaluated in parallel.

828/17/5-5

Page 8: Multiprocessor execution of functional programs

432 Goldberg

long as f required both of their values. In this case, the functions in a program, such as g and h here, specify the behavior of parallel tasks.

2.4. Lambda Lifting and Supercombinators

An ALFL function may contain free variables--variables that do not occur in the function's formal parameter list. There are essentially two ways to provide access to free variables:

�9 A hierarchical environment structure could be maintained that provides a path between the use of a free variable and the activa- tion record of the function in which the variable was bound. In conventional languages, this is usually implemented by either a static chain or a display.

�9 All functions can be transformed into combinators. Combinators are simply functions that contain no free variables. In a com- binator body, all variables references are to variables that occur in the formal parameter list. No hierarchical environment structure is required to support the evaluation of combinators since the yalues of all variables in a combinator body have been passed as arguments.

There is an advantage to using combinators that is particular to parallel ~implementations. In an implementation using environments, the use of a variable could occur on a processor other than the one on which the variable was bound. Thus, interprocessor communication would be required to resolve the variable reference. This would not be the case with combinators, since all variables are local.

Any functional program can be translated into an expression con- taining only references to a fixed set of combinators. (13) In fact, the two combinators S and K,

S f g x = f x (gx)

K x y = x

are sufficient. As an alternative, Hughes observed that one could derive a different set of combinators for each program. (14) He called these derived combinators supercombinators.

The translation of expressions in a functional language into supercom- binators is called lambda lifting. (~5) In its simplest form, lambda lifting adds all free variables in a function definition to its formal parameter list. Any

Page 9: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 433

application of the function is also modified to include the free variables as arguments. For example, the ALFL program

{ f x == { g y == x + y ; result g 1; };

result f 7; }

would be transformed by lambda lifting to

{ fx== gxl; gxy==x+y; result f 7; }

This form of lambda lifting creates supercombinators that may be less efficient than those generated by Hughes's method of lambda lifting. Hughes's supercombinators can be considered more lazy than the ones generated via this method (see Ref. 14).

In our implementations, we perform lambda lifting on ALFL functions to create supercombinators that are even more efficient than Hughes's. For the purposes of this paper, we assume that ALFL functions have already been transformed to supercombinators before the algorithms described here are applied. For a description of our lambda lifting method, see Ref. 16.

2.5. Related W o r k

A substantial amount of work has been dedicated to designing architectures specifically for parallel graph reduction of functional programs (generally in supercombinator form). Such projects include: the AMPS project at the University of Utah, ~17'18) the ALICE project at Imperial College, r and the GRIP project at the University College London. ~2~ Our approach differs in that we desire to execute functional language on commercial multiprocessors. There has been work done on compiling functional languages efficiently for conventional uniprocessors. One such project is the LML compiler using the abstract G-machine model developed at Chalmers University. ~7'21~

3. T A S K S A N D P R O G R A M G R A N U L A R I T Y

In order to run on a multiprocessor, a functional program must be decomposed into tasks, each of which executes on a single processor and

Page 10: Multiprocessor execution of functional programs

434 Goldberg

in parallel with other tasks. These tasks may themselves create other tasks, may synchronize with other tasks, and may return values to other tasks. In order to generate code for these tasks, the compiler must assign a procedural description to the work that a task performs. One way to view this procedural description is as an intermediate representation of the source program in which explicit synchronization between parallel com- ponents of the program has been included.

In any multiprocessor system, there is an overhead cost involved in creating parallel tasks. In many architectures this cost is significant and, depending on the particular tasks being executed, may outweigh any benefit gained from exploiting parallelism. When a task is created, some communication must occur between the processor that initiated the crea- tion and the processor that will execute the new task. This communication may be as inexpensive as accessing a shared queue of tasks or as expensive as sending a message over a network.

This cost leads to the notion of the granularity of a parallel computa- tion. Granularity is a measure of how much computation occurs on each processor between periods of communication, and is an indication of how often the execution of a task will incur communication overhead. If the grain size is large, resulting in a coarse-grained computation, a large amount of computation is done between periods of communication. Thus communication costs are incurred relatively infrequently. If the grain size is small, resulting in a fine grained computation, then the communication overhead will be incurred often during the computation.

Our goal is to extract as much parallelism as possible from the program by decomposing expressions into parallel tasks. However, even if an expression contains some parallelism, unless that parallelism is useful the expression should be evaluated by a single task on one processor. We define an expression to be practically sequential if it contains no useful parallelism. Notice that whether an expression is practically sequential or not depends upon the communication costs of the particular multipro- cessor being utilized.

4. SERIAL C O M B I N A T O R S

In our system, the procedural description of each task is expressed as a serial eombinator. A serial combinator is a function whose body contains constructs for creating and synchronizing the execution of tasks. The body of each serial combinator is executed sequentially. Each call to a serial combinator creates a new task, along with a new node in tlie program graph to h01d the state information for that task. Like any combinator, a serial combinator does not require a hierarchical environment structure.

Page 11: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 435

Every variable accessed by a serial combinator can be found in the local environment represented by its activation record.

A serial combinator is defined to be a function with the following properties:

1. It is a combinator, its body contains no free variables.

2. Its body is practically sequential and contains constructs for synchronizing its execution with other tasks.

3. It is the largest possible function that satisfies properties 1 and 2. That is, its body could not occur as a subexpression within the body of another serial combinator.

The third property listed here reflects the fact that the program should be partitioned into as few serial combinators as possible without reducing the potential for exploiting useful parallelism. If a practically sequential expression is decomposed into several serial combinators, its execution time will be adversely affected by the overhead of the serial combinator calls.

4.1. Serial Combinators and Tasks

Serial eombinators were chosen to specify the behavior of a task because they are natural extensions of the functions used in uniprocessor graph reduction. On a uniprocessor system, graph reduction provides the mechanism for lazy evaluation and maintaining shared expressions via the creation of nodes in the graph. On a multiprocessor, the evaluation of serial combinators via graph reduction accomplishes all of the following:

1. It supports lazy evaluation via the creation of nodes representing delayed expressions in the same manner as uniprocessor graph reduction.

2. It supports sharing of expressions via the manipulation of arcs between nodes, also in the same way as uniprocessor graph reduction.

3. It provides a representation for the state of a task via the nodes in the graph.

4. It provides a multi-threaded dynamic chain as a mechanism for parallel activations of serial combinators to return values to the task that spawned them. The dynamic links for many currently executing functions may point to the same activation record.

Multiproeessor serial combinator reduction subsumes all the func- tionality of uniprocessor graph reduction. Thus, every function in the parti-

Page 12: Multiprocessor execution of functional programs

436 Goldberg

tinned program will be a serial combinator and every serial combinator will generate a node in the graph, whether or not explicit synchronization or lazy evaluation is required. In section 6 a new model of graph reduction is discussed that lifts this requirement.

4.2. Constructs for Creating Tasks and Synchronization

The synchronization constructs that serial combinators contain are the spawn, wait, and demand constructs. For ease of explanation, we will represent serial combinators using S-expression syntax much like that of Lisp. Each construct is described here.

4,2. 1. The Demand Construct

The demand construct has the form

(demand ( v l . . . Vn)

body)

and indicates that the values of variables v~ ... v , , which may be bound to unevaluated expressions, should be demanded in parallel. The evaluation of body begins as soon as the values of v~ ... v, have been demanded and does not wait for the values to return. Because serial combinators preserve the termination properties of lazy evaluation, we must be certain, using strictness analysis, that the values of v I ... v, will be needed at some point in the computation.

4.2.2. The Wait Construct

A wait construct has the form

(wait (Vl.,. V n)

body)

and indicates that the values of vl ... vn must be availabe before the evalua- tion of body can begin. If the values of vl ... v, are still being computed, the evaluation of the serial combinator is suspended. Evaluation is resumed when the needed values have returned. Each of Vl ... v, must have already been demanded or spawned (see next section). Although the evaluation of a serial combinator call may be suspended, the local processor is free to execute any other available task.

Page 13: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 437

4.2.3. The Spawn Construct

The spawn construct has the form

(spama ( (v l expa) . . . (v,~ exp,,) )

body)

and specifies that each expression expf should be evaluated by creating a new task, along with a corresponding node in the program graph. Every function call that creates a node in the graph must be an invocation of a serial combinator. Therefore, each expi must be a serial combinator call. When expi becomes evaluated, the value returned by the corresponding task is bound to the variable vi. The evaluation of body proceeds without blocking on the values of vl ... v, and thus each vi must occur within a wait construct when its value is needed.

The spawn construct is the only way to specify the creation of a node. Therefore all serial combinator applications must occur within a spawn construct. If the compiler determines that one of the serial combinator applications in the spawn construct should be evaluated locally, the corresponding task will be executed on the local processor without invoking the dynamic scheduler. In this case, the variable-expression pair in the spawn construct is marked "local." An example of the use of the spawn and wait constructs is:

( spawn ( ( ( v l ( f x y ) ) ( l o c a l v2 (g x y ) ) ) ) ( w a i t ( v l v2)

(+ v l v 2 ) ) )

It may seem strange that the variable v2 here, which was computed locally, occurs in a wait construct. The evaluation of (g x y) may suspend, and it is possible for the value of vl to become available before the value of v2. If this happens, we do not want the evaluation of ( + v l v2) to begin when vl arrives, but rather to wait until the values of both variables are available.

4.2.4. The Let Construct

A final construct that is of less interest but is still useful is the let construct. It has the form

( le t ( ( v l e x p l ) . . .(v,~ exp,~)) body)

and is similar to spawn with the exception that each expi is an expression that can be executed immediately as part of the current task on the local

Page 14: Multiprocessor execution of functional programs

438 Goldberg

processor without invoking grap reduction. This means that expi is restric- ted to an application of a primitive operator that requires no activation record. Each expi will be evaluated in a conventional manner (using registers, temporary locations, etc.) on the local processor.

The evaluation of body continues as soon as the values have been computed. Naturally, no vi need appear in a demand or a wait. The let construct is primarily used to show that certain arguments in a serial com- binator application are not worth evaluating in parallel but should still be evaluated immediately. An example of the use of the let construct is

(let ((vl (4 x y)) (v2 (* y z))) (spawn ((local v3 (f vl v2)))

(wait (v3) (+ v3 2 ) ) ) )

It is worth noting that the constructs described earlier can be viewed of as annotations to functional programs. As such, these constructs could be provided to allow a programmer to specify explicitly the desired parallel behavior of his program. Research into para-functional programming (22"23) has explored the possibility of including these kinds of annotations in functional languages.

4.3. The Placement of Spawns and Demands

To maximize the amount of useful parallelism, serial combinator invocations should be spawned as soon as it is known that their values are needed. Likewise, variables bound to non-strict arguments should be demanded as soon as it is known that their values are needed. For the spawns and demands to occur as soon as possible during execution, they must be lifted, or hoisted, up the serial combinator body from where they originally occurred to the first place in which it can be determined they it will be needed.

An expression is first determined to be safely evaluable at one of two places:

1. At the branches of a conditional: Before the predicate of a condi- tional is evaluated, those expressions or variables that occur in either branch (but not both branches) of the conditional cannot be spawned or demanded. Once the predicate has been evaluated, all serial combinator calls whose values will be needed in the appropriate branch can immediately be spawned. For example, suppose the following expression comprises the body of an ALFL function:

(f x y) -> (g x y) + (h x y), i

Page 15: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 439

where f, g, and h are all sufficiently complex to occur in spawn constructs and the values of x and y are available. The correspond- ing serial combinator expression could be (ignoring the need for a wait construct),

(spawn ((local vl (f x y))) (if vl (spawn ((v2 (g x y)) (local v3 (h x y)))

(+ v2 v 3 ) ) 1))

Notice that the calls to g and h cannot be hoisted any higher without being premature.

2. At the top of serial combinator definitions: If there are serial com- binator applications that will always be evaluated within a serial combinator body then these applications should occur in a spawn construct at the top of the serial combinator definition. Likewise any variable that may be bound to a delayed expression and will eventually be referenced should occur in a demand construct at the top of the serial combinator definition. In the previous example, the spawn construct containing the call to f occurred at the top of the serial combinator body.

4.4. The Placement of Waits

In order to minimize the time that tasks spend waiting, each serial combinator should accomplish as much as possible before encountering a wait construct. Therefore wait constructs should occur as low as possible in the body of each serial combinator.

The most obvious place for a wait construct is immediately before a variable reference, where it would contain only that variable. Suppose that the variable reference occurs within an arithmetic expression. Such an expression should be evaluated using the registers and stack of the local processor to hold the operands and intermediates values. If the expression ( + (* x y ) ( - z w)) occurs within the body of a function and only the values of x and y are known to be available, then z and w will have to occur in a wait construct. A straightforward translation of the expression into serial combinator form would be

(+ (* x y) (- (wait (z) z) (wait (w) w)))

However, there are two problems with this expression:

1. Intermediate values: If evaluation of the expression has to suspend until the value of z or w has arrived, then the value of (* x y) may

Page 16: Multiprocessor execution of functional programs

440 Goldberg

already have been computed and would need to be saved. Having to save intermediate values during suspension could result in a significant overhead in space and time.

2. Multiple Suspensions: After the value of z had returned, the task evaluating the expression would resume executing. However, before any useful computation could be performed, the task would again be suspended if the value of w had not yet arrived. There- fore, the execution of the task would have been suspended and resumed twice even though resuming execution after the value of z returned provided no benefit.

A better translation of the original expression would be

(wait (z w) (+ (* x y) (- z w)))

Once the evaluation of the arithmetic part of this new expression begins, all of the needed values are available and execution will not have to suspend. This means that no intermediate value will have to be preserved during a suspension and only one wait construct (and thus only one suspend/ resume) is sufficient to evaluate the expression.

In general; the execution times of expressions involving only primitive operations are short. Therefore, the lifting of a wait construct out of such an expression should not cause a task to wait much earlier than necessary.

The only places a wait construct can occur without creating the problem of storing intermediate values during suspension is at the top of the body in a serial combinator definition and at a conditional. For example,

f x y -> g x , h y

can be translated as follows:

(spawn ((local vl (f x y))) (wait (vl)

(if vl (spawn ((local v2 (g x))) (wait (v2)

v2)) (spawn ((local v3 (h y)))

(wait (v3) v3))) ))

Not all conditionals may contain wait constructs, however. If a condi- tional expression is nested within another expression, a wait construct might create the problem of storing intermediate values. In this case, we can treat a nested conditional expression as an invocation of a serial c o r n -

Page 17: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 441

binator, IF, that behaves like a conditional. For the rest of this section, we will assume that all conditionals that should not contain wait constructs have already been translated into calls to the IF combinator. This will simplify our presentation of the translation of supercombinators into serial combinators.

2.5. The Creation of Delayed Expressions

In graph reduction, the delayed evaluation of an expression is represented by a node in the program graph. The node serves as a closure, containing all the information that will be needed when the expression is ready to be evaluated.

We have not yet described how the creation of nodes representing the delayed evaluation of expressions is specified in serial combinators. The evaluation of an expression is delayed when the expression occurs as a non- strict argument in a serial combinator application. As mentioned in Sec- tion 4.2, every serial combinator application must occur in a spawn con- struct in order to be invoked. The arguments in these applications are assumed to be non-strict unless they are simply variables that have already been bound to expressions in spawn, demand, or let constructs. Given the expression

(g 1) 2

if f is not strict in its first argument, the serial combinator version of the above expression would be

(spawn ( ( l o c a l v l ( f (g 1) 2 ) ) ) (wait (v l )

v l ) )

Since (g 1) was not explicitly spawned, a node in the graph is created to represent its delayed evaluation. If the value of the corresponding bound variable in the body of f is demanded then evaluation of (g 1) will commence.

If, however, f is strict in its first argument (and g is sufficiently complex) then the serial combinator version of the previous expression would be

(spawn ((vl (g 1)) (local v2 (f vl 2))) (wait (v2)

v2))

In this case, we can see that the evaluation of (g 1) has already started by the time f is called. In Section 7 we describe the mechanism for passing unevaluated (or currently evaluating) arguments to serial combinators.

Page 18: Multiprocessor execution of functional programs

442 Goldberg

4.5. 1. Lazy Creation of Delayed Expressions

In his dissertation, Hughes (24) describes an optimization that reduces the cost of representing delayed expressions by insuring that at most one node is created for each non-strict argument in a function invocation. In conventional graph reduction, several nodes may be created to represent a non-strict argument. For example, in the serial combinator expression

(spawn ((vl (f (g (h x I) (h y 2)) 3))) vl)

where f is not strict in its first argument, the straightforward way to create a delayed expression for (g (hx 1 ) ( b y 2 ) ) would be to create a node representing the delayed invocation of g and two other nodes representing the delayed invocations of h. Figure 6 illustrates how three nodes are used to represent the delayed expression (g (h x 1)(h y2)). A new serial com- binator foo is defined as

foo x y == g (h x I) (h y 2);

and our original expression becomes (f (foo x y)3). We call this optimiza- tion lazy creation. [Hughes left it unnamed.] Only one node is needed to represent the delayed expression (foo x y) as shown in Fig. 7. Only if the value of (g (h x 1) (h y 2)) is needed will the subgraph pictured in Fig. 6 be constructed. In general, a subgraph representing a delayed expression may be arbitrarily large. Thus lazy creation can provide a significant savings.

Multiple nodes representing a delayed expression. Fig. 6.

Page 19: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs

/

Fig. 7. A single node representing an arbitrarily large delayed expression.

443

5. SERIAL COMBINATOR GENERATION

We now present the algorithm for transforming ALFL functions (in supercombinator form) into serial combinators.

5.1. Estimating Execution Time

In order to determine whether it is worthwhile to partition an expres- sion into parallel tasks or not, it is necessary to be able to estimate its execution time. Determining the execution time precisely is, in general, impossible. We must be content to use some heuristic, and, in fact, we use an extremely simple one. It consists of assigning each primitive operator a complexity (based on the number of instructions required to perform the operation). A call to a non-recursive function is assigned the complexity of the body of that function. The complexity of a call to a recursive function or an unknown function is assumed to be infinite. This means that such a call is always assumed to have sufficient complexity to constitute a task that can be executed on a remote processor. Of course, this assumption occasionally will be wrong.

Once we have estimated the complexity of expressions in the program using this method, we are ready to translate the supercombinators into serial combinators. For purposes of this paper, we have simplified the discussion (in the following section) of how the complexity measure is used. A more complete discussion can be found in Ref. 25 where a whole chapter is devoted to this topic.

Estimated execution times can also be used to order the variable- expression pairs in a spawn, since tasks are spawned in left-to-right order. A complete discussion of this can also be found in Ref. 25. For the purpose of this paper, however, we choose to order the variable-expression pairs of

828/17/5-6

Page 20: Multiprocessor execution of functional programs

444 GoJdberg

a spawn construct in the order in which the variables will be needed in the body of the spawn. This way, the variables that are earlier will be spawned (and hopefully return) earlier. Thus, if the compiler decides to evaluate one of the spawned expressions locally, it will choose the first expression in the spawn construct (see Section 5.4).

5.2. The Serialize Algorithm

The procedure that performs the bulk of the work in transforming a supercombinator expression e into a serial combinator expression is called serialize. It takes the expression e and returns a tuple:

where

(e',s,w,d,l,C)

e' is a serial combinator expression whcih may contain spawn, wait, demand, and let constructs.

�9 s is a set of variable-expression pairs that should occur in a spawn construct before e' is evaluated.

�9 w is a set of variables that should occur in a wait construct before e ' is evaluated.

�9 d is a set of variables that should occur in a demand construct before e' is aveluated.

�9 l is a set of variable-expression pairs that should occur in a let construct before e' is evaluated.

�9 C is a set of new serial combinator definitions that were generated by serialize.

We define serialize later in this section. The serialize_prog algorithm transforms a program--represented as a

set of supercombinators-- into a set of serial combinators. Here is the definition of serialize_prog(P) where P is a refined supercombinator program:

1. Le t P = { F1 x l l . . . x l k l = = el;

Fn Xnl . . . X n k n -----~ Cn; r e s u l t e }

2. For each i, 1 ~< i ~< n, let <e;, si, wi, di, li, C~> = serialize(ei)

3. Let <e', s, w, d, l, C> = serialize(e)

Page 21: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 445

4. For each i, 1 ~< i ~< n, redefine each F,. to be a serial combinator:

Fi X i l . . . X i k 1 = = insert_constructs(e~, si, wi, di, li)

.

6.

7.

where insert-constructs creates an expression that contains the necessary spawn, wait, demand, and let constructs for s;, wi, and I~, respectively. The body of the expression (after these constructs) is e~. We define insert_constructs later in this section.

Let Cp be the set containing the definition of every serial com- binator F~ generated in the previous step.

Let e" = insert_constructs(e', s, w, d, l)

The serial combinator version of the program P consists of the set of new combinator definitions described by

C? u G Ci i = 1

and the result expression e".

In simple terms, serialize decomposes a function application or binary expression as follows:

1. For each strict argument of sufficient complexity, serialize creates a serial combinator definition and puts a call to that serial combinator into a spawn construct.

2. Each strict argument that is not worth spawning is placed in a let construct.

3. For each non-strict argument, a new serial combinator definition is created and a call to that serial combinator is substituted into the expression. Since this call is not explicitly spawned, a node in the graph is created to represent its delayed evaluation (see Sec- tion 4.5).

Here is the formal definition of serialize. It takes an expression e and performs the following actions:

�9 If e is a constant c then return (c, {}, {}, {}, {}, { } )

�9 If e is a bound variable x then return (x , {}, {x}, {x}, {}, { } )

�9 If e is a conditional (if ele2e3) then:

1. Let (e'l, sl, w,, dl, ll, C1) = serialize(el) ( e'2, s2, w2, d2, 12, C2 ) = serialize(e2) (e~, $3, w3, d3, 13, C3) = serialize(e3)

Page 22: Multiprocessor execution of functional programs

446 Goldberg

2. Let d '=dlw(d2c~d3)

3. Let e ' = ( i f e ] insert_eonstruets(e'z, s2, ( w 2 - Wl ), ( d2 - d'), 12) insert_constructs(e'3, s3, (w3 - wl), (d3 - d'),/3))

4. Return (e', sl, wl, d', ll, (C1 w Czw C3))

If e is a strict binary operation, say ( + e , e2), perform the following steps. Assume that the complexity of el is less than the complexity of e2 (otherwise exchange el and e2 in the following steps).

1. Let (e'l, Sl, Wl, dl, ll, C1) =serialize(el) (e'2, s2, w2, d2, 12, C2) =serialize(e2)

2. If the complexity of e~ is large enough (depending on the target machine) to warrant executing in parallel with e2 then

(a) Create a new identifier V (b) Let e'l' =insert_constructs(e'l, sl, wl, dl, 11). Thus el' con-

sists of the expression e'~ preceded by the spawns contained in sl, the waits contained in w~, and so on.

(c) Define a new combinator F1 as follows: t t

Flal ... an = el

where al ... an are the free variables in el. Let C be the singleton set containing this new combinator definition.

(d) Let e ' = ( + Ve'z) (e) Let p be the variable-expression pair (V (Flal ... an)). (f) Return (e', (s2 w. {p}), (w2 w { V}), d2, 12, C~ u C2 u C)

3. Otherwise, ( + e 1 e2) should not be decomposed. Return

( (+e]e~) , (sl ws2), (wa w w2), (dl wd2), (/1 wl2), (C1 w C2))

If e is an application (eo el ... e,) then

1. For each i, 0 ~< i ~< n, let (e;, si, w i, di, li, Ci) = serialize(ei)

2. Define the following sets:

S = {i[ eo is strict in its ith argument} P = {i[ i t S and e~ is too simple to evaluate remotely} Q= { j [ j 6 S and ej is sufficiently complex to evaluate

remotely } R - - { k l k r <<.n}

3. Let Vi be a new identifier for each i e S and let V also be a new identifier.

Page 23: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 447

.

,

6.

For each j e (Q u R), let e j '= insert_constructs(@ sj, wj, dj, l/) and define a new combinator Fs:

Fj asl ... % = e7

where as1 ... aj,j are the free variables in ej. Let C be the set of all the new combinator definitions.

Let l = {(vj e j ) j j e P }

For each j eQ, let pj be the variable-expression pair (v s (F s as~ ... aj,j)). Also, let p be the variable-expression pair (V(vo xl . . .xn)) where, for 1 <.i<.n,

v i i f i e S xi = (Fi all ... ai, i) if iE R

7. L e t s = { p s l j e Q } ~ { p }

8. Let w = Uj~e ws, and let d = U/~e d/.

9. Return (V,s,w,d,l,(UT=lC3wC)

5.3. Insert_constructs and Top_sort

Insert constructs is defined as follows:

i n s e r t _ c o n s t r u c t s ( e , s, w, d, I ) = (demand d

top_sort(e, s, w, l ))

It creates a demand construct containing the variables in d. The body of the demand construct is a serial combinator expression generated by top~sorr Since only formal parameters can occur in a demand list, they can be demanded before any spawns or lets are required for binding variable names.

Top~ort takes a list of spawns, waits, and lets and creates a serial combinator expression with the appropriate constructs. Since expressions in a let construct may reference variables bound in a spawn construct and vice versa, the let, spawn, and wait constructs have to be (topologically) sorted so that variables are bound before they are used. [A topological sort is required because spawn, wait and let expressions are placed in separate lists during serialize. This contributes to the loss of information about the order in which the expressions occurred in the original program.]

Top~ort(e, s, w, l) behaves as follows:

�9 If s = { } and l = { } then let e ' = (wait w e) and return e'.

Page 24: Multiprocessor execution of functional programs

448 Goldberg

s = { ( v l

t : {(v4 w = {vl,

{x}

�9 Otherwise, define the following sets

s' = { (vi exp~) ] (v~ exp,) e s and exp~ contains no occurrence of variable bound in s)

l' = {(vj expj) l(vj expi ) e land vj occurs free in an expression in s' )

w '= {(vk I vke w and v~ occurs free in an expression in l'}

�9 Let e '= top_sort(e,(s- s)', (w-w ' ) , (1-1')) and return

(wait w' (let l'

(spawn i n s e r t _ s p a w n ( s ' , e') e ' ) ) ) )

The procedure insert~pawn(s' , e') arranges the variable-expression pairs in s' in the order in which the variables are referenced in e'.

Consider the following example of the use of insert_constructs. If

( f x y ) ) (v2 (g v4 2) ) (v3 ( f 2 3 ) ) } ( . x y ) ) } V2, v3, X}

then insert_constructs(s, w, d, l, ( + vl (* v3 v2))) will return

(demand (x) (spawn ((vl (f x y)) (v3 (f 2 3)))

( w a i t (x) ( l e t ( ( v 4 (+ x y ) ) )

(spawn ((v2 (g v4 2))) (wait (vl v2 v3)

(+ vi (* v3 v 2 ) ) ) ) ) ) )

5.4. The Clean-up Phase

A serial combinator expression generated by serialize may contain redundant spawn or wait constructs. For example, the expression

(i~ (= x i) ( i f (= y i ) (+ x y) l ) 2)

Page 25: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 449

would be translated into

(demand (x) (wait (x)

(if (= x 1) (demand

(wait (if

2)))

(y) (y) ( = y l ) (demand (x)

(wait (x) (+ x y) ) )

1)))

The last phase, called the clean-up phase, of the translation of supercom- binators into serial combinators is a pre-order tree traversal over the serial combinator bodies (generated by serialize) to remove redundant demands and waits. The cleaned-up version of this serial combinator expression would be

(demand (x) (wait (x)

(if (= x I) (demand (y)

(wait (y) (i~ (= y 1)

(+ x y ) ) ) 1)))

2)))

The clean-up phase also attaches the label "local" to a variable- expression pair in a spawn construct if appropriate. Any spawn construct satisfying the following conditions will have its first variable-expression pair modified with the local label.

�9 The spawn construct is immediately followed by a wait construct.

�9 The first variable bound in the spawn construct occurs in the wait construct.

Since evaluation will have to suspend until the first expression in the spawn construct becomes evaluated, that expression should be executed locally.

For example, the expression (+ (fx y) (g x y)) would be translated by serialize into

(spawn ((vl (f x y)) (v2 (g x y))) ( w a i t (v l v2)

(+ vl v2)))

Page 26: Multiprocessor execution of functional programs

450 Goldberg

This expression would be translated by the clean-up phase into

(spawn ( ( l o c a l v l ( f x y) ) (v2 (g x y ) ) ) (wa i t (v l v2)

(+ vl v2)))

If the first variable bound in the spawn construct does not occur in the wait list, no variable bound in the spawn construct can occur in the wait list. This is because the first variable bound in the spawn construct is the first variable whose value is required in the body (see insert_spawns).

5.5. Example

Here is a simple example of the translation from ALFL programs into Serial Combinators. The serial combinator code is the actual output of the computer (before code-generation). The program is a divide and conquer factorial program:

{ pfac 1 h == l=h->l, { pfacl mid == pfac 1 mid + pfac (mid + i) h;

result pfacl} ((l+h)/2); result pfac i I0; }

The only place where significant amount of parallelism occurs is in the body of pfacl in which pfac is invoked twice in parallel.

The supercombinator version is:

{ pfac 1 h == 1 = h -> I, pfacl 1 h ((i + h) / 2); pfacl 1 h mid == pfac 1 mid + pfac (mid + I) h; result pfac i I0 }

and the serial combinator version is:

{pfac i h == (demand (I h) (wait (i h)

(if (= 1 h) 1 (let ((vl (/ (+ i h) 2) ) )

(spawn ((local v2 (pfacl 1 h vl) ) ) (wait (v2)

v 2 ) ) ) ) ) )

Page 27: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 451

pfacl 1 h mid == (demand (mid 1 h) (wait (mid)

(let ((v4 (+ mid 1))) (spawn ((v15 (pfac v4 h))

(local v3 (pfac 1 mid))) (wait (v3 v5)

(+ v3 v5))))))

result pfac 1 10 }

Even though pfac contains a spawn construct, the spawned expression is evaluated locally and thus no attempt is made at exploiting (non-existing) parallelism. Only in the body of pfacl is parallelism exploited by spawning two expressions simultaneously (one of them locally).

6. A HETEROGENEOUS GRAPH REDUCTION M O D E L

Previously, we stated that every function call in the program had to be an invocation of a serial combinator and had to create a node in the program graph. This is because a node is the only kind of activation record available in graph reduction. However, for many serial combinator invoca- tions, the full power of graph reduction: namely supporting lazy evalua- tion, sharing, and parallelism, is not needed. In this section we describe a heterogeneous evaluation model that incorporates both graph reduction and conventional stack-based evaluation.

The graph reduction model, while extremely powerful and general, fails to exploit a particular strength of current multiprocessor architectures: The hardware and instruction set of each processor have been optimized for the execution of sequential programs written in first-order, call-by-value programming languages. The organization of these machines is centered around the use of registers for performing primitive operations on data and the use of a stack to provide a mechanism for executing procedure calls.

Many parts of a functional program may be practically sequential, and for some practically sequential expressions, applicative order evaluation preserves the termination properties of normal order evaluation. Any expression that exhibits both these properties can therefore be efficiently executed in a conventional manner, utilizing only the stack and registers of the host processor (although higher-order functions may require the creation of closures in a heap).

Page 28: Multiprocessor execution of functional programs

452 Goldberg

6.1. Stack Execut ion of Serial Combinators

We say a serial combinator is stack executable if:

�9 It calls at most one function at a time; that is, it never makes several function calls in parallel.

�9 It never "forks" a function call. That is, it never proceeds without waiting for a value of a function call to return.

�9 It only calls functions that are themselves stack executable.

Likewise, we define a stack executable spawn to be a spawn construct with the following properties:

1. It contains a single variable-expression pair (v e).

2. It is immediately followed by a wait construct containing the variable v.

3. The expression e is an application of a stack-executable serial combinator to arguments that are already evaluated.

serial combinator must be stack All spawns in a stack executable executable spawns.

6.2. Modifying Serial Combinators

We define a new construct, called stack-spawn, to indicate that a serial combinator call can be stack allocated. This construct is of the form,

( s t a c k - s p a u n ( ( v l exp l ) . . . (vn e x p . ) ) body)

and indicates that each expression expi is a serial combinator call and should be evaluated immediately on the local processor by creating an activation record for it on the stack. The value returned by this conven- tional evaluation of expi is bound to the variable v;. The stack-spawn construct is similar to the let construct, with the exception that an activa- tion record is required for each expression.

The first step in creating a new set of serial combinator definitions that can be executed on the stack is to determine which serial combinators are stack executable according to the conditions listed previously. For any program represented by serial combinators, we would like to find the set S of stack executable serial combinators. Obviously, the third condition stated above requires the solution of a recursive set equation. In order for a function f to be an element of S, all functions called by f must be

Page 29: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 453

elements of S. If stack(f , S) indicates, given S, whether the function f is stack executable according to the conditions stated before, then

S = { f lstack( f , S ) = true}

A fixpoint iteration method is used to solve for S. The initial set S O consists of all serial combinators in the program.

S i+ 1 = { f l stack( f S i ) = true }

When a fixpoint is reached, i.e. S s+ 1 = S j for some value of j, then we have solved the set equation and S = S j. We are guaranteed to reach a fixpoint because each iteration can only remove functions from a set that was originally finite. Furthermore, since S O contains all functions, our fixpoint will be the largest set containing only stack-executable functions.

The definition of stack( f , S) is straightforward. It traverses the serial combinator version of the body o f f to determine if it is stack executable, again according to the conditions (given the set S of stack executable func- tions).

For each stack executable serial combinator f , two definitions are generated. The first definition specifies the behavior of the combinator when a call is to be evaluated using graph reduction. This is necessary in case some of the arguments in a call represent unevaluated expressions. The second definition o f f specifies its behavior when executed on the stack. In either case, all occurrences of stack executable spawns are converted to stack-spawns.

As a simple example, consider the factorial function (written in ALFL):

f a c x == x=0 -> 1, x * f a c ( x - l ) ;

The serial combinator version of fac would be:

f a c X == (demand (x) (wait (x)

(i~ (= x o)

(let ((vl ( - x 1 ) ) ) ( spawn ( ( l o c a l v2 ( f a c v l ) ) )

(wait (v2) (* x v2) ) ) ) ) ) )

Since the spawn in the body of fac is a stack executable spawn and fac is stack executable, the two definitions for fac are:

Page 30: Multiprocessor execution of functional programs

454 Goldberg

g _ f a c x == (demand (x ) (wai t (x)

(i~ (= x 0) 1 (ie~ ((vl ( - x I)))

( s t a c k - s p a w n ( ( v 2 ( s _ f a c v l ) ) ) (* x v 2 ) ) ) ) ) )

s_fac x == if (=xO) 1 (let ((vl (- x 1)))

(stack-spawn ((v2 (s_fac vl))) (* x v2 ) ) )

g_fac is the version that utilizes graph reduction and s_fac is the stack executable version. Any call to iac with arguments that are either unevaluated or currently evaluating must be a call to g_fac. Within g_fac the argument to the recursive call is already evaluated, thus g_fac calls s~ac.

7. ALFALFA: GRAPH REDUCTION ON A HYPERCUBE M U L T I P R O C E S S O R

In the preceding sections we described the translation of an ALFL program into a set of serial combinators that specify the behavior of parallel tasks. In this section we describe an implementation, called Alfalfa, of a heterogeneous graph reducer on the Intel iPSC hypercube multi- processor. (26) In particular, we describe the Alfalfa run-time system that supports serial combinator reduction and dynamic load balancing.

7.1. The Intel iPSC

The Intel iPSC is a MIMD multiprocessor that can be configured with up to 128 Intel 80386 microprocessors (although our experiments were run on a machine using Intel 80286 microprocessors). Each processor has its own memory and there is no shared memory. The processors are linked via a hypercube network; each processor sits at a vertex of an N dimensional hypercube, where N is an integer between 0 and 7.

In an N dimensional hypercube, there 2 N processors. Each processor has N neighboring processors with which it can communicate directly. All communication between processors in the iPSC occurs via the passing of messages. Messages between nonneighboring processors are forwarded by intervening processors. The longest distance a message must travel within

Page 31: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 455

the iPSC is through N links (or "hops"). Message routing is performed by the operating system and is transparent to the programmer. The operating system provides the user with only a few communication primitives such as send, blocking receive, and non-blocking receive.

7.2. The Alfalfa Run-time System

The alfalfa run-time system is a completely distributed mechanism for distributing and performing the execution of the serial combinator code generated by the Alfalfa compiler. It is replicated identically on each processor in the system and is solely responsible for reducing the portion of the graph residing in the local memory. The major components of the run-time system, pictured in Fig. 8, are the graph reducer, message handler, diffusion scheduler, and storage manager.

7.3. The Graph Reducer

The synchronization constructs that serial combinators contain are simply calls to routines in the graph reducer module. These routines perform the necessary transformations on the graph. Before discussing how these operations are performed, we describe the data structures involved.

7.3. 1. Data Structures

A node in the graph is a contiguous block of bytes that contains the following- fields:

PROCESSOR

i_,, Task Queue i

,oca,task~ / t [ o~tgoin! Serial Graph

CombinatOrcode ReducerA I

Stq ma

�9 ~rnlng tasks

Diffusion nagerWage I Scheduler

NETWORK

Fig. 8. The Alfalfa run-time system.

Page 32: Multiprocessor execution of functional programs

456 Goldberg

�9 S t a t e : Either "unevaluated," "pending" (which means that the node is in the process of being evaluated), or "evaluated."

�9 Value: If the node has been evaluated, the value field contains the result. Otherwise it contains a pointer to code that specifies the computation to be performed when the value of the node is requested.

�9 A r g s : This is a vector containing the values of the arguments in the function call represented by the node. Each element contains either a value or a pointer to another node in the graph.

�9 R e q u e s t s : A list of other nodes that have requested the value of this node.

�9 Eva l f i e ld : A bitfield indicating the status of each element in the args vector. If the ith bit of the bitfield is 1 then the ith argument has already been evaluated and contains a value. Otherwise the i th argument is a pointer to another node.

�9 W a i t m a s k : A bitfield indicating which arguments must be evaluated before evaluation of the node can proceed. Evaluation proceeds when, for every 1 in the waitmask, there is a correspond- ing 1 in the evalfield.

�9 R e f C o u n t : The reference count of this node for storage reclamation purposes.

A pointer to a node in the graph has two fields:

�9 P r o c e s s o r : The address (processor id) of the processor upon which the node resides.

�9 N o d e : The address of the node on its host processor.

A t a s k is an instruction for the run-time system. [-Notice that the task data structure is different from the notion of a task as a p r o c e s s used in previous sections of this paper.] Program execution proceeds by repeatedly removing tasks from the task queue of each processor and performing the action specified by the task. There are three kinds of tasks:

�9 An e v a l t a s k contains pointers to a target node and a source node. It indicates that the value of the target node is being requested by the source node.

�9 A r e t u r n t a s k contains a pointer to a target node and a value. It indicates that the value is being returned to the target node as a result of evaluating some other node.

�9 A b u i l d t a s k contains a node description and indicates that a new node matching the node description should be built and evaluated.

Page 33: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 457

7.4. Execut ion

7.4.1. Handling Tasks

Execution begins via the creation of a collection of nodes representing the initial graph. An evaltask requesting the value of the root node of the graph is placed on the task queue of the processor that the root node resides on. The run-time system on each processor then remove tasks, if present, from the local task queue.

If an evaltask is encountered by the reducer and the target node n is unevaluated, evaluation of n proceeds by a jump to the code pointed to by n's value field. This code, of course, is the code generated for a serial combinator by the Alfalfa compiler. When the serial combinator code is finished executing, a returntask is created to return the resulting value v to any requesting node. n's state is modified to "evaluated" and its value field is overwritten with v. If n's state is already "evaluated" when the evaltask is received, the reducer immediately created a returntask with n's value.

If a buildtask is encountered, a node (or collection of nodes) matching the description is created in the local graph space, and execution proceeds as though an evaltask was encountered for the root of the new subgraph.

If a returntask returning a value to a node n is encountered, the appropriate elements of n's args vector and evalfield are updated. If the evalfield now has a 1 in every bit position that the waitmask does, n is ready to be awakened. This is accomplished by simply jumping to the code pointed to by n's value field. Otherwise, no action is taken.

7.4.2. Executing the Synchronization Constructs

We now describe what effects the demand, wait, and spawn constructs have on a node n during execution. Figure 9 shows the state of a node before it is evaluated.

In a demand construct, the list of variables has been translated into a list of indices into n's args vector. For each argument index i being demanded, if the ith bit of n's evalfield is 1 (i.e. the ith argument has already been evaluated) then no action is taken. Otherwise an evaltask is created to request the value of the ith argument. The execution of n's code continues without blocking.

The execution of a wait construct causes n's waitmask to be modified such that if the ith argument occurs in the wait, the ith bit of the waitmask is set to 1. n's value field is modified so that it now points to a continuation, namely code that will be executed when the needed arguments return. If all the arguments occurring in the wait construct have already been evaluated, then execution proceeds via a jump to the continuation. Otherwise, control is relinquished to the run-time system to carry out the next task. Figure 10

Page 34: Multiprocessor execution of functional programs

458 Goldberg

g r a p h space c o d e s p a c e

010 I 000

I I , l \

Fig, 9.

. . . . - ~

Code for F

i f (n ->args [1 ] .va lue) = 6 then getvalue(n,O); getvalue(n,2); n ->wa i tmask = 5; n ->va lue .con t = Fcontl ;

else return value(n,1);

Code for Fcontl

A node in its initial state.

shows the state of a node when all needed arguments have returned and it is about to resume.

When the value of one of n's arguments becomes available, the appropriate elements of n's args vector and evalfield are updated. If the evalfield now has a 1 in every bit position that the waitmask does, n is ready to be awakened. Again, this is accomplished by simply jumping to the code pointed to by n's value field. Otherwise, no action is taken.

The spawn construct causes a buildtask to be created for each expres- sion in the spawn. For each expression, the run-time system invokes its dynamic scheduler to select a processor on which to create the new node

g r a p h s p a c e c o d e space_

I -~

11 / 101 I

[I; ' 5 7 L _ _ ..}~

Code for F

Code for Fcontl

return value(n, n->args[O].value + n- >args[2] .value);

Fig. 10. A suspended node about to resume.

Page 35: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 459

representing the spawned expression. A description of the node, as specified by the compiler, is inserted into the task and the task is placed in the appropriate processor's task queue. A slot in n's args vector is reserved for the value of the new node when it returns.

The results of the program is the value returned by the root node in the graph.

7.5. Storage Reclamation

For simplicitly, Alfalfa uses a distributed reference counting algorithm similar to that described in Ref. 27 for storage reclamation. Unfortunately, this involves substantial message passing overhead to maintain correctness. In future versions of Alfalfa, we intend to utilize either generational reference counting (28) or weighted reference counting, (29'3~ two schemes with substantially better communication performance.

7.6. Dynamic Schedul ing

The purpose of Alfalfa's dynamic scheduler is to choose a processor on which to allocate each new node in the graph. We tested a number of scheduling algorithms on Alfalfa, each belonging to a category that we call diffusion scheduling.

Diffusion scheduling is a class of algorithms for dynamically schedul- ing tasks on distributed memory multiprocessors. Diffusion scheduling is completely decentralized; each processor is responsible for deciding whether to execute a task locally or to send it to another processor. The decisions of whether to allocate work on another processor, and on which processor to allocate the work, are made locally.

In the diffusion scheduling algorithms that we have tested on Alfalfa, the information available to a processor is restricted to local load informa- tion and possibly load information about its nearest neighbors~ In general, diffusion scheduling algorithms may allow each processor to possess infor- mation about an arbitrary number of processors. A processor can only allocate work onto a nearest neighbor, and cannot directly affect the state of a non-neighboring processor. A processor only sends work to a neighbor when it believes that the neighbor has a lower work load. At the start of the computation, work is generally allocated to a small number of processors. As the parallelism of the computation increases, the work diffuses over the other processors.

In a multiprocessor such as the Intel iPSC, in which communication is expensive, the diffusion scheduling algorithms that we use have two attractive features:

828/17/5-7

Page 36: Multiprocessor execution of functional programs

460 Goldberg

1. Since each processor need only be aware of the load of its neighbors, the amount of communication required to keep the information up to date is relatively small.

2. Since each processor can allocate work only on neighboring processors, most communication occurs between neighbors.

However, our diffusion scheduling algorithms are less sensitive to changes in the computational demands of a program than less restrictive methods. If a few processors are suddenly saturated by an explosive growth in parallelism, they are less able to send some of the work to remote processors that may be underutilized. This could lead to a poorer distribution of work through the system.

The diffusion scheduling used by Alfalfa is similar in spirit to the scheduling used in the Rediflow multiprocessor (3~) (which has only been simulated to date). The results of simulation experiments involving fixed combinator reduction using diffusion scheduling were reported previously by Hudak and Goldberg. (32) These preliminary experiments indicated that simple load balancing heuristics often perform as well as more sophisticated strategies. As we shall see, the results presented here support this conclusion.

In Alfalfa, nodes are non-migrating. Once the diffusion scheduler has allocated a node on a given processor, it remains there until it is reclaimed. The rational for this is that the overhead required to migrate a previously allocated node is often greater than the computation required to evaluate the node. This is based on observations made during simulations prior to Alfalfa's implementation.

In Alfalfa, a processor's load is measured by the length of its task queue. In the following sections, we describe how and when this load information is transmitted to neighboring processors.

7.7. Test Programs

We tested Alfalfa on four programs. They were:

1. Parallel Factorial (pfac): This is a simple divide and conquer algo- rithm for computing factorial. The ALFL and serial combinator code for pfac was given in Section 5.5.

2. Eight Queens (queens): This program finds all solutions to the eight queens problem by performing a parallel search through possible board configurations. The queens program decomposes nicely for multiprocessor execu- tion. Large numbers of serial combinator calls may be evaluated in

Page 37: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 461

parallel, and each serial combinator has a substantial sequential component. Unfortunately, space considerations preclude us from showing the ALFL and serial combinator code for this program and the following programs.

3. Adaptive Quadrature (quad): This is a numerical algorithm for approximating the area under a curve. The interval of interest is partitioned into subintervals whose areas are approximated by trapezoids. In order to increase accuracy, the subintervals have varying widths based on the shape of the curve. In sections where the curve is relatively straight, the subintervals may be large. The subintervals are smaller in sections where the curve is less well- behaved. Adaptive quadrature is highly parallel. However, the computation of the area of each trapezoid is a very simple opera- tion, and thus the granularity of the computation is relatively fine. Nevertheless, the performance of Alfalfa when executing this program was quite good.

4. Matrix Multiplication (matmult): This program performs standard matrix multiplication and each matrix is represented as a vector of vectors. Although matrix multiplication has a high degree of parallelism, the execution of matmult requires the distribution of copies of the rows and columns of the matrices. This has a signifi- cant negative effect on Alfalfa's performance because of its high communication costs. When this program is run on a single pro- cessor, no copying of the rows and columns of the initial matrices is required.

Alfalfa performed quite well on the first three programs. Using a variety of diffusion methods, significant (although not linear) speedup over the single- processor case was achieved.

The results of experiments using Alfalfa are described in detail in Refs. 23 and 33. In this section, we give a brief description of two of the diffusion methods used and present the best results gained from using them for each test program.

7.8. Non-Communicating Diffusion Scheduling

In non-communication diffusion scheduling, no load information is transmitted between processors. Each processor chooses, based only on its local state, whether to execute a new task locally or to send it to a neighboring processor.

In our non-communicating diffusion strategy, a processor chooses to send a new task to a remote processor whenever the number of tasks on

Page 38: Multiprocessor execution of functional programs

462 Go!dberg

its local task queue surpasses some threshold. This threshold is dependent on the number of processors in the system: The greater the number of pro- cessors, the lower the threshold. Hopefully, this causes wider distribution of tasks throughout a large multiprocessor system. Once the threshold is surpassed, a processor allocates new tasks to neighboring processors in a round-robin fashion.

7.9. Communicating Diffusion Scheduling

In communicating diffusion scheduling, load information is transmit- ted between neighboring processors during execution. In our strategy, the load of a processor is reported to its neighbors whenever its task queue length differs by at least a factor of two from the previously reported value and either the old value or new value is greater than some threshold. Since a processor's task queue length will often fluctuate between 0 and 1 or between 1 and 2, the threshold should be large enough to prevent messages from being sent due to minor changes in load. A large threshold value of means that fewer load messages will be sent by lightly loaded processors. The factor-of-two requirement was chosen somewhat arbitrarily, and arose from the observation that small changes in queue length are more signifi- cant on a lightly loaded processor than on a heavily loaded processor.

A new task is allocated on a neighboring processor with the least reported load, but only if the local processor's load is greater than the neighbors load by some amount. Our experiments varied the difference in load that was required in order for a task to be sent remotely, as well as the frequency with which load messages were sent (based on the threshold described in the preceding paragraph).

7.10. Comparing the Diffusion Methods

The best performance of each diffusion method for each program is plotted in Figs. ! 1-14.

The most striking aspect of these results is that (at least in Alfalfa) there is little difference between the best performances of the communicat- ing and non-communicating diffusion algorithms on 32 processors. As one would expect, communicating diffusion performed better in systems with small numbers of processors, where all the processors became saturated. If a program is very large, or if many programs are running simultaneously, communicating diffusion may perform better on large multiprocessor systems as well. Nevertheless, it is clear that in many cases, there is no need to develop sophisticated load balancing strategies because simple strategies will work just as well.

Page 39: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 463

P a r a l l e l F a c t o r i a l ( p f a c ) 10000 [ , ! ' I ' I '

B000 I + . . . . . - F N o n - C ~ n n a n i e a t i n g

X X Communieatin~

8000

i ~x D ................. [] linear speedup

4000 i ~

~ .

2000

i ' " - ; ........................................................ ,-o 0 , I , 0 10 Z0 80

N~rber of Processors

Fig . 11. T h e e x e c u t i o n t i m e s for p fac o n Alfalfa .

40

50000 Q u e e n s

I ' I ' I '

40000 + ..... +Non-Comrunicating

!, i~ X X C~m~on ica t i n 8

30000 i\

D ................. [] Linear ~peedup

20000

10000

0 0 , { , { , I J IO 20 80

Nunrber of Processors

Fig. 12. T h e e x e c u t i o n t i m e s for q u e e n s o n Alfalfa .

40

Page 40: Multiprocessor execution of functional programs

4 6 4 G o l d b e r g

10000 A d a p t i v e Q u a d r a t u r e I ' I ' I '

+ . . . . . %Non-Ccrnnunicating

coco : t • X C c r a m n i e a t i n g

ti/ 6 0 0 0 ~ a ................. [] l i n e a r s p e e d u p

4ooo ~ ~ -

2 0 0 0

10 20 30 4 0 Nurrber o f P r o c e s s o r s

Fig. 13. The execution times for quad on Alhlfa.

1 5 0 0 0

1 0 0 0 0

]

5 0 0 0

Matrix Multiplication i ' i ' I

+ . . . . . -Jr Non-Carrmun i c at ing

X X C c m n ~ i c a t i n g

~ B ................. G | J n e a r s p e e d u p

\ "~: . . . . . . .~_ . . . . . . . . . . . . . = _ ~

O ~ I , I "'~ . . . . . . . . . . . . . . . . r " ' e , 1 0 20 3 0

NurrJber o f P r o c e s s o r s

Fig. 14. The execution times ~ r matmul t on AlVIn.

4 0

Page 41: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 465

Both strategies performed well on all programs except matmult. This is due to the large amount of shared data in the program that must be sent via message passing between processors. As we shall see in the next section, this problem goes away on a shared-memory multiprocessor.

8. BUCKWHEAT: GRAPH REDUCTION ON A SHARED M E M O R Y MULT IPROCESSOR

Buckwheat is an implementation of a heterogeneous graph reducer on the Encore Multimax, a shared memory multiprocessor/34~ In many ways, Buckwheat is similar to Alfalfa. We therefore present a brief description of Buckwheat that covers only those aspects that differ from Alfalfa.

8.1. The Encore M u i t i m a x

The Encore Multimax is a bus-based shared memory multiprocessor. Buckwheat was implemented on a system that contained twelve processors. Each processor is a 10 MHz National Semiconductor NS32032 micro- processor. [Like the Intel iPSC, the Encore Multimax has been upgraded since this research was performed.] Any location in memory can be accessed by any processor over a very fast bus called the Nanobus. An important feature of the shared memory in the Multimax is that any byte can be used as a lock (for enforcing mutual exclusion, etc.). Atomic test- and-set instructions are supported in order to set and reset these locks.

8,2. System Organizat ion

In graph reduction, the program graph logically resides in a single space. Thus, a shared memory multiprocessor is the most natural architec- ture on which to implement graph reduction. On the Multimax any processor can access any component of the program graph. Naturally, access to any node in the graph that being mutated must be restricted to the processor performing the mutation.

Buckwheat's processors are self-scheduled. That is, when a processor becomes free it removes a task from a shared task queue and performs the action dictated by the task. No processor needs to be aware of the state of any other processor in the system.

The organization of buckwheat is pictured in Fig. 15. Each processor has a private copy of the graph reducer module, serial combinator code, and storage manager. Even though the Multimax has a single physical memory, multiple copies of these modules allows the processors to execute the routines without memory contention. Of course, there may still be

Page 42: Multiprocessor execution of functional programs

466 Goldberg

Graph Space

Memory

Queue Structure

Value and Pointers

Graph Transformations

Current New Tasks Tasks

... . . . . |

..... ""ed'~'ce r .......... }}!! ........... : i i i i i i i i i I i t

Management ii:ii Sto age I ii! Management i ................................ I!!! ................................ lil

~! ................ [...,,., ......... , .... ~i Reducer

lliil Storage li] Management

................. | ............ Reducer

i !iiiiiFi!iiiiii! !ii!i Storage

Management

Processors

Fig. 15. The Buckwheat system.

contention for the bus. However, the Nanobus is fast enough that the effect of bus contention is minimal.

The graph space and task queue structure reside in a shared area of memory. In its simplest form, the queue structure consists of a single queue from which all processors access tasks to be executed. A more sophisticated task queue structure is described in Section 8.4.

8.3. Node Representation

The graph structures in Buckwheat are identical to those in Alfalfa with two exceptions:

1. A node pointer is a standard (32 bit) pointer.

2. Each node contains an additional byte that serves as a lock for mutual exclusion.

Like Alfalfa, Buckwheat uses reference counting for storage reclamation.

8.4. Queue-based Scheduling

Processor scheduling is accomplished by maintaining a central queue structure which every processor accesses. The simplest approach would be for every processor to remove tasks from the single shared queue. However, a shared queue causes contention between processors attempting to access the queue. This problem is exacerbated as the number of processors in the system grows. Unless the hardware supports efficient access to a central

Page 43: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 467

Primary Queue

Processors

rimaryOueue 1 I Pr'n'ary~ I /

I Secondary Queue

Fig. 16. Buckwheat's two-level queue structure.

queue, it is often necessary to modify the queue structure to prevent contention.

The solution we have implemented for Buckwheat is a two-level queue structure, illustrated in Fig. 16. A processor can directly access a task queue, called a primary queue, that it shares with a small number of other processors. There may be many primary queues in the system. Each pr imary queue has a rather small fixed size. We define the set of processors accessing a single primary queue to be a primary cluster.

If a processor is ready to execute a task and its primary task queue is empty, it can access another queue, called the secondary queue, which is shared among all the processors in the system. Similarly, if a processor attempts to put a task onto its primary queue and its pr imary queue is full, then the task is put onto the secondary queue.

There are several advantages to the two-level queue structure:

1. Since a primary queue is shared by a relatively small number of processors, contention for the queue is reduced.

2. The secondary queue provides a way to send tasks from a busy primary cluster to other primary clusters. The cost of the extra indirection needed to access the secondary queue is only incurred by idle processors in idle primary clusters or when a primary cluster becomes very busy. If the size of the primary queue is chosen appropriately, the vast majority of queue accesses will be to primary queues.

8.5. Execution Results

The four applications programs (pfac, queens, quad, matmult) were executed on Buckwheat. Almost 600 runs were performed to measure:

Page 44: Multiprocessor execution of functional programs

468 Goldberg

1. The performance of Buckwheat using a single shared task queue.

2. The effect of using a two-level queue structure. The number of processors in a primary cluster as well as the sizes of the primary queues were varied in order to find the best task queue configura- tion.

The results of these experiments are described in detail in Ref. 35. Figures 17-20 show the best performance on each of the test programs using a single queue and the two-level queue. On the twelve processor system, Buckwheat performed best when there were four processors in each primary cluster and each primary queue had a size of twenty or less (below twenty, the performance did not vary greatly). Buckwheat performed extremely well on every test program. It performed especially well on mat- mult, the program that Alfalfa performed poorly on. This is indicative of the advantage that shared-memory provides for programs will large shared data structures such as matrices.

8.6. Comparison with Uniprocessor Implementations

It may be interesting to compare Alfalfa's and Buckwheat's perfor- mance on the test programs against efficient uniprocessor implementations

P a r a l l e l F a c t o r i a l ( P f a c ) 2 0 0 0 0 0 0 ' ' ' ' I ~ ' ' ' I ' '

1500000 ~ )~ . . . . . . X ~in~le queue

~'~ ~ I I ' I~o-leve] Queue

I000000

".\~.~ E~ ................. ~ Linear ~peedup

~ " ~ ~ , - - ~ - ~ - - x - - ~---~-- ~-- -• 5 0 0 0 0 0

"~'~E]'~E3-.-~_~._._..E~....... ~

0 x t = , I ~ i , , I r r 5 1 0

Ntxrber of Proeeeeors

Fig. 17. The execution times for pfac on Buckwheat.

Page 45: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 469

Queens 20000000 ' ' ' ' l ' ' ' ' I '

15000000

m

I 0000000

5000000

3000000

2500000

2000000

la

v 1500000

I 000000

500000

Fig, 18.

Fig. 19.

. . . . . . • S i n g l e Queue

[ L ~ l e v e l Queue

~ D ................. 0 L i n e a r Speedup

. . . . . . . . N - - N

5 10 N~rlber of P r o e e a a o r s

Theexecutiontimes ~rqueenson Buckwheat.

Adapt ive Quadrature (Quad) ' ' J ' ' ' ' I ' '

• X Single Queue

, . . . . . ,o-io,ol Quooe

................. ~ Linear Speedup

5 I0 N u ~ e r of Prooeauors

T h e e x e c u t i o n t i m e s fo r q u a d o n B u c k w h e a t .

Page 46: Multiprocessor execution of functional programs

470 Goldberg

20000000

1 5 0 0 0 0 0 0

.=

10000000

5000000

Matrix M u l t i p l i c a t i o n (Matrrult) ' ' ' 1 ' ' ' ' I '

• X S i n g l e Queue

+ . . . . . --b T w o - l e v e l Queue

5 10 NLrnber o f P r o c e s s o r s

Fig. 20. The execution times for matmult on Buckwheat.

of the same programs written in C, Pascal, or some other sequential language. We have not done so and feel the results of such a comparison would be difficult to interpret for the following reasons:

�9 Both Alfalfa and Buckwheat are research prototypes, not produc- tion quality systems. Little effort was made to optimize the system code and the code generated by the compilers. The compilers generated C code which, especially in this case, is much less efficient than native code.

�9 A large amount of run-time overhead is incurred by run-time type checking and run-time tests to support lazy evaluation.

These are the reasons (among others) that our stated objective was to examine introspective speedups, the speedups of a multiprocessor system over the same system running on a single processor.

9. C O N C L U S I O N S

Did we succeed in demonstrating that it is feasible to execute conven- tional functional programs on currently available multiprocessors? Buck-

Page 47: Multiprocessor execution of functional programs

Multiprocessor Execution of Functional Programs 471

wheat has shown that shared memory multiprocessor execution of func- tional languages can be efficient over a wide range of programs that exhibit inherent parallelism. Although we hesitate to draw conclusions about exploiting massive parallelism on very large shared memory machines, we find our results encouraging. Our results on Alfalfa indicates that for large classes of functional programs, execution on loosely coupled multi- processors is feasible. However, we recognize that more work remains to be done on partitioning data for these machines. In both cases, the serial combinator approach appears to have been successful in detecting and exploiting inherent parallelism in functional programs. Furthermore, very simple schemes for load balancing provided excellent performance for distributing tasks throughout the multiprocessor systems.

10. ACKNOWLEDGMENTS

I would like to thank Paul Hudak for his advice and his contributions to this research. I would also like to thank those in the wrestling group at Yale and the C-10 group at Los Alamos National Laboratory. I would also like to express my appreciation to Wendy Goldberg for her patience and support.

REFERENCES

1. J. Backus, Can programming be liberated from the von Neumann style? A functional style and its algebra of programs, CACM 21(8):613-641 (August 1978).

2. D. A. Turner, The semantic elegance of applicative languages. In Functional Programming Languages and Computer Architecture, p. 85-92, A CM (1981).

3. S. L. Peyton Jones, Directions in functional programming research. In Distributed Computing Systems Programme, Chapter 14, p. 220-249, Peter Peregrinus Ltd., London (1984).

4. S. L. Peyton Jones, The Implementation of Functional Programming Languages. Prentice Hall (1987).

5. Robin Milner, The Standard ML core language, Polymorphism, Vol. 2, No. 2 (October 1985).

6. D. A. Turner, Miranda: a non-strict functional language with polymorphic types. In Functional Programming Languages and Computer Architecture, p. 1-16, Springer-Verlag LNCS 201 (September 1985).

7. J. McGraw et al., SISAL: Streams and Iteration in a Single Assignment Language, Language Reference Manual, Version 1.2. Technical Report M-146, LLNL (March 1985).

9. J. Kowalik (ed.). Parallel MIMD Computation: The HEP Supercomputer and its Applica- tions, MIT Press (1985).

10. R. Oldehoeft and D. Cann, Applicative parallelism on a shared-memory multiprocessor. IEEE Software (January 1988).

11. P. Hudak, ALFL Reference Manual and Programmer's Guide. Research Report YALEU/DCS/RR-322, Second Edition, Yale University (October 1984).

Page 48: Multiprocessor execution of functional programs

472 Goldberg

12. C. P. Wadsworth, Semantics and Pragmatics of the Lambda Calculus. PhD thesis, Oxford University (1971).

13. H. K. Curry and R. Feys, Combinatory Logic, North-Holland Pub. Co., Amsterdam (1958).

14. R. J. M. Hughes, Super-combinators: a new implementation method for applicative languages. In Proc. 1982 ACM Conf. on LISP and Functional Prog., p. 1-10, ACM (August 1982).

15. T. Johnsson, Lambda Lifting: Transforming programs to recursive equations. In Functional Programming Languages and Computer Architecture, p. 190-203, Springer- Verlag LNCS 201 (September 1985).

16. B. Goldberg, Detecting sharing of partial applications in functional programs. In Proceedings of 1987 Functional Programming Languages and Computer Architecture Conference, p. 408-425, Springer-Verlag LNCS 274 (September 1987).

17. R. M. Keller, G. Lindstrom, and S. Patil, A loosely-coupled applicative multi-processing system. In AFIPS, p. 613-622 (June 1979).

18. R. M. Keller, G. Lindstrom, and S. Patil, An Architecture for a Loosely-Coupled Parallel Processor, Technical Report UUCS-78-105, University of Utah (October 1978).

19. J. Darlington and M. Reeve, Alice: a multi-processor reduction machine for the parallel evaluation of applicative languages. In Functional Programming Languages and Computer Architecture, ACM, p. 65-76 (October 1981).

20. S. L. Peyton Jones, C. Clack, J. Salkild, and M. Hardie, GRIP--A high-performance architecture for parallel graph reduction. In Proceedings of 1987 Functional Programming Languages and Computer Architecture Conference, p. 98-112, Springer-Verlag LNCS 274 (September 1987).

21. T. Johnsson, Efficient compilation of lazy evaluation. In Proceedings of the SIGPLAN'84 Symposium on Compiler Construction, p. 58-69 (June 1984).

22. P. Hudak and L. Smith, Para-functional programming: A paradigm for programming multiprocessor systems. In Proc. 12th Sym. on Prin. of Prog. Lang., p. 243-254, ACM (January 1986).

23. P. Hudak, Para-functional programming, Computer 19(8):60-71 (August 1986). 24. R. J. M. Hughes, The Design and Implementation of Programming Languages. PhD thesis,

Oxford University (July 1983). 25. B. Goldberg, Multiprocessor Execution of Functional Programs. PhD thesis, Yale Univer-

sity, Department of Computer Science (May 1988). 26. iPSC User's Guide -Preliminary. Intel Corporation (July 1985). 27. Claus-Werner Lermen and Dieter Maurer, A protocol for distributed reference counting.

In Proc. 1986 ACM Conference on Lisp and Functional Programming, p. 343-350, ACM SIGPLAN/SIGACT/SIGART, Cambridge, Massachusetts (August 1986).

28. B. Goldberg, Generational Reference Counting: A reduced-communication storage reclamation scheme. In Proceedings of the SIGPLAN'89 Conference on Programming Language Design and Implementation, A CM (June 1989).

29. D. I. Bevan, Distributed garbage collection using reference counting. In PARLE Parallel Architectures and Languages Europe, p. 176-187, Springer-Verlag LNCS 259 (June 1987).

30. Paul Watson and Ian Watson, An efficient garbage collection scheme for parallel computer architectures. In PARLE Parallel Architectures and Languages Europe, p. 432-443, Springer-Verlag LNCS 259 (June 1987).

3L R. M. Keller, F. C. H. Lin, and J. Tanaka, Rediflow multiprocessing. In Proc. Compcon Spring 84, p. 41~417 (February 1984).

32. P. Hudak and B. Goldberg, Experiments in diffused combinator reduction. In Proc. 1984 ACM Conf. on LISP and Functional Prog., p. 167-176, ACM (August 1984).

Page 49: Multiprocessor execution of functional programs

Nlultiprocessor Execution of Functional Programs 473

33. B. Goldberg and P. Hudak, Implementing functional programs on a hypercube multi- processor. In Proceedings of Third Conference on Hypereube Concurrent Computers and Applications, ACM (January 1988).

34. Multimax Technical Summary, Encore Computer Corporation, Marlborough, Massachusetts (1986).

35. B. Goldberg, Buckwheat: Graph reduction on a shared memory multiprocessor. In Proceedings of the 1988 ACM Conference on Lisp and Functional Programming, p. 40-51 (July 1988).

Printed in Belgium