28
International Journal of Parallel Programming, Vol. 22, No. 6, 1994 Parallel-Access Using Fast- Fits Theodore Johnson 1 Received November 13, 1992 Memory Management The two most common approaches to managing shared-access memory--free lists and buddy systems--have significant drawbacks. Free list algorithms have poor memory access characteristics, and buddy systems utilize their space inefficiently. In this paper, we present an alternative approach to parallel-access memory management based on the fast-fits algorithm. A fast-fits memory manager stores free blocks in a tree structure, providing fast access and efficient space use. Since the fast-fits algorithm accesses fewer blocks than a free list algo- rithm, it reduces the amount of cache invalidation overhead due to the memory manager. Our performance experiments show that the parallel-access fast-fits memory manager allows significantly greater access rates than a serial-access fast-fits memory manager does. We note that shared-memory multiprocessor systems need efficient dynamic storage allocators, both for system purposes and to support parallel programs. KEY WORDS: Memory manager; shared memory; concurrent data structure. 1. INTRODUCTION A memory manager accepts two kinds of operations: requests to allocate and requests to release arbitrary size blocks of memory. For example, the UNIX system calls malloc() and free() are requests to a memory manager. A concurrent (or parallel-access) memory manager handles requests for shared-memory in a multiprogrammed uniprocessor, a shared-memory multiprocessor, or a distributed shared virtual memory environment. Examples of applications that require parallel memory managers include t Department of Computer and Information Science, University of Florida, Gainesville, Florida 32611-2024. 617 0885-7458/94/1200-0617507.00/0 ~ 1994PlenumPublishing Corporation

Parallel-access memory management using fast-fits

Embed Size (px)

Citation preview

Page 1: Parallel-access memory management using fast-fits

International Journal of Parallel Programming, Vol. 22, No. 6, 1994

Parallel-Access Using Fast- Fits

T h e o d o r e J o h n s o n 1

Received November 13, 1992

Memory Management

The two most common approaches to managing shared-access memory--free lists and buddy systems--have significant drawbacks. Free list algorithms have poor memory access characteristics, and buddy systems utilize their space inefficiently. In this paper, we present an alternative approach to parallel-access memory management based on the fast-fits algorithm. A fast-fits memory manager stores free blocks in a tree structure, providing fast access and efficient space use. Since the fast-fits algorithm accesses fewer blocks than a free list algo- rithm, it reduces the amount of cache invalidation overhead due to the memory manager. Our performance experiments show that the parallel-access fast-fits memory manager allows significantly greater access rates than a serial-access fast-fits memory manager does. We note that shared-memory multiprocessor systems need efficient dynamic storage allocators, both for system purposes and to support parallel programs.

KEY WORDS: Memory manager; shared memory; concurrent data structure.

1. I N T R O D U C T I O N

A memory manager accepts two k i n d s o f operations: requests t o allocate a n d r e q u e s t s to release a r b i t r a r y size b l o c k s of m e m o r y . F o r e x a m p l e , t he

U N I X s y s t e m cal ls malloc() and f ree ( ) a re r e q u e s t s to a m e m o r y m a n a g e r .

A c o n c u r r e n t (o r p a r a l l e l - a c c e s s ) m e m o r y m a n a g e r h a n d l e s r e q u e s t s

for s h a r e d - m e m o r y in a m u l t i p r o g r a m m e d u n i p r o c e s s o r , a s h a r e d - m e m o r y

m u l t i p r o c e s s o r , o r a d i s t r i b u t e d s h a r e d v i r t u a l m e m o r y e n v i r o n m e n t .

E x a m p l e s o f a p p l i c a t i o n s t h a t r e q u i r e p a r a l l e l m e m o r y m a n a g e r s i n c l u d e

t Department of Computer and Information Science, University of Florida, Gainesville, Florida 32611-2024.

617

0885-7458/94/1200-0617507.00/0 ~ 1994 Plenum Publishing Corporation

Page 2: Parallel-access memory management using fast-fits

618 Johnson

parallel sparse matrix factorization algorithms (t'2) and communications software on clustered parallel systems. (8)

A memory manager for a parallel system should have certain desirable characteristics. First, a memory manager for a parallel application should be concurrent. Bigler et aL (4) have found that concurrent memory managers are more appropriate for use by parallel programs than are serial memory managers protected by a critical section. Second, the memory manager should also be space efficient. Parallel supercomputers tend to be efficient only on large problems. If the problem does not fit into available memory, an out-of-core solution must be used, which greatly degrades perfor- mance. (13) An improvement in memory efficiency permits the solution of previously unsolvable problems. Third, the memory manager should access as few cache lines as possible, to avoid imposing a cache invalidation over- head on the rest of the system. Grunwald et aL ~6) have found that conven- tional free list algorithms cause a large volume of cache misses and page faults when run on a uniprocessor. In this paper, we present a parallel memory manager that is concurrent, is space efficient, and accesses few blocks on each allocate and release request, avoiding excessive cache invalidations.

The algorithms that we present are relatively application and architec- ture independent. We assume only a shared memory parallel processor and a parallel application that requests and releases blocks of memory of a variety of sizes. Given an architecture and an application, the implementor can tune the memory manager to achieve the best performance. However, the implementor still requires a memory manager of the type described in this work.

Applications often make most of their memory requests for blocks of a few sizes. (7" 8) This behavior is to be expected when the application main- tains dynamic lists and trees. However, a typical application that is run on a parallel supercomputer is likely to be a numeric application, and will make many requests to allocate space for temporary vectors and matrices. The request sizes will typically vary greatly during the execution of the program.(1.2, 5) If the program often requests blocks of a particular size, the memory manager can use a segregated storage technique (9) and keep a pool of the popular block sizes. Highly parallel techniques for managing concurrent pools have been discussed in the literatureJ 1~ ~1) However, the program will still need a concurrent memory manager that can allocate blocks of arbitrary sizes.

A parallel memory manager must be tuned for the architecture it is implemented on. For example, memory should always be allocated in units of a cache line to avoid the "false sharing" or "ping-pong" effect. ~lz) The implementation must consider the available locking mechanisms and

Page 3: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 619

whether locality hints are necessary. For example, locality and concurrency can be improved by partitioning the memory space, and managing each memory space independently. An operation prefers to allocate from its local memory space, and tries remote spaces if the allocation request on the local memory space fails (this approach is discussed in Ref. 13). However, a concurrent memory manager is still needed for each partition. The memory manager should be highly concurrent to reduce the number of partitions that are needed, since partitioning will reduce space efficiency.

1.1. Previous Work

Most heap memory management algorithms use one of two main methods: free lists and buddy systems. In a free list algorithm, (14) the free blocks are linked together in a list. Initially, all of the memory is free, and the free list consists of a block containing the entire memory. An allocate operation searches the free list and returns a portion of a selected free block. The method by which the allocate operation selects a free block determines the free list algorithm. Concurrent free list managers typically use first f i t - - tha t is, they allocate from the first block that is large enough. This algorithm is used because only local information is needed to decide whether or not to allocate from a block. A release operation adds the released block to the free list. The released block is merged with an existing block on the free list, if possible. Both the allocate and the release opera- tions require that much of the free list be searched, so free list algorithms are considered to be slow, though space efficient. (lsl

An alternative memory manager can be built using a buddy systemJ 16~ In a buddy system, memory blocks are available only as one of several fixed sizes. Each memory block has a buddy, with which it can combine and form a larger size block, which can in turn be split into its constituent buddies. A common buddy system is the binary buddy system, in which all blocks are of size c2 i. There is a free list for each block size, making allocate operations fast when there is an available block of the right size. Buddy systems are considered to be very fast, but space inefficient/15~

Stephenson (17'18~ proposes a fast free list memory management algorithm which he calls fast-fits. The free blocks are organized into a tree structure, so that a block can be allocated or released in an expected O(log n) time. The free blocks in a fast-fits memory manager can be of arbitrary size, so that fast-fits suffers little internal fragmentation. Bozman et al. (15~ have found that fast-fits is both fast (though not as fast as a buddy system) and space efficient (though not as space efficient as a free list). We note that the fast-fits memory manager is used in a number of commercial operating systems, including SUN Microsystems' SunOS Unix. (tg~

828/22/6-3

Page 4: Parallel-access memory management using fast-fits

620 Johnson

Since many programs allocate memory blocks in only a few sizes, the segregated storage technique t9) can greatly improve the speed of memory allocation and deallocation. Free blocks of a few special sizes are stored in separate free lists. Blocks of these sizes are put into the special free lists when released, and an allocation request for one of these blocks is an O(1) operation if a block of that size is available. If the special free list is empty, or if a different sized block is requested, then one of the previously discussed memory managers is used.

A number of parallel free list algorithms have been proposed. Stone t2~ proposes a first-fit free list algorithm that uses the fetch-and-add instruc- tion ~21) and locking for concurrency control. Memory is added to or removed from a free block using the fetch-and-add instruction, but blocks must be exclusively locked whenever a free block is added to or removed from the list. Bigler eta/. (4) compare three algorithms, one of which is a concurrent algorithm that searches the free list using lock coupling (the successor block must be locked before the lock on the current block may be released). The authors find that the concurrent algorithm is more appropriate for a parallel processing environment than are the serial algorithms. Ellis and Olson ~t3~ propose two concurrent free list algorithms. Their first algorithm is similar to Stone's, 12~ but breaks the memory being managed into several segments, each of which has an independent free list. Ellis and Olson's second algorithm greatly simplifies the locking performed on the free blocks, and keeps empty blocks in the structure until it is guaranteed that no operation will read the header information for that block. Ford ~22) compares a locking approach to an optimistic approach in several real-time memory managers, and concludes that the optimistic approach is better because the critical sections are shorter.

Gottlieb and Wilson develop parallel buddy systems that use the fetch- and-add instruction to coordinate processors. Their first algorithm ~23'24~ models a buddy system as a tree. The number of blocks of each size that are contained in the subtree rooted at a node is stored at each node. Con- current allocators use this information to navigate the tree. Their second algorithm ~8'24J is a concurrent version of the commonly described buddy algorithm. Both of Gottlieb and Wilson's algorithms suffer from excessive fragmentation, because operations that access the memory manager don't cooperate. Johnson and Davis ~25) present a space-efficient parallel buddy memory manager in which allocate and release operations cooperate to avoid splitting blocks needlessly.

While a parallel first fit memory manager is space efficient, it requires processes to access many widely scattered blocks to perform an operation. As a result, free list algorithms have a poor memory reference pattern. In a serial environment, this problem can cause excessive paging. 16~ In

Page 5: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 621

a parallel environment, the poor memory reference pattern can cause excessive network traffic and many cache invalidations. A fast fit memory manager requires that far fewer blocks be examined, reducing the load that an allocate or release operation places on the memory system. Bozman et al. (~5) report that a free list algorithm requires an operation to reference 3 to 30 times more blocks than the fast-fits algorithm does, while providing only slightly better space utilization (90% to 93% for both approaches). The parallel buddy memory manager presented of Johnson and Davis (25~ is fast and low-contention, (3) but has a much lower space utilization (about 65% to 70%). ~5'25)

In this paper, we show how to implement a parallel fast-fits memory manager by modifying the procedures presented by Stephenson (~7" ~8) and applying tree-locking protocols. (26'27) We discuss some implementational issues, examine the performance of the parallel fast-fits algorithms, discuss the value of several optimizations, and compare the performance of parallel fast-fits to a serial implementation. We find that the parallel fast fits algorithm is a practical alternative to other parallel memory management solutions.

2. T H E F A S T - F I T S M E M O R Y M A N A G E R

The fast-fits memory manager keeps the free blocks in a Cartesian tree, ('-~'29) which is a binary tree that stores (X, Y) points from the two- dimensional plane. If n is a node in a Cartesian tree, then let n .X be the stored X coordinate and let n. Y be the stored Y coordinate. The nodes in the Cartesian tree are ordered by the following two rules (see Fig. 1):

1. I f n ~ . X < n 2 . X , then ni comes before n_~ in the inorder traversal of the Cartesian tree.

2. If n~ is a descendant of nz, then n~. Y~< nz. Y.

One can implement search, insert, and delete operations on a Car- tesian tree using local rebalancing only (~s' 28) (as we will discuss in a later section). Vuillemin ~28). shows that the expected number of comparisons needed to insert a point into a Cartesian tree is O(log m) if there are m nodes in the tree (assuming that the Y components of the entries in the tree form a random permutation).

When the Cartesian tree is used for memory management, the X coor- dinate is the address of the free block, and the Y coordinate is the size of the free block. This ordering makes sense for a memory manager. By rule 1, blocks are stored in binary search tree order by address. Therefore, a release operation can find neighboring blocks (the blocks with the closest addresses) by searching down one path of the tree. By rule 2, blocks are stored according to heap order by size. Therefore, if an allocate operation

Page 6: Parallel-access memory management using fast-fits

J J

Johnson 622

Fig. 1. Example Cartesian tree.

sees that a node n is too small to satisfy the request, all nodes in the subtree rooted at n are too small to satisfy the request, and should not be considered. As a corollary, if the root is too small to satisfy the request, no free block can satisfy the request. Figure 1 shows an example of a fast-fits tree.

Stephenson ~8) provides algorithms for implementing allocate and release operations on a Cartesian tree. The allocate operation searches the Cartesian tree until it finds a node n that can satisfy the request, but neither of n's children can satisfy the request. If at least one of n's children can satisfy the request, the allocate operation will continue its search in the child's subtree. The search algorithm is defined by the action taken when both children can satisfy the request. Leftmostfit searches the left subtree, and better f i t searches the subtree rooted at the smaller child. After n is found, the block needed to satisfy the request is taken from n. If not all of n is allocated, but n becomes smaller than one of its children, n must be demoted until rule 2 is satisfied (a demote is an optimization over a delete and an insert).

When a block of memory is released, the release operation must check the set of free blocks to find any neighboring blocks that can be merged with the released block. The mergeable nodes are deleted from the Cartesian tree and combined with the released block, which is then inserted into the tree. Stephenson also presents algorithms that take advantage of special cases to speed the execution of the release operation.

Page 7: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 623

The allocate and release operations search only one path in the Cartesian tree. The evidence that the Cartesian tree will have O(log m) depth suggests that allocate and release operations will be fast. Since allocations are made using "best-fit"-like algorithms, fast-fits should be space efficient. The measurements taken by Stephenson (~8) and Bozman et aL ~15) show that the fast-fits memory manager is in fact fast and space efficient.

3. PARALLEL A C C E S S F A S T - F I T S

Serial fast-fits operations tend to move down a single path of the Cartesian tree, so the serial algorithms can be naturally parallelized by using a tree-locking protocol, t27) To do so, however, we need to modify the serial algorithms somewhat. The tree locking protocol requires that if process P holds a lock on the tree, then P can lock node n only if P holds a lock on n's parent. For concurrency, we need to release locks as early as possible, t3~ As a result, we must modify the serial release procedure, which occasionally needs to move combined blocks higher in the tree. In such a case, the parallel release procedure will be required to make another pass starting from the root.

In this section, we describe two algorithms, one that relies on exclusive locks (the W-only algorithm) and one that uses both shared and exclusive locks (the RWU-algorithm). We also discuss several optimizations of the algorithms, the value of which we examine in Section 4. Finally, we discuss methods for implementing the required locks in a practical shared-memory multiprocessor.

3.1. W-Only Algorithm The W-only algorithm relies on exclusive locks to prevent interference

between processes that access the Cartesian tree concurrently. The algorithms for performing fasts-fits memory management are not described well in easily available literature ~17) (a good description of the algorithms is contained in the technical reportl~8)), so we provide another description here. In this section, we provide a high level description only. The pseudocode for selected procedures is listed in the appendix. A full listing of the procedures appears in the technical report. ~3~)

We will need to modify the fast-fits data structures slightly. A node in the serial fast-fits tree needs to store its size and pointers to both (poten- tial) children. A node in the parallel fast-fits tree might need a storage location for locking, depending on the multiprocessor architecture. In the algorithms we present, we store the size of both children in the parent, which permits a simpler and more concurrent algorithm than is otherwise possible. Increasing the size of a fast-fits node increases the minimum size

Page 8: Parallel-access memory management using fast-fits

624 Johnson

allocation. However, this is not likely to be a problem because memory should be allocated in units of a cache line to prevent false sharing. ~12}

The root node of the fast-fits tree is a free block that might be chosen for allocation, so the root of the fast-fits tree will change periodically. We use an anchor node to provide a stable reference to the root. To simplify the algorithms we use a fast-fits node structure for the anchor, setting the size field to the size of managed memory plus one.

The allocate and release operations use several common sub-opera- tions. We describe the W-only algorithm by discussing each procedure in turn.

Allocate: To prevent interference, the concurrent algorithms require that a node be locked whenever it is accessed. The allocate procedure begins by locking the anchor and the root of the Cartesian tree (which is the only child of the anchor). The tree is then searched for a suitable free block to allocate from, using the lock coupling protocol ~26' 27. 327: the child node is locked before the parent node is unlocked. The search criteria can be leftmost fit or better-fit I~s) (as described in Section 2). For the parallel- access memory manager, we propose the random better-fit algorithm. When random better-fit has the choice of two subtrees to search, it chooses between them randomly. Introducing randomness to the search algorithm hashes the memory manager operations to different subtrees, increasing the possible concurrency. We note that the roots of the subtrees don't need to be locked when a search decision is made, because the child node sizes are stored in the parent.

When the allocate procedure is searching for the free block to allocate from, it releases the lock on node n's parent only after it determines that it can allocate from one of n's children. When the appropriate node, n, to allocate from is found, the operation has a W lock on n and n's parent (both of which will be modified). If all of n must be allocated to satisfy the request, the calling process uses the delete procedure to remove n from the fast-fits tree. Otherwise the calling process uses the demote procedure to restore rule 2 for a Cartesian tree.

Release: The release operation returns a block of memory to the memory manager. When the block is returned, it must be combined with any currently existing free blocks to defragment memory. There are two possible neighbors (adjacent blocks). They can be found by searching the tree for the address of the released block until two neighbors are found or a leaf is reached. ~18)

To perform the release, the serial algorithm makes two passes over the fast-fits tree. The first pass finds and deletes any neighboring free blocks, which are combined with the released block. On the second pass, the com- bined block is inserted into the tree. Stephenson describes an optimization

Page 9: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 625

that makes a single pass and promotes the released block if it becomes larger than its parent due to merging.

Neither of the serial algorithms can be directly implemented as a parallel-access memory manager. Let us consider the two-pass release algo- rithm. Suppose process P1 is releasing block BI, process P2 is releasing neighboring block B2, and IBI] > IB2I. Suppose further that P1 and P2 complete their first passes at about the same time, and don't find any neighboring blocks. Now, P~ and P2 will insert their blocks into the fast- fits tree and terminate. The tree will contain two neighboring free blocks, and memory will become needlessly fragmented. Next, let us consider the one-pass algorithm. In order to satisfy the tree locking criteria, a lock must be held on a node if it might be accessed again. ~26' 27) Therefore, a lock must be held on the smallest node in the search path that is larger than the released block can become when it combines with its neighbors. It is an easy matter to bound the size of the released block by examining its current size and the size of the current tree node. However, the bound is rather pessimistic, so that locks will be held on nodes near the root for a long period of time, leading to poor performance. ~3~

The parallel release algorithm simultaneously searches for the place to insert the released block and for the released blocks's neighbors. If one of the released block's neighbors is larger than the released block, the release procedure will find the neighbor first. In this case, the neighbor is deleted from the tree, the free blocks are combined, and the combined block is released into the tree. If the position to insert the released block is found first, the release procedure inserts the released block. The delete and insert procedures search for any additional neighboring blocks while they restruc- ture the tree. Like the allocate procedure, the release procedure uses lock coupling, and ensures that a lock is held on the parent of the node to be inserted or deleted.

The release procedure might make several passes through the fast-fits tree. If the initial pass doesn't find a neighbor, the operation terminates after performing the insert. If the initial pass finds one or two neighbors, the neighbors are deleted (possibly a side effect of the insert or delete operations). These blocks are inserted into the tree during subsequent passes. We note that the subsequent passes might also delete blocks (either the released block, or blocks released by concurrent inserts).

If the released block's neighbor is found first, it doesn't always need to be deleted from the tree. If the combined block isn't larger than the parent, it can be left in its place. The remaining task is to find the other neighbor in the tree and delete it, if it exists. The find procedure performs this task.

Demote: The demote procedure restores the second property ~f the Cartesian tree (a block must be larger than its children). Due to the

Page 10: Parallel-access memory management using fast-fits

626 Johnson

rightmost path of leftmost path the left subtree I . the right subtC fee

i

Fig. 2. Right(Left)most paths in the left(right) subtree.

address ordering of the Cartesian tree, the demoted node will become the child of a node on the rightmost path of the left subtree, or on the.leftmost path of the right subtree (see Fig. 2). The demote procedure combines these two paths, ordering by the size of the blocks. At every restructuring step, the demote procedure examines the size component of the demoted

paxent demoW.d block

ft txee right tre.e

~ b l o c k

I 12| I Fig. 3. Demote operation restructuring example.

Page 11: Parallel-access memory management using fast-fits

627

Fig. 4.

Parallel-Access Memory Management Using Fast-Fits

Demote operation restructuring example.

node, the root of the right tree, and the root of the left tree; whichever is largest becomes the child of the parent. The initial right and left trees are the original right and left subtrees of the demoted node. Figures 3 and 4 illustrate the demote procedure. Thirty-four words have been allocated from node

�9 (280, 54) of the tree in Fig. 1. In Fig. 3, the root of the right tree is attached to the parent because it is the largest block. In Fig. 4, the left tree and the demoted node are added. When the demoted node is added to the tree, the remaining right and left subtrees are attached to the demoted node.

The demote procedure ensures that the node that it needs to modify is locked. Initially, this node is the parent: afterwards, it is the node most recently re-incorporated into the tree. The roots of the unincorporated right and left subtrees don't need to be locked because any operation that wished to modify their contents would need a lock on their parents, but these locks are held by the demote procedure. We have omitted from the appendix the psuedocode of the demote operation for brevity. However, it is similar to the code for the delete operation, which we discuss next.

Delete: The delete procedure is similar to the demote procedure, as again the rightmost path of the left subtree and the leftmost path of the right subtree are merged. In the delete procedure, however, only two nodes are compared (as illustrated in Figs. 3 and 4, except without the demoted block). While tree is being restructured after deleting node n, the neighbors of n are detected, deleted from the tree, and combined with n. This step is necessary because a block might be deleted because it merged with the released block and became larger than its parent. If neighbors of n are in the subtree, they will either be the last node on the rightmost path of the

828/22/6-4

Page 12: Parallel-access memory management using fast-fits

628 Johnson

left subtree, or the last node on the leftmost path of the right subtree (see Fig. 2). These nodes are easy to delete because they will have only one child, which can be attached to the parent in place of n.

Insert: The insert procedure differs from the demote and delete proce- dures because the insert procedure must create the leftmost path of the right subtree and the rightmost path of the left subtree, rather than merge them. The parent of the node being inserted is assumed to be locked when the proce- dure is entered. The inserted node is made a child of the parent, displacing the subtree that is to be split. The inserted node becomes the last node on the initial leftmost path of the right subtree (called righthook) and also the last node on the initial rightmost path of the left subtree (called lefthook). On each step, the root of the displaced subtree is attached to righthook or lefthook depending on whether its X value is larger or smaller than that of the inserted node. The root of the displaced subtree then becomes righthook (lefthook), and the left (right) subtree becomes the displaced subtree. The steps of an insert procedure are shown in Figs. 5 and 6. The node (240, 30) is inserted into the tree left after demoting node (280, 20). In Fig. 5, the inserted node initially contains the lefthook and righthook pointers. Since the root of the dis- placed subtree is to the left of the inserted node, it is attached to lefthook, and becomes the rightmost node of the left subtree. In Fig. 6, the child is to the right of the inserted node, so it is attached to righthook. After this step, the detached subtree is empty, so the restructuring is finished.

lefthook righchook righthook

child ,,

Fig. 5. Insert operation restructuring example.

lcf~ook

22 child

Page 13: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 629

~q ~ le fthook

Fig. 6. Insert operation restructuring example.

Both righthook and lefthook are always W-locked, and thc procedure applies lock-coupling by locking the new righthook (lefthook) before releasing the lock on the old one. If the insert procedure detects a neighbor of the node being inserted, then that node will be the last node on the right(left)most path of the left (right) subtree. The left subtree of the neighbor will be to left of the node being inserted, and the right" subtree to the right, so the children can be immediately attached. All that remains is to search for the other neighbor.

Find: If a release operation finds a neighbor n while searching for the position to insert the newly freed block, the release operation merges the freed block with n. If the merged block n is not larger than the parent (at this point the operation has a lock on n and n's parent), n is in the proper place in the tree and does not need to be removed. In this case, all that remains is to search for the other neighbor of the released block that might exist in the subtree rooted at n. The find operation searches the rightmost (leftmost) path of the left (right) subtree for a neighbor of the root node. As in the cases of the insert and delete procedures, if the neighbor exists it will be the last node on the path, and so is easy to delete.

3. 1. I. Correctness

In this section, we show that the W-only algorithm is correct in the sense that after an arbitrary concurrent execution of a set 0 of allocate and release operations in which every allocate is successful, the fast-fits tree meets the two Cartesian tree ordering conditions (from Section 2), and

Page 14: Parallel-access memory management using fast-fits

630 Johnson

reflects a linearizable execution of the operations. [An execution of concurrent operations is linearizable (23) if each operation appears to occur at a point in time during its actual execution.]

A release operation might make several passes through the tree, so we assign P to be the passes made by the operations in O: if 0,.~O is. an allocate operation, then put Pol in P, and if oi is a release operation that makes k passes, put Po,.t ..... Po,.k in P. We have:

T h e o r e m 1. The concurrent execution of the passes in P have an equivalent serial execution E such that i fp locks the anchor before p', then p < p ' in E.

Proof. Since each pass satisfies the tree locking criteria, (27) the theorem follows. II

We next need to show that operations as well as the passes have an equivalent serial execution. Since an allocate operation makes a single pass, we need only consider the release operations. Let us assume that O is finite, and that all allocate operations are successful. Then, each release terminates after a finite number of passes, since blocks are combined on the passes. The release operation puts the block it is releasing into the tree on one of its passes, so its execution occurs at that point.

We note that if 0 is infinite a release operation might not terminate, due to interfering operations. This problem is extremely rare, and we never observed its occurrence. If some allocate operations are unsuccessful, 0 might not be linearizable. A release operation might delete a block that the allocate needs to be successful. In this case, we can introduce imaginary allocate and release operations that occur when a (real) release operation makes a pass, and perform the allocation and deletion of the neighbors that the pass removes from and returns to the tree.

3.2. The R W U Algor i thm

The W-only algorithm permits several operations to execute concurrently because the operations may travel to different branches of the Cartesian tree. However, the root is a serialization bottleneck since every pass must place an exclusive lock on the root at some point. The exclusive lock on the root places a total order on the passes. This is a stricter ordering than is necessary, because two passes might restructure different subtrees, and ordering is. only necessary among operations with conflicting read and write sets.

Our second parallel-access fast-fits algorithm, the RWU algorithm, makes an initial search using shared (R) locks and lock coupling. When the

Page 15: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 631

operation finds the place to perform restructuring, it (effectively) releases its locks and places exclusive locks. Between the time that the shared locks are released and the exclusive lock is set, a concurrent operation might modify the tree. Therefore, after the exclusive lock is obtained, the opera- tion must resume its search, possibly starting over from the root.

The first exclusive lock that an operation places is an upgrade lock (U lock); subsequent locks are W locks. A lock compatibility chart is listed in Table I. The R and W locks are the usual read and write locks, and are granted in FCFS order. An R lock can be upgraded to a U lock, which will be granted before any W locks are granted, but after all previous U locks and currently held R locks are released. This protocol is necessary to ensure that when an operation places its first exclusive lock, the node it will access still exists in the tree.

Table I. Lock Compatibility ChaK

R W U

R yes no no

W no no no

U no no no

The RWU algorithm uses the same insert, delete, demote, and find operations as the W-only algorithm, so we only need to discuss the allocate and release operations. The code for the allocate operation is in the appendix (we omit the code for the release operation to conserve space).

Allocate: The RWU allocate operation searches the fast-fits tree the same way that the W-only allocate operation does, except that it places R locks. When the operation finds the node to allocate from, it places exclusive locks to prevent interference.

When an allocate operation decides to place exclusive locks on a node n, it has R locks on n and on n's parent p. The pointer in p to n might change, but p's location won't change. Thus p is the highest node, and therefore the first node (by the tree locking protocol) that must be exclusively locked. Placing the first exclusive lock requires a special protocol. Let us consider the possibilities: Since the operations place shared locks during their search phase, two allocate operations ol and oz might have shared locks on p and n and decide to place their first exclusive lock on p. If ot obtains an exclusive lock on p before 02 does, then o2 will be blocked by ol at p. Next, o~ will attempt to place an exclusive lock on n, and will be blocked by Oz, resulting in deadlock. Therefore, an allocate operation must release its lock on n before placing an exclusive lock on p.

Page 16: Parallel-access memory management using fast-fits

632 Johnson

Suppose that when operation o~ places its first exclusive lock on p, it retains its R lock on p. Then, ol and o 2 will deadlock on node p. Suppose that operation ol releases its R lock on p before placing an exclusive lock on p. Then, some other operation o 3 might place an exclusive lock on p and delete it before ol places its exclusive lock in the queue. Suppose that o~ upgrades its lock, removing its R lock from the head of the queue and placing a U lock at the tail of the queue. Then, if o3 already had its lock in the queue, o~ would lock a deleted node.

These considerations lead us to a system of three locks (R, U, and W), where U locks have priority over W locks. When o I upgrades its R lock to a U lock, it will succeed in obtaining the lock before any operation obtains a W lock on the node. Since operations place U locks only on the parents of the modified nodes, the node will still exist when the lock is granted.

Several operations might place a U lock on p simultaneously, so o 1 might obtain its lock on p before o2 does. If ol allocates from n, then o2 might find that n is too small to allocate from when it obtains the U lock on p. If o 2 finds that neither of p's children are large enough to allocate from, it must search the fast-fits tree again starting from the anchor. In the appendix, the place where the allocate operation makes this decision is marked by (1). To ensure that the allocate operation terminates, the second pass is made using the W-only algorithm. If o2 finds that at least one o fp ' s children is large enough to allocate from, it searches the tree starting at p using the W-only algorithm. If a release operation was performed-in the subtree rooted at p, then the operation might be able to perform the allocate deeper in the tree.

Release: An RWU release operation is produced by transforming the W-only release operation in a manner similar to that used to create the RWU release operation. The release operation searches the tree for the position to take an action using R locks, then places a U lock on the parent of the node that is modified. When the U lock is granted on p, the release operation performs the W-only algorithm starting at p. As in the case of the allocate operation, other release or allocate operations might require further searching (if n is no longer smaller than the released block, or is no longer a neighbor of the released block). However, since the size of p is guaranteed to remain the same, the place to perform the release action is guaranteed to be below p, so the release operation never restarts.

3.2.1. Correctness

The W-only algorithm orders all operations by the time at which they lock the anchor. The RWU algorithm gains concurrency by loosening the operation ordering, and as a result one operation might overtake another while they read the data structure.

Page 17: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 633

In order to analyze the RWU algorithm, we need to establish an ordering between operations that have read-write conflicts on a node.

D e f i n i t i o n . We will say that Ox precedes 0 z at node n if Oz places lock L1 on n before Oz places lock Lz on n, and

1. Lz is a U or Wlock, and Oz never places a U or Wlock on n,

2. O~ never places a W or Ulock on n, and L2 is a U or Wlock, or

3. both L1 and Lz are U or Wlocks.

By using this definition of operation ordering, the rules of tree locking still apply, and the algorithm can be seen to be correct.

3.2.2. Lock Implementations

The performance of the search structure algorithms depends on the overhead due to the lock queues. In this section, we examine some methods for low-overhead synchronization in the fast-fits tree.

W-only: In the W-only algorithm, multiple processes can contend for a lock on the anchor. Spin locking for a critical section can saturate the network of a shared-memory multiprocessor/34,35~ Solutions include test-and-test-and-set spin locking, which uses cache coherence to reduce network traffic, and the contention-free MCS lock. (36) The memory architecture might directly support exclusive locks. For example, the Kendall Square KSR1 (37) allows the user to access a cache line in "atomic" mode, which prevents cache invalidations.

On any nonanchor node in the fast-fits tree, there can be at most one lock waiting in any lock queue (this follows from the lock-coupling protocol). As a result, nonanchor nodes can be locked by setting a lock bit. When an operation needs to lock a new node, it checks to see if the lock bit is set. If so, the operation spins, waiting for the bit to be reset. The first access to the lock bit goes over the network; subsequent accesses are satisfied by the local cache. The spinning processor is informed that the lock bit is reset by the cache coherence protocol. When the spinning processor finds that the lock bit is reset, it sets the bit and continues its accesses.

R WU: Assuming that U locks are available to the implementor simplifies the description of the algorithm. However, their semantics are unusual, so that while R and W locks should be available, it is unlikely that U locks will be available. We next describe a simple method to imple- ment the equivalent of U locks by using R and W locks and examining the lock queue.

Page 18: Parallel-access memory management using fast-fits

634 Johnson

For this implementation, we assume (at first) a spin-lock implementa- tion of shared and exclusive locks in which the head of the queue can be read by the processes (such as the one described by Mel lor-Crummy and Scottt36)). We will call the shared locks provided by the MCS algorithm r locks and the exclusive locks w locks. The key to the simulation is the observation that in the nonanchor nodes, at most one W lock will be in a node's lock queue at a time. At the anchor, there are no W locks.

The processes use the following protocol to place locks:

1. To place an R lock, the operation places an r lock.

2. To upgrade from a R lock to a U lock, the process requests a w lock, then releases its r lock.

3. To set a W lock, the process enqueues a w lock. After obtaining the w lock, the process checks to see if its lock is the only one in the queue. If so, the process proceeds. Otherwise, the w lock is released, and the operation places another w lock.

Since an operation that places a W lock must have a lock on the parent of the node, no more operations will join the node's lock queue, so the W-locking operation won't starve. If a process obtains a W lock on a node and there are other locks in the node's queue, the other lock requests must be U locks. By releasing and relocking the node, the W-locking operation flushes out the U locks and gives them priority. If the size of the lock queue cannot be determined, but locks are granted in FCFS order, then the writer always releases the first w lock, then obtains and uses the second w lock. Again, this protocol has the effect of flushing any existing U locks.

4. P E R F O R M A N C E

We studied the performance of the concurrent fast-fits algorithms to answer several questions about the value of implementing the algorithms:

�9 What is the performance of the parallel-access memory managers as compared to each other and to a serial fast-fits memory manager?

�9 What performance benefit do the different optimizations give?

�9 What is the memory efficiency of the parallel-access memory manager?

We wrote a concurrent fast-fits simulator to study the performance of the algorithms and their optimizations. The simulator starts with all of memory free. We ran the simulator on two types of workloads: synthetic and trace-driven.

Page 19: Parallel-access memory management using fast-fits

Para l le l -Access Memory Management Using Fast-Fits 635

In the synthetic workload, allocate requests arrive according to a Poisson process. The allocated block is released after an exponentially dis- tributed length of time. Each node access time is exponentially distributed. Each simulation ran until 100,000 blocks had been allocated and released, and statistics were collected after 20,000 blocks had been allocated and released. The 95 % confidence intervals on the response time data are well within _ 2 %.

To generate a workload for the trace-driven simulations, we instrumented the code of two multifrontal sparse matrix factorization algorithms: MUPS ~38) and AFstack. (39) Conventional sparse matrix factorization algorithms perform the factorization in place. Unfortunately, this approach creates a widely scattered memory reference pattern, which results in poor performance. Frontal algorithms gather the nonzeros of the sparse matrix into dense kernels which are factored using BLAS routines. 14~ Multifrontal algorithms factor many dense kernels simultaneously. Each front must be allocated a block of memory to hold the dense matrix kernel and the supporting data structures. After a front is factored, part of the factorization is used as the results, and part is passed on as a contribution to other fronts. When all contributions of a block have been incorporated into their destination fronts, the memory that holds the factored front can be released. Because of their heavy use of dynamic memory, sparse matrix factorization algorithms are a good source of memory use traces.

In a parallel implementation, each dense kernel is a task to be executed. The instrumented code recorded the memory allocation and deallocation performed by each task, and also the task execution time and task dependencies. This information was fed to a task graph scheduler which simulated the execution of a specified number of processors on the task graph. The task graph scheduler assigned request and release times to the blocks of memory used in the factorization algorithms by correlating the requests to the times at which the tasks are scheduled. The resulting trace is a realistic simulation of the memory request pattern of a parallel sparse matrix factorization algorithm.

4.1. Synthetic Workload The first set of experiments is intended to be an application-inde-

pendent and an architecture-independent study. The vagaries of the request patterns generated by applications and the performance details of par- ticular architectures make such a study valuable. For this reason we use a synthetic workload and an idealized machine.

Concurrency: Our first set of experiments compares the performance of the different algorithms. Figs. 7 and 8 show the response times of the

Page 20: Parallel-access memory management using fast-fits

636 Johnson

35

30

25

20

15

10

5

0

Allocate response time

response time

. . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . , t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . : ~ - 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

a r r i v a l r a t e

- - RWU / Find - -~- W-only/Find ' - ~ W-only [] serial

Fig. 7. Response time vs. allocate operation arrival rate.

allocate and the release operations, respectively, with an increasing arrival rate. The increasing arrival rate can be interpreted as more frequent requests to the memory manager by each parallel process, or as an increasing number of parallel processes. Each node access requires an expected one time unit. The total memory size is 2M (2 21) words, and the average request size is 500 words. Half of the request sizes were chosen from a uniform distribution, and half from an exponential distribution (truncated to the memory size). A block is released an expected 5000 time units after being allocated.

The first algorithm that we simulate is the W-only algorithm that does not use the find procedure. The response time of the allocate operations is directly measurable, but a release operation might make several passes. The curve for the W-only algorithm in Fig. 8 is the response time for a single pass of a release operation multiplied by the average number of passes per block release. In this experiment, each block release requires about 1.94 passes.

The large number of release operations per block deallocation makes the W-only algorithm inefficient, and decreases the rate at which opera- tions can be processed due to contention for the root. We ran simulations of the W-only algorithm that use the find operation (W-only with Find), and plot the results. The comparison of the allocate and release response

Page 21: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits

80

70

60

5 0

40

30

20

10

0

637

Release response time

response time

I I f I I ~ i

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

ar r iva l rate

RWU / Find

Fig. 8.

- - ~ - W-only / Find - ~ W-only ~ serial

Response time vs. allocate operation arrival rate.

times shows that using the find operation significantly decreases the response times and increases the maximum throughput. The W-only with Find algorithm requires a maximum of 1.36 release operations per block release, which accounts for the increased maximum throughput and for the bulk of the decrease in the response time of the release operation.

Since the find operation clearly improves performance, we only implemented the RWU algorithm with finds, which we plot in Figs. 7 and 8 also. The response time of the RWU algorithm contains very little waiting time. The increase in response time is due to an increase in the number of nodes examined per path. The response curve ends well before the apparent onset of a serialization bottleneck. We could not get our simulator to successfully complete with a higher arrival rate, because of excessive lock queuing. When the simulator starts, the tree is small, and operations place W locks near the root. So, contention is high during the startup transient, although the contention disappears when the tree becomes large (the steady state).

For comparison, we implemented a serial fast-fits memory manager by retaining the anchor lock in the W-only with Find algorithm for the duration of the operation.

Page 22: Parallel-access memory management using fast-fits

638 Johnson

We plot the performance of the serial algorithm in Figs. 7 and 8. All three concurrent algorithms compare favorably to executing the fast-fits algorithm in a critical section. The W-only with Find algorithm can support an access rate about five times that of the serial algorithm. This can be expected because the serial algorithm holds the anchor lock for about five times longer than the W-only algorithm does. The improvement of the W-only algorithm over the serial algorithm increases as the number of free blocks increases (due to the increase in execution time). The access rate for the RWU block can be considerably larger still.

We note that in our comparison of the serial and the parallel access algo- rithms, we did not model the additional times required to obtain locks. However, the W-only algorithm, like the serial algorithm, requires only one full lock. The synchronization of the remaining nodes is performed when the operation reads the node. Therefore, locking in the W-only algorithm does not impose a significant overhead as compared to the serial algorithm. Depending on the architecture, the RWU algorithm might requires a software reader/writer lock at every node, and may create more problems that it solves. However, the hardware might provide inexpensive reader/writer locks . (37)

On a related note, we are interested in how well balanced the fast-fits tree is. When we ran the simulations of the RWU algorithm, we collected the average number of free blocks and the average number of blocks that an operation accessed. We plot the average path length of the allocate and release operations against the log of the average number of free blocks in Fig. 9. The plots are nearly linear, showing that the execution time of the fast-fits operations grows proportionally to the log of the number of free blocks. We find that the path length of the allocate operations is about k,,logL82(F), and the path length of one of the passes of the release operation is k~ logl.7(F), where F is the number of free blocks.

RWU 2-restart: We investigated the benefit of two different optimiza- tions. The first is the use of the find operation, which has already been discussed. The second optimization concerns the actions taken when a release operation finds that it must start over. We recommended that the release operation use the W-only algorithm on its second pass, to ensure that it completes in a finite number of steps.

We implemented the (RWU 2-restart) optimization, in which an allocate operation uses the W-only algorithm the second time that it must restart. The 1-restart algorithm requires that 0.9% percent of allocate operations use the W-only algorithm when the arrival rate is 0.34, while the 2-restart requires that only 0.053% of the allocate operations use the W-only algorithm at the same arrival rate. The restart rate is very small for the 1-restart algorithm, and the 2-restart optimization had little effect on performance, so we don't plot the results.

Page 23: Parallel-access memory management using fast-fits

P a r a l l e l - A c c e s s M e m o r y M a n a g e m e n t Us ing Fas t -F i t s 639

Tree Path Length

14

12

10

6] 4

2

o! 50

Path Length

............................... S . ~ . ~ _ ~ ..... ~ .........................

I I I I I I I I I I

500

Number of Free Blocks

allocate ~ l - release

Fig. 9. Path length vs. number of tree nodes (RWU algorithm)�9

Abort rate uniform and exponential block size distributions

probability of abort 0.03 ~ �9 . . . . . . . . . . . . . . . . . . . . . . .

0.025

0.02

0.015

0.01

0.005

0

_ ~

I

I I

20 40 60 80 100

percentage of memory used

RWU/random fit + W-only/random fit

RWU/better fit ~ First Fit

Fig. 10. Comparison of space efficiency for different algorithms.

Page 24: Parallel-access memory management using fast-fits

640 Johnson

Memory Efficiency: The primary motivation for using a parallel fast- fits algorithm over using a parallel buddy system is the superior space utilization of fast-fits. We ran a set of experiments that tested the memory efficiency of the algorithms. The arrival rate is held constant at 0.05, and the request size varied. The time between the allocation and the dealloca- tion of a block is an expected 7500 time units. We increased the request size to simulate an increasing demand for memory, and plotted the abort rate due to lack of memory in Fig. 10.

We plot the abort rate against the expected amount of memory used (the actual peak memory use was somewhat higher). The random fit algorithm that we used provided about an 80% memory efficiency. For a comparison, we tested the memory efficiency of the more standard better fit algorithm, and found that its memory efficiency was slightly lower. Finally, we tested the memory efficiency of the serial first fit algorithm, and found that it had a negligibly lower abort rate. We conclude that the parallel fast fits memory manager has the same memory efficiency as the serial fast fits memory manager. Bozman et al. ~5~ and Stephenson u8~ have shown that the fast fits memory manager has a memory efficiency which is slightly lower than that of the first fit algorithm.

4.2. T r a c e - D r i v e n S i m u l a t i o n

In order to verify the performance characteristics of the algorithms, we tested them on a trace driven simulation. We instrumented the AFstack and the MUPS algorithm, and executed the algorithms on gemat l l , a matrix derived from electrical power simulation. ~4~ We used the resulting traces to test the serial fast-fits algorithm, the W-only algorithm, and the RWU algo- rithm (each using the find optimization) with between 8 and 128 processors.

The algorithms tended to request small blocks--75 % of the requests were for 20 words or less. However, the request size distributions have a long tail. Ten percent of the requests were for 40 words or more, and five percent of the requests were for more than 100 words. Thus, a segregated storage technique can help to improve performance but a general heap memory manager is still required. Due to the difficulty of allocating memory blocks in FORTRAN, the data structures that comprise a front are allocated separately in both algorithms, which tends to increase the number of requests and to exaggerate the number of small requests. A more efficient algorithm would allocate all memory required for a front in a single request, since all components are allocated and deallocated simultaneously.

The average response time of fast fits memory managers using the AFstack trace is shown in Fig. 11, and the performance using the MUPS trace is shown in Fig. 12. These results verify the conclusions reached in the

Page 25: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 641

AFstack trace

250

200

150

100

50

allocate response time

I I I 1 I I I I 1 I I t I I I I

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

number of processors

Serial I W-only x RWU

Fig. 11. Trace-driven performance comparison using the AFstack trace.

MUPS trace allocate response time

40

30

20

10

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

number of processors

Seria l I W-only ~ RWU

Fig. 12. Trace-driven performance comparison using the MUPS trace.

Page 26: Parallel-access memory management using fast-fits

642 Johnson

previous section. The serial algorithm is a severe serialization bottleneck, the W-only algorithm has good performance under all but the highest degrees of parallelism, and the RWU algorithm permits very highly concurrent access.

5. C O N C L U S I O N

We present two concurrent memory managers based on the fast-fits algorithm. Both algorithms have good space efficiency, fast response times and are easily implemented. The W-only algorithm supports moderate concurrency levels and the RWU algorithm supports high concurrency levels. The parallelism available through the RWU algorithm increases as the size of the fast-fits tree increases. Both concurrent algorithms provide a significantly higher throughput than a serial fast-fits algorithm, and their improvement increases as the tree size increases.

We examined several possible optimizations and found that the use of the find operation yields a significant improvement, while allowing multiple restarts has benefit. Our random better fit allocation strategy has slightly better space efficiency than Stephenson's better fit, and both return an 80 % space efficiency on the workload that we applied. The serial first-fit algorithm returns the same memory utilization, indicating that parallelism doesn't incur a storage efficiency penalty.

The concurrent fast-first memory managers offer a good alternative to the existing concurrent buddy or concurrent free list algorithrias. Con- current buddy systems t25) are a viable option if space efficiency is not an issue. However, most applications expand to fill the available space. We feel that free list algorithms have serious drawbacks in a parallel environ- ment. Locking the free list will create a serial bottleneck due to the lengthy execution times. Parallel-access free lists require that each operation place a large number of locks. In addition, free list algorithms have poor memory reference patterns, which can cause excessive cache invalidations and network trafficJ 6)

In contrast to a buddy system, fast-fits has good space utilization. In addition, a fast-fits operation accesses only a few nodes, so that it doesn't generate excessive numbers of widely dispersed shared-memory references. A fast-fits memory manager that is protected by a lock can make a good shared-memory manager. A simple modification to the serial fast-fits algo- rithm produces the parallel-access W-only algorithm, which produces significantly less contention than the serial algorithm. We found that the W-only algorithm supports an access rate five times that provided by the serial algorithm, on relatively small free lists. For even greater concurrency we provide the RWU algorithm, which makes use of shared locks to remove the root bottleneck. In addition, the RWU algorithm allows the

Page 27: Parallel-access memory management using fast-fits

Parallel-Access Memory Management Using Fast-Fits 643

cache lines that store the tree to be accessed in shared mode, reducing cache invalidations and improving the cache hit rate. If software reader/writer locks must be used, the overhead of setting locks is likely to make the RWU algorithm impractical. However, modern memory architec- tures can permit inexpensive reader/writer locks. For example, the Kendall Square KSR1 allows shared and exclusive access to subpages in the "lock" mode. (37) As a result, setting read and write locks costs the same as making read and write memory accesses.

A C K N O W L E D G M E N T S

We like to thank C.J. Stephenson for discussing his fast-fits memory manager with us.

REFERENCES

1. T. A. Davis and P. C. Yew, A nondeterministic parallel algorithm for general unsymmetric sparse LU factorization, SIAM J. Matrix Anal. Appl. I1(3):383-402 (1990).

2. I. S. Duff, Multiprocessing a sparse matrix code on the Atliant FX/8, J. Comp. Appl. Math. 27:229-239 (1989).

3. F. J. Roeber, Raytheon Submarine Signals Division, Personal communication (1991). 4. B. Bigler, S. Allan, and R. Oldehoeft, Parallel dynamic storage allocation, Proc. lnt'l.

Conf. on Parallel Processing, pp. 272-275 (1985). 5. I. S. Duff, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices, Oxfod

University Press, Oxford (1986). 6. D. Grunwald, B. Zorn, and R. Henderson, Improving the cache locality of memory alloca-

tion, SIGPLAN Conf. on Programming Language Design and Implementation, pp. 177-186 (1993).

7. B. Zorn and D. Grunwald, Emperical measurements of six allocation-intensive C programs, ACM SIGPLAN Notices 27(2):71-80 (1992).

8. A. Gottlieb and J. Wilson, Parallelizing the usual buddy algorithm, Ultracomputer System Software Note 37, Courant Institute (1982).

9. T. Standish, Data Structures Techniques, Addison-Wesley (1980). 10. U. Manber, On maintaining dynamic information in a concurrent environment, SIAM

Journal on Computing 15(4):1130-1142 (1986). 11. D. Kotz and C. S. Ellig, Evaluation of concurrent pools, Proc. IntT. Conf. on Distrib.

Comput. Syst., pp. 378-385 (1989). 12. S. J. Eggers and T. E. Jeremiassen, Eliminating false sharing, Proc. Int'l. Conf. on Parallel

Processing, pp. 1377-1381 (1991). 13. C. S. Ellis and T. Olson, Concurrent dynamic storage allocation, Proc. Int'l. Conf. on

Parallel Processing, pp. 502-511 (1987). 14. D. Knuth, The Art of Computer Programming, Volume 1, Addison-Wesley (1968). 15. G. Bozman, W. Buco, T. P. Daly, and W. H. Tetzlaff, Analysis of free storage algo-

rithms-revisited, IBM Systems Journal 23(1 ):44-64 (1984). 16. J. L. Peterson and T. A. Norman, Buddy systems, Comm. of the ACM 20(6):421-431

(1977). 17. C. J. Stephenson, Fast fits: New methods for dynamic storage allocation, Proc. of the

Ninth ACM Symp. of Oper. Syst. Principles, pp. 30-32 (1983).

828/22/6-5

Page 28: Parallel-access memory management using fast-fits

644 Johnson

18. C. J. Stephenson, Fast fits: New methods for dynamic storage allocation, Technical report, IBM T.J. Watson Research Center, Yorktown Heights, New York (1983).

19. As noted in the SunOS 4.1.2 malloc man page, 20. H. Stone, Parallel memory allocation using the fetch-and-add instruction, Technical Report

RC 9674, IBM T. J. Watson Research Center, Yorktown Heights, New York (1982). 21. A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, Basic techniques for the emcient

coordination of very large numbers of cooperating sequential processors, A C M Trans. on Programming Languages and Systems 5(2):164-189 (1983).

22. R. Fod, Concurrent algorithms for real time memory management, IEEE Software Ia(5):10--23 (September 1988).

23. A. Gottlieb and J. Wilson, Using the buddy system for concurrent memory allocation, Ultracomputer System Software Note 6, Courant Institute (1981).

24. J. Wilson, Operating System Data Structures for Shared-memory MIMD Machines with Fetch-and-add, PhD. thesis, NYU (1988).

25. T. Johnson and T. Davis, Parallel buddy memory management, Parallel Processing Letters 2(4):391-398 (1992).

26. R. Bayer and M. Schkolnick, Concurrency of operations on B-trees, Acta lnformatica 9:1-21 (1977).

27. P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley (1987).

28. J. Vuillemin, A unifying look at data structures, Commun. of the A CM 23(4):229-239 (1980). 29. C. Aragon and R. Seidel, Randomized search trees, Proc. o f the 30th Symp. on the Founda-

tions of Computer Science, pp. 540-545 (1989). 30. T. Johnson and D. Shasha, The performance of concurrent data structure algorithms,

Trans. on Database Systems, pp. 51-101 (March 1993). 3!. T. Johnson, A concurrent fast-fits memory manager. Technical Report TR91-009,

available at anonymous ftp site ftp.cis.ufl.edu:/cis/tech-reports/tr91/tr91-009.ps.Z, Univer- sity of Florida, Department of CIS, 1991.

32. D. Shasha and N. Goodman, Concurrent search structure algorithms, A C M Trans. on Database Systems 13(1 ):53-90 (1988).

33. M. Herlihy and J. Wing, Linearizability: A. correctness condition for concurrent objects, ACM Trans. on Programming Languages and Systems 12(3):463-492 (1990).

34. T. E. Anderson, The performance of spin lock alternatives for shared memory multi- processors, IEEE Trans. on Parallel and Distrib. Syst. 1(1):6-16 (1990).

35. R. R. Glenn, D. V. Pryor, J. M. Conroy, and T. Johnson, Characterizing memory hotspots in a shared memory MIMD machine, Supercomputing, pp. 554-566, IEEE and ACM SIGARCH (1991).

36. J. M. Mellor-Crummey and M. L. Scott, Synchronization without contention, Fourth Int'l. Conf. on Architect. Support for Programming Languages and Oper. Syst., pp. 269-278 (1991 ).

37. Kendall Square Research, 170 Tracer Lane, Waltham, Massachusetts 02154-1379, KSR1 Principles o f Operation (1992).

38. I. S. Duff and J. K. Reid, The multifrontal solution of unsymmetric sets of linear equa- tions, SIAM J. Sci. Statist. Comput. 5(3):633-641 (1984).

39. T. A. Davis, A combined unifrontal/multifrontal method for unsymmetric sparse matrices, Proc. of the Fifth S l A M Conf. on Applied Linear Algebra, Snowbird, Utah, pp. 413--417 (1994).

40. J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of level-3 basic linear algebra subprograms, A CM Trans. on Math. Software 16:1-17 (1990).

41. I. S. Duff, R. G. Grimes, and J. G. Lewis, Sparse matrix test problems, ACM Trans. Math. Software 15:1-14 (1989).