Hotos Atomically Draft

1Concurrent programming for dummies(and smart people too)

Tim HarrisUniversity of Cambridge Computer Laboratory

J J Thomson Avenue, Cambridge, UK, CB3 0FD.Tel: +44 1223 334476. [email protected]

AbstractConcurrent programming is notoriously difcult. Cur-rent abstractions are intricate to use and make it difcult to designcomputer systems that are reliable and scalable. They require pro-grammers to commit to a particular locking discipline at an earlystage. We argue in favour of a declarative style of concurrency con-trol in which programmers directly indicate the safety propertiesthat they require, rather than how to enforce them.

Our alternative stands to be easier to use while also deliveringhigher performance. It avoids the problems of priority inversionand deadlock. It can be readily introduced into mainstream lan-guages such as Java and C++. Furthermore, it can be implementedefciently and can make it easier to exploit the hardware parallelismavailable in todays commodity SMP and SMT systems.

I. INTRODUCTION

Computing hardware has developed substantially since thefirst workshop in this series in 1987. As at any time overthose last 15 years, increases in processor speed, networkbandwidth, storage capacities and so on are frequentlyreported. However, beyond such quantitative improve-ments, the setting in which execution occurs has morefundamentally changed. Concurrent execution and dis-tributed computing are now commonplace rather than ex-otic; even single-CPU machines now use simultaneousmulti-threading (SMT) to provide parallelism.

Despite this there has been little development in main-stream techniques for writing concurrent applications.POSIX, Win32 and the Java and CLR virtual machineshave converged on priority-based multi-threaded execu-tion controlled by mutual exclusion locks and conditionvariables. Is this because that is the right choice? Isit because these are convenient, powerful and readily-understood abstractions? The answer to both of thosequestions is no.

In Section II we present evidence in support of this as-sertion. Then, in Section III we show what should bedone about it. Readers may experience a sense of deja`vu at that point: we propose Hoares conditional criticalregions (CCRs) [9]. We go on to discuss in Section IVhow modern techniques finally make this attractive con-struct amenable to efficient implementation; a statementwe substantiate with initial performance measurements.

II. WHY CURRENT ABSTRACTIONS FAIL

A change is needed in the tools we have for writing con-current systems, whether at the level of individual ap-plications, at the level of large programs and application

servers, or at the level of operating systems and their con-stituent device drivers and protocol implementations.

While mutual-exclusion locks can readily be used toenforce safety properties, they make it hard to ensuregood progress. Programmers must decide what granular-ity of locking is appropriate. Protecting large data struc-tures with a single lock makes programming easier but re-duces the parallelism that can be exploited. Using manysmaller locks may allow better parallelism, but leads to in-tricate code which spends much of its time juggling locks.The optimal selection depends on the systems workload,meaning that an informed decision is difficult in operatingsystems and library code.

Any system using mutual-exclusion locks must be de-signed to avoid deadlock. Engineering rules such as defin-ing an order in which locks must be acquired are difficultto apply in any large system because they require globalknowledge.

Furthermore, existing abstractions are entangled anddifficult to explain in isolation. Condition variables can-not be understood separately from mutual exclusion locks;a wait operation on a condition variable must be com-bined atomically with a release on a lock. The semanticsof a notify operation differ between systems the pro-grammer must understand exactly how many threads may(or must) be woken and how their subsequent schedulinginteracts with that of the notifier. The same questions arisein systems offering a monitor abstraction.

A programmer using priority-based scheduling and mu-tual exclusion locks must understand the problem of prior-ity inversion. Some cases can be handled by more sophis-ticated scheduling (e.g. priority inheritance), others againrequire global knowledge (e.g. a priority ceiling protocol).

The root of these problems is the imperative style inwhich existing facilities for concurrency control are ex-posed to programmers. The resulting hand-compilationinto these operations obfuscates the safety and progressproperties that the programmer is actually trying to pro-vide and commits the code, at the point at which it is writ-ten, to following a particular locking discipline.

III. BACK TO CONDITIONAL CRITICAL REGIONS

The observation that existing concurrency-control featuresare difficult to use is far from novel. In fact, standardtexts on concurrent programming often introduce the sub-

2ject by showing examples written using apparently unre-alistic language constructs before introducing the featuresthat are currently available [2], [4].

In specifying concurrent systems, notation such as is common, indicating a conditionalcritical region (CCR) containing statements S that shouldexecute atomically when a boolean condition B is satis-fied. CCRs could be exposed in a modern language byintroducing a new keyword, say atomically, to grouptogether blocks of statements and to provide conditions fortheir execution.

To illustrate the power of this construct, consider theimplementation of a single-cell shared buffer. An opera-tion to store a value into it would simply be:

while (!done) {atomically (!full) {

full = true;value = new_value;done = true;

}}

The idea is that atomically(cond)statementsevaluates the condition cond and if true executes thestatements, all as-if atomically. If cond is false thenthe thread blocks until the condition may have becometrue. The condition can be an arbitrary boolean expres-sion, accessing shared fields, invoking methods and so on.Compare this with an alternative based on mutual exclu-sion locks and condition variables: we no longer have toselect which lock will protect the shared fields and we nolonger have to be concerned with lost-wake-up problems.

Unfortunately, no efficient implementation has previ-ously been known for general CCRs [3]. There are tworeasons for this. Firstly, the statements may be arbitraryand so the appearence of atomicity can only be guaran-teed by actually serializing execution. The second prob-lem is that every thread that is waiting on a condition mustre-evaluate it every time that anything it could have de-pended on has been updated. Again, there is no associa-tion between the updates made in one critical region andthe need to re-evaluate other specific conditions.

We have solved both of these problems. Building on ourwork on non-blocking algorithms and, in particular, on theabstraction of a software transactional memory, we havebuilt a practical implementation of CCRs. We introducethese topics in Sections III-A and III-B respectively, be-fore showing how they can be extended to support generalCCRs in Section III-C.

A. Non-blocking algorithms

Over recent years we have been working, alongside otherresearch groups, on practical non-blocking algorithms forshared-memory systems. These algorithms are designedto work correctly in concurrent systems without relyingon high-level features such as mutual exclusion locks andcondition variables. Instead, they are built directly from

the individual operations that a particular processor guar-antees to be atomic principally from the word-sizedmemory reads, memory writes and compare-and-swap op-erations that are provided (or can be built) on all main-stream processor families.

Non-blocking algorithms can make strong progressguarantees, either on a per-thread basis for real-time be-haviour [6], or on a system-wide basis to preclude dead-lock and priority inversion [12], [8]. Massalin and Pushowed the value of these techniques in the Synthesis ker-nel [10].

Several powerful general-purpose non-blocking algo-rithms have been developed. We recently developeda multi-word compare-and-swap operation (NCAS) forshared-memory systems that reads the contents of a se-ries of locations, compares these against specified valuesand, if they all match, updates the locations with a fur-ther set of values [5]. All of this is performed as-if atom-ically. NCAS is a useful abstraction because it allows adata structure to be updated from one consistent state toanother by constructing a multi-word update that acts onthe locations involved.

For a non-contended lock-free update to locations ourpublished algorithm uses word-sized CAS oper-ations and requires 2 bits to be reserved in each locationthat may be updated. Subsequent refinements reduce thisto and a single reserved bit usually an otherwise-zero bit in an aligned pointer value. Herlihy, Luchangcoand Moir developed an obstruction-free NCAS implemen-tation with an even more streamlined implementation [7].

We believe that multi-word atomic update operations,such as NCAS, should form the main concurrent program-ming abstractions in computer systems. They provide anattractive distinction between the applications responsi-bility for setting out what update should be made and theimplementors responsibility for selecting an appropriateprogress guarantee. They allow complex concurrent algo-rithms to be implemented without having to identify par-ticular locks to protect particular data (and without the riskthat this identification is done incorrectly).

So, should NCAS itself be provided as an alternative tomutual exclusion locks and condition variables? We be-lieve not, as a number of problems remain. Firstly, NCASis still a somewhat intricate abstraction to program with,requiring re-implementations of existing data structures.Secondly, although it allows multi-word updates to bemade as-if atomically, it does not provide any mechanismfor a thread to block for example if it attempts to removean item from a buffer which is currently empty. Imple-mentations using NCAS would have to revert to polling.

B. Improving transparency

The first of these problems with NCAS can readily be ad-dressed by providing a higher-level software transactionalmemory (STM) interface built over the same multi-wordupdate algorithm [11]. The STM interface allows mem-

3ory accesses to be grouped into transactions; essentiallythe STM keeps track of the accesses that are being madeduring an operation on a concurrent data structure, ratherthan requiring that the application does so.

Transaction managementvoid STMStart()void STMAbort()boolean STMCommit()

A STMBegin operation starts a new transaction withinthe executing thread. A STMAbort aborts the transactionin progress by the executing thread. A STMCommit oper-ation attempts to commit the transaction in progress by theexecuting thread, returning true if the commit succeedsand false if it aborts.

The STM exposes two separate sets of read and writeoperations. The first set are for use within transactions.The second external set are for use by memory accessesoccurring outside any transaction.

Memory accessesstm word STMRead(addr a)void STMWrite(addr a, stm word w)

stm word STMExtRead(addr a)void STMExtWrite(addr a, stm word w)

Building on this STM interface we have developed an ex-tension to the Java programming language which providestransactional updates with a high degree of transparency.This provides the first part of our new atomically key-word, allowing an arbitrary group of statements to be iden-tified for transactional execution. For example, a program-mer given access to an ordinary hash-table through the lo-cal variable ht could write

atomically {if (new_val == NULL) {ht.delete (key);

} else {ht.insert (key, new_val);

}}

The JVM uses the transactional interface to imple-ment all accesses to potentially-shared data within theatomically block. Method calls are recognized bythe bytecode-to-native-code compiler and dispatched toalternate implementations which also perform STM ac-cesses. This means that an existing data structure canbe used concurrently simply by wrapping its operationswithin atomically.

C. Synchronization

Previous STM implementations have not consideredwhether threads can block part-way through a transac-tion. Using explicit condition variables for synchroniza-tion would be a shame given that one of the benefits of ouratomically construct is that it frees programmers fromhaving to identify explicit locks to protect data structures.

Our approach, which goes beyond what has been donein those existing STMs, is to provide sufficient facilitiesto structure a complete concurrent program using only theSTM abstraction i.e. allowing it to be used in place ofmutual exclusion locks and condition variables, as well asalongside them. In terms of a procedural STM interfacethis adds just one new operation:

Transaction synchronizationvoid STMWait()

STMWait validates the execution thus far and, if this val-idation succeeds, it blocks the thread. All of the exist-ing STM implementations have some notion of a threadowning a set of memory locations that it is interested inand this same mechanism can be used to trigger a wake-up to a thread when its condition should be re-evaluated.Once woken, the transaction is aborted and, in the exam-ple shown, will continue around the loop and re-evaluatethe condition. This completes the facilities needed for im-plementing CCRs.

IV. IMPLEMENTATION STATUS

Our prototype system supporting atomically is basedon version 1.2.2 of the Sun Java Virtual Machine for Re-search. This JVM implementation has already undergoneextensive optimization; we are comparing our initial pro-totype against a best-of-breed system [1]. The STM im-plementation we have developed has a number of notablefeatures: It does not need to reserve any storage space in the lo-cations that may be updated; the STM can hold word-sizeinteger values as well as pointers. We can trade off the likelihood that non-conflictingCCRs can execute in parallel against the space overheadsof managing the heap. Ordinary heap accesses can be implemented using stan-dard memory reads and ordinary memory writes; nopenalty is imposed on non-transactional accesses to fields.

Our current implementation is not yet non-blocking; weare extending it to support both obstruction-free updates(allowing one transaction to wrest an ownership recordfrom another) and lock-free updates (allowing one trans-action to help another that it encounters to complete).

A. Results

We can already provide absolute performance that is com-petitive with lock-based schemes. To illustrate this, wereturn to the example of shared buffers, each able to con-tain a single value. The experimental configuration has

threads conceptually arranged in a ring with a sharedbuffer between each adjacent pair of threads. Of thesebuffers, initially contain tokens and the remainder areempty. Each thread then loops a fixed number of timesremoving an item from the buffer on its right and placingit in the buffer on its left. Figure 1 illustrates this config-uration. We measure the elapsed wall-clock time for each

4Fig. 1. Experimental configuration. In this case and .

thread to perform a fixed number of iterations running onan unloaded 4-processor Sun Fire V480 server.

We will examine performance more thoroughly in aforthcoming paper, but here we will briefly present two ex-treme configurations for this experiment. In the first casewe set so that, ideally, only one thread should everbe running and the total execution time should scale lin-early with the number of threads. In the second case weset so that there are the same number of tokens asthere are threads. In principle all of the threads can exe-cute without blocking so long as they operate in lock-step.

Figure 2 shows our results from these initial experi-ments. In every case the performance of our STM-basedatomically construct, implementing general condi-tional critical regions is competitive with an implemen-tation based directly on mutual exclusion locks and con-dition variables. In fact, in many cases we achieve bet-ter performance in an absolute sense. Furthermore, unlikethe lock-based implementation, our abstraction could bebuilt over a non-blocking STM providing correspondinglystronger progress guarantees. The CPU-time consumptionis the same between the STM-based and original imple-mentations.

V. CONCLUSION AND FUTURE WORK

In this paper we have argued that concurrent programmingcan be made easier by moving away from the abstractionsof locks and condition variables and instead providing fa-cilities that more closely capture the safety properties thata programmer is trying to implement.

We have shown how general conditional critical regionscan be supported and how this construct can out-perform alock-based scheme. This approach makes it substantiallyeasier to write reliable concurrent systems; it is no coin-cidence that the same construct is frequently used in textbooks and in the specification of concurrent systems.

We should emphasise that although our current imple-mentation is based on the JVM, it can readily be adaptedto a C++ or C setting, either using compiler support toprovide a similar level of transparency, or simply usingprogramming conventions or operator overloading.

In future work we will provide a more thorough evalu-ation of our system and implement the full range of non-blocking STMs. In doing so we will provide a system

0

500

1000

1500

2000

2500

1 2 3 4

Elap

sed

time

/ ms

Threads

STM

Original

0

200

400

600

800

1000

1200

1400

1 2 3 4

Elap

sed

time

/ ms

Threads

Original

STM

(a) ! (b) "#Fig. 2. Experiment execution time.

which is not only easier to program, but which allowshigher concurrency and theoretically-stronger progressguarantees.

VI. ACKNOWLEDGEMENTS

This work has been supported by a donation from the Scal-able Synchronization Research Group at Sun Labs Mas-sachusetts.

REFERENCES[1] AGESEN, O., DETLEFS, D., GARTHWAITE, A., KNIPPEL, R.,

RAMAKRISHNA, Y. S., AND WHITE, D. An efficient meta-lockfor implementing ubiquitous synchronization. In Object-OrientedProgramming, Systems, Languages & Applications (OOPSLA 99)(Nov. 1999), vol. 34(10) of ACM SIGPLAN Notices, pp. 207222.

[2] ANDREWS, G. R. Concurrent Programming: Principles andPractice. Benjamin/Cummings Publishing Company, Inc., Red-wood City, California, 1991.

[3] ANDREWS, G. R., AND SCHNEIDER, F. B. Concepts and nota-tions for concurrent programming. Computing Surveys 15, 1 (Mar.1983), 343.

[4] BACON, J., AND HARRIS, T. L. Operating Systems: Concurrentand Distributed Software Design, 3rd ed. Addison Wesley, 2003.

[5] HARRIS, T. L., FRASER, K., AND PRATT, I. A. A practical multi-word compare-and-swap operation. In Proceedings of the 16th In-ternational Symposium on Distributed Computing (Oct. 2002).

[6] HERLIHY, M. Wait-free synchronization. ACM Transactions onProgramming Languages and Systems 13, 1 (Jan. 1991), 124149.

[7] HERLIHY, M., LUCHANGCO, V., AND MOIR, M. Obstruction-free software NCAS and transactional memory. To appear.

[8] HERLIHY, M., LUCHANGCO, V., AND MOIR, M. Obstruction-free synchronization: Double-ended queues as an example. To ap-pear.

[9] HOARE, C. A. R. Towards a theory of parallel programming. InOperating Systems Techniques (London, 1972), C. A. R. Hoare andR. H. Perrott, Eds., vol. 9 of A.P.I.C. Studies in Data Processing,Academic Press, pp. 6171.

[10] MASSALIN, H., AND PU, C. A lock-free multiprocessor OS ker-nel. Tech. Rep. CUCS-005-91, Columbia University, Departmentof Computer Science, June 1991.

[11] SHAVIT, N., AND TOUITOU, D. Software transactional memory.In Proceedings of the 14th Annual ACM Symposium on Principlesof Distributed Computing (Aug. 1995), ACM Press, pp. 204213.

[12] VALOIS, J. D. Lock-Free Data Structures. PhD thesis, RensselaerPolytechnic Institute, Department of Computer Science, 1995.

Documents

Hotos Atomically Draft