Introduction to Transactional MemoryFundamentals (3/3) Transaction nesting at closed open Software transactional memory (STM) Compiler and runtime operation, no hardware support High

Introduction to Transactional Memory

Sami Kiminki

2009-03-12

Presentation outline

Contents

1 Introduction 1Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 High-level programming with TM 3Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 TM implementations 5Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 TM in Sun Rock processor 8Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 Introduction

Motivation

• Lock-based pessimistic critical section synchronization is problematic

• For example

– Coarse-grained locking does not scale well

– Fine-grained locking is tedious to write

– Combined sequence of fine-grained operations must often be convertedinto coarse-grained operation, e.g., move item atomically from collectionA to collection B

– Not all problems are easy to scale with locking, e.g., graph updates

– Deadlocks

– Debugging is sometimes very difficult

• Critical section locking is superfluous for most times

• Obtaining and releasing locks requires memory writes

• Could we be more optimistic about synchronization?

1

The idea of transactional computing

• Optimistic approach

– Instead of assuming that conflicts will happen in critical sections, as-sume they don’t

– Rely on conflict detection: abort and retry if necessary

• If critical section locking is superfluous most of the time, aborts are rare.

– Typically threads manipulate different parts of the shared memory

– Consider, e.g., web server serving pages for different users

High hopes for transactional computingSome often pronounced hopes for transactional computing but still with little

backing of experimental evindence in real-life implementations

• Almost infinite linear scalability

• Scalability to “non-scalable” algorithms

• Relaxation of cache coherency requirements⇒ still more hardware scalability

• Effortless parallel programming

• Less and easier-to-solve bugs due to the lack of locks

• Saviour of parallel programming crisis

Not a silver bullet

• No deadlocks but prone to livelocks

• Not all algorithms can be made parallel even with speculation

• Mobile concerns: failed speculation means wasted energy

• Real-time concerns: predictability

Transactional memory (TM)

• Technique to implement transactional computing

• The idea

– Work is performed in atomic isolated transactions

– Track all memory accesses

– If no conflicts occurred with other transactions, write modifications tomain memory atomically at commit

• Conflict

– Memory that has been read is changed before transaction is committed— i.e., input has changed before output is produced

– Transaction is aborted, but may be later retried automatically or man-ually

2

Some basic implementation characteristics

• Isolation level

– weak — transactions are isolated only from other transactions

– strong — transactions are isolated also from non-transactional code

• Workset limitations

– maximum memory footprint

– maximum execution time

– maximum nesting depth

– or unbounded if no fundamental limitations

• Conflict detection granularity

Annotations

A good introduction into transactional memory can be found in [1]. Transactionalmemory is an active research topic, as is indicated by the number of recentlypublished articles in various journals and conference proceedings, see e.g. thebibliography section.

Arguably, the transactional memory techniques are sparked from Tom Knight’swork with LISP in 1986 which considers making LISP programming easier for de-velopers by utilizing small transactions [11]. The modern era with current seman-tics is presented in [10].

Transactional memory techniques are interesting because they can potentiallyenable the use of various other techniques. For example, cache coherence protocolsin multicore systems could benefit of utilizing transactional memories [9].

However, until very recently, almost all results have been more or less aca-demic. Especially, hardware-related results have almost invariably been producedby simulations. Because of this, one could question the feasibility of these results.Pure software approaches are being criticized in high-level publications [4].

Finally, even if transactional memory techniques prove successful, it is impor-tant to note that they are not likely to revolutionalize the world—by themselves,at least. Instead, in the author’s opinion, these techniques should be considercomplementary.

2 High-level programming with TM

Section outlineQuick glance into high-level programming interfaces

• Transactional statement in C++ (Sun/Google approach)

• OpenTM

• A low-level interface will be introduced later in Sun Rock section

Transactional statements in C++ (1/3)

• Sun/Google consideration, but not a final solution

• Basic syntax: transaction compound statement

• Target: STM, weak isolation, closed nesting, I/O prohibited

3

Transactional statements in C++ (2/3)

• Starting and ending a transaction:

– Tx begins just before execution of transactional compound statement

– Tx commits by normal exit (statement executed or continue, break,return, goto)

– Tx aborts by conflict, throwing an exception or executing longjmp thatresults exiting the transactional compound statement

• Special considerations for throwing exceptions

– How to throw an exception if everything is rolled back, also the con-struction of thrown object(!)

– Restrictions for referencing memory from thrown objects are likely toapply

Transactional statements in C++ (3/3)Example code:

// a t om i c map//// Imp l em e n t e d b y i n h e r i t i n g s t d : : map and w r a p p i n g a l l// d a t a m a n i p u l a t o r me t h o d s i n t o t r a n s a c t i o n s

#include <map>

template<c lass key type , c lass mapped type>c lass atomic map : public std : : map<key type , mapped type> {

public :

s td : : pair<typename atomic map : : i t e r a t o r , bool>i n s e r t ( const typename atomic map : : va lue type &v) {

transaction {return std : : map<key type , mapped type > : : i n s e r t (v ) ;

}}

. . .} ;

Annotations

The Sun/Google consideration with open issues is presented in [5]. It is mentionedthat they are more prone to get some usefult bits working quickly than making afull specification, in which everything is considered. Considering the authors andtiming, this work is likely connected to the forthcoming Rock processor.

OpenTM (1/3)

• Extension to OpenMP

• Targets: strong isolation, open and closed transaction nesting, I/O prohibited

• Speculative parallelism

OpenTM (2/3)

• New constructs to specify transactions

– #pragma omp transaction — atomic transaction

– #pragma omp transfor — each iteration is a transaction, may be exe-cuted in parallel

4

– #pragma omp transsections #pragma omp transsection — OpenMPparallel sections, transactionally executed

– #pragma omp orelse — Executed if transaction was aborted

• Additional clauses to specify commit ordering, transaction chunk sizes, etc

OpenTM (3/3)Example code:

#pragma omp p a r a l l e l forfor ( i =0; i<N; i++) {

#pragma omp transaction{ bin [A[ i ] ] = bin [A[ i ] ] + 1 ; }

}

#pragma omp transfor schedule ( static , 42 , 6)for ( i =0; i<N; i++) {

bin [A[ i ] ] = bin [A[ i ] ]+1 ;}

#pragma omp transsections ordered{#pragma omp transsection

WORKA( ) ;

#pragma omp transsectionWORK B( ) ;

}

Source: http://tcc.stanford.edu/publications/tcc_pact2007_talk.pdf

Annotations

OpenTM [2] is a much broader approach to utilize transactional memories thanSun/Google consideration of TM-enhanced C++. There is much more consider-ation on practical issues such as commit ordering, nesting styles, and speculativeparallelization. GCC 4.3-based compiler implementation and simulator exists, seehttp://opentm.stanford.edu/ for details.

3 TM implementations

Section outlineGlance at transactional memory implementations

• Fundamentals

• Software transactional memory

• Hardware-accelerated software transactional memory

• Hardware transactional memory

• Hybrid transactional memory

• Note on supporting legacy software

Fundamentals (1/3)Data versioning

• Lazy versioning

– Transaction hosts local copy of accessed data

– Writes go to commit buffer

– Data is written into main memory when transaction commits

• Eager versioning

5

http://tcc.stanford.edu/publications/tcc_pact2007_talk.pdf

http://opentm.stanford.edu/

– Transactions write data immediately into main memory. Isolation isprovided by locking and/or aborting conflicting transactions

– Overwritten values go to undo buffer

– Undo buffer is executed when transaction aborts

Fundamentals (2/3)Conflict detection

• Pessimistic conflict detection

– Conflicts are detected progressively with reads and writes

– Conflicts are resolved by aborting or stalling progress

– Circular conflicts may halt progress altogether unless specifically de-tected

• Optimistic conflict detection

– Conflicts are detected at commit time at, resolved by aborts

– Works only with lazy versioning

– Efficient only when conflict probability is low

– Perhaps less latencies but more wasted work than pessimistic

• Granularity of conflict detection is important design property

– Fine granularity makes conflict detection slow, e.g., word granularity

– Coarse granularity makes conflict detection report false conflicts, e.g.,page level granularity

Fundamentals (3/3)Transaction nesting

• flat

• closed

• open

Software transactional memory (STM)

• Compiler and runtime operation, no hardware support

• High overhead

– Every memory access must be tracked ⇒ extra memory traffic

– Conflict detection is expensive

– Typical real-world experimental results: 30–90% of time spent in STM,scalability far from linear

• Legacy code must be specifically considered

• However, flexible solution as no HW requirements

• Unbounded transactions are easily implemented

• Strong isolation is expensive

6

Annotations

STM compiler by Intel is presented in [20] and can be obtained from http://

whatif.intel.com. Criticism on STM is found in [4].

Hardware-accelerated software transactional memory (HASTM)

• Common bottleneck, i.e., memory access tracking and conflict detection, isaccelerated by employing mark bits in the cache

• Almost HTM speeds are claimed

• Approach by Intel

Annotations

Hardware-accelerated STM is presented in [18]. In that work, accelerating thebottlenecks of STM with simple best-effort hardware support is considered. Theyclaim almost the speeds of unbounded HTM. This work could hint on Intel’s futuredirections on transactional memories.

Hardware transactional memory (HTM)

• Hardware support for providing atomicity, versioning and conflict detection

• Versioning typically (but not always) implemented in data cache using ex-isting cache coherency protocols for conflict detection ⇒ almost 0-overhead

• Transactions are far from unbounded both in memory footprint, executiontime, and nesting

– Albeit resource virtualization can overcome hardware limitations (com-pare to memory virtualization)

• Interrupts, context switches and other irregularities can cause false aborts

Annotations

Important real-world ISA discussion is found in [13], which is a more holisticapproach. However, in Sun Rock processor the ISA extension (Sec. 4) is muchsmaller. But then again, Rock is best-effort only.

Different approach to HTM is LogTM-SE, which does not rely on caches forimplementation. Instead, it uses signature techniques. LogTM-SE provides un-bounded transactional memory by utilizing virtualization techniques. [22]

Hybrid transactional memory (HyTM)

• Use HTM but fallback to STM when HW limits are reached

• HTM mode incurs some overhead compared to pure HTM, as checks mustbe made whether HW operation is safe

• Typical overhead around 10–20% compared to pure HTM

• Much faster than pure STM, but without HW limits

• Most transactions are small enough for HTM, only few of them fallback toSTM

• Approach by Sun

7

http://whatif.intel.com

http://whatif.intel.com

Annotations

HyTM is initially presented in [6], although some previous considerations exists.For a wrap-up of HyTM techniques utilized by Sun in Rock research can be foundin [8] and its references.

Perhaps the biggest performance problem in HyTM is that the hardware trans-actions must always check whether possibly conflicting software transactions arepresent. Speeding this up by utilizing memory protection is considered in [3]. It isalso worthwhile to note that HyTM requires that there must exist HTM and STMversion of every piece of software which might be run inside a transaction.

To conclude considerations of different implementations, phased transactionalmemory (PhTM) is an attempt to bring the best of many worlds. PhTM can switchbetween multiple implementation strategies, e.g., pure HTM to HyTM, based onthe current workload [12].

Coping with legacy software

• Characteristics of legacy code:

– Code using locks to synchronize critical sections

– STM: Code which is not produced by STM compiler, i.e., memory ac-cesses are not instrumented

• Workarounds:

– Critical sections: convert into transactions by using speculative lockelision

– Memory accesses: apply dynamic binary translation to instrument mem-ory accesses

• Support for legacy code is important if existing libraries are to be used insidetransactions!

Annotations

Transactional lock elision (TLE) is a speculative lock elision technique, imple-mented by utilizing transactions [17]. TLE is also used in works by Sun to specu-latively execute lock-synchronized blocks in Java and C++ [8].

Dynamic Binary Translation techniques for transactional lock elision and in-strumenting memory accesses is discussed in [21].

4 TM in Sun Rock processor

Section outlineTransactional memory of Sun Rock processor

• Sun Rock processor overview

• Transactional memory implementation

• HTM ISA

• Applications

This is preproduction information, details are subject to change.

8

Sun Rock processor overview (1/2)

Source: http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf

Sun Rock processor overview (2/2)

• Rock is next generation SPARC processor

• 16-core design, organized in 4x4 groups

• L1 ICache (32kB) per core group and L1 DCache (32kB) per core pair, on-chip 4x512kB L2-cache

• Each core executes 2 threads of software and 1 or 2 (configurable) speculative“scout” threads

• Transactional memory support

• 321M transistors, 65nm process, 250W @ 2.1GHz

• General availability in 2009H2

Annotations

Some sources for information:

• publications [19, 7, 8]

• http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf

• http://en.wikipedia.org/wiki/Rock_processor

Note that there is already at least two revisions on Rock. Generally, materialreleased in 2008 refers to R1 and material released in 2009 refers to R2. Notably,HTM implementation has changed a bit.

9

http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf

http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf

http://en.wikipedia.org/wiki/Rock_processor

TM in Rock

• Lazy versioning:

– Speculation bits in L1 DCache to track memory accesses, contain alsomodified data

– 16 or 32-entry commit buffer containing list of modified cache lines.Buffer size depends on scout thread configuration

– Modified lines flushed to main memory at commit

– Abort simply discards commit log and modified cache lines

• Optimistic conflict detection

– Invalidated lines abort transaction

– Cache line granularity

– Based on existing cache coherency protocols

• Best effort only:

– Interrupts, exceptions, tlb misses, branch speculation misses abort on-going transaction

– Also “difficult” instructions such as some common procedure entry/epi-logue instructions and div-family

ISA support

• Basically, three instructions

– chkpt <fail pc> — start transaction and specify abort address

– commit — commit transaction

– rd %cps, <dest reg> — read transaction abort status

• Contention management and retry/fallback policies are implemented in soft-ware

Annotations

The Rock TM instruction set architecture is explained in [14] (Rock R1), but seealso [8] for some changes in Rock R2.

Example applications

• Efficient synchronization primitive implementations

• Atomic container updates

• Speculative execution of restricted critical sections

• Implementing new syncronization primitives, such as double-CAS

• HTM part for hybrid HTM/STM implementations

Annotations

More application considerations with simulated results are found in [7].

10

Some published experimental results (1/3)

0.1

1

10

100

1 2 3 4 5 6 8 10 12 16

Thr

ough

put (

ops/

usec

)

# Threads

HashTable Test: keyrange=256, 0% lookups

phtmphtm-tl2

hytmstm

stm-tl2one-lock

0.1

1

10

100

1 2 3 4 5 6 8 10 12 16

Thr

ough

put (

ops/

usec

)

# Threads

HashTable Test: keyrange=128000, 0% lookups

phtmphtm-tl2

hytmstm

stm-tl2one-lock

(a) (b)

Figure 1. HashTable with 50% inserts, 50% deletes: (a) key range 256 (b) key range 128,000.

even for the single thread case (these retries explain whythe single lock outperforms HyTM and PhTM somewhatin the single threaded case). In contrast, with the 256 keyrange experiment (scenario (a)), only 0.02% of hardwaretransactions are retries in the single thread case, and evenat 16 threads only 16% are retries.

Furthermore, the distribution of CPS values from failedtransactions in the 16 thread, 256 key range case is dom-inated by COH while in the 128,000 key range case it isdominated by ST and CTI. This makes sense because thereis more contention in the smaller key range case (resultingin the CPS register being set to COH), and worse locality inthe larger one. Poor locality can cause transactions to failfora variety of reasons, including micro-DTLB mappings thatneed to be reestablished (resulting in ST), and mispredictedbranches (resulting in CTI).

Finally, this experiment and the Red-Black Tree experi-ment (see Section 6) highlighted the possibility of the codein the fail-retry path interfering with subsequent retry at-tempts. Issues with cache displacement, TLB displacementand even modifications to branch-predictor state can arise,wherein code in the fail-retry path interferes with subsequentretries, sometimes repeatedly. Transaction failures caused bythese issues can be very difficult to diagnose, especially be-cause adding code to record and analyze failure reasons canchange the behavior of the subsequent retries, resulting inasevere probe effect. As discussed further in (6), the logic fordeciding whether to retry in hardware or fail to software washeavily influenced by these issues, and we hope to improveit further after understanding some remaining issues we havenot had time to resolve yet.

6. Red-Black TreeNext, we report on experiments similar to those in the pre-vious seciton, but using a red-black tree, which is consider-ably more challenging than a simple hash table for severalreasons. First, transactions are longer and access more data,

and have more data dependencies. Second, when a red-blacktree becomes unbalanced, new insertion operations perform“rotations” to rebalance it, and such rotations can occasion-ally propagate all the way to the root, resulting in longertransactions that perform more stores. Third, mispredictedbranches are much more likely when traversing a tree.

We used an iterative version of the red-black tree (5), soas to avoid recursive function calls, which are likely to causetransactions to fail in Rock. We experimented with variouskey ranges, and various mixes of operations. In each experi-ment, we prepopulate the tree to contain about half the keysin the specified key range, and then measure the time re-quired for all threads to perform 1,000,000 operations eachon the tree, according to the specified operation distribution;we report results as throughput in total operations per mi-crosecond. Figure 2(a) shows results for the “easy” case ofa small tree (128 keys) and 100% lookup operations. Fig-ure 2(b) shows a more challenging case with a larger tree(2048 keys), with 96% lookups, 2% inserts and 2% deletes.

The 100% lookup experiment on the small tree yieldsexcellent results, similar to those shown in the previoussection. For example, at 16 threads, PhTM outperforms thesingle lock by a factor of more than 50. However, as wego to larger trees and/or introduce even a small fraction ofoperations that modify the tree, our results are significantlyless encouraging , as exemplified by the experiment shown inFigure 2(b). While PhTM continues to outperform the singlelock in almost every case, in many cases it performs worsethan the TL2 STM system (7). A key design principle forPhTM was to be able to compete with the best STM systemsin cases in which we are not able to effectively exploit HTMtransactions. Although we have not yet done it, it is trivialtomake PhTM stop attempting to use hardware transactions,so in principle we should be able to get the benefit of thehardware transactions when there is a benefit, suffering onlya negligible overhead when there is not. The challenge is in

Source: Dice et al: Early Experience with a Commercial Hardware Transactional Memory Implementation,

ASPLOS’09.

c© Sun Microsoftems, Inc.


0.1

1

10

100

1 2 3 4 6 8 12 16

Thr

ough

put (

ops/

usec

)

# Threads

RedBlackTree: 100% reads, keyrange=[0,128)

0.1

1

10

100

1 2 3 4 6 8 12 16

Thr

ough

put (

ops/

usec

)

# Threads

RedBlackTree: 96% reads, keyrange=[0,2048)

phtmphtm-tl2

hytmstm

stm-tl2one-lock

(a) (b)

Figure 2. Red-Black Tree. (a) 128 keys, 100% reads (b) 2048 keys, 96% reads, 2% inserts, 2% deletes.

deciding when to stop attempting hardware transactions, butin extreme cases this is easy.

Before giving up on getting any benefit from HTM insuch cases, however, we want to understand the behaviorbetter, and explore whether better retry heuristics can help.

As discussed earlier, understanding the reasons for trans-action failure can be somewhat challenging. Although thementioned CPS improvements have alleviated this problemto some extent, it is still possible for different failure reasonsto set the same CPS values. Therefore, we are motivated tothink about different ways of analyzing and inferring rea-sons for failures. Below we discuss an initial approach wehave taken to understanding our red-black tree data.

6.1 Analyzing Transacation Failures

Significant insight into the reason for a transacation failingcan be gained if we know what addresses are read and writ-ten by it. We added a mechanism to the PhTM library thatallows the user to register a call-back function to be calledat the point that a software transaction attempts to commit;furthermore, we configured the library to switch to a soft-ware phase in which only the switching thread attempts asoftware transaction. This gives us the ability to examine thesoftware transaction executed by a thread that has just failedto execute the same operation as a hardware transaction.

We used this mechanism to collect the following infor-mation about operations that failed to complete using hard-ware transactions: Operation name (Get, Insert or Delete);read set size (number of cache lines1); maximum numberof cache lines mapping to a single L1 cache set; write setsize (number of cache lines and number of words); numberof words in the write set that map to each bank of the storequeue; number of write upgrades (cache lines that were readand then written); and number of stack writes.

1 In practice, we collected the number of ownership-records covering theread-set. Since each cache-line maps to exactly one ownership-record, andsince the size of our ownership-table is very large, we believe that the twoare essentially the same.

We profiled single threaded PhTM runs with various treesizes and operation distributions. Furthermore, because thesequence of operations is deterministic (we fixed the seedfor the pseudo random number generator used to choose op-erations), we could also profileall operations using an STM-only run, and use the results of the PhTM runs to eliminatethe ones that failed in hardware. This way, we can comparecharacteristics of transactions that succeed in hardware tothose that don’t, and look for interesting differences thatmaygive clues about reasons for transaction failures.

Results of Analysis In addition to the experiments de-scribed above, we also tried experiments with larger trees(by increasing the key range), and found that many opera-tions fail to complete using hardware transactions, even forsingle threaded runs with 100% lookup operations. This doesnot seem too surprising: the transactions read more locationswalking down a deeper tree, and thus have a higher chanceof failing to fit in the L1 cache.

We used the above-described tools to explore in moredepth, and we were surprised to find out that the problemwas not overflowing of L1 cache sets, nor exceeding thestore queue limitation. Even for a24, 000 element tree, noneof the failed operations had a read-set that overflowed anyof the L1 cache sets (in fact, it was rare to see more than2

loads hit the same4-way cache set). Furthermore,none ofthe transactions exceeded the store queue limitation. Puttingthis information together with the CPS values of the failedtransactions, we concluded that most failures were becausetoo many instructions were deferred due to the high num-ber of cache misses. Indeed, when we then increased thenumber of times we attempt a hardware transaction beforeswitching to software, we found that we could significantlydecrease the number of such failing transactions, because theadditional retries served to bring needed data into the cache,thereby reducing the need to defer instructions.

Even though we were able to get the hardware transac-tions to commit by retrying more times, the additional re-


ASPLOS’09.



1

10

100

1 2 3 4 6 8 12 16

Thr

ough

put (

ops/

usec

)

# Threads

STLVector Test: initsiz=100, ctr-range=40

htm.oneLocknoTM.oneLock

htm.rwLocknoTM.rwLock

1

10

100

1 2 3 4 6 8 12 16

Thr

ough

put (

oper

atio

ns/u

s)

# threads

TLE with Hashtable in Java

0:10:0-locks0:10:0-TLE1:8:1-locks1:8:1-TLE

2:6:2-locks2:6:2-TLE

4:2:4-locks4:2:4-TLE

(a) (b)

Figure 3. (a) TLE in C++ with STLvector (b) TLE in Java withHashtable.

for Hashtable are shown in Figure 3; a curve labeled with2-6-2 indicates 20%puts, 60%gets, and 20%removes.

With 100% get operations, TLE is highly successful,and the throughput achieved scales well with the numberof threads. As we increase the proportion of operations thatmodify theHashtable, more transactions fail, the lock isacquired more often, contention increases, and performancediminishes. Nonetheless, even when only 20% of the opera-tions aregets, TLE outperforms the lock everywhere exceptthe single threaded case. We hope to improve performanceunder contention, for example by adaptively throttling con-currency when contention arises.

We also conducted similar experiments forHashMap. Asbefore (5), we found thatHashMap performed similarly toHashtable in the read-only test. When we introduced op-erations that modify the collection, however, while we stillachieve some performance improvement over the lock, so farour results are not as good as forHashtable. We have madesome interesting observations in this regard.

We observed good performance withHashMap compara-ble toHashtable, but noticed that later in the same experi-ment, performance degraded and became comparable to theoriginal lock. After some investigation, we determined thatthe difference was caused by the JIT compiler changing itsdecision about how to inline code. At first, it would inline thesynchronized collection wrapper together with each of theHashMap’s put, get andremove methods. Thus, when theJVM converted the synchronized methods to transactions,the code to be executed was all in the same method.

Later, however, the JIT compiler revisited this decision,and in the case ofput, instead inlined the synchronized col-lection wrapper into the worker loop body and then emit-ted a call to a method that implementsHashMap.put().As a result, when the TLE-enabled JVM converts the syn-chronized method to a transaction, the transaction containsa function call, which—as discussed in Section 3—can of-ten abort transactions in Rock. If the compiler were aware

of TLE, it could avoid making such decisions that are detri-mental to transaction success.

We also testedTreeMap from java.util.concurrent,another red-black tree implementation. Again, we achievedgood results with small trees and read-only operations, butperformance degraded with larger trees and/or more muta-tion. We have not investigated in detail.

We are of course also interested in exploiting Rock’sHTM in more realistic applications than the microbench-marks discussed so far. As a first step, we have experimentedwith the VolanoMarkTM benchmark (18). With the code forTLE emitted, but with the feature disabled, we observeda 3% slowdown, presumably due to increased register andcache pressure because of the code bloat introduced. Whenwe enabled TLE, it did not slow down the benchmark fur-ther, as we had expected, and in fact it regained most of thelost ground, suggesting that it was successful in at least somecases. However, a similar test with an internal benchmarkyielded a 20% slowdown, more in line with our expectationthat blindly attempting TLE for every contended critical sec-tion would severely impact performance in many cases.

This experience reinforces our belief that TLE must beappliedselectively to be useful in general. We are workingtowards being able to do so. As part of this work we havebuilt a JVM variant that includes additional synchronizationobservability and diagnostic infrastructure, with the purposeof exploring an application and characterizing its potentialto profit from TLE and understanding which critical sectionsare amenable to TLE, and the predominant reasons in casesthat are not. We hope to report in more detail on our experi-ence with the tool soon.

8. Minimum Spanning Forest algorithmKang and Bader (10) present an algorithm that uses trans-actions to build a Minimum Spanning Forest (MSF) in par-allel given an input graph. Their results using an STM forthe transactions showed good scalability, but the overhead


ASPLOS’09.


11

References[1] Ali-Reza Adl-Tabatabai, Christos Kozyrakis, and Bratin Saha. Unlocking concurrency.

ACM Queue, 4(10):24–33, 2007.

[2] Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, and Kunle Oluko-tun. The OpenTM transactional application programming interface. In Proceedings of the16th International Conference on Parallel Architecture and Compilation Techniques(PACT ’07), pages 376–387, Washington, DC, USA, 2007. IEEE Computer Society.

[3] Lee Baugh, Naveen Neelakantam, and Craig Zilles. Using hardware memory protection tobuild a high-performance, strongly-atomic hybrid transactional memory. In Proceedingsof the 35th International Symposium on Computer Architecture (ISCA ’08), pages 115–126, Washington, DC, USA, 2008. IEEE Computer Society.

[4] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chi-ras, and Siddhartha Chatterjee. Software transactional memory: Why is it only a researchtoy? ACM Queue, 6(5):46–58, 2008.

[5] Lawrence Crowl, Yossi Lev, Victor Luchangco, Mark Moir, and Dan Nussbaum. Inte-grating transactional memory into C++. In Proceedings of the ACM SIGPLAN Work-shop on Transactional Computing (TRANSACT 2007) [15]. http://www.cs.rochester.edu/meetings/TRANSACT07/.

[6] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and DanielNussbaum. Hybrid transactional memory. In Proceedings of the 12th International Con-ference on Architectural Support for Programming Languages and Operating Systems(ASPLOS-XII), pages 336–346, New York, NY, USA, 2006. ACM.

[7] Dave Dice, Maurice Herlihy, Doug Lea, Yossi Lev, Victor Luchangco, Wayne Mesard,Mark Moir, Kevin Moore, and Dan Nussbaum. Applications of the adaptive transactionalmemory test platform. In Proceedings of the ACM SIGPLAN Workshop on TransactionalComputing (TRANSACT 2008) [16]. http://www.unine.ch/transact08/program.html.

[8] Dave Dice, Yossi Lev, Mark Moir, and Dan Nussbaum. Early experience with a commercialhardware transactional memory implementation. In Proceedings of the 14th InternationalConference on Architectural Support for Programming Languages and Operating Sys-tems (ASPLOS-XIV), 2009.

[9] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, BenHertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun.Transactional memory coherence and consistency. In 31st annyal International sympo-sium on Computer architecture (ISCA ’04), pages 102–113, Washington, DC, USA, June2004. IEEE Computer Society.

[10] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support forlock-free data structures. ACM SIGARCH Computer Architecture News, 21(2):289–300,1993.

[11] Tom Knight. An architecture for mostly functional languages. In LFP ’86: Proceedingsof the 1986 ACM conference on LISP and functional programming, pages 105–112, NewYork, NY, USA, 1986. ACM.

[12] Yossi Lev, Mark Moir, and Dan Nussbaum. PhTM: Phased transactional memory. InProceedings of the ACM SIGPLAN Workshop on Transactional Computing (TRANSACT2007) [15]. http://www.cs.rochester.edu/meetings/TRANSACT07/.

[13] Austen McDonald, JaeWoong Chung, Brian D. Carlstrom, Chi Cao Minh, Hassan Chafi,Christos Kozyrakis, and Kunle Olukotun. Architectural semantics for practical trans-actional memory. In Proceedings of the 33th International Symposium on ComputerArchitecture (ISCA ’06), pages 53–65, Washington, DC, USA, 2006. IEEE ComputerSociety.

[14] Mark Moir, Kevin Moore, and Dan Nussbaum. The adaptive transactional memory testplatform: a tool for experimenting with transactional code for Rock. In Proceedingsof the ACM SIGPLAN Workshop on Transactional Computing (TRANSACT 2008) [16].http://www.unine.ch/transact08/program.html.

[15] Proceedings of the ACM SIGPLAN Workshop on Transactional Computing (TRANS-ACT 2007), August 2007. http://www.cs.rochester.edu/meetings/TRANSACT07/.

[16] Proceedings of the ACM SIGPLAN Workshop on Transactional Computing (TRANS-ACT 2008), February 2008. http://www.unine.ch/transact08/program.html.

[17] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of lock-basedprograms. In Proceedings of the 10th international conference on Architectural supportfor programming languages and operating systems (ASPLOS X), pages 5–17, New York,NY, USA, 2002. ACM.

[18] Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. Architectural support for soft-ware transactional memory. In Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO 39), pages 185–196, 2006.

12

http://www.cs.rochester.edu/meetings/TRANSACT07/


http://www.unine.ch/transact08/program.html





[19] Mark Tremblay and Shailender Chaudhry. A third-generation 65nm 16-core 32-threadplus 32-scout-thread CMT SPARC processor. In IEEE International Solid-State CircuitsConference Digest of Technical Papers (ISSCC), pages 82–83, February 2008.

[20] Cheng Wang, Wei-Yu Chen, Youfeng Wu, Bratin Saha, and Ali-Reza Adl-Tabatabai. Codegeneration and optimization for transactional memory constructs in an unmanaged lan-guage. In Proceedings of the International Symposium on Code Generation and Opti-mization (CGO ’07), pages 34–48, March 2007.

[21] Cheng Wang, Victor Ying, and Youfeng Wu. Supporting legacy binary code in a softwaretransaction compiler with dynamic binary translation and optimization. In InternationalConference on Compiler Construction (CC 2008), volume 4959 of Lecture Notes inComputer Science, pages 291–306. Springer Berlin / Heidelberg, 2008.

[22] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill,Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling hardware transactionalmemory from caches. In Proceedings of the 2007 IEEE 13th International Symposiumon High Performance Computer Architecture (HPCA ’07), pages 261–272, Washington,DC, USA, 2007. IEEE Computer Society.

13

Documents

Introduction to Transactional MemoryFundamentals (3/3) Transaction nesting at closed open Software transactional memory (STM) Compiler and runtime operation, no hardware support High