8
Transactional WaveCache: Towards Speculative and Out-of-Order DataFlow Execution of Memory Operations Leandro A. J. Marzulo, Felipe M. G. Franc ¸a Universidade Federal do Rio de Janeiro Systems Engineering and Computer Science Program, COPPE Rio de Janeiro - Brazil {lmarzulo, felipe}@cos.ufrj.br ıtor Santos Costa Universidade do Porto Departamento de Ciˆ encia de Computadores Porto - Portugal [email protected] Abstract The WaveScalar is the first DataFlow Architecture that can efficiently provide the sequential memory semantics required by imperative languages. This work presents a speculative memory disambiguation mechanism for this ar- chitecture, the Transaction WaveCache. Our mechanism maintains the execution order of memory operations within blocks of code, called Waves, but adds the ability to spec- ulatively execute, out-of-order, operations from different waves. This mechanism is inspired by progress in sup- porting Transactional Memories. Waves are considered as atomic regions and executed as nested transactions. Wave that have finished the execution of all their memory oper- ations are committed, as soon as the previous waves are also committed. If a hazard is detected in a speculative Wave, all the following Waves (children) are aborted and re-executed. We evaluated the Transactional WaveCache on a set of benchmarks from Spec 2000, Mediabench and Mibench (telecomm). Speedups ranging from 1.31 to 2.24 (related to the original WaveScalar) where observed when the benchmark doesn’t perform lots of emulated function calls or access memory very often. Low speedups of 1.1 to slowdowns of 0.96 were observed when the opposite hap- pens or when the memory concurrency was high. 1. Introduction As the speed increase of the single processor slows down, multi-core machines are becoming a standard in envi- ronments ranging from the desktop and portable computing up to servers. This makes parallelism an important con- cern, as in order to achieve high performance with the new architectures one must be able to extract parallelism from the application. Unfortunately, this is often difficult with the standard Von Neumman approach to computing, mo- tivating aggressive research on alternative models such as TRIPS [14] and RAW [20]. A popular alternative was the DataFlow model, where instructions can execute as soon as they have their input operands available [7, 6, 13], instead of following the program order. Although dataflow ideas are in use, e.g., in Tomasulo’s algorithm [19] actual DataFlow machines were never popular. Arguably, a major problem with them is that it was hard to support the memory seman- tics required by imperative languages. Moving to DataFlow thus required both a new architecture and a new program- ming language. Swanson’s WaveScalar architecture is a radical rethought of WaveScalar concepts that addresses this problem by pro- viding a memory interface that executes memory accesses according to the program order [17, 18, 16]. The key idea is that computation is divided into waves, such that each wave runs as a data-flow computation, but sequenc- ing of waves guarantees traditional memory access order. It thus becomes possible to run imperative programs in a dataflow machine and still obtain significant speedups. Sim- ulation results for WaveScalar show that performance can be comparable with conventional superscalar designs and even CMPs, for single threaded programs, but with 30% less area [16]. When it comes to multithreaded applications, WaveScalar performs 2 to 11 times faster than CMPs [16]. 20th International Symposium on Computer Architecture and High Performance Computing 1550-6533/08 $25.00 © 2008 IEEE DOI 10.1109/SBAC-PAD.2008.29 183

[IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

Transactional WaveCache: Towards Speculative and Out-of-Order DataFlowExecution of Memory Operations

Leandro A. J. Marzulo, Felipe M. G. FrancaUniversidade Federal do Rio de Janeiro

Systems Engineering and Computer Science Program, COPPERio de Janeiro - Brazil

{lmarzulo, felipe}@cos.ufrj.br

Vıtor Santos CostaUniversidade do Porto

Departamento de Ciencia de ComputadoresPorto - Portugal

[email protected]

Abstract

The WaveScalar is the first DataFlow Architecture thatcan efficiently provide the sequential memory semanticsrequired by imperative languages. This work presents aspeculative memory disambiguation mechanism for this ar-chitecture, the Transaction WaveCache. Our mechanismmaintains the execution order of memory operations withinblocks of code, called Waves, but adds the ability to spec-ulatively execute, out-of-order, operations from differentwaves. This mechanism is inspired by progress in sup-porting Transactional Memories. Waves are considered asatomic regions and executed as nested transactions. Wavethat have finished the execution of all their memory oper-ations are committed, as soon as the previous waves arealso committed. If a hazard is detected in a speculativeWave, all the following Waves (children) are aborted andre-executed. We evaluated the Transactional WaveCacheon a set of benchmarks from Spec 2000, Mediabench andMibench (telecomm). Speedups ranging from 1.31 to 2.24(related to the original WaveScalar) where observed whenthe benchmark doesn’t perform lots of emulated functioncalls or access memory very often. Low speedups of 1.1 toslowdowns of 0.96 were observed when the opposite hap-pens or when the memory concurrency was high.

1. Introduction

As the speed increase of the single processor slowsdown, multi-core machines are becoming a standard in envi-

ronments ranging from the desktop and portable computingup to servers. This makes parallelism an important con-cern, as in order to achieve high performance with the newarchitectures one must be able to extract parallelism fromthe application. Unfortunately, this is often difficult withthe standard Von Neumman approach to computing, mo-tivating aggressive research on alternative models such asTRIPS [14] and RAW [20]. A popular alternative was theDataFlow model, where instructions can execute as soon asthey have their input operands available [7, 6, 13], instead offollowing the program order. Although dataflow ideas arein use, e.g., in Tomasulo’s algorithm [19] actual DataFlowmachines were never popular. Arguably, a major problemwith them is that it was hard to support the memory seman-tics required by imperative languages. Moving to DataFlowthus required both a new architecture and a new program-ming language.

Swanson’s WaveScalar architecture is a radical rethoughtof WaveScalar concepts that addresses this problem by pro-viding a memory interface that executes memory accessesaccording to the program order [17, 18, 16]. The keyidea is that computation is divided into waves, such thateach wave runs as a data-flow computation, but sequenc-ing of waves guarantees traditional memory access order.It thus becomes possible to run imperative programs in adataflow machine and still obtain significant speedups. Sim-ulation results for WaveScalar show that performance canbe comparable with conventional superscalar designs andeven CMPs, for single threaded programs, but with 30%less area [16]. When it comes to multithreaded applications,WaveScalar performs 2 to 11 times faster than CMPs [16].

20th International Symposium on Computer Architecture and High Performance Computing

1550-6533/08 $25.00 © 2008 IEEE

DOI 10.1109/SBAC-PAD.2008.29

183

Page 2: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

The tiled organization of the WaveScalar architecture andthe placement algorithms allow an efficient use of the pro-cessor’s resources.

In the original Wavescalar design memory requests fromdifferent waves always follow program order. This can bemore restrictive than current designs, where write buffer-ing and read reordering is used extensively. Work by theWaveScalar group shows that such techniques can indeedbe very useful in the WaveScalar context [16]. In this workwe present and study an even more aggressive approach.We propose an optimistic speculative memory access algo-rithm where we can run different waves that access memorycaches out-of-order. If conflicts arise, speculative waves areforced to undo their memory accesses. The approach wasinspired by the observation that waves can be regarded asmemory transactions, so the work in Transactional Mem-ories should apply. Notice that although allowing a paral-lelism increase, this aproach requires more complex hard-ware (on the other hand, such hardware could also be usedto support actual transactions).

We evaluated the WaveCache on a set of benchmarksfrom Spec 2000, Mediabench and Mibench (telecomm).Speedups ranging from 1.31 to 2.24 where observed whenthe benchmark doesn’t perform lots of emulated functioncalls. Low speedups of 1.1 to slowdowns of 0.96 were ob-served when the opposite happens or when the memory con-currency was high.

Section 2 discusses Transactional Memory concepts thatinfluenced the creation of Transactional WaveCache. Sec-tion 3 describes WaveScalar instruction set, memory in-terface and processor architecture. Section 4 presents theTransactional WaveCache and Section 5 presents and dis-cusses experiments and results. Conclusions and futurework are presented in Section 6.

2. Transactional Memories

The term Transactional Memory (TM) was coined byHerlihy and Moss [9] as “a new multiprocessor architec-ture intended to make lock-free synchronization as efficient(and easy to use) as conventional techniques based on mu-tual exclusion”. The motivation was to avoid the complexityof lock-based programming. The advent of CMPs (single-Chip MultiProcessors) has increased demand for easier toprogram parallel models and has thus made the subject be-come popular.

A transaction is an atomic block of code. In TM they canexecute speculatively without blocking. To do so, a recordof memory and register file changes must be kept by thetransactions. In case there is a conflict between two trans-actions, one of them is aborted and re-executed. If a trans-action finishes without conflicts, a commit is made. Thereare four major problems that must be considered to create

a Transactional Memory solution: data versioning, conflictdetection, nesting and virtualization.

In eager data versioning an undo log keeps the backupof memory regions and registers changed by transactionsand its used to perform the restoration in case of an abor-tion. If a commit happens, the undo log is erased. In lazydata versioning a write buffer keeps the changes made byeach transaction. Those changes are applied to memory andregister upon a commit and erased on abortion. Conflict de-tection can also be eager or lazy. In the former, conflicts aredetected as soon as they happen and in the later, they arejust detected when a transaction finishes (it will commit ifno conflicts were detected).

One further problem in implementing TM is that CPUsmay start new transactions while still executing anothertransaction, the so-called Nested transactions. Such trans-actions may be treated as a unique transaction (flattening),or may be treated independently (closed and open nest-ing [11, 12]). Last, Transactional Memories systems shouldensure the correct execution of programs even when trans-action exceeds the scheduler time slice, the caches andmemory capacities or the number of independent nestinglevels allowed by its hardware [3].

Transactional memories mechanisms may be imple-mented in hardware [9, 2, 3], where higher performance canbe achieved, but complete solutions to nesting and virtual-ization problems may increase the design complexity andmake it too expensive. Software implementations [15, 8, 1]allow more complex solutions with lower performance. Hy-brid implementations [10, 4, 1] try to get the best of bothworlds.

3. WaveScalar

The WaveScalar is the DataFlow architecture used toimplement the Transactional WaveCache, presented in thiswork. This Section briefly describes WaveScalar instructionset, memory interface and processor architecture.

3.1 Instruction Set

A DataFlow graph describes a program in the DataFlowmodel. The nodes in the graph are instructions: instruc-tions are intelligent in the sense that they have an asso-ciated functional unit. Edges represent the operands ex-changed between instructions. The WaveScalar InstructionSet starts from the Alpha ISA [5]. The main difference isthat branches must be transformed into a mechanism to se-lect consumers of values: (i) Select (φ) receives two valuesv1 and v2 and a boolean selector s. One of the values v1 orv2 is sent, according to s; (ii) Steer (ρ) receives a value vand a boolean b. The value v is sent to one of two paths,depending on b.

184

Page 3: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

3.1.1 Waves

Waves are connected, acyclic fragments of the control flowgraph with a single entrance. Waves extend hyper-blocksin that they can also contain joins. The WaveScalar com-piler (or binary translator) partitionates a program into a setof maximal waves. Waves are then ordered through Wave-ordering annotations.

Notice that different iterations in a loop may execute inparallel in Dataflow mode, if there are no dependencies be-tween its instructions. When an instruction in the loop re-ceives an operand there must be a way to identify to whichiteration the operand is destined. To accomplish that, ev-ery operand carries a tag that indicates the iteration number,or the Wave number. The Wave-Advance instruction ad-vances the wave numbers for input operands of a wave. It isassociated with every input argument, and is often mergedwith the next instruction to reduce overhead.

3.1.2 Memory-ordering

During compilation, all memory operations further receivea key < P, C, S > (Predecessor, Current and Successor),that allow the memory system to establish a chain that con-nects all memory requests in a Wave. A memory requestcan only be executed if the previous request in the chainand all memory requests from the previous Wave have al-ready been executed. When a memory operation is thefirst or the last of a Wave, P=“.” and S=“.”, respectively(wildcard “.” denotes inexistent operation). Operations thatoccur before and after a branch block have S=“?” andP=“?”, respectively (wildcard “?” denotes unknown op-eration). If there are no memory operations in one of thepaths of a branch, there is no way to establish a chain be-tween the operations that are before and after the branchblock. To solve this problem, a MemNop instruction mustbe inserted in that path. Figure 1 shows a piece of codewith a IF-THEN-ELSE block (a) and the related DataFlowgraph(b). The Wave-ordering annotations of Memory oper-ations are also shown, with the dashed lines indicating thechain that is formed between them.

Wave-ordered memory incorporates simple run-timememory-disambiguation inside each wave by taking advan-tage of the fact that the address of a Store is sometimesready before the data value. When this happens, the mem-ory system can safely proceed with future memory oper-ations to different addresses. Stores are broken in tworequests: Store-Address-Request and Store-Data-Request.Store-Address-Requests arriving without the correspondingStore-Data-Request are inserted in partial store queues tohold future operations to the same address. Other operationsthat access the same address are inserted in the same queue.Once the Store-Data-Request arrives for that address, thememory system can apply all the operations in the partial

if(V[0]==0)

V[1]=3;

else

V[1]=2;

V[2]=2;

const #0

const #1 const #2

Load <.,0,?>

== 0

Store <0,1,3> Store <0,2,3>

T F

ρT F

ρT F

ρ

const #3

Store <?,3,.>

(a) (b)

Figure 1. WaveScalar DataFlow graph

store queue in quick succession. Decoupling address anddata of Stores increases memory parallelism by 30% on av-erage [16].

3.2 The WaveScalar architecture

The WaveScalar architecture is called WaveCache and itcomprises all the hardware, except main memory, requiredto run a WaveScalar program. It is designed as a scalablegrid of identical dataflow processing elements, the wave-ordered memory hardware, and a hierarchical interconnectto support communication. The Cluster, each one with fourDomains, is the construction block of the WaveCache. It hasan L1 cache, the StoreBuffer that interfaces with the Wave-ordered memory, and a Switch to provide intra and inter-Cluster communication. Each Domain has eight processingelements grouped in Pods of two PEs each. The Clusters arereplicated across the die, forming a matrix that is connectedto the L2 cache. Figure 2 (reproduced, with permission ofthe copyright owner, from [16]) shows an overview of theWaveCache.

The PEs implement the DataFlow firing rule and thusexecute instructions. Each PE has an ALU, memory struc-tures to hold operands, logic to control execution and com-munication, and an instruction buffer. When a program ex-ecutes in WaveScalar, multiple instructions are mapped tothe same PE by a placement algorithm. As the programevolves, some instructions become unnecessary and are re-placed by new ones. The firing rule ensures that an instruc-tion is executed only when all its operands are available.Every PE has a matching table that holds all operands des-tined to instructions mapped to that PE, until the instruc-tions are ready to be fired, causing those operands to beconsumed and possibly producing result operands.

Each Cluster has a StoreBuffer (SB) that is responsible

185

Page 4: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

L2 L2 L2

L2

L2

L2

D$ D$

D$

D$

PEPOD DOMAIN

CLUSTER

Net

SB

Figure 2. The WaveCache architecture

for the memory ordering mechanism of some Waves. TheWave Map holds the information about which StoreBufferhas the custody of which Wave. This Map is stored in mainmemory; a unique view of the lines used by each SB is guar-anteed by the cache coherence protocol. Memory Requestsare sent from the PEs to their local StoreBuffers and routed,if needed, to the remote SB that is responsible for the re-quests’ Wave. The SB then inserts the request in a list forthat wave. That request will be executed as soon as all theprevious waves have executed all their memory operations,and also when a chain is established between the previousexecuted memory operation and the present one.

4. The Transactional WaveCache

By default, WaveScalar follows a strict ordering mech-anism for memory accesses. Partial Stores can add paral-lelism to the execution of memory operations within a wave,but parallel execution between waves is only possible if lat-ter waves do not need memory. The motivation for our workis that more parallelism becomes available if we allow dis-joint memory operations to execute in parallel.

This work presents an alternative memory orderingmechanism that maintains the execution order of memoryoperations within a wave, but adds the ability to spec-ulatively execute, out-of-order, operations form differentwaves. This ordering mechanism is inspired on the wayTransactional Memories works. Waves are considered asatomic regions and executed as nested transactions. In anutshell, if a wave has finished the execution of all its mem-ory operations, it can commit as soon as the previous waveshave committed. If an hazard is detected in a speculativewave, all the following Waves (children) are aborted andre-executed.

4.1 Transactions in WaveScalar

The large body of work in TM suggests a number of al-ternatives to implement our approach. Next, we present ouralgorithm. The first idea is that each Wave has a F bit thatindicates whether it has finished all its memory operations.This bit is stored in the Wave Map (see page 3). Moreover,each StoreBuffer has an extra attribute called lastCommit-tedWave that holds the number of the last committed wave.The F bit and the lastCommittedWave indicate the transac-tions state:

1. Wave X is non-speculative if all previous Waves arecommitted, i.e., (X = lastCommittedWave + 1).The transactional context does not need to be kept fornon-speculative Waves;

2. At initialization (SB.lastCommittedWave = −1)as the first Wave (Wave.Id = 0) is always non-speculative;

3. At initialization, for all waves W , W.F = FALSE;

4. When a wave W finishes the execution of all its mem-ory operations, W.F = TRUE;

5. A Wave W is pending if

(W.Id > lastCommittedWave + 1) ∧ W.F

6. A Wave W can commit when

(W.Id = lastCommittedWave + 1) ∧ W.F

In contrast to standard TM, transaction size is not a prob-lem here: transactions that need more resources than avail-able should just stall until resources become available oruntil they become non-speculative.

4.2 The Transactional context

In Transactional Memory for Von Neumman machines,before a transaction starts it is necessary to save the contentof all registers used by the program. A history of all changesin memory must also be kept. If a hazard is detected this in-formation will be used to restore the memory and registerfile. In WaveScalar there is no register file but all operandsused by a transaction that were produced outside the trans-action need to be re-sent. Those operands form the read setof the transaction or Wave.

We observed that the Wave-Advance instructions con-trol how operands reach a wave. In order to keep the read setfor each Wave, a natural step is to modify those instructionsso that they will send a copy of their operand to the Store-Buffer, that will in turn insert them into a novel structure

186

Page 5: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

called Wave-Context-Table (WCT). A StoreBuffer may haveseveral WCTs: each WCT holds the read set for a wave,such that each line holds a Wave-Advance operand. Ifthe WCT fills up, the Wave-Advance operands remains inthe PEs output buffers and block wave advance.

We further need to keep a memory log: this is im-plemented through a table called MemOp-History (MOH).Per each operation, the MOH stores the Wave number, theCurrent number of the operation’s memory annotation, theoperation type and data backup for Store operations. EachStoreBuffer has an associated MOH. Notice that one prob-lem is that waves belonging to different SBs can interef-ere. All the involved SBs would need a unique view of theMOH so that hazards between those waves could be de-tected: the MOH would need to reside in memory bring-ing high communication costs. We avoid this by ensuringthat a speculative wave must use the same SB as its closestnon-speculative wave. Arguably, this solution could limitparallelism, but in fact waves in the same thread are usuallyunder the responsibility of the same SB [16].

4.3 Speculation and Re-execution

We used the two previous data-structures so that theWaveScalar’s original memory ordering mechanism wouldallow memory requests from different waves to be executedconcurrently. In order to study how speculation can affectperformance, we can impose a limit on the number of spec-ulative waves that may execute. We call this limit the Spec-ulation Window.

Before we go on, we observe that problems arise becausean instruction may receive operands from old and recent ex-ecutions. In the worst case, operands might match in theMatching Table (MT) (see page 3) and be consumed pro-ducing wrong results. We further add an execution num-ber ExeN to the operands’ tags and change the firing ruleso that ExeN is also used to match operands. An instruc-tion that executes with operands of ExeN = K producesoperands also within ExeN = K .

The StoreBuffer maintains a lastExeN attribute thatholds the number of the last execution. Waves also havean CurrentExeN attribute: a memory request can only beaccepted by Wave W if ExeN ≤ W.CurrentExeN .

We can now describe the rollback algorithm. If a haz-ard is found in Wave W , the WCTs for all waves Y suchthat Y.Id > W.Id are discarded; all operands in the WCTfor W are then re-sent, causing the re-execution of all fol-lowing waves. Notice that at the moment of a re-execution,some operands that belong to the read set of W might not bepresent in the WCT yet. This is not a problem: they simplydid not reach the Wave-Advance instruction. When theydo reach: (i) the operands will be inserted in the WCT, (ii)will have their ExeN updated, and (iii) will be sent to the

new instance of wave W .Notice that when operands are re-sent by the Store-

Buffers they have their ExeN changed to a unique num-ber that identifies the new execution. More precisely, whenWave W causes a re-execution, all waves ≥ W have theircurrentExeN set to StoreBuffer.lastExeN . Thus, memoryrequests from old executions will not be accepted by thememory system.

4.4 Hazard Detection

A hazard may exist between two memory operations ifthey access the same memory address. As the TransactionalWaveCache maintains Wavescalar memory ordering withinwaves, operations in the same Wave will never cause a haz-ard. When an memory operation belonging to Wave X isexecuted, a hazard can only occur between Wave X andsome other wave Y such that X.Id > Y.id.

In our architecture, the MOH structure is the key forrecognizing hazards. Consider two memory operations Aand B, and their respective waves X and Y , such thatX.Id > Y.Id. We assume that there were no previous spec-ulative stores between them and that A has executed. WhenB finally executes, the possible hazards and respective so-lutions are as follows:

RAW (Read After Write): Occurs when A is a Load andB is a Store. All waves ≥ X must be aborted andre-executed. If B is speculative, it is inserted in theMOH and its backup field is either (i) copied from A’svalue field in the MOH or (ii) obtained from a Load.

WAW (Write After Write): When both A and B areStores, the Wave X doesn’t need to be re-executed.If B is speculative, it is inserted in the MOH and itsbackup field receives A’s backup. Either way, theStore needs not to go to main memory, and A’sbackup field in the MOH receives the value of StoreB.

WAR (Write After Read): When A is a Store and Bis a Load, we need not to re-execute X . Instead, thebackup field of A can be the return value for the Load.If B is speculative, it should be inserted in the MOH.

4.5 Commit and RollBack

Commits execute whenever a non-speculative wave fin-ishes its execution in the memory system. A commit inWave X makes wave X + 1 non-speculative. Therefore,the transactional context of X + 1 can be erased; if X + 1completed (i.e., F = TRUE), it will commit and we can pro-ceed to the next wave. Notice that when a Wave commitsthe next Wave may be under the responsibility of a differ-ent StoreBuffer. If so, a message is sent to that SB so that

187

Page 6: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

it knows that now it has the non-speculative wave and thatit can start executing memory operations. The lastExeNattribute is also sent and updated in the destination SB, to beused in case of future re-executions.

Committing requires cleaning the MOH. One possibilitywould be to scan the whole structure. Our prototype uses aSearch Catalog to speedup this operation. For each wave,the Search Catalogue points to the list including all opera-tions for the wave.

We are now in a position to present the necessary stepsto re-execute a program if wave X was aborted:

1. Erase the lines L in the Search Catalog whereL.Wave > X ;

2. Restore the memory to its previous state (using thebackup field of Stores MOH) before the executionof all operations of waves ≥ X . During this processthe MOH for those waves must also be erased;

3. Clean the WCTs I where I.Wave > X ;

4. Clean all requests for Waves ≥ X in the local Store-Buffer. Request in remote SBs will be cleaned lazilywhen they receive requests for the new executions;

5. Increment the lastExeN attribute in the local Store-Buffer and copy that value to the currentExeN of allWaves ≥ X ;

6. Re-send all the operands in X’s read set.

4.6 Erasing Old Operands

One problem with speculative execution is that theoperands from old executions compete for resources withoperands from the current execution. In the worst case, oldoperands may find their way to memory access instructions,and generate memory requests that will not be responded bythe memory system. To make things worst, those operandswill not be consumed from the PE queues and may even-tually spill into memory, further degrading system perfor-mance. We propose two mechanisms to address this prob-lem by removing older operands from processing elements:the MT Checkups and the Execution Maps.

MT Checkups scan received operands in the matching ta-bles in order to find all operands to the same instruction,wave, thread and application, but with different ExeN .Operands from older executions should be erased, whereasthe newer operand is inserted in the matching table. If thereceived operand is older, it is ignored. If their execution isthe same, the operand is just inserted in the matching table.

For operands that are the first to arrive in a instruction,since there are no other operands to compare to, MT Check-ups will not be useful. Also, old executions are usually

ahead of new ones, and the former will just be reached ifit gets stuck in memory accesses instructions. ExecutionMaps give the PEs the ability to eliminate operands basedon its local view of the execution. A table containing pairsof < Wave, ExeN > is kept in each PE to hold informa-tion on allowed operands. If a PE has the pairs < 0, 0 > and< 5, 1 > in its Execution Map, it means that from Wave 0 toWave 4 operands with ExeN ≥ 0 can be accepted and, forwaves ≥ 5 only operands with ExeN ≥ 1 can be accepted.If an operand with Wave number 3 and ExeN = 1 arrives,it will be accepted and the pair < 5, 1 > will be replaced bythe pair < 3, 1 > in the Execution Map.

It is important to observe that those mechanisms do notguarantee that all old operands are going to be erased fromthe PEs, but they can minimize the explosion of parallelismcaused by the Transactional WaveCache.

5. Experiments and Results

5.1 Methodology

Currently, WaveScalar programs must be compiled us-ing the Alpha Tru64 cc compiler and then translated toWaveScalar assembly, using a binary translator. The binarytranslator does not support full Alpha assembly yet, hencemost programs are partially run as WaveScalar, and par-tially run as Alpha programs. In this work, we extended theKahuna WaveScalar architectural simulator [16] to supportthe Transactional WaveCache. The Transactional Wave-Cache do not yet support multi-threaded execution and De-coupled Stores. Alpha emulation support is available, butbefore control is transfered to the Alpha simulator, we needto stall speculation and wait for all waves to commit.

A set of benchmarks from Spec 2000, Mediabench andMibench was used to evaluate performance. The bench-marks were not executed to completion and, since the num-ber of instructions executed can vary according to the spec-ulation window, we changed the simulator to allow a se-mantic cut that is done upon commit. Doing so, we cancompare the memory state for every scenario with the orig-inal WaveScalar. For the Spec 2000 benchmarks (mcf, artand equake) we executed the waves corresponding to thefirst 2 milion instructions. They were compiled using theO3 optimization. The Mediabench (G721 and EPIC) andthe Mibench (CRC) benchmarks were compiled using O2and the first 500 thousand instructions were executed.

Table 1 shows the architectural parameters used in thesimulations. All the applications were executed in theoriginal WaveScalar, without any memory disambiguationmechanisms and also with Decoupled Stores. They werealso executed using Transactional WaveCache, with Specu-lation Window sizes of 2. In order to have a more detailedunderstanding of the effect of the Speculation Window size

188

Page 7: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

on the performance, for EQUAKE and EPIC, we also ex-ecuted with Speculation Window sizes of 3, 5, 10, 20, 30and with an infinite Speculation Window. The TransactionalWaveCache structures (WaveContext-Tables, Search Cat-alogs, MemOp-Histories and Execution Maps) where notlimited in the implementation.

Table 1. Architectural parametersPlacement algorithm Exp2 (dynamic)

Cache L1Direct Mapping1024 lines of 32 bitsAccess time: 1 cycle

Cache L24-way set associative131072 lines of 128 bitsAccess time: 7 cycles

Memory access time 100 cycles

Processing Elements

Operand queues size: 10000000 linesQueues search time: 0Executions per cycle: 1Input operands per instruction per cycle: 3Number of instructions: 8

StoreBuffers4 input ports4 output ports

5.2 Results

Figure 3 shows the speedups of Transactional Wave-Cache (TWC) and Decouple Stores for all applicationscompared to the original WaveScalar without memorydisambiguation. Floating point applications (ART andEQUAKE) show a significant speedup, even for a smallSpeculation Window (1.35 and 1.31, respectively). MCFand EPIC perform lots of function calls that are emulated inthe Alpha machine. Since we had to disabled speculationto do that, we only get the overhead of Transactional Wave-Cache when that happens. Speedups of 1.02 for MCF andslowdowns of 0.96 for EPIC were observed. This is causedby emulated function calls, that wouldn’t be necessary ifwe had a WaveScalar compiler that could translate all in-structions. If memory accesses happen very often, memoryconcurrency will be high and the Transaction WaveCachewill tend to not provide speedups. Since we are executingthe first instructions of each benchmark, and if the variablesinitialization phase is to long, we will have a long periodof where our mechanism will not help much on the finalperformance.

Decoupled Stores performs better than TransactionalWaveCache for most applications. Since both techniquesare orthogonal, a combination of two is possible and is sub-ject of ongoing work.

Figure 4 shows the speedups achieved for SpeculationWindow sizes varying from 2 to Infinite for the EQUAKEbenchmark. The speedup grows with the Speculation Win-dow, although the maximum speedup (2.24) is reached fora Speculation Windows of less then 20. This shows that the

Figure 3. TWC x Decouple Stores

Transactional WaveCache structures can be small and stillprovide speedups. The same experiment was performed forEPIC. The slowdown is the constant (0.96) for all Specu-lation Window lengths, showing it has no influence on thespeedup, since speculation is disabled very often due to em-ulated function calls.

Figure 4. Equake - Speedup x Speculation

For all the benchmarks, ART was the only applica-tion where RAW hazards, and therefore reexecutions, hap-pened. This was expected, for small speculation windows,but for EQUAKE and EPIC, RAW hazards were not foundeven for big Speculation Windows.EPIC is limited by em-ulated function calls and EQUAKE does not access onesame memory address very often, witch explains the highspeedups achieved with no hazards found.

6. Conclusions and Future Work

The advent of multi-cores has kindled interest of novelarchitectures, such as WaveScalar, a DataFlow style archi-tecture that can run imperative programs. We present theTransactional WaveCache, a memory disambiguation tech-

189

Page 8: [IEEE 2008 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) - Campo Grande, Brazil (2008.10.29-2008.11.1)] 2008 20th International Symposium

nique for WaveScalar that allows memory operations fromdifferent waves to execute out-of-order in a speculative way.Initial results show that very significant speedups can beachieved through this mechanism, even in the presence ofhazards. We believe that our results show that aggressivespeculation on memory accesses can improve performancefor WaveScalar style architectures.

Our results confirm previous work indicating that theWaveScalar default memory architecture could be a lim-iting factor in performance. We obtain similar results tothe Decouple Store mechanism, that performs memory dis-ambiguation within a wave. Since the Transactional Wave-Cache acts in different waves, both contributions are orthog-onal, suggesting that a combination of the two techniquesshould be a subject of further study.

Our work motivates a large number of future researchdirections. First, our implementation relies on prior workthe Kahuna simulator and the WaveScalar binary translator.As a next step, we plan to improve both systems so that theTransactional WaveCache to run more applications. Thiswill allow a more complete understanding of the benefitsand limitations of our technique.

We also should notice that our design is one point in thevery large design space that has been explored by the Trans-actional Memory community. In the future, and benefittingfrom our experience with the Transactional WaveCache, weplan to study alternative designs, namely towards simpli-fying design complexity. We also plan to study other userand/or low-level mechanisms to throttle parallelism, and tostudy whether these data-structures could also be used toachieve synchronization.

References

[1] A.-R. Adl-Tabatabai, B. T. Lewis, V. Menon, B. R. Murphy,B. Saha, and T. Shpeisman. Compiler and runtime supportfor efficient software transactional memory. In Proceedingsof the 2006 Conference on Programming language designand implementation, pages 26–37. Jun 2006.

[2] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiser-son, and S. Lie. Unbounded transactional memory. IEEEMicro, 26(1):59–69, 2006.

[3] J. Chung, C. C. Minh, A. McDonald, T. Skare, H. Chafi,B. D. Carlstrom, C. Kozyrakis, and K. Olukotun. Trade-offs in transactional memory virtualization. In ASPLOS-XII:Proceedings of the 12th international conference on Archi-tectural support for programming languages and operatingsystems, pages 371–381, New York, NY, USA, 2006. ACM.

[4] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir,and D. Nussbaum. Hybrid transactional memory. InASPLOS-XII: Proceedings of the 12th international confer-ence on Architectural support for programming languagesand operating systems, pages 336–346, New York, NY,USA, 2006. ACM.

[5] R. Desikan, D. C. Burger, S. W. Keckler, and T. Austin. Sim-alpha: A validated, execution-driven alpha 21264 simulator.Technical Report TR-01-23, UT-Austin Computer Sciences,2001.

[6] V. G. Grafe, G. S. Davidson, J. E. Hoch, and V. P. Holmes.The epsilon dataflow processor. In ISCA ’89: Proceedings ofthe 16th annual international symposium on Computer ar-chitecture, pages 36–45, New York, NY, USA, 1989. ACMPress.

[7] J. R. Gurd, C. C. Kirkham, and I. Watson. The manchesterprototype dataflow computer. Commun. ACM, 28(1):34–52,1985.

[8] M. Herlihy, V. Luchangco, M. Moir, and I. William N.Scherer. Software transactional memory for dynamic-sizeddata structures. pages 92–101, Jul 2003.

[9] M. Herlihy and J. E. B. Moss. Transactional memory: Ar-chitectural support for lock-free data structures. In ISCA,pages 289–300, 1993.

[10] S. Kumar, M. Chu, C. J. Hughes, P. Kundu, and A. Nguyen.Hybrid transactional memory. In Proceedings of Symposiumon Principles and Practice of Parallel Programming, Mar2006.

[11] A. McDonald, J. Chung, B. D. Carlstrom, C. C. Minh,H. Chafi, C. Kozyrakis, and K. Olukotun. Architectural se-mantics for practical transactional memory. In ISCA ’06:Proceedings of the 33rd annual international symposiumon Computer Architecture, pages 53–65, Washington, DC,USA, 2006. IEEE Computer Society.

[12] J. E. B. Moss. Open nested transactions: Semantics andsupport. In WMPI, Austin, TX, February 2006.

[13] G. M. Papadopoulos and D. E. Culler. Monsoon: an ex-plicit token-store architecture. In ISCA ’90: Proceedings ofthe 17th annual international symposium on Computer Ar-chitecture, pages 82–91, New York, NY, USA, 1990. ACMPress.

[14] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,D. Burger, S. W. Keckler, and C. R. Moore. Exploitingilp, tlp, and dlp with the polymorphous trips architecture.SIGARCH Comput. Archit. News, 31(2):422–433, 2003.

[15] N. Shavit and D. Touitou. Software transactional memory. InPODC ’95: Proceedings of the fourteenth annual ACM sym-posium on Principles of distributed computing, pages 204–213, New York, NY, USA, 1995. ACM.

[16] S. Swanson. The WaveScalar Architecture. PhD thesis, Uni-versity of Washington, 2006.

[17] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin.Wavescalar. In Microarchitecture, 2003. MICRO-36. Pro-ceedings. 36th Annual IEEE/ACM International Symposiumon, pages 291–302, 2003.

[18] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Put-nam, K. Michelson, M. Osin, and S. Eggers. The wavescalararchitecture. ACM Transactions on Computer Systems,TOCS, 25 February 2007.

[19] R. M. Tomasulo. An efficient algorithm for exploring mul-tiple arithmetic units. IBM Journal of Research and Devel-opment, 11:25–33, Jan 1967.

[20] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee,V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,S. Amarasinghe, and A. Agarwal. Baring it all to software:Raw machines. Computer, 30(9):86–93, Sept. 1997.

190