Efﬁcient GPU Synchronization without Scopes: Saying No to …rsim.cs.illinois.edu/Pubs/15-MICRO-scopes.pdf · 2018-02-19 · Efﬁcient GPU Synchronization without Scopes: Saying

Efficient GPU Synchronization without Scopes:Saying No to Complex Consistency Models

Matthew D. Sinclair† Johnathan Alsop† Sarita V. Adve†‡

† University of Illinois at Urbana-Champaign‡ École Polytechnique Fédérale de Lausanne

[email protected]

ABSTRACTAs GPUs have become increasingly general purpose, ap-plications with more general sharing patterns and fine-grained synchronization have started to emerge. Un-fortunately, conventional GPU coherence protocols arefairly simplistic, with heavyweight requirements for syn-chronization accesses. Prior work has tried to resolvethese inefficiencies by adding scoped synchronization toconventional GPU coherence protocols, but the result-ing memory consistency model, heterogeneous-race-free(HRF), is more complex than the common data-race-free (DRF) model. This work applies the DeNovo co-herence protocol to GPUs and compares it with conven-tional GPU coherence under the DRF and HRF consis-tency models. The results show that the complexity ofthe HRF model is neither necessary nor sufficient to ob-tain high performance. DeNovo with DRF provides asweet spot in performance, energy, overhead, and mem-ory consistency model complexity.

Specifically, for benchmarks with globally scoped fine-grained synchronization, compared to conventional GPUwith HRF (GPU+HRF), DeNovo+DRF provides 28%lower execution time and 51% lower energy on average.For benchmarks with mostly locally scoped fine-grainedsynchronization, GPU+HRF is slightly better – how-ever, this advantage requires a more complex consis-tency model and is eliminated with a modest enhance-ment to DeNovo+DRF. Further, if HRF’s complexityis deemed acceptable, then DeNovo+HRF is the bestprotocol.

Categories and Subject DescriptorsB.3.2 [Hardware]: Memory Structures – Cache mem-ories; Shared memory; C.1.2 [Processor Architec-tures]: Single-instruction-stream, multiple-data-streamprocessors (SIMD); I.3.1 [Computer Graphics]: Graph-ics processors

KeywordsGPGPU, cache coherence, memory consistency models,data-race-free models, synchronization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected] 2015 ACM 978-1-4503-4034-2/15/12 ...$15.00.DOI: http://dx.doi.org/10.1145/2830772.2830821

1. INTRODUCTIONGPUs are highly multithreaded processors optimized

for data-parallel execution. Although initially used forgraphics applications, GPUs have become more general-purpose and are increasingly used for a wider range ofapplications. In an ongoing effort to make GPU pro-gramming easier, industry has integrated CPUs andGPUs into a single, unified address space [1, 2]. Thisallows GPU data to be accessed on the CPU and viceversa without an explicit copy. While the ability toaccess data simultaneously on the CPU and GPU hasthe potential to make programming easier, GPUs needbetter support for issues such as coherence, synchro-nization, and memory consistency.

Previously, GPUs focused on data-parallel, mostlystreaming, programs which had little or no sharing ordata reuse between Compute Units (CUs). Thus, GPUsused very simple software-driven coherence protocolsthat assume data-race-freedom, regular data accesses,and mostly coarse-grained synchronization (typically atGPU kernel boundaries). These protocols invalidate thecache at acquires (typically the start of the kernel) andflush (writethrough) all dirty data before the next re-lease (typically the end of the kernel) [3]. The dirtydata flushes go to the next level of the memory hier-archy shared between all participating cores and CUs(e.g., a shared L2 cache). Fine-grained synchroniza-tion (implemented with atomics) was expected to beinfrequent and executed at the next shared level of thehierarchy (i.e., bypassing private caches).

Thus, unlike conventional multicore CPU coherenceprotocols, conventional GPU-style coherence protocolsare very simple, without need for writer-initiated inval-idations, ownership requests, downgrade requests, pro-tocol state bits, or directories. Further, although GPUmemory consistency models have been slow to be clearlydefined [4, 5], GPU coherence implementations wereamenable to the familiar data-race-free model widelyadopted for multicores today.

However, the rise of general-purpose GPU (GPGPU)computing has made GPUs desirable for applicationswith more general sharing patterns and fine-grained syn-chronization [6, 7, 8, 9, 10, 11]. Unfortunately, con-ventional GPU-style coherence schemes involving fullcache invalidates, dirty data flushes, and remote execu-tion at synchronizations are inefficient for these emerg-ing workloads. To overcome these inefficiencies, recentwork has proposed associating synchronization accesseswith a scope that indicates the level of the memory hi-

erarchy where the synchronization should occur [8, 12].For example, a synchronization access with a local scopeindicates that it synchronizes only the data accessed bythe thread blocks within its own CU (which share theL1 cache). As a result, the synchronization can executeat the CU’s L1 cache, without invalidating or flushingdata to lower levels of the memory hierarchy (since noother CUs are intended to synchronize through this ac-cess). For synchronizations that can be identified ashaving local scope, this technique can significantly im-prove performance by eliminating virtually all sourcesof synchronization overhead.

Although the introduction of scopes is an efficientsolution to the problem of fine-grained GPU synchro-nization, it comes at the cost of programming complex-ity. Data-race-free is no longer a viable memory con-sistency model since locally scoped synchronization ac-cesses potentially lead to “synchronization races” thatcan violate sequential consistency in non-intuitive ways(even for programs deemed to be well synchronized bythe data-race-free memory model). Recently, Hower etal. addressed this problem by formalizing a new mem-ory model, heterogeneous-race-free (HRF), to handlescoped synchronization. The Heterogeneous System Ar-chitecture (HSA) Foundation [1], a consortium of sev-eral industry vendors, and OpenCL 2.0 [13] have re-cently adopted a model similar to HRF with scopedsynchronization.

Although HRF is a very well-defined model, it cannothide the inherent complexity of using scopes. Intrinsi-cally, scopes are a hardware-inspired mechanism thatexpose the memory hierarchy to the programmer. Us-ing memory models to expose a hardware feature is con-sistent with the past evolution of memory models (e.g.,the IBM 370 and total store order (TSO) models es-sentially expose hardware store buffers), but is discour-aging when considering the past confusion generatedby such an evolution. Previously, researchers have ar-gued against such a hardware-centric view and proposedmore software-centric models such as data-race-free [14].Although data-race-free is widely adopted, it is still asource of much confusion [15]. Viewing the subtletiesand complexities associated even with the so-called sim-plest models, we argue GPU consistency models shouldnot be even more complex than the CPU models.

We therefore ask the question – can we develop co-herence protocols for GPUs that are close to the sim-plicity of conventional GPU protocols but give the per-formance benefits of scoped synchronization, while en-abling a memory model no more complex than data-race-free?

We show that this is possible by considering a recenthardware-software hybrid protocol, DeNovo [16, 17, 18],originally proposed for CPUs. DeNovo does not requirewriter-initiated invalidations or directories, but does ob-tain ownership for written data. The key additionaloverhead for DeNovo over GPU-style coherence withHRF is 1 bit per word at the caches (previous work onDeNovo uses software regions [16] for selective invalida-tions; our baseline DeNovo protocol does not use these

to minimize overhead). Specifically, this paper makesthe following contributions:

• We identify DeNovo (without regions) as a viablecoherence protocol for GPUs. Due to its use ofownership on writes, DeNovo is able to exploitreuse of written data and synchronization vari-ables across synchronization boundaries, withoutthe additional complexity of scopes.

• We compare DeNovo with the DRF consistencymodel (i.e., no scoped synchronization) to GPU-style coherence with the DRF and HRF consis-tency models (i.e., without and with scoped syn-chronization). As expected, GPU+DRF performspoorly for applications with fine-grained synchro-nization. However, DeNovo+DRF provides a sweetspot in terms of performance, energy, implemen-tation overhead, and memory model complexity.

Specifically, we find the following when comparingthe performance and energy for DeNovo+DRF andGPU+HRF. For benchmarks with no fine-grainedsynchronization, DeNovo is comparable to GPU.For microbenchmarks with globally scoped fine-grained synchronization, DeNovo is better thanGPU (on average 28% lower execution time and51% lower energy). For microbenchmarks withmostly locally scoped synchronization, GPU doesbetter than DeNovo – on average 6% lower execu-tion time, with a maximum reduction of 13% (4%and 10% lower, respectively, for energy). However,GPU+HRF’s modest benefit for locally scoped syn-chronization must be weighed against HRF’s highercomplexity and GPU’s much lower performancefor globally scoped synchronization.

• For completeness, we also enhance DeNovo+DRFwith selective invalidations to avoid invalidatingvalid, read-only data regions at acquires. The ad-dition of a single software (read-only) region doesnot add any additional state overhead, but does re-quire the software to convey the read-only regioninformation. This enhanced DeNovo+DRF proto-col provides the same performance and energy asGPU+HRF on average.

• For cases where HRF’s complexity is deemed ac-ceptable, we develop a version of DeNovo for theHRF memory model. We find that DeNovo+HRFis the best performing protocol – it is either com-parable to or better than GPU+HRF for all casesand significantly better for applications with glob-ally scoped synchronization.

Although conventional hardware protocols such asMESI support fine-grained synchronization with DRF,we do not compare to them here because prior researchhas observed that they incur significant complexity (e.g.,writer-initiated invalidations, directory overhead, andmany transient states leading to cache state overhead)and are a poor fit for conventional GPU applications [3,19] (also evidenced by HSA’s adoption of HRF). Addi-tionally, the DeNovo project has shown that for CPUs,

DeNovo provides comparable or better performance thanMESI at much less complexity [16, 17, 18].

This work is the first to show that GPUs can supportfine-grained synchronization efficiently without resort-ing to the complexity of the HRF consistency modelat modest hardware overhead. DeNovo with DRF pro-vides a sweet spot for performance, energy, overhead,and memory model complexity, questioning the recentmove towards memory models for GPUs that are morecomplex than those for CPUs.

2. BACKGROUND: MEMORY CONSISTENCYMODELS

Depending on whether the coherence protocol usesscoped synchronization or not, we assume either data-race-free (DRF) [14] or heterogeneous-race-free (HRF) [8]as our memory consistency model.

DRF ensures sequential consistency (SC) to data-race-free programs. A program is data-race-free if itsmemory accesses are distinguished as data or synchro-nization, and, for all its SC executions, all pairs of con-flicting data accesses are ordered by DRF’s happens-before relation. The happens-before relation is the ir-reflexive, transitive closure of program order and syn-chronization order, where the latter orders a synchro-nization write (release) before a synchronization read(acquire) if the write occurs before the read.

HRF is defined similar to DRF except that each syn-chronization access has a scope attribute and HRF’ssynchronization order only orders synchronization ac-cesses with the same scope. There are two variantsof HRF: HRF-Direct, which requires all threads thatsynchronize to use the same scope, and HRF-Indirect,which builds on HRF-Direct by providing extra supportfor transitive synchronization between different scopes.One key issue is the prospect of synchronization races –conflicting synchronization accesses to different scopesthat are not ordered by HRF’s happens-before. Suchraces are not allowed by the model and cannot be usedto order data accesses.

Common implementations of DRF and HRF enforcea program order requirement: an access X must com-plete before an access Y if X is program ordered beforeY and either (1) X is an acquire and Y is a data access,(2) X is a data access and Y is a release, or (3) X and Yare both synchronization. For systems with caches, theunderlying coherence protocol governs the program or-der requirement by defining what it means for an accessto complete, as discussed in the next section.

3. A CLASSIFICATION OF COHERENCEPROTOCOLS

The end-goal of a coherence protocol is to ensure thata read returns the correct value from the cache. Forthe DRF and HRF models, this is the value from thelast conflicting write as ordered by the happens-beforerelation for the model. Following the observations madefor the DeNovo protocol [16, 17], we divide the task ofa coherence protocol into the following:

Invalidation Tracking DifferentInitiator up-to-date copy scopes?

Conv HW writer ownership yesSW reader writethrough yes

Hybrid reader ownership yes

Table 1: Classification of protocols covering conven-tional HW (e.g., MESI), SW (e.g., GPU), and Hybrid(e.g., DeNovo) coherence protocols.

(1) No stale data: A load hit in a private cache shouldnever see stale data.

(2) Locatable up-to-date data: A load miss in a privatecache(s) must know where to get the up-to-date copy.

Table 1 classifies three classes of hardware coherenceprotocols in terms of how they enforce these require-ments. Modern coherence protocols accomplish the firsttask through invalidation operations, which may be ini-tiated by the writer or the reader of the data. Theresponsibility for the second task is usually handled bythe writer, which either registers its ownership (e.g., at adirectory) or uses writethroughs to keep a shared cacheup-to-date. The HRF consistency model adds an addi-tional dimension of whether a protocol can be enhancedwith scoped synchronization.

Although our taxonomy is by no means comprehen-sive, it covers the space of protocols commonly usedin CPUs and GPUs as well as recent work on hybridsoftware-hardware protocols. We next describe exampleimplementations from each class. Without loss of gener-ality, we assume a two level cache hierarchy with privateL1 caches and a shared last-level L2 cache. In a GPU,the private L1 caches are shared by thread blocks [20]executing on the corresponding GPU CU.Conventional Hardware Protocols used in CPUs

CPUs conventionally use pure hardware coherenceprotocols (e.g., MESI) that rely on writer-initiated in-validations and ownership tracking. They typically usea directory to maintain the list of (clean) sharers orthe current owner of (dirty) data (at the granularity ofa cache line). If a core issues a write to a line thatit does not own, then it requests ownership from thedirectory, sending invalidations to any sharers or theprevious owner of the line. For the purpose of invalida-tions and ownership, data and synchronization accessesare typically treated uniformly. For the program orderconstraint described in Section 2, a write is completewhen its invalidations reach all sharers or the previousowner of the line. A read completes when it returns itsvalue and that value is globally visible.

Although such protocols have not been explored withthe HRF memory model, it is possible to exploit scopedsynchronization with them. However, the added bene-fits, are unclear. Furthermore, as discussed in Section 1,conventional CPU protocols are a poor fit for GPUs andare included here primarily for completeness.Software Protocols used in GPUs

GPUs use simple, primarily software-based coherencemechanisms, without writer-initiated invalidations orownership tracking. We first consider the protocolswithout scoped synchronization.

GPU protocols use reader-initiated invalidations – anacquire synchronization (e.g., atomic reads or kernellaunches) invalidates the entire cache so future reads donot return stale values. A write results in a writethoughto a cache (or memory) shared by all the cores partic-ipating in the coherence protocol (the L2 cache withour assumptions) – for improved performance, thesewritethroughs are buffered and coalesced until the nextrelease (or until the buffer is full). Thus, a (correctlysynchronized) read miss can always obtain the up-to-date copy from the L2 cache.

Since GPU protocols do not have writer-initiated in-validations, ownership tracking, or scoped synchroniza-tion, they perform synchronization accesses at the sharedL2 (more generally, the closest memory shared by allparticipating cores). For the program order require-ment, preceding writes are now considered complete bya release when their writethroughs reach the shared L2cache. Synchronization accesses are considered com-plete when they are performed at the shared L2 cache.

The GPU protocols are simple, do not require proto-col state bits (other than valid bits), and do not incurinvalidation and other protocol traffic overheads. How-ever, synchronization operations are expensive – the op-erations are performed at the L2 (or the closest sharedmemory), an acquire invalidates the entire cache, and arelease must wait until all previous writethroughs reachthe shared L2. Scoped synchronizations reduce thesepenalties for local scopes.

In our two level hierarchy, there are two scopes – pri-vate L1 (shared by thread blocks on a CU) and sharedL2 (shared by all cores and CUs). We refer to these aslocal and global scopes, respectively. A locally scopedsynchronization does not have to invalidate the L1 (onan acquire), does not have to wait for writethroughs toreach the L2 (on a release), and is performed locally atthe L1. Globally scoped synchronization is similar tosynchronization accesses without scopes.

Although scopes reduce the performance penalty, theycomplicate the programming model, effectively expos-ing the memory hierarchy to the programmer.DeNovo: A Hybrid Hardware-Software Protocol

DeNovo is a recent hybrid hardware-software protocolthat uses reader-initiated invalidations with hardwaretracked ownership. Since there are no writer-initiatedinvalidations, there is no directory needed to track shar-ers lists. DeNovo uses the shared L2’s data banks totrack ownership – either the data bank has the up-to-date copy of the data (no L1 cache owns it) or it keepsthe ID of the core that owns the data. DeNovo refers tothe L2 as the registry and the obtaining of ownershipas registration. DeNovo has three states – Registered,Valid, and Invalid – similar to the Modified, Shared, andInvalid states of the MSI protocol. The key differencewith MSI is that DeNovo has precisely these three stateswith no transient states, because DeNovo exploits data-race-freedom and does not have writer-initiated invali-dations. A consequence of exploiting data-race-freedomis that the coherence states are stored at word granu-larity (although tags and data communication are at a

larger conventional line granularity, like sector caches).1

Like GPU protocols, DeNovo invalidates the cache onan acquire; however, these invalidations can be selectivein several ways. Our baseline DeNovo protocol exploitsthe property that data in registered state is up-to-dateand thus does not need to be invalidated (even if thedata is accessed globally by multiple CUs). PreviousDeNovo work has also explored additional optimizationssuch as software regions and touched bits. We explorea simple variant where we identify read-only data re-gions and do not invalidate those on acquires (for sim-plicity, we do not explore more comprehensive regionsor touched bits). The read-only region is a hardwareoblivious, program level property and is easier to deter-mine than annotating all synchronization accesses with(hardware- and schedule-specific) scope information.

For synchronization accesses, we use the DeNovoSync0protocol [18] which registers both read and write syn-chronizations. That is, unless the location is in regis-tered state in the L1, it is treated as a miss for both(synchronization) reads and writes and requires a reg-istration operation. This potentially provides betterperformance than conventional GPU protocols, whichperform all synchronization at the L2 (i.e., no synchro-nization hits).

DeNovoSync0 serves racy synchronization registra-tions immediately at the registry, in the order in whichthey arrive. For an already registered word, the reg-istry forwards a new registration request to the reg-istered L1. If the request reaches the L1 before theL1’s own registration acknowledgment, it is queued atthe L1’s MSHR. In a high contention scenario, multi-ple racy synchronizations from different cores will forma distributed queue. Multiple synchronization requestsfrom the same CU (from different thread blocks) arecoalesced within the CU’s MSHR and all are servicedbefore any queued remote request, thereby exploitinglocality even under contention. As noted in previouswork, the distributed queue serializes registration ac-knowledgments from different CUs – this throttling isbeneficial when the contending synchronizations will beunsuccessful (e.g., unsuccessful lock accesses) but canadd latency to the critical path if several of these syn-chronizations (usually readers) are successful. As dis-cussed in [18], the latter case is uncommon.

DeNovoSync optimizes DeNovoSync0 by incorporat-ing a backoff mechanism on registered reads when thereis too much read-read contention. We do not explore itfor simplicity.

To enforce the program order requirement, DeNovoconsiders a data write and a synchronization (read orwrite) complete when it obtains registration. As before,data reads are complete when they return their value.

DeNovo has not been previously evaluated with scopedsynchronization, but can be extended in a natural way.Local acquires and releases do not invalidate the cacheor flush the store buffer. Additionally, local synchro-

1This does not preclude byte granularity accesses as dis-cussed in [16]. None of our benchmarks, however, have bytegranularity accesses.

nization operations can delay obtaining ownership.

4. QUALITATIVE ANALYSIS OF THE PRO-TOCOLS

4.1 Qualitative Performance AnalysisWe study the GPU and DeNovo protocols, with and

without scopes, as described in Section 3. In order tounderstand the advantages and disadvantages of eachprotocol, Table 2 qualitatively compares coherence pro-tocols across several key features that are important foremerging workloads with fine-grained synchronization:exploiting reuse of data across synchronization points(in L1), avoiding bursty traffic (especially for writes),decreasing network traffic by avoiding overheads like in-validations and acknowledgment messages, only trans-ferring useful data by decoupling the coherence andtransfer granularity, exploiting reuse of synchronizationvariables (in L1), and efficient support for dynamic shar-ing patterns such as work stealing. The coherence pro-tocols have different advantages and disadvantages basedon their support for these features:GPU coherence, DRF consistency (GPU-D): Con-ventional GPU protocols with DRF do not require inval-idation or acknowledgment messages because they self-invalidate all valid data at all synchronization pointsand write through all dirty data to the shared, back-ing LLC. However, there are also several inefficiencieswhich stem from poor support for fine-grained synchro-nization and not obtaining ownership. Because GPUcoherence protocols do not obtain ownership (and don’thave writer-initiated invalidations), they must performsynchronization accesses at the LLC, they must flush alldirty data from the store buffer on releases, and theymust self-invalidate the entire cache on acquires. Asa result, GPU-D cannot reuse any data across syn-chronization points (e.g., acquires, releases, and ker-nel boundaries). Flushing the store buffer at releasesand kernel boundaries also causes bursty writethroughtraffic. GPU coherence protocols also transfer dataat a coarse granularity to exploit spatial locality; foremerging workloads with fine-grained synchronizationor strided accesses, this can be sub-optimal. Further-more, algorithms with dynamic sharing must synchro-nize at the LLC to prevent stale data from being ac-cessed.GPU coherence, HRF consistency (GPU-H ): Chang-ing the memory model from DRF to HRF removes sev-eral inefficiencies from GPU coherence protocols whileretaining the benefit of no invalidation or acknowledg-ment messages. Although globally scoped synchroniza-tion accesses have the same behavior as GPU-D, lo-cally scoped synchronization accesses occur locally anddo not require bursty writebacks, self-invalidations, orflushes, improving support for fine-grained synchroniza-tion and allowing data to be reused across synchroniza-tion points. However, scopes do not provide efficientsupport for algorithms with dynamic sharing becauseprogrammers must conservatively use a global scope forthese algorithms to prevent stale data from being ac-

cessed.DeNovo coherence, DRF consistency (DeNovo-D): The DeNovo coherence protocol with DRF has sev-eral advantages over GPU-D. DeNovo-D ’s use of own-ership enables it to provide several of the advantagesof GPU-H without exposing the memory hierarchy tothe programmer. For example, DeNovo-D can reusewritten data across synchronization boundaries since itdoes not self-invalidate registered data on an acquire.With the read-only optimization, this benefit also ex-tends to read-only data. DeNovo-D also sees hits onsynchronization variables with temporal locality bothwithin a thread block and across thread blocks on thesame CU. Obtaining ownership also allows DeNovo-Dto avoid bursty writebacks at releases and kernel bound-aries. Unlike GPU-H, obtaining ownership specificallyprovides efficient support for applications with dynamicsharing and also transfers less data by decoupling thecoherence and transfer granularity.

Although obtaining ownership usually results in ahigher hit rate, it can sometimes increase miss latency;e.g., an extra hop if the requested word is in a remoteL1 cache or additional serialization for some synchro-nization patterns with high contention (Section 3). Thebenefits, however, dominate in our results.DeNovo coherence, HRF consistency (DeNovo-H ): Using the HRF memory model with the DeNovocoherence protocol combines all the advantages of own-ership that DeNovo-D enjoys with the advantages oflocal scopes that GPU-H enjoys.

4.2 Protocol Implementation OverheadsEach of these protocols has several sources of imple-

mentation overhead:GPU-D: Since GPU-D does not track ownership, theL1 and L2 caches only need 1 bit (a valid bit) per lineto track the state of the cache line. GPU coherence alsoneeds support for flash invalidating the entire cache onacquires and buffering writes until a release occurs.GPU-H: GPU coherence with the HRF memory modeladditionally requires a bit per word in the L1 caches tokeep track of partial cache block writes (3% overheadcompared to GPU-D ’s L1 cache). Like GPU-D, GPU-H also requires support for flash invalidating the cachefor globally scoped acquires and releases and has an L2overhead of 1 valid bit per cache line.DeNovo-D and DeNovo-H: DeNovo needs per-wordstate bits for the DRF and HRF memory models be-cause DeNovo tracks coherence at the word granularity.Since DeNovo has 3 coherence states, at the L1 cachewe need 2 bits per-word (3% overhead over GPU-H ). Atthe L2, DeNovo needs one valid and one dirty bit perline and one bit per word (3% overhead versus GPU-H ).DeNovo-D with read-only optimization (DeNovo-D+RO): Logically, DeNovo needs an additional bit perword at the L1 caches to store the read-only informa-tion. However, to avoid incurring additional overhead,we reuse the extra, unused state from DeNovo’s coher-ence bits. There is some overhead to convey the regioninformation from the software to the hardware. We passthis information through an opcode bit for memory in-

Feature Benefit GD GH DD DHReuse Written Data Reuse written data across synch points 7 3 (if local scope) 3 3

Reuse Valid Data Reuse cached valid data across synch points 7 3 (if local scope) 72 3 (if local scope)No Bursty Traffic Avoid bursts of writes 7 3 (if local scope) 3 3No Invalidations/ACKs Decreased network traffic 3 3 3 3Decoupled Granularity Only transfer useful data 7 7 3 3Reuse Synchronization Efficient support for fine-grained synch 7 3 (if local scope) 3 3Dynamic Sharing Efficient support for work stealing 7 7 3 3

Table 2: Comparison of studied coherence protocols.

Interconnect

L1 $ GPU

CU Scratch

L1 $ GPU

CU Scratch

CPU

Core L1 $

L2 $ Bank

L2 $ Bank L2 $ Bank

…

L1 $ GPU

CU Scratch

…

L2 $ Bank

Figure 1: Baseline heterogeneous architecture [21].

structions.

5. METHODOLOGYOur work is influenced by previous work on DeN-

ovo [16, 17, 18, 21]. We leverage the project’s exist-ing infrastructure [21] and extend it to support GPUsynchronization operations based on the DeNovoSync0coherence protocol for multicore CPUs [18].

5.1 Baseline Heterogeneous ArchitectureWe model a tightly coupled CPU-GPU architecture

with a unified shared memory address space and co-herent caches. The system connects all CPU cores andGPU Compute Units (CUs) via an interconnection net-work. Like prior work, each CPU core and each GPUCU (analogous to an NVIDIA SM) is on a separate net-work node. Each network node has an L1 cache (localto the CPU core or GPU CU) and a bank of the sharedL2 cache (shared by all CPU cores and GPU CUs).3

The GPU nodes also have a scratchpad. Figure 1 illus-trates this baseline system which is similar to our priorwork [21]. The coherence protocol, consistency model,and write policy depend on the system configurationstudied (Section 5.3).

5.2 Simulation Environment and ParametersWe simulate the above architecture using an inte-

grated CPU-GPU simulator built from the Simics full-system functional simulator to model the CPUs, theWisconsin GEMS memory timing simulator [22], and2Mitigated by the read-only enhancement.3HRF [8] uses a three-level cache hierarchy; we use two lev-els because the GEMS simulation environment (Section 5.2)only supports two levels. We believe our results are notqualitatively affected by the depth of the memory hierarchy.

CPU ParametersFrequency 2 GHz

Cores 1GPU Parameters

Frequency 700 MHzCUs 15

Memory Hierarchy ParametersL1 Size (8 banks, 8-way assoc.) 32 KB

L2 Size (16 banks, NUCA) 4 MBStore Buffer Size 256 entries

L1 hit latency 1 cycleRemote L1 hit latency 35−83 cycles

L2 hit latency 29−61 cyclesMemory latency 197−261 cycles

Table 3: Simulated heterogeneous system parameters.

GPGPU-Sim v3.2.1 [23] to model the GPU (the GPUis similar to an NVIDIA GTX 480). The simulator alsouses Garnet [24] to model a 4x4 mesh interconnect witha GPU CU or a CPU core at each node. We use CUDA3.1 [20] for the GPU kernels in the applications sincethis is the latest version of CUDA that is fully supportedin GPGPU-Sim. Table 3 summarizes the common keyparameters of our system.

For energy modeling, GPU CUs use GPUWattch [25]and the NoC energy measurements use McPAT v.1.1 [26](our tightly coupled architecture more closely resemblesa multicore system’s NoC than the NoC modeled inGPUWattch). We do not model the CPU core or CPUL1 energy since the CPU is only functionally simulatedand not the focus of this work.

We provide an API for manually inserted annotationsfor region information (for DeNovo-D+RO) and distin-guishing synchronization instructions and their scope(for HRF).

5.3 ConfigurationsWe evaluate the following configurations with GPU

and DeNovo coherence protocols combined with DRFand HRF consistency models, using the implementa-tions described in Section 3. The CPU always uses theDeNovo coherence protocol. For all configurations weassume 256 entry coalescing store buffers next to theL1 caches. We also assume support for performing syn-chronization accesses (using atomics) at the L1 and L2.We do not allow relaxed atomics [12] since precise se-mantics for them are under debate and their use is dis-couraged [15, 27].GPU-D (GD): GD combines the baseline DRF mem-ory model (no scopes) with GPU coherence and per-forms all synchronization accesses at the L2 cache.

Benchmark InputNo Synchronization

Backprop (BP)[28] 32 KBPathfinder (PF)[28] 10 x 100K matrix

LUD[28] 256x256 matrixNW[28] 512x512 matrix

SGEMM[29] mediumStencil (ST)[29] 128x128x4, 4 iters

Hotspot (HS)[28] 512x512 matrixNN[28] 171K records

SRAD v2 (SRAD)[28] 256x256 matrixLavaMD (LAVA)[28] 2x2x2 matrix

Global SynchronizationFA Mutex (FAM G),

Sleep Mutex (SLM G), 3 TBs/CU,Spin Mutex (SPM G), 100 iters/TB/kernel,

Spin Mutex+backoff (SPMBO G), 10 Ld&St/thr/iterLocal or Hybrid Synchronization

FA Mutex (FAM L),Sleep Mutex (SLM L),Spin Mutex (SPM L), 3 TBs/CU,

Spin Mutex+backoff (SPMBO L), 100 iters/TB/kernel,Tree Barr+local exch (TBEX LG), 10 Ld&St/thr/iter

Tree Barr (TB LG),3 TBs/CU,

Spin Sem (SS L), 100 iters/TB/kernel,Spin Sem+backoff (SSBO L)[6] readers: 10 Ld/thr/iter

writers: 20 St/thr/iterUTS[8] 16K nodes

Table 4: Benchmarks with input sizes. All thread blocks(TBs) in the synchronization microbenchmarks executethe critical section or barrier many times. Microbench-marks with local and global scope are denoted with a’ L’ and ’ G’, respectively.

GPU-H (GH): GH uses GPU coherence and HRF’sHRF-Indirect memory model. GH performs locally scopedsynchronization accesses at the L1s and globally scopedsynchronization accesses at the L2.DeNovo-D (DD): DD uses the DeNovoSync0 coher-ence protocol (without regions), a DRF memory model,and performs all synchronization accesses at the L1 (af-ter registration).DeNovo-D with read-only optimization (DD+RO):DD+RO augments DD with selective invalidations toavoid invalidating valid read-only data on acquires.DeNovo-H (DH): DH combines DeNovo-D with theHRF-Indirect memory model. Like GH, local scope syn-chronizations always occur at the L1 and do not requireinvalidations or flushes.

5.4 BenchmarksEvaluating our configurations is challenging because

there are very few GPU application benchmarks thatuse fine-grained synchronization. Thus, we use a combi-nation of application benchmarks and microbenchmarksto cover the space of use cases with (1) no synchroniza-tion within a GPU kernel, (2) synchronization that re-quires global scope, and (3) synchronization with mostlylocal scope. All codes execute GPU kernels on 15 GPUCUs and use a single CPU core. Parallelizing the CPUportions is left for future work.

5.4.1 Applications without Intra-Kernel Synchroniza-tion

We examine 10 applications from modern heteroge-neous computing suites such as Rodinia [28, 30] andParboil [29]. None of these applications use synchro-nization within the GPU kernel and are also not writ-ten to exploit reuse across kernels. These applicationstherefore primarily serve to establish DeNovo as a viableprotocol for today’s use cases. The top part of Table 4summarizes these applications and their input sizes.

5.4.2 (Micro)Benchmarks with Intra-Kernel Synchro-nization

Most GPU applications do not use fine-grained syn-chronization because it is not well supported on currentGPUs. Thus, to examine the performance for bench-marks with various kinds of synchronization we use aset of synchronization primitive microbenchmarks, de-veloped by Stuart and Owens [6] – these include mutexlocks, semaphores, and barriers. We also use the Un-balanced Tree Search (UTS) benchmark [8], the onlybenchmark that uses fine-grained synchronization in theHRF paper.4 The microbenchmarks include centralizedand decentralized algorithms with a wide range of stallcycles and scalability characteristics. The amount ofwork per thread also varies: the mutex and tree barrieralgorithms access the same amount of data per threadwhile UTS and the semaphores access different amountsof data per thread. The bottom part of Table 4 sum-marizes the benchmarks and their input sizes.

We modified the original synchronization primitivemicrobenchmarks to perform data accesses in the criti-cal section such that the mutex microbenchmarks havetwo versions: one performs local synchronization andaccesses unique data per CU while the other uses globalsynchronization because the same data is accessed by allthread blocks. We also changed the globally synchro-nized barrier microbenchmark to use local and globalsynchronization with a tree barrier: all thread blockson a CU access unique data and join a local barrierbefore one thread block from each CU joins the globalbarrier. After the global barrier, thread blocks exchangedata for the subsequent iteration of the compute phase.We also added a version of the tree barrier where eachCU exchanges data locally before joining the global bar-rier. Additionally, we changed the semaphores to use areader-writer format with local synchronization: eachCU has one writer thread block and two reader threadblocks. Each reader reads half of the CU’s data. Thewriter shifts the reader thread block’s data to the rightsuch that all elements are written except for the first ele-ment of each thread block. To ensure that no stale datais accessed, the writers obtain the entire semaphore.The working set fits in the L1 cache for all microbench-marks except TBEX LG and TB LG, which have largerworking sets because they repeatedly exchange dataacross CUs.

4RemoteScopes [11] uses several GPU benchmarks with fine-grained synchronization from Pannotia [10] but these bench-marks are not publicly available.

0%

20%

40%

60%

80%

100%

G* D* G* D* G* D* G* D* G* D* G* D* G* D* G* D* G* D* G* D*

BP PF LUD NW SGEMM ST HS NN SRAD LAVA 103

101 101 101 101

(a) Execution time

0%

20%

40%

60%

80%

100%


GPU Core+ Scratch L1 D$ L2 $ N/W

102 101 101 101 101 101

BP PF LUD NW SGEMM ST HS NN SRAD LAVA

(b) Dynamic energy

0%

20%

40%

60%

80%

100%


Read Regist. WB/WT Atomics

103

110

106 106

101 101 101

BP PF LUD NW SGEMM ST HS NN SRAD LAVA

(c) Network trafficFigure 2: G* and D*, normalized to D*, for benchmarkswithout synchronization.

Similar to the tree barrier, the UTS benchmark uti-lizes both local and global synchronization. By perform-ing local synchronization accesses, UTS quickly com-pletes its work. However, since the tree is unbalanced,it is likely that some thread blocks will complete beforeothers. To mitigate load imbalance, CU’s push to andpull from a global task queue when their local queuesbecome full or empty, respectively.

6. RESULTSFigures 2, 3, and 4 show our results for the appli-

cations without fine-grained synchronization, for mi-crobenchmarks with globally scoped fine-grained syn-chronization, and for codes with locally scoped or hy-brid synchronization, respectively. Parts (a)-(c) in eachfigure show execution time, energy consumed, and net-work traffic, respectively. Energy is divided into multi-ple components based on the source of energy: GPU

0%

20%

40%

60%

80%

100%

G* D* G* D* G* D* G* D* G* D*

FAM_G SLM_G SPM_G AVG SPMBO_G

(a) Execution time

0%

20%

40%

60%

80%

100%

G* D* G* D* G* D* G* D* G* D*

N/W L2 $ L1 D$ Scratch GPU Core+


(b) Dynamic energy

0%

20%

40%

60%

80%

100%

G* D* G* D* G* D* G* D* G* D*

Atomics WB/WT Regist. Read


(c) Network trafficFigure 3: G* and D*, normalized to G*, for globallyscoped synchronization benchmarks.

core+,5 scratchpad, L1, L2, and network. Networktraffic is measured in flit crossings and is also dividedinto multiple components: data reads, data registra-tions (writes), writebacks/writethroughs, and atomics.

In Figures 2 and 3, we only show the GPU-D andDeNovo-D configurations because HRF does not affectthese cases (there is no local synchronization) – we de-note the systems as G* to indicate that GPU-D andGPU-H obtain the same results and D* to indicatethat DeNovo-D and DeNovo-H obtain the same re-sults.6 For Figure 4, we show all five configurations,denoted as GD, GH, DD, DD+RO, and DH.

Overall, compared to the best GPU coherence proto-col (GPU-H ), we find that DeNovo-D is comparable for

5GPU core+ includes the instruction cache, constant cache,register file, SFU, FPU, scheduler, and the core pipeline.6We do not show the read-only enhancement here becauseD* is significantly better than G*.

applications with no fine-grained synchronization andbetter for microbenchmarks that employ synchroniza-tion with only global scopes (average of 28% lower exe-cution time, 51% lower energy). For microbenchmarkswith mostly locally scoped synchronization, GPU-H isbetter (on average 6% lower execution time and 4%lower energy) than DeNovo-D. This modest benefit ofGPU-H comes at the cost of a more complex memorymodel – adding a read-only region enhancement withDeNovo-D removes most of this benefit and using HRFwith DeNovo makes it the best performing protocol.

6.1 GPU-D vs. GPU-HFigure 4 shows that when locally scoped synchroniza-

tion can be used, GPU-H can significantly improve per-formance over GPU-D, as noted in prior work [8]. Onaverage GPU-H decreases execution time by 46% andenergy by 42% for benchmarks that use local synchro-nization. There are two main sources of improvement.First, the latency of locally scoped acquires is muchsmaller because they are performed at L1 (which re-duces atomic traffic by an average of 94%). Second,local acquires do not invalidate the cache and local re-leases do not flush the store buffer. As a result, datacan be reused across local synchronization boundaries.Since accesses hit more frequently in the L1 cache, ex-ecution time, energy, and network traffic improve. Onaverage, the L1, L2, and network energy componentsdecrease by 71% for GPU-H while data (non-atomic)network traffic decreases by an average of 78%.

6.2 DeNovo-D vs. GPU Coherence6.2.1 Traditional GPU Applications

For the ten applications studied that do not use fine-grained synchronization, Figure 2 shows there is gener-ally little difference between DeNovo* and GPU*. De-Novo* increases execution time and energy by 0.5% onaverage and reduces network traffic by 5% on average.

For LavaMD, DeNovo* significantly decreases net-work traffic because LavaMD overflows the store buffer,which prevents multiple writes to the same locationfrom being coalesced in GPU*. As a result, each ofthese writes has to be written through separately tothe L2. Unlike GPU*, after DeNovo* obtains owner-ship to a word, all subsequent writes to that word hitand do not need to use the store buffer.

For some other applications, obtaining ownership causesDeNovo* to slightly increase network traffic and energy.First, multiple writes to the same word may requiremultiple ownership requests if the word is evicted fromthe cache before the last write. GPU* may be ableto coalesce these writes in the store buffer and incur asingle writethrough to the L2. Second, DeNovo* mayincur a read or registration miss for a word registeredat another core, requiring an extra hop on the networkcompared to GPU* (which always hits in the L2). Inour applications, however, these sources of overheadsare minimal and do not affect performance. In general,the first source (obtaining ownership) is not on the criti-cal path for performance and the second source (remote

L1 miss) can be partly mitigated (if needed) using directcache to cache transfers as enabled by DeNovo [16].

6.2.2 Global Synchronization BenchmarksFigure 3 shows the execution time, energy, and net-

work traffic for the four benchmarks that use only glob-ally scoped fine-grained synchronization. For these bench-marks, HRF has no effect because there are no synchro-nizations with local scope.

The main difference between GPU* and DeNovo* isthat DeNovo* obtains ownership for written data andglobal synchronization variables, which gives the follow-ing key benefits for our benchmarks with global syn-chronization. First, once DeNovo* obtains ownershipfor a synchronization variable, subsequent accesses fromall thread blocks on the same CU incur hits (until an-other CU is granted ownership or the variable is evictedfrom the cache). These hits reduce average synchroniza-tion latency and network traffic for DeNovo*. Second,DeNovo* also benefits because owned data is not in-validated on an acquire, resulting in data reuse acrosssynchronization boundaries for all thread blocks on aCU. Finally, release operations require getting owner-ship for dirty data instead of writing through the datato L2, resulting in less traffic.

As discussed in Section 4.1, obtaining ownership canincur overheads relative to GPU coherence in some cases.However, for our benchmarks with global synchroniza-tion, these overheads are compensated by the reuse ef-fects mentioned above. As a result, on average, DeN-ovo* reduces execution time, energy, and network traf-fic by 28%, 51%, and 81%, respectively, relative to GPU*.

6.2.3 Local Synchronization BenchmarksFor the microbenchmarks with mostly locally scoped

synchronization, we focus on comparing DeNovo-D withGPU-H since Figure 4 shows that the latter is the bestGPU protocol.

DeNovo-D and GPU-H both increase reuse and syn-chronization efficiency relative to GPU-D for applica-tions that use fine-grained synchronization, but they doso in different ways. GPU-H enables data reuse acrosslocal synchronization boundaries, and can perform lo-cally scoped synchronization operations at L1. There-fore, these benefits can only be achieved if the applica-tion can explicitly define locally scoped synchronizationpoints. In contrast, DeNovo enables reuse implicitlybecause owned data can be reused across any type ofsynchronization point. In addition, DeNovo-D obtainsownership for all synchronization operations, so evenglobal synchronization operations can be performed lo-cally. Like the globally scoped benchmarks, obtainingownership for atomics also improves reuse and localityfor benchmarks like TB LG and TBEX LG that haveboth global and local synchronization.

Since GPU-H does not obtain ownership, on a glob-ally scoped release, it must flush and downgrade alldirty data to the L2. As a result, if the store bufferis too small, then GPU-H may see limited coalescingof writes to the same location, as described in Sec-tion 6.2.1. TB LG and TBEX LG exhibit this effect.

0%

20%

40%

60%

80%

100%

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

HG

D

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

SPM_L SPMBO_L FAM_L SLM_L SS_L TBEX_LG UTS AVG SSBO_L TB_LG

(a) Execution time

0%

20%

40%

60%

80%

100%

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

HG

D

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

N/W L2 $ L1 D$ Scratch GPU Core+

SPM_L SPMBO_L FAM_L SLM_L SS_L UTS AVG SSBO_L TBEX_LG TB_LG

(b) Dynamic energy

0%

20%

40%

60%

80%

100%

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

HG

D

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

GD

GH

DD

DD

+R

OD

H

GD

GH

DD

DD

+R

O

DH

Atomics WB/WT Regist. Read

SPM_L SPMBO_L FAM_L SLM_L SS_L UTS AVG SSBO_L TBEX_LG TB_LG

(c) Network traffic

Figure 4: All configurations with synchronization benchmarks that use mostly local synchronization, normalized toGD.

DeNovo-D also occasionally suffers from full store buffersfor these benchmarks, but its cost for flushing is lower –each dirty cache line only needs to send an ownership re-quest to L2. Furthermore, once DeNovo-D obtains own-ership, any additional writes will hit and do not needto use the store buffer, effectively reducing the numberof flushes of a full store buffer. By obtaining ownershipfor the data, DeNovo-D is able to exploit more reuse.In doing so, DeNovo reduces network traffic and energyrelative to GPU-H for these applications.

Conversely, DeNovo only enables reuse for owned data;i.e., there is no reuse across synchronization boundariesfor read-only data. performance and increases networktraffic with locally scoped synchronization. SS L is alsohurt by the order that the readers and writers enterthe critical section: many readers enter first, so read-write data is invalidated until the writer enters and ob-tains ownership for it. DeNovo-D also performs slightlyworse than GPU-H for UTS because DeNovo-D usesglobal synchronization and must frequently invalidatethe cache and flush the store buffer. Although owner-ship mitigates many disadvantages of global synchro-nization, frequent invalidations and store buffer flusheslimit the effectiveness of DeNovo-D.

On average, GPU-H shows 6% lower execution timeand 4% lower energy than DeNovo-D, with maximumbenefit of 13% and 10% respectively. However, GPU-H ’s advantage comes at the cost of increased memorymodel complexity.

6.3 DeNovo-D with Selective (RO) InvalidationsDeNovo-D ’s inability to avoid invalidating read-only

data is a key reason GPU-H outperforms it for the lo-cally scoped microbenchmarks. Using the read-only re-gion enhancement for DeNovo-D, however, removes anyperformance and energy benefit from GPU-H on aver-age. In some cases, GPU-H is better, but only up to7% for execution time and 4% for energy. AlthoughDeNovo-D+RO needs more program information, un-like HRF, this information is hardware agnostic.

6.4 Applying HRF to DeNovoDeNovo-H enjoys the benefits of ownership for data

accesses and globally scoped synchronization accesses aswell as the benefits of locally scoped synchronization.Reuse in L1 is possible for owned data across globalsynchronization points and for all data across local syn-chronization points. Local synchronization operationsare always performed locally, and global synchroniza-tion operations are performed locally once ownership isacquired for the synchronization variable.

Compared to DeNovo-D, DeNovo-H provides someadditional benefits. With DeNovo-D many synchro-nization accesses that would be locally scoped alreadyoccur at L1 and much data locality is already exploitedthrough ownership. However, by explicitly defining lo-cal synchronization accesses, DeNovo-H is able to reuseread-only data and data that is read multiple times be-fore it is written across local synchronization points. Itis also able to delay obtaining ownership for both lo-

cal writes and local synchronization operations. As aresult, compared to DeNovo-D, DeNovo-H reduces ex-ecution time, energy, and network traffic for all appli-cations with local scope.

Although DeNovo-D+RO allows reuse of read-onlydata, DeNovo-H ’s additional advantages described abovealso provide it a slight benefit over DeNovo-D+RO ina few cases.

Compared to GPU-H, DeNovo-H is able to exploitmore locality because owned data can be reused acrossany synchronization scope and because registration forsynchronization variables allows global synchronizationrequests to also be executed locally.

These results show that DeNovo-H is the best config-uration of those studied because it combines the advan-tages of ownership (from DeNovo-D) and scoped syn-chronization (from GPU-H ) to minimize synchroniza-tion overhead and maximize data reuse across all syn-chronization points. However, DeNovo-H significantlyincreases memory model complexity and does not pro-vide significantly better results than DeNovo-D+RO,which uses a simpler memory model but has some over-head to identify the read-only data.

7. RELATED WORK7.1 Consistency

Previous work on memory consistency models for GPUsfound that the TSO and relaxed memory models did notsignificantly outperform SC in a system with MOESIcoherence and writeback caches [31, 32]. However, thework does not measure the coherence overhead of thestudied configurations or evaluate alternative coherenceprotocols. The DeNovo coherence protocol also has sev-eral advantages over an ownership-based MOESI proto-col, as discussed in Section 2.

7.2 Coherence ProtocolsThere has also been significant prior work on optimiz-

ing coherence protocols for standalone GPUs or CPU-GPU systems. Table 5 compares DeNovo-D to the mostclosely related prior work across the key features fromTable 2:HSC[33]: Heterogeneous System Coherence (HSC) is ahierarchical, ownership-based CPU-GPU cache coher-ence protocol. HSC provides the same advantages asthe ownership-based protocols we discussed in Section 2.By adding coarse-grained hardware regions to MOESI,HSC aggregates coherence traffic and reduces MOESI’snetwork traffic overheads when used with GPUs. How-ever, HSC’s coarse regions restrict data layout and thetypes of communication that can effectively occur. Fur-thermore, HSC’s coherence protocol is significantly morecomplex than DeNovo.Stash[21], TemporalCoherence[19], FusionCoher-ence[34]: Stash uses an extension of DeNovo for CPU-GPU systems, but focuses on integrating specialized,private memories like scratchpads into the unified ad-dress space. It does not provide support for fine-grainedsynchronization and does not draw any comparisonswith conventional GPU style coherence or comment on

Feature Benefit HSC Stash[21], Quick Remote DD[33] TC[19], Release[3] Scopes[11]

FC[34]Reuse Written Data Reuse written data across synchs 3 7 3 3 3Reuse Valid Data Reuse cached valid data across synchs 3 7 3 7 7No Bursty Traffic Avoid bursts of writes 3 3 7 7 3No Invalidations/ACKs Decreased network traffic 7 3 7 7 3Decoupled Granularity Only transfer useful data 7 3 3 (for STs) 3 (for STs) 3Reuse Synchronization Efficient support for fine-grained synch 3 7 3 3 3Dynamic Sharing Efficient support for work stealing 3 3 7 3 3

Table 5: Comparison of DeNovo to other GPU coherence schemes. The read-only region enhancement to DeNovoalso allows valid data reuse for read-only data.

consistency models. By extending the cache’s coherenceprotocol used in this work to support local and globalGPU synchronization operations, DeNovo-D reuses writ-ten data across synchronization points. We also ex-plore DRF and HRF consistency models, while the stashassumes a DRF consistency model. FusionCoherenceand TemporalCoherence use timestamp-based protocolsthat utilize self-invalidations and self-downgrades andthus provide many of the same benefits as DeNovo-D.However, this work does not consider fine-grained syn-chronization or impact on consistency models.QuickRelease[3]: QuickRelease reduces the overheadof synchronization operations in conventional GPU co-herence protocols and allows data to be reused acrosssynchronization points. However, QuickRelease requiresbroadcast invalidations to ensure that no stale data canbe accessed. Additionally, QuickRelease does not haveefficient support for algorithms with dynamic sharing.RemoteScopes[11]: RemoteScopes improves on Quick-Release by providing better support for algorithms withdynamic sharing. In the common case, dynamicallyshared data synchronizes with a local scope and whendata is shared, RemoteScopes “promotes” the scope ofthe synchronization access to a larger common scopeto synchronize properly. Although RemoteScopes im-proves performance for applications with dynamic shar-ing, because it does not obtain ownership, it must useheavyweight hardware mechanisms to ensure that nostale data is accessed. For example, RemoteScopes flushesthe entire cache on acquires and uses broadcast invali-dations and acknowledgments to ensure data is flushed.

Overall, while each of the coherence protocols in pre-vious work provides some of the same benefits as DeNovo-D, none of them provide all of the benefits of DeNovo-D.Furthermore, none of the above work explores the im-pact of ownership on consistency models.

8. CONCLUSIONGPGPU applications with more general sharing pat-

terns and fine-grained synchronization have recently emerged.Unfortunately, conventional GPU coherence protocolsdo not provide efficient support for them. Past workhas proposed HRF, which uses scoped synchronization,to address the inefficiencies of conventional GPU coher-ence protocols. In this work, we choose instead to re-solve these inefficiencies by extending DeNovo’s software-driven, hardware coherence protocol to GPUs. DeNovois a hybrid coherence protocol that provides the best

features of ownership-based and GPU-style coherenceprotocols. As a result, DeNovo provides efficient sup-port for applications with fine-grained synchronization.Furthermore, the DeNovo coherence protocol enables asimple SC-for-DRF memory consistency model. UnlikeHRF, SC-for-DRF does not expose the memory hier-archy or require programmers to carefully annotate allsynchronization accesses with scope information.

Across 10 CPU-GPU applications, which do not usefine-grained synchronization or dynamic sharing, DeN-ovo provides comparable performance to a conventionalGPU coherence protocol. For applications that utilizefine-grained, globally scoped synchronization, DeNovosignificantly outperforms a conventional GPU coher-ence protocol. For applications that utilize fine-grained,locally scoped synchronization, GPU coherence withHRF modestly outperforms the baseline DeNovo pro-tocol, but at the cost of a more complex consistencymodel. Augmenting DeNovo with selective invalida-tions for read-only regions allows it to obtain the sameaverage performance and energy as GPU coherence withHRF. Furthermore, if HRF’s complexity is deemed ac-ceptable, then a (modest) variant of DeNovo under HRFprovides better performance and energy than conven-tional GPU coherence with HRF. These findings showthat DeNovo with DRF provides a sweet spot for perfor-mance, energy, hardware overhead, and memory modelcomplexity – HRF’s complexity is not needed for ef-ficient fine-grained synchronization on GPUs. Mov-ing forward, we will analyze full-sized applications withfine-grained synchronization, once they become avail-able, to ensure that DeNovo with DRF provides similarbenefits for them. We will also examine what additionalbenefits we can obtain by applying additional optimiza-tions, such as direct cache to cache transfers [16], toDeNovo with DRF.

AcknowledgmentsThis work was supported in part by a Qualcomm Inno-vation Fellowship for Sinclair, the Center for Future Ar-chitectures Research (C-FAR), one of the six centers ofSTARnet, a Semiconductor Research Corporation pro-gram sponsored by MARCO and DARPA, and the Na-tional Science Foundation under grants CCF-1018796and CCF-1302641. We would also like to thank BradBeckmann and Mark Hill for their insightful commentsthat improved the quality of this paper.

9. REFERENCES[1] “HSA Platform System Architecture Specification.”

http://www.hsafoundation.com/?ddownload=4944, 2015.

[2] IntelPR, “Intel Discloses Newest Microarchitecture and 14Nanometer Manufacturing Process Technical Details,” IntelNewsroom, 2014.

[3] B. Hechtman, S. Che, D. Hower, Y. Tian, B. Beckmann,M. Hill, S. Reinhardt, and D. Wood, “QuickRelease: AThroughput-Oriented Approach to Release Consistency onGPUs,” in IEEE 20th International Symposium on HighPerformance Computer Architecture, 2014.

[4] T. Sorensen, J. Alglave, G. Gopalakrishnan, and V. Grover,“ICS: U: Towards Shared Memory Consistency Models forGPUs,” in International Conference on Supercomputing,2013.

[5] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan,J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson,“GPU Concurrency: Weak Behaviours and ProgrammingAssumptions,” in Proceedings of the 20th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, 2015.

[6] J. A. Stuart and J. D. Owens, “Efficient SynchronizationPrimitives for GPUs,” CoRR, vol. abs/1110.4623, 2011.

[7] M. Burtscher, R. Nasre, and K. Pingali, “A QuantitativeStudy of Irregular Programs on GPUs,” in IEEEInternational Symposium on Workload Characterization,2012.

[8] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R.Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood,“Heterogeneous-Race-Free Memory Models,” in Proceedingsof the 19th International Conference on ArchitecturalSupport for Programming Languages and OperatingSystems, 2014.

[9] J. Y. Kim and C. Batten, “Accelerating IrregularAlgorithms on GPGPUs Using Fine-Grain HardwareWorklists,” in 47th Annual IEEE/ACM InternationalSymposium on Microarchitecture, 2014.

[10] S. Che, B. Beckmann, S. Reinhardt, and K. Skadron,“Pannotia: Understanding Irregular GPGPU GraphApplications,” in IEEE International Symposium onWorkload Characterization, 2013.

[11] M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D.Hill, and D. A. Wood, “Synchronization UsingRemote-Scope Promotion,” in Proceedings of the 20thInternational Conference on Architectural Support forProgramming Languages and Operating Systems, 2015.

[12] B. R. Gaster, D. Hower, and L. Howes, “HRF-Relaxed:Adapting HRF to the Complexities of IndustrialHeterogeneous Memory Models,” ACM Transactions onArchitecture and Code Optimizations, vol. 12, April 2015.

[13] L. Howes and A. Munshi, “The OpenCL Specification,Version 2.0.” Khronos Group, 2015.

[14] S. Adve and M. Hill, “Weak Ordering – A New Definition,”in Proceedings of the 17th Annual InternationalSymposium on Computer Architecture, 1990.

[15] S. V. Adve and H.-J. Boehm, “Memory Models: A Case forRethinking Parallel Languages and Hardware,”Communications of the ACM, pp. 90–101, August 2010.

[16] B. Choi, R. Komuravelli, H. Sung, R. Smolinski,N. Honarmand, S. Adve, V. Adve, N. Carter, and C.-T.Chou, “DeNovo: Rethinking the Memory Hierarchy forDisciplined Parallelism,” in Proceedings of the 20thInternational Conference on Parallel Architectures andCompilation Techniques, 2011.

[17] H. Sung, R. Komuravelli, and S. V. Adve, “DeNovoND:Efficient Hardware Support for DisciplinedNon-determinism,” in Proceedings of the 18th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 13–26, 2013.

[18] H. Sung and S. V. Adve, “DeNovoSync: Efficient Supportfor Arbitrary Synchronization without Writer-Initiated

Invalidations,” in Proceedings of the 20th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, 2015.

[19] I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, andT. M. Aamodt, “Cache Coherence for GPU Architectures,”in 19th International Symposium on High PerformanceComputer Architecture, 2013.

[20] NVIDIA, “CUDA SDK 3.1.” http://developer.nvidia.com/object/cuda_3_1_downloads.html.

[21] R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa,P. Srivastava, M. Kotsifakou, S. V. Adve, and V. S. Adve,“Stash: Have Your Scratchpad and Cache it Too,” inProceedings of the 42nd Annual International Symposiumon Computer Architecture, pp. 707–719, 2015.

[22] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R.Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill,and D. A. Wood, “Multifacet’s General Execution-drivenMultiprocessor Simulator (GEMS) Toolset,” SIGARCHComputer Architecture News, 2005.

[23] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, andT. M. Aamodt, “Analyzing CUDA Workloads Using aDetailed GPU Simulator,” in IEEE InternationalSymposium on Performance Analysis of Systems andSoftware, 2009.

[24] N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha, “GARNET:A Detailed On-chip Network Model Inside a Full-systemSimulator,” in IEEE International Symposium onPerformance Analysis of Systems and Software, 2009.

[25] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S.Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch:Enabling Energy Optimizations in GPGPUs,” inProceedings of the 40th Annual International Symposiumon Computer Architecture, 2013.

[26] S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, andN. Jouppi, “McPAT: An Integrated Power, Area, andTiming Modeling Framework for Multicore and ManycoreArchitectures,” in 42nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, 2009.

[27] H.-J. Boehm and B. Demsky, “Outlawing Ghosts: AvoidingOut-of-thin-air Results,” in Proceedings of the Workshop onMemory Systems Performance and Correctness, 2014.

[28] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H.Lee, and K. Skadron, “Rodinia: A Benchmark Suite forHeterogeneous Computing,” in IEEE InternationalSymposium on Workload Characterization, 2009.

[29] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W.Chang, N. Anssari, G. D. Liu, and W. Hwu, “Parboil: ARevised Benchmark Suite for Scientific and CommercialThroughput Computing,” tech. rep., Department of ECEand CS, University of Illinois at Urbana-Champaign, 2012.

[30] S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, andK. Skadron, “A Characterization of the Rodinia BenchmarkSuite with Comparison to Contemporary CMP workloads,”in IEEE International Symposium on WorkloadCharacterization, 2010.

[31] B. Hechtman and D. Sorin, “Evaluating Cache CoherentShared Virtual Memory for Heterogeneous MulticoreChips,” in IEEE International Symposium on PerformanceAnalysis of Systems and Software, 2013.

[32] B. A. Hechtman and D. J. Sorin, “Exploring MemoryConsistency for Massively-threaded Throughput-orientedProcessors,” in Proceedings of the 40th AnnualInternational Symposium on Computer Architecture, 2013.

[33] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann,M. D. Hill, S. K. Reinhardt, and D. A. Wood,“Heterogeneous System Coherence for IntegratedCPU-GPU Systems,” in Proceedings of the 46th AnnualIEEE/ACM International Symposium onMicroarchitecture, 2013.

[34] S. Kumar, A. Shriraman, and N. Vedula, “Fusion: DesignTradeoffs in Coherence Cache Hierarchies for Accelerators,”in Proceedings of the 42nd Annual InternationalSymposium on Computer Architecture, 2015.

http://www.hsafoundation.com/?ddownload=4944

http://developer.nvidia.com/object/cuda_3_1_downloads.html

http://developer.nvidia.com/object/cuda_3_1_downloads.html

Documents

Efﬁcient GPU Synchronization without Scopes: Saying No to …rsim.cs.illinois.edu/Pubs/15-MICRO-scopes.pdf · 2018-02-19 · Efﬁcient GPU Synchronization without Scopes: Saying