An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

1

An Asymmetric Distributed Shared Memory Model for Heterogeneous

Parallel Systems

Isaac Gelado, Javier Cabezas. John Stone,Sanjay Patel, Nacho Navarro and Wen-mei Hwu

3/17/2010


2

1. Introduction: Heterogeneous Computing

3/17/2010

Heterogeneous Parallel Systems: CPU: sequential control-intensive code Accelerators: massively data-parallel

code

CPU

ACC

Existent programming models are DMA based: Explicit memory copy Programmer-managed memory

coherence

CPU

ACC

IN

IN

OUT

OUT


3

Outline

1. Introduction2. Motivation3. ADSM: Asymmetric Distributed Shared

Memory4. GMAC: Global Memory for ACcelerators5. Experimental Results6. Conclusions

3/17/2010


4

2.1 Motivation: Reference System

3/17/2010

CPU(N - Cores)

PCIe Bus

RAM Memory

RAM Memory

GPU-likeAccelerator

High bandwidthWeak consistencyLarge page size

Low latencyStrong consistencySmall page size

RAM Memory

RAM Memory

Device Memory

System Memory


5

2.2 Motivation: Memory Requirements High memory

bandwidth requirements

Non fully-coherent systems: Long-latency

coherence traffic Different coherence

protocols Accelerator memory always growing (e.g. 6GB

NVIDIA Fermi, 16GB PowerXCell 8i)

3/17/2010


6

2.3 Motivation: DMA-Based Programming

3/17/2010

• Duplicated Pointers

• Explicit Coherence Management

• CUDA Sample Code

CPU GPU

foo foofoofoo

void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo);}


7

3.1 ADSM: Unified Virtual Address Space• Unified Virtual Shared Address Space

• CPU: access both, system and accelerator memory• Accelerator: access to its own memory• Under ADSM, both will use the same virtual address when

referencing the shared object

CPU ACCbar

baz

foo foo

Shared DataObject

3/17/2010

System Memory

Device Memory


8

3.2 ADSM: Simplified Code• Simpler CPU code than in DMA-

based programming models• Hardware-independent code

Single PointerData

Assignment

Peer DMALegacy

Support3/17/2010

void compute(FILE *file, int size) {

float *foo;

foo = adsmMalloc(size);

fread(foo, size, 1, file);

kernel<<<Dg, Db>>>(foo, size);

cpuComputation(foo);

adsmFree(foo);

}

CPU GPU

foo


9

3.3 ADSM: Memory Distribution

• Asymmetric Distributed Shared Memory principles:• CPU accesses objects in accelerator memory but

not vice versa• All coherency actions are performed by the CPU

• Trashing unlikely to happen:• Synchronization Variables: Interrupt-based and

dedicated hardware• False-sharing: Data object sharing granularity

3/17/2010


10

3.4 ADSM: Consistency and Coherence• Release consistency:

• Consistency only relevant from CPU perspective• Implicit release/acquire at accelerator call/return

CPU ACC

Foo

CPU ACC

FooAcceleratorReturnAccelerator

Call

3/17/2010

• Memory Coherence:• Data ownership information enables eager data transfers• CPU maintains coherency


11

4. Global Memory for Accelerators

• ADSM implementation

• User-level shared library

• GNU / Linux Systems

• NVIDIA CUDA GPUs

3/17/2010


12

4.1 GMAC: Overall Design

• Layered Design:• Multiple Memory Consistency Protocols• Operating System and Accelerator Independent code

CUDA-like Front-End

Memory Manager(Different Policies)

Kernel Scheduler(FIFO)

Operating SystemAbstraction Layer

Accelerator AbstractionLayer (CUDA)

3/17/2010


13

4.2 GMAC: Unified Address Space

System Virtual Address Space

GPU Physical Address Space

3/17/2010

• Virtual Address Space formed by GPU and System physical memories

• GPU memory address range cannot be selected

• Allocate same virtual memory address range in both, GPU and CPU

• Accelerator Virtual memory would ease this process


14

• Batch-Update: copy all shared objects• Lazy-Update: copy modified / needed shared

objects• Data object granularity• Detect CPU read/write accesses to shared objects

• Rolling-Update: copy only modified / needed memory• Memory block size granularity• Fixed maximum number of modified blocks in system

memory flush data when maximum is reached

4.3 GMAC: Coherence Protocols

3/17/2010


15

5.1 Results: GMAC vs. CUDA

3/17/2010

• Batch-Update overheads:– Copy output data

on call– Copy non-used data

• Similar performance for CUDA, Lazy-Update and Rolling-Update


16

5.2 Results: Lazy vs. Rolling on 3D Stencil• Extra data copy

for small data objects

• Trade-off between bandwidth and page fault overhead

3/17/2010


17

6. Conclusions

• Unified virtual shared address space simplifies programming of heterogeneous systems

• Asymmetric Distributed Shared Memory• CPU access accelerator memory but not vice versa• Coherence actions only executed by CPU

• Experimental results shows no performance degradation

• Memory translation in accelerators is key to efficient implement ADSM

3/17/2010


18

Thank you for your attention

Eager to start using GMAC?http://code.google.com/p/adsm/

[email protected]@googlegroups.com

3/17/2010

http://code.google.com/p/adsm/

mailto:[email protected]

mailto:[email protected]


19

Backup Slides

3/17/2010


20

4.4 GMAC: Memory Mapping

• Software: allocate different address space and provide translation function (gmacSafePtr())

• Hardware: implement virtual memory in the GPU

3/17/2010

System Virtual Address Space

GPU Physical Address Space

• Allocation might fail if the range is in use


21

4.5 GMAC: Protocol States

• Protocol States: Invalid, Read-only, Dirty

3/17/2010

Invalid

Dirty

Read Only

Call

Call

Read

Write

Write

Flush

Invalid Dirty

Call

Return• Batch-Update:

• Call / Return• Lazy-Update:

• Call / Return• Read / Write

• Rolling-Update:• Call / Return• Read / Write• Flush


22

4.6 GMAC: Rolling vs. Lazy

3/17/2010

• Batch – Update: transfer on kernel call

• Rolling – Update: transfer while CPU computes


23

5.3 Results: Break-down of Execution

3/17/2010


24

5.4 Results: Rolling Size vs. Block Size

• No appreciable effect on most benchmarks

3/17/2010

• Small Rolling size leads to performance aberrations

• Prefer relative large rolling sizes


25

6.1 Conclusions: Wish—list

• GPU Anonymous Memory Mappings:• GPU to CPU mappings never fail• Dynamic memory re—allocations

• GPU dynamic Pinned Memory:• No intermediate data copies on flush

• Peer DMA:• Speed—up I/O operations• No intermediate copies on GPU-to-GPU copies

3/17/2010

Documents

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems