19
Extending Unified Parallel C for GPU Computing Yili Zheng, Costin Iancu, Paul Hargrove, Seung-Jai Min, Katherine Yelick Lawrence Berkeley National Lab

Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Extending Unified Parallel C for GPU Computing

Yili Zheng, Costin Iancu, Paul Hargrove, Seung-Jai Min, Katherine Yelick

Lawrence Berkeley National Lab

Page 2: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

GPU Cluster with Hybrid Memory

Computer Node

CPU Memory

GPU

GPUMemory

CPU CPU

GPU

GPUMemory

Computer Node

CPU Memory

GPU

GPUMemory

CPU CPU

GPU

GPUMemory

Network

2/24/2010 2SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 3: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Current Programming Model for GPU Clusters

MPI + CUDA/OpenCL

CPU

CPU

GPU

CPU

CPU

GPU

CPU

CPU

GPU

MPI CUDA/OpenCL

2/24/2010 3SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 4: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

PGAS Programming Model for Hybrid Multi-Core Systems

Computer Node

CPU Memory

GPU

GPUMemory

CPU CPU

GPU

GPUMemory

Computer Node

CPU Memory

GPU

GPUMemory

CPU CPU

GPU

GPUMemory

Network

PGAS

2/24/2010 4SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 5: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

PGAS Example: Global Matrix Distribution

Global Matrix View Distributed Matrix Storage

1

3

2

4

9

11

10

12

5

7

6

8

13

15

14

16

1

9

5

13

3

11

7

15

2

10

6

14

4

12

8

16

2/24/2010 5SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 6: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

PGAS Example: Fast Fourier Transform

GPU I

Global Matrix View Distributed Matrix Storage

Shared [4] A[4][4];

1D_FFT(A); // row fft

A’=AT;

1D_FFT(A’); // col fft

A= A’T

GPU II

GPU III

GPU IV

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

2/24/2010 6SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 7: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Hybrid Partitioned Global Address Space

Private Segment on Host Memory

Thread 1

Shared Segment on Host Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 2

Shared Segment on Host Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 3

Shared Segment on GPU Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 4

Shared Segment on GPU Memory

Private Segment on GPU Memory

Each thread has only one shared segment, which can be either in host memory or in GPU memory, but not both.

Decouple the memory model from execution models; therefore it supports various execution models.

Backward compatible with current UPC and CUDA/OpenCLprograms.

2/24/2010 7SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 8: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Execution Models

• Synchronous model

• Virtual GPU model

• Hybrid model

2/24/2010 8

CPU CPU CPU CPU

GPUGPU

CPU CPU CPU CPU

CPU CPU CPU CPU GPU GPU

GPU GPU GPU GPU

CPU CPU CPU CPU

CPU CPU CPU CPU

SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 9: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

UPC Overview

• PGAS dialect of ISO C99

• Distributed shared arrays

• Dynamic shared-memory allocation

• One-sided shared-memory communication

• Synchronization: barriers, locks, memory fences

• Collective communication library

• Parallel I/O library2/24/2010 9SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 10: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Hybrid PGAS Example

Private Segment on Host Memory

Thread 1

Shared Segment on Host Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 2

Shared Segment on Host Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 3

Shared Segment on GPU Memory

Private Segment on GPU Memory

Private Segment on Host Memory

Thread 4

Shared Segment on GPU Memory

Private Segment on GPU Memory

p = malloc(4) *p

sp = upc_alloc(4)*sp*sp

Standard C

UPC with GPU shared heap

UPC sp = upc_alloc(4) *sp*sp

*sp

*sp

*p *p

int *p

shared int *sp

2/24/2010 10SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 11: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Data Transfer Example• MPI+CUDA

• UPC with GPU extensions

Process 1:send_buffer = malloc(nbytes);cudaMemcpy(send_buffer, src); MPI_Send(send_buffer);free(send_buffer);

Process 2:recv_buffer = malloc(nbytes);

MPI_Recv(recv_buffer);cudaMemcpy(dst, recv_buffer);free(recv_buffer);

Thread 1:upc_memcpy(dst, src, nbytes);

Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communications map well to DMA transfers for GPU devices. Concise code

copy nbytes data from src on GPU 1 to dst on GPU 2

Thread 2:// no operation required

2/24/2010 11SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 12: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

PGAS GPU Code Example

• Thread 1 • Thread 2

// use GPU memory for shared segmentbupc_attach_gpu(gpu_id);

shared [] int * shared sp;

// shared memory is allocatd on GPU sp = upc_alloc(sizeof(int)); *sp = 4; // write to GPU memoryupc_barrier;

shared [] int * shared sp;

upc_barrier;// read from remote GPU memoryprintf(“%d”, *sp);

2/24/2010 12SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 13: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Berkeley UPC Software Stack

UPC-to-C Translator

UPC Applications

UPC Runtime

GASNet Communication Library

Network Drivers and OS Libraries

Translated C code with Runtime Calls

2/24/2010 13SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 14: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Translation and Call Graph Example

shared [] int * shared sp;*sp = a;

UPC-to-C Translator

UPCR_PUT_PSHARED_VAL(sp, a);

gasnet_put(sp, a); gasnet_put_to_gpu(sp, a);

UPC Runtime

GASNet GASNet+CUDA

Is *sp on GPU?

No Yes

2/24/2010 14SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 15: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Active Messages

• Active messages = Data + Action

• Key enabling technology for both one-sided and two-sided communications– Software implementation of Put/Get

– Eager and Rendezvous protocols

• Remote Procedural Calls– Facilitate “owner-computes”

– Execute asynchronous tasks

Request

Reply

A B

Request handler

Reply handler

2/24/2010 15SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 16: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

GASNet Extensions for GPUOne-sided Communication

• How to transfer to/from remote GPU device memory?– Active Messages (AM)– Need to execute CUDA

operations outside of AM handler context because they may block

– Solution: asynchronous GPU task queue

• How to know when the data transfer is done?– Send an ACK message after

the GPU op is done on the GPU device

– Solution: GPU task queue polling and callback support

Host1 Host2

GPU2GPU1

Active Message Request

Active Message

ReplyCUDA

MemcpyCUDA

Memcpy

2/24/2010 16SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 17: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Implementation

• UPC-to-C translator– No change because the UPC runtime API is intact.– Compile UPC code and CUDA code separately and then link

the object files with libs together.

• UPC runtime extensions– Shared-heap management for GPU device memory– Accesses to shared data on GPU (via pointer-to-shared)– Interoperability of UPC and GPU (CUDA)

• GASNet extensions– Put and Get operations for GPU– Asynchronous GPU task queue for running GPU operations

outside of AM handler context

2/24/2010 17SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 18: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Summary

• Runtime extensions for enabling PGAS on GPU clusters– Unified API for data management and communication – High-level expressions of data movement enabling end-to-

end optimizations– Compatible with different execution models and existing

GPU applications

• Reusable modular components in the implementation– Task queue for asynchronous task execution– Communication protocols for heterogeneous processors – Portable to other GPU SDK, e.g., OpenCL. Platform (CUDA)

specific codes are limited and encapsulated.

• Work in progress

2/24/2010 18SIAM PP10 -- Extending Unified Parallel C for GPU Computing

Page 19: Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

Thank You!

• MS 31

UPC at Scale

1:20 PM - 3:20 PM

Room: Leonesa II

• MS 52

Getting Multicore Performance with UPC

1:20 PM - 3:20 PM

Room: Eliza Anderson Amphitheater

2/24/2010 19SIAM PP10 -- Extending Unified Parallel C for GPU Computing