Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
Extending Unified Parallel C for GPU Computing
Yili Zheng, Costin Iancu, Paul Hargrove, Seung-Jai Min, Katherine Yelick
Lawrence Berkeley National Lab
GPU Cluster with Hybrid Memory
Computer Node
CPU Memory
GPU
GPUMemory
CPU CPU
GPU
GPUMemory
Computer Node
CPU Memory
GPU
GPUMemory
CPU CPU
GPU
GPUMemory
Network
2/24/2010 2SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Current Programming Model for GPU Clusters
MPI + CUDA/OpenCL
CPU
CPU
GPU
CPU
CPU
GPU
CPU
CPU
GPU
MPI CUDA/OpenCL
2/24/2010 3SIAM PP10 -- Extending Unified Parallel C for GPU Computing
PGAS Programming Model for Hybrid Multi-Core Systems
Computer Node
CPU Memory
GPU
GPUMemory
CPU CPU
GPU
GPUMemory
Computer Node
CPU Memory
GPU
GPUMemory
CPU CPU
GPU
GPUMemory
Network
PGAS
2/24/2010 4SIAM PP10 -- Extending Unified Parallel C for GPU Computing
PGAS Example: Global Matrix Distribution
Global Matrix View Distributed Matrix Storage
1
3
2
4
9
11
10
12
5
7
6
8
13
15
14
16
1
9
5
13
3
11
7
15
2
10
6
14
4
12
8
16
2/24/2010 5SIAM PP10 -- Extending Unified Parallel C for GPU Computing
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
PGAS Example: Fast Fourier Transform
GPU I
Global Matrix View Distributed Matrix Storage
Shared [4] A[4][4];
1D_FFT(A); // row fft
A’=AT;
1D_FFT(A’); // col fft
A= A’T
GPU II
GPU III
GPU IV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2/24/2010 6SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Hybrid Partitioned Global Address Space
Private Segment on Host Memory
Thread 1
Shared Segment on Host Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 2
Shared Segment on Host Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 3
Shared Segment on GPU Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 4
Shared Segment on GPU Memory
Private Segment on GPU Memory
Each thread has only one shared segment, which can be either in host memory or in GPU memory, but not both.
Decouple the memory model from execution models; therefore it supports various execution models.
Backward compatible with current UPC and CUDA/OpenCLprograms.
2/24/2010 7SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Execution Models
• Synchronous model
• Virtual GPU model
• Hybrid model
2/24/2010 8
CPU CPU CPU CPU
GPUGPU
CPU CPU CPU CPU
CPU CPU CPU CPU GPU GPU
GPU GPU GPU GPU
CPU CPU CPU CPU
CPU CPU CPU CPU
SIAM PP10 -- Extending Unified Parallel C for GPU Computing
UPC Overview
• PGAS dialect of ISO C99
• Distributed shared arrays
• Dynamic shared-memory allocation
• One-sided shared-memory communication
• Synchronization: barriers, locks, memory fences
• Collective communication library
• Parallel I/O library2/24/2010 9SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Hybrid PGAS Example
Private Segment on Host Memory
Thread 1
Shared Segment on Host Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 2
Shared Segment on Host Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 3
Shared Segment on GPU Memory
Private Segment on GPU Memory
Private Segment on Host Memory
Thread 4
Shared Segment on GPU Memory
Private Segment on GPU Memory
p = malloc(4) *p
sp = upc_alloc(4)*sp*sp
Standard C
UPC with GPU shared heap
UPC sp = upc_alloc(4) *sp*sp
*sp
*sp
*p *p
int *p
shared int *sp
2/24/2010 10SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Data Transfer Example• MPI+CUDA
• UPC with GPU extensions
Process 1:send_buffer = malloc(nbytes);cudaMemcpy(send_buffer, src); MPI_Send(send_buffer);free(send_buffer);
Process 2:recv_buffer = malloc(nbytes);
MPI_Recv(recv_buffer);cudaMemcpy(dst, recv_buffer);free(recv_buffer);
Thread 1:upc_memcpy(dst, src, nbytes);
Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communications map well to DMA transfers for GPU devices. Concise code
copy nbytes data from src on GPU 1 to dst on GPU 2
Thread 2:// no operation required
2/24/2010 11SIAM PP10 -- Extending Unified Parallel C for GPU Computing
PGAS GPU Code Example
• Thread 1 • Thread 2
// use GPU memory for shared segmentbupc_attach_gpu(gpu_id);
shared [] int * shared sp;
// shared memory is allocatd on GPU sp = upc_alloc(sizeof(int)); *sp = 4; // write to GPU memoryupc_barrier;
shared [] int * shared sp;
upc_barrier;// read from remote GPU memoryprintf(“%d”, *sp);
2/24/2010 12SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Berkeley UPC Software Stack
UPC-to-C Translator
UPC Applications
UPC Runtime
GASNet Communication Library
Network Drivers and OS Libraries
Translated C code with Runtime Calls
2/24/2010 13SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Translation and Call Graph Example
shared [] int * shared sp;*sp = a;
UPC-to-C Translator
UPCR_PUT_PSHARED_VAL(sp, a);
gasnet_put(sp, a); gasnet_put_to_gpu(sp, a);
UPC Runtime
GASNet GASNet+CUDA
Is *sp on GPU?
No Yes
2/24/2010 14SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Active Messages
• Active messages = Data + Action
• Key enabling technology for both one-sided and two-sided communications– Software implementation of Put/Get
– Eager and Rendezvous protocols
• Remote Procedural Calls– Facilitate “owner-computes”
– Execute asynchronous tasks
Request
Reply
A B
Request handler
Reply handler
2/24/2010 15SIAM PP10 -- Extending Unified Parallel C for GPU Computing
GASNet Extensions for GPUOne-sided Communication
• How to transfer to/from remote GPU device memory?– Active Messages (AM)– Need to execute CUDA
operations outside of AM handler context because they may block
– Solution: asynchronous GPU task queue
• How to know when the data transfer is done?– Send an ACK message after
the GPU op is done on the GPU device
– Solution: GPU task queue polling and callback support
Host1 Host2
GPU2GPU1
Active Message Request
Active Message
ReplyCUDA
MemcpyCUDA
Memcpy
2/24/2010 16SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Implementation
• UPC-to-C translator– No change because the UPC runtime API is intact.– Compile UPC code and CUDA code separately and then link
the object files with libs together.
• UPC runtime extensions– Shared-heap management for GPU device memory– Accesses to shared data on GPU (via pointer-to-shared)– Interoperability of UPC and GPU (CUDA)
• GASNet extensions– Put and Get operations for GPU– Asynchronous GPU task queue for running GPU operations
outside of AM handler context
2/24/2010 17SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Summary
• Runtime extensions for enabling PGAS on GPU clusters– Unified API for data management and communication – High-level expressions of data movement enabling end-to-
end optimizations– Compatible with different execution models and existing
GPU applications
• Reusable modular components in the implementation– Task queue for asynchronous task execution– Communication protocols for heterogeneous processors – Portable to other GPU SDK, e.g., OpenCL. Platform (CUDA)
specific codes are limited and encapsulated.
• Work in progress
2/24/2010 18SIAM PP10 -- Extending Unified Parallel C for GPU Computing
Thank You!
• MS 31
UPC at Scale
1:20 PM - 3:20 PM
Room: Leonesa II
• MS 52
Getting Multicore Performance with UPC
1:20 PM - 3:20 PM
Room: Eliza Anderson Amphitheater
2/24/2010 19SIAM PP10 -- Extending Unified Parallel C for GPU Computing