GMAC Global Memory for Acceleratorsdeveloper.amd.com/wordpress/media/2013/06/2908_1_final.pdf · 2013. 10. 24. · GMAC Performance •Sobel filtering on video stream •OpenCL: –

GMAC Global Memory for Accelerators

Wen-mei W. Hwu, Isaac Gelado and Javier Cabezas

GMAC in a nutshell

• GMAC: Unified Virtual Address Space for OpenCL

– Simplifies the CPU code

– Exploits advanced OpenCL features for free

– Transparent memory consistency management

• Vector addition example – Really simple kernel code

– But, what about the CPU code?

__kernel void vector(__global float *c, __global float *a, __global float *b) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }

6/15/11 2 AMD Fusion Summit 2011

CPU OpenCL code (I)

• Set-up OpenCL int main(int argc, char *argv[]) { cl_platform_id platform; cl_device_id device; cl_context context; cl_command_queue command_queue; cl_program program; cl_kernel kernel; cl_int error_code; float *a, *b, *c; cl_mem d_a, d_b, d_c; /* Start setting up OpenCL */ error_code = clGetPlatformIDs(1, &platform, NULL); assert(error_code == CL_SUCCESS); error_code = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); assert(error_code == CL_SUCCESS); context = clCreateContext(0, 1, &device, NULL, NULL, &error_code); assert(error_code == CL_SUCCESS); command_queue = clCreateCommandQueue(context, device, 0, &error_code); assert(error_code == CL_SUCCESS); program = clCreateProgramWithSource(context, 1, &kernel_source, NULL, &error_code); assert(error_code == CL_SUCCESS); error_code = clBuildProgram(program, 1, &device, NULL, NULL, NULL); assert(error_code == CL_SUCCESS); kernel = clCreateKernel(program, "vecAdd", &error_code); assert(error_code == CL_SUCCESS);


CPU OpenCL code (II)

• Allocate memory and initialize data /* Alloc & init input data */ assert((a = (float *)malloc(vecSize * sizeof(float)) != NULL); d_a = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); read_file(“vector_A.data”, a, vecSize); assert((b = (float *)malloc(vecSize * sizeof(float)) != NULL); d_b = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); read_file(“vector_B.data”, b, vecSize); /* Alloc output data */ assert((b = (float *)malloc(vecSize * sizeof(float)) != NULL); d_b = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); /* Copy data to the device */ assert(clEnqueueWriteBuffer(command_queue, d_a, CL_FALSE, 0, vecSize * sizeof(float), a, 0, NULL, NULL) == CL_SUCCESS); assert(clEnqueueWriteBuffer(command_queue, d_b, CL_FALSE, 0, vecSize * sizeof(float), b, 0, NULL, NULL) == CL_SUCCESS);


CPU OpenCL code (III)

• Call the kernel and save the output /* Set kernel arguments */ assert(clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_c) == CL_SUCCESS); assert(clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_a) == CL_SUCCESS); assert(clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_b) == CL_SUCCESS); /* Call the kernel */ size_t global_size = vecSize; assert(clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, NULL, 0, NULL, NULL) == CL_SUCCESS); assert(clFinish(command_queue) == CL_SUCCESS); /* Get the results back */ assert(clEnqueueReadBuffer(command_queue, d_c, CL_FALSE, 0, vecSize * sizeof(float), c, 0, NULL, NULL) == CL_SUCCESS); save_file(“vector_C.data”, c, vecSize); /* Release memory */ clReleaseMemObject(d_c); free(c); clReleaseMemObject(d_b); free(b); clReleaseMemObject(d_a); free(a); clReleaseCommandQueue(command_queue); clReleaseContext(context); return 0; }


GMAC code sample


int main(int argc, char *argv[]) { float *a, *b, *c; assert(eclCompileSource(kernel_source) == eclSuccess); /* Alloc & init input data */ assert(eclMalloc((void **)&a, vecSize * sizeof(float)) == eclSuccess); read_file(“vector_A.data”, vecSize); assert(eclMalloc((void **)&b, vecSize * sizeof(float)) == eclSuccess) read_file(“vector_B.data”, vecSize); /* Alloc output data */ assert(eclMalloc((void **)&c, vecSize * sizeof(float)) == eclSuccess) /* Call the kernel */ ecl_kernel kernel; size_t globalSize = vecSize; assert(eclGetKernel("vecAdd", &kernel) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 0, c) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 1, a) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 2, b) == eclSuccess); assert(eclCallNDRange(kernel, 1, NULL, &globalSize, NULL) == eclSuccess); save_file(“vector_C.data”, vecSize); eclFree(a); eclFree(b); eclFree(c); return 0; }

GMAC Supported Platforms

• Any OpenCL 1.1 compatible stack, with optimizations for:

– AMD Fusion devices

– AMD Radeon HD

– NVIDIA Tesla

• Windows 7 (64 and 32 bits)

• GNU/Linux (64 and 32 bits)

6/15/11 AMD Fusion Summit 2011 7

Outline

• Introduction

• GMAC Memory Model

– Asymmetric Memory

– Global Memory

• Performance Evaluation

• Conclusions


GMAC Memory Model

• Unified CPU / GPU virtual address space

• Asymmetric address space accessibility

CPU

Memory

GPU

Shared Data Accessed by CPU and GPU via same pointer

CPU Data


GMAC Implementation

• Fusion APU

• AMD Radeon HD


CPU

Physical Memory

GPU

CPU CPU

Physical Memory

GPU GPU

Physical Memory

Coherence

GMAC Consistency Model

• Implicit acquire / release primitives at accelerator call / return boundaries


CPU GPU

CPU GPU

GMAC Coherence

• Avoid unnecessary data copies

• Lazy-update: – Call: transfer modified data

– Return: transfer when needed


Accelerator System Memory

Accelerator Memory

CPU

GMAC Memory API

• Allocate shared memory eclError_t eclMalloc(void **ptr, size_t size)

– Allocated memory address (returned by reference)

– Gets the size of the data to be allocated

– Error code, eclSuccess if no error

• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . }


GMAC Memory API

• Release shared memory eclError_t eclFree(void *ptr)

– Memory address to be released


• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . eclFree(foo); }


• Functions overridden (interposition) by GMAC: – Standard C Library memory functions: memset(), memcpy()

– Standard C Library I/O: fread(), fwrite(), read(), write()

– MPI: MPI_Send(), MPI_Receive

• Get advanced OpenCL features for free – Asynchronous highly optimized data transfers

– Pre-pinned memory

GMAC Built-in Optimizations


Calls to fread()

Data Transfers wait for kernel completion

Outline

• Introduction



– Global Memory


• Conclusions


GMAC Global Memory

• For multi-GPU systems: data accessible by all accelerators, but owned by the CPU

• Example: medium matrix in FDTD simulations

CPU

Memory

GPU

GPU


GMAC Global Memory

• Read-only data structures

– Zero-copy memory if read only once by the GPU

– Replicated data if read often by the GPU

• GMAC Global memory:

– Pre-pinned zero-copy in AMD Fusion

– Discrete GPU (e.g. HD Radeon):

• Replicated data copies if enough GPU memory

• Pre-pinned zero-copy otherwise


GMAC Global memory API

• Allocate global shared Memory eclError_t eclGlobalMalloc(void **ptr, size_t size)

– Allocated memory address (returned by reference)

– Gets the size of the data to be allocated


• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclGlobalMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . }


Outline

• Introduction



– Global Memory


• Conclusions


GMAC Performance

• Vector Addition: worst case scenario


0.85

0.9

0.95

1

1.05

1.1

1.15

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Spee

d-u

p

Exec

uti

on

tim

e (s

eco

nd

s)

Vector Size

SpeedUp OpenCL GMAC

GMAC Performance

• Sobel filtering on video stream

• OpenCL:

– 2.5ms per frame

– 192 lines of code

• GMAC:

– 1.5ms per frame

– 91 lines of code

• Both OpenCL and GMAC are faster than a CPU implementation


GMAC Hands-on

• Sobel Filtering Example

• Bullet Particle Collision Demo

– OpenCL

– GMAC


Outline

• Introduction



– Global Memory


• Conclusions


Conclusions

• Single virtual address space for CPUs and GPUs

• Use OpenCL advanced features

– Automatic overlap data communication and computation

– Get access to any GPU from any CPU thread

• Get more performance from your application more easily



http://www.multicorewareinc.com

Backup Slides

Rolling Update Data Transfers

• Overlap CPU execution and data transfers

• Minimal transfer on-demand

• Rolling-update: – Memory-block size granularity


Accelerator System Memory

Accelerator Memory

CPU

GMAC and Multi-threading

• In the past, one host thread had one CPU

• In GMAC, each host thread has:

– One CPU

– One GPU

• A GMAC thread is running on GPU or on the CPU, but not on both at the same time

• Create threads using what you already know – pthread_create(...)


GMAC and Multi-threading

• Virtual memory accessibility:

– Complete address space in CPU mode

– Partial address space in GPU mode

CPU CPU

GPU GPU Memory



http://www.multicorewareinc.com

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

Documents

GMAC Global Memory for Acceleratorsdeveloper.amd.com/wordpress/media/2013/06/2908_1_final.pdf · 2013. 10. 24. · GMAC Performance •Sobel filtering on video stream •OpenCL: –