17
The Open Standard for Parallel Programming of Heterogeneous systems James Xu

The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Embed Size (px)

Citation preview

Page 1: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

The Open Standard for Parallel Programming of Heterogeneous systems

James Xu

Page 2: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

IntroductionParallel Applications Becoming common

placeGPGPUMATLABQuad Cores

Page 3: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

ChallengesVendor specific APIsCPU – GPGPU Programming gap

Page 4: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCLOpen Computing LangauageIntroduces uniformity“Close-to-silicon”Parallel Computing using all possible

resources on end systemInitially by AppleKhronos group, OpenGL, OpenALMajor Vendor support

Page 5: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL OverviewAll computational resources on an end

system seen as peersCPU, GPU, ARM, DSPs etcStrict IEEE 754 Floating Point specification.

Fixed rounding, errorDefines architecture models and software

stack

Page 6: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture Model – Platform

Page 7: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture – Execution ModelKernel – Smallest unit of execution, like a C

functionHost program – A collection of kernelsWork item, an instance of kernel at run timeWork group, a collection of work items

Page 8: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture – Execution Model

Page 9: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture – Memory Model

Page 10: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture – Programming ModelData Parallel, work group consist of instances

of same kernel (work items)Different data elements are fed into the work

items in the groupTask Parallel, work group consist of a single

work item (instance of kernel)Work group can run independentlyEach compute device sees a number of work

groups in parallel, thus task parallel

Page 11: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Architecture – Programming ModelOnly CPUs are expected to have task parallel

mechanismsData parallel model must be present on all

OpenCL compatible devices

Page 12: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL RuntimeLanguage derived from ISO C99 (C

Language)Restrictions:

No recursionno function points

All standard data types, including vectorsOpenGL extension

Page 13: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL Software Stack

Shows the steps to develop an OpenCL program

Page 14: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL Example in C

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,

__local float *sMemx, __local float *sMemy) {

int blockIdx = get_group_id(0) * 1024 + tid;float2 data[16];in = in + blockIdx; out = out + blockIdx;

globalLoads(data, in, 64);

FFT Example using GPU

Page 15: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL Example in CfftRadix16Pass(data);twiddleFactorMul(data, tid, 1024, 0);localShuffle(data, sMemx, sMemy, tid,(((tid&15)*65) + (tid >> 4)));fftRadix16Pass(data);twiddleFactorMul(data, tid, 64, 4);localShuffle(data, sMemx, sMemy, tid,(((tid>>4)*64) + (tid & 15)));fftRadix4Pass(data);fftRadix4Pass(data + 4);fftRadix4Pass(data + 8);fftRadix4Pass(data + 12);

globalStores(data, out, 64);

}

Page 16: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL Example in Ccontext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);queue = clCreateWorkQueue(context, NULL, NULL, 0);

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,sizeof(float)*2*num_entries, NULL);

program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL);clBuildProgramExecutable(program, false, NULL, NULL);kernel = clCreateKernel(program, "fft1D_1024");

global_work_size[0] = n;local_work_size[0] = 64;range = clCreateNDRangeContainer(context, 0, 1, global_work_size,local_work_size);

Page 17: The Open Standard for Parallel Programming of Heterogeneous systems James Xu

OpenCL Example in CclSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);