A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,

A Brief Introduction to OpenCL • Reference - Programming Massively Parallel

Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11

• What is OpenCL?

– It is a standardized, cross-platform, parallel-computing API

– It is designed to be portable & work with heterogeneous systems

– Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing

– Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code

– In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem

Data Parallelism Model • There is a direct corres-

pondence with CUDA

• Host programs are used to launch kernels on OpenCL devices

• The index space maps data to the work items

• Work items are groups, as in Blocks with CUDA

• Work items in the same group can be synchronized

• The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!)

Parallel Execution Model

Getting global and local values • Thead IDs and Sizes

– The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension)

– If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension

Device Architecture • The CPU is a traditional

computer that exectures the host program

• Here is an OpenCL device

– It contains one or more compute units (CU)

– Each CU contains one or more processing elements (PE)

– There are a variety of memory types

Memory Characteristics • Global memory – dynamically allocated by host, has

read/write access by both host and devices

• Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device

• Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group

• Private memory – corresponds to CUDA local memory

Kernel Functions • Similarities with CUDA

– __kernel corresponds to __global in CUDA

– A vector add kernel is shown below; two input vectors a and b and one output vector result

– All three vectors reside in global memory, the two inputs are const

– This is a 1D problem so get_global_id(0) is used to get the thread index

– The addition is performed as expected

Device Management & Kernel Launch • Now for the “ugly” side of OpenCL

– CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps

– This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers

• An OpenCL context

– Use clCreateContext()

– Use clGetDeviceIDs() to find all devices

– Create a command queue for each device

– A sequence of function call are made to insert the kernel code with its execution parameters

A “Simple” Example - 1 • Line by Line

– Set error code to success

– Call create context from type

• Include all devices (param 2)

• The last argument sets the error code

– Line 3 declares parmsz, the size of the memory buffer

– Line 4 is the first call to clGetContextInfo

• clctx from line 2 is the first param

• Param 4 is NULL since the size is not known

• There will be another call in line 6 where the missing information is supplied

A “Simple” Example - 2 • Line by Line

– Line 5 uses malloc to assign to cldevs the address of the buffer

– Call clGetContextInfo again in line 6

• The third param is set to parmsz

• The fourth param is set to cldevs

• The error code is returned

– Line 7 creates a command queue for the first device

• cldevs is treated as an array and the 2nd param is cldevs[0]

• This generates a command queue for the first device in the list returned by clGetContextInfo

Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange

– Threads are now work items; blocks are work groups

– Each work item calculates up to eight grid points

– Each work group has 64 to 256 work items

Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is

changed

Changes in Data Access Indexing • The change are relatively minor

– __global__ becomes __kernel

– The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1

The Inner Loop of the DCS kernel • The OpenCL code is shown

– The logic is basically the same

– _rsqrt() has been changed to native_rsqrt

Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string

– Line 3 – delivers source code string to the OpenCL run time system

– Line 4 – sets up the compiler flags

– Line 5 – invokes the runtime compiler to build program

– Line 6 – handle to kernel that can be submitted to command queue

Host Code for the kernel Launch - 1

Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms

– Lines 3 – 6 : sets up arguments to be passed to the kernel

– Line 8 : submits the DCS kernel for launch

– Lines 9-10 : check for errors, if any

– Line 11 : transfers result data in energy array back to host memory

– Lines 12-13 : releases memory

Documents

A Brief Introduction to OpenCL–It is a standardized, cross-platform, parallel-computing API –It is designed to be portable & work with heterogeneous systems –Unlike earlier models,