Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A Brief Introduction to OpenCL • Reference - Programming Massively Parallel
Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11
• What is OpenCL?
– It is a standardized, cross-platform, parallel-computing API
– It is designed to be portable & work with heterogeneous systems
– Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing
– Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code
– In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem
Data Parallelism Model • There is a direct corres-
pondence with CUDA
• Host programs are used to launch kernels on OpenCL devices
• The index space maps data to the work items
• Work items are groups, as in Blocks with CUDA
• Work items in the same group can be synchronized
• The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!)
Parallel Execution Model
Getting global and local values • Thead IDs and Sizes
– The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension)
– If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension
Device Architecture • The CPU is a traditional
computer that exectures the host program
• Here is an OpenCL device
– It contains one or more compute units (CU)
– Each CU contains one or more processing elements (PE)
– There are a variety of memory types
Memory Characteristics • Global memory – dynamically allocated by host, has
read/write access by both host and devices
• Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device
• Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group
• Private memory – corresponds to CUDA local memory
Kernel Functions • Similarities with CUDA
– __kernel corresponds to __global in CUDA
– A vector add kernel is shown below; two input vectors a and b and one output vector result
– All three vectors reside in global memory, the two inputs are const
– This is a 1D problem so get_global_id(0) is used to get the thread index
– The addition is performed as expected
Device Management & Kernel Launch • Now for the “ugly” side of OpenCL
– CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps
– This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers
• An OpenCL context
– Use clCreateContext()
– Use clGetDeviceIDs() to find all devices
– Create a command queue for each device
– A sequence of function call are made to insert the kernel code with its execution parameters
A “Simple” Example - 1 • Line by Line
– Set error code to success
– Call create context from type
• Include all devices (param 2)
• The last argument sets the error code
– Line 3 declares parmsz, the size of the memory buffer
– Line 4 is the first call to clGetContextInfo
• clctx from line 2 is the first param
• Param 4 is NULL since the size is not known
• There will be another call in line 6 where the missing information is supplied
A “Simple” Example - 2 • Line by Line
– Line 5 uses malloc to assign to cldevs the address of the buffer
– Call clGetContextInfo again in line 6
• The third param is set to parmsz
• The fourth param is set to cldevs
• The error code is returned
– Line 7 creates a command queue for the first device
• cldevs is treated as an array and the 2nd param is cldevs[0]
• This generates a command queue for the first device in the list returned by clGetContextInfo
Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange
– Threads are now work items; blocks are work groups
– Each work item calculates up to eight grid points
– Each work group has 64 to 256 work items
Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is
changed
Changes in Data Access Indexing • The change are relatively minor
– __global__ becomes __kernel
– The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1
The Inner Loop of the DCS kernel • The OpenCL code is shown
– The logic is basically the same
– _rsqrt() has been changed to native_rsqrt
Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string
– Line 3 – delivers source code string to the OpenCL run time system
– Line 4 – sets up the compiler flags
– Line 5 – invokes the runtime compiler to build program
– Line 6 – handle to kernel that can be submitted to command queue
Host Code for the kernel Launch - 1
Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms
– Lines 3 – 6 : sets up arguments to be passed to the kernel
– Line 8 : submits the DCS kernel for launch
– Lines 9-10 : check for errors, if any
– Line 11 : transfers result data in energy array back to host memory
– Lines 12-13 : releases memory