OpenCLThe Open Standard for Programming
Heterogeneous Parallel Hardware
Master Seminar Winter Term 2008/09Multicore Parallel Programming
Peter Thoman 04-12-2008
04-12-2008 OpenCL Peter Thoman 2
Outline● Introduction & Motivation● Background:
● GPGPU Programming History● Task-based Multicore CPU Programming
● OpenCL● Design Overview● Components● Execution Model● Memory Model● Examples
● Open Questions & Research Opportunities
04-12-2008 OpenCL Peter Thoman 3
Introduction● Recent years:
● Proliferation of parallel computing devices:– Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs
→ Standardized programming environment desirable● OpenCL is intended to be that standard
● Allow targeting various computing devices with the same program
● Simplify development for “exotic” hardware → Stipulate further growth beyond HPC & research
04-12-2008 OpenCL Peter Thoman 4
Motivation – why bother?● Higher level of parallelism & specialization generally
yields higher maximum performance
Computation (GFlop/s) Bandwidth (Gb/s)0
100
200
300
400
500
600
700
800
900
1000
Intel Core 2 Quad Q9450IBM Cell BENVIDIA Geforce GTX260
04-12-2008 OpenCL Peter Thoman 5
GPGPU History● Starting around 2003:
● Programmable Shaders introduced on GPUs– Originally intended for lighting calculations on surfaces etc.
● Side effect – allow GPUs to be used as general purpose computing devices
→ GPGPU born● Two broad phases so far:
● Early GPGPU: 2003-07 – Graphics APIs (DirectX/OpenGL) used to write GPGPU programs
● Current GPU computing: 2007-? – Vendor-supplied APIs
04-12-2008 OpenCL Peter Thoman 6
Early GPGPU● Graphics APIs used
● “Rendering” with pixel shaders and ping-ponging
● Disadvantages● Programmer must know graphics
APIs and concepts● Overheads introduced by
Graphics pipeline● No communication and
synchronization primitives
04-12-2008 OpenCL Peter Thoman 7
Current GPU Computing● Vendor-supplied APIs: CUDA, CTM● CUDA far more popular
● CudaZone lists 144 projects in a large variety of fields● With speedups (over CPU) from factor 2 to 480
● Advantages:● Standard C with simple extensions● Arbritrary read/writes from/to memory (no texture restriction)● Small high-speed shared memory as manual cache or for
communication● Traditional CPU functionality like bitwise integer ops
● Disadvantage: Vendor/Hardware specific
04-12-2008 OpenCL Peter Thoman 8
Task-based Parallelism on CPUs● As opposed to a data-parallelism model like on GPUs● Long history of implicitly task-based systems:
● MPI or other message-passing● Basic Threading● Even fork-join
● Explicitly task-based models rather new:● OpenMP 3.0● Research projects like Star Superscalar
– Presented last week!
04-12-2008 OpenCL Peter Thoman 9
OpenMP 3.0 Task Model● Simple spawning and synchronization of tasks● Same memory model as existing OMP constructs● No dependency handling
Example: Parallel Postorder Tree Traversal
04-12-2008 OpenCL Peter Thoman 10
OpenCL● Important:
● Specification not yet released, all information based on public presentations given at Siggraph and SC08
● Timeline:
04-12-2008 OpenCL Peter Thoman 11
OpenCL● Broad industry support:
● Next version of Apple OSX will most likely include first implementation
04-12-2008 OpenCL Peter Thoman 12
OpenCL – Design Goals● Enable use of all computational resources in a system
– allow programming GPUs, CPUs, Cell, etc.● Support data- and task-parallel compute models● Approachable low-level, high-performance abstraction
with silicon-portability● Familiar C-like parallel programming model● Drive future hardware requirements including floating
point precision limits● Close integration with OpenGL for visualization
04-12-2008 OpenCL Peter Thoman 13
OpenCL – Design Illustration● Convergence of both hardware and programming
models:
04-12-2008 OpenCL Peter Thoman 14
OpenCL Components (1)● OpenCL consists of 3 components:
● Platform Layer● Runtime System● Compiler/Language Specification
● Platform Layer:● Query, select and initialize devices● Create compute contexts and command queues
● Runtime System:● Resource management (memory, program scheduling)● Executing compute kernels
04-12-2008 OpenCL Peter Thoman 15
OpenCL Components (2)● Compiler (either online or offline compilation)
● Builds components written in compute kernel language● Language:
● Based on ISO C99, no recursion or function pointers● Built-in types:
– Scalar and vector data types, pointers– Data type conversion functions– Image-related types
● Built-in functions:– Work-item and synchronization functions– Math: math.h, relational and geometric functions– Functions to read and write images– Double precision support and rounding modes– Atomics to global and shared memory– Writes to 3D images
Optional
Required
04-12-2008 OpenCL Peter Thoman 16
OpenCL Execution Model● Components:
● Compute Kernels:– Basic units of computation, similar to C functions
● Compute Programs:– Collection of kernels and internal functions
● Components are queued in a command queue to execute on a specific device
● Two different Execution models:● Data-Parallel● Task-Parallel
04-12-2008 OpenCL Peter Thoman 17
OpenCL Data-Parallel Model● Programmer specifies N-dimensional computation
domain● Every element is a work-item
● Total number of items = global work size● Global work size is the maximum degree of parallelism for
this computation● Work-items can be grouped in work-groups
● Mapped either explicitly or implicitly● Items in groups can communicate and synchronize● Work-groups can also be executed in parallel
04-12-2008 OpenCL Peter Thoman 18
OpenCL Task-Parallel Model● Optional for compute devices
● Most current GPUs probably won't support it● Tasks are executed as a single work-item
● Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++
● No clearer specification for now, conjectured to be similar to OpenMP 3.0 model
04-12-2008 OpenCL Peter Thoman 19
OpenCL – Memory Model● Relaxed consistency shared memory model● Multiple distinct address spaces, can be collapsed on
some devices:● Private Memory
per work-item● Local Memory
per compute unit● Global/Constant Memory
● Qualifiers:__private, __local, __constant and __global
04-12-2008 OpenCL Peter Thoman 20
OpenCL – Examples (1)● Simple vector addition kernel (compute device) code
04-12-2008 OpenCL Peter Thoman 21
OpenCL – Examples (2)● Host Code – Initialization of a GPU device
and associated context / command queue:
04-12-2008 OpenCL Peter Thoman 22
OpenCL – Examples (3)● Host Code – allocate device memory buffers
and create / build program:
04-12-2008 OpenCL Peter Thoman 23
OpenCL – Examples (4)● Host code – create and run compute kernel:
04-12-2008 OpenCL Peter Thoman 24
OpenCL – Examples (5)● Kernel Code – Matrix Transpose:
04-12-2008 OpenCL Peter Thoman 25
Open Questions / Research● Shared code for vastly different hardware enables
research opportunities:● Distribution of kernels on devices
● Run given kernel on GPU or CPU, or maybe split?● Requires
● Analysis of kernels, either statically or dynamically● Lookup or benchmarking of available hardware at runtime● Fast decision algorithm using this information
– Either analytical or machine learning
04-12-2008 OpenCL Peter Thoman 26
Summary● Modern and future systems contain massively parallel,
heterogeneous hardware● Worth the headache because of performance potential
● OpenCL● Open standard platform for programming such systems● Data- and task-parallel execution models
– In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter
● Relaxed consistency shared memory– Distinct collapsible address spaces
● Release soon!
Thank you!
Consult the accompanying seminar document for a complete list of references.
STI Cell
NVIDIA GTX 280 ATI/AMD Rv770
AMD PhenomIntel Nehalem