Download pdf - OpenCL Slides

OpenCLThe Open Standard for Programming

Heterogeneous Parallel Hardware

Master Seminar Winter Term 2008/09Multicore Parallel Programming

Peter Thoman 04-12-2008

04-12-2008 OpenCL Peter Thoman 2

Outline● Introduction & Motivation● Background:

● GPGPU Programming History● Task-based Multicore CPU Programming

● OpenCL● Design Overview● Components● Execution Model● Memory Model● Examples

● Open Questions & Research Opportunities


Introduction● Recent years:

● Proliferation of parallel computing devices:– Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs

→ Standardized programming environment desirable● OpenCL is intended to be that standard

● Allow targeting various computing devices with the same program

● Simplify development for “exotic” hardware → Stipulate further growth beyond HPC & research


Motivation – why bother?● Higher level of parallelism & specialization generally

yields higher maximum performance

Computation (GFlop/s) Bandwidth (Gb/s)0

100

200

300

400

500

600

700

800

900

1000

Intel Core 2 Quad Q9450IBM Cell BENVIDIA Geforce GTX260


GPGPU History● Starting around 2003:

● Programmable Shaders introduced on GPUs– Originally intended for lighting calculations on surfaces etc.

● Side effect – allow GPUs to be used as general purpose computing devices

→ GPGPU born● Two broad phases so far:

● Early GPGPU: 2003-07 – Graphics APIs (DirectX/OpenGL) used to write GPGPU programs

● Current GPU computing: 2007-? – Vendor-supplied APIs


Early GPGPU● Graphics APIs used

● “Rendering” with pixel shaders and ping-ponging

● Disadvantages● Programmer must know graphics

APIs and concepts● Overheads introduced by

Graphics pipeline● No communication and

synchronization primitives


Current GPU Computing● Vendor-supplied APIs: CUDA, CTM● CUDA far more popular

● CudaZone lists 144 projects in a large variety of fields● With speedups (over CPU) from factor 2 to 480

● Advantages:● Standard C with simple extensions● Arbritrary read/writes from/to memory (no texture restriction)● Small high-speed shared memory as manual cache or for

communication● Traditional CPU functionality like bitwise integer ops

● Disadvantage: Vendor/Hardware specific


Task-based Parallelism on CPUs● As opposed to a data-parallelism model like on GPUs● Long history of implicitly task-based systems:

● MPI or other message-passing● Basic Threading● Even fork-join

● Explicitly task-based models rather new:● OpenMP 3.0● Research projects like Star Superscalar

– Presented last week!


OpenMP 3.0 Task Model● Simple spawning and synchronization of tasks● Same memory model as existing OMP constructs● No dependency handling

Example: Parallel Postorder Tree Traversal


OpenCL● Important:

● Specification not yet released, all information based on public presentations given at Siggraph and SC08

● Timeline:


OpenCL● Broad industry support:

● Next version of Apple OSX will most likely include first implementation


OpenCL – Design Goals● Enable use of all computational resources in a system

– allow programming GPUs, CPUs, Cell, etc.● Support data- and task-parallel compute models● Approachable low-level, high-performance abstraction

with silicon-portability● Familiar C-like parallel programming model● Drive future hardware requirements including floating

point precision limits● Close integration with OpenGL for visualization


OpenCL – Design Illustration● Convergence of both hardware and programming

models:


OpenCL Components (1)● OpenCL consists of 3 components:

● Platform Layer● Runtime System● Compiler/Language Specification

● Platform Layer:● Query, select and initialize devices● Create compute contexts and command queues

● Runtime System:● Resource management (memory, program scheduling)● Executing compute kernels


OpenCL Components (2)● Compiler (either online or offline compilation)

● Builds components written in compute kernel language● Language:

● Based on ISO C99, no recursion or function pointers● Built-in types:

– Scalar and vector data types, pointers– Data type conversion functions– Image-related types

● Built-in functions:– Work-item and synchronization functions– Math: math.h, relational and geometric functions– Functions to read and write images– Double precision support and rounding modes– Atomics to global and shared memory– Writes to 3D images

Optional

Required


OpenCL Execution Model● Components:

● Compute Kernels:– Basic units of computation, similar to C functions

● Compute Programs:– Collection of kernels and internal functions

● Components are queued in a command queue to execute on a specific device

● Two different Execution models:● Data-Parallel● Task-Parallel


OpenCL Data-Parallel Model● Programmer specifies N-dimensional computation

domain● Every element is a work-item

● Total number of items = global work size● Global work size is the maximum degree of parallelism for

this computation● Work-items can be grouped in work-groups

● Mapped either explicitly or implicitly● Items in groups can communicate and synchronize● Work-groups can also be executed in parallel


OpenCL Task-Parallel Model● Optional for compute devices

● Most current GPUs probably won't support it● Tasks are executed as a single work-item

● Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++

● No clearer specification for now, conjectured to be similar to OpenMP 3.0 model


OpenCL – Memory Model● Relaxed consistency shared memory model● Multiple distinct address spaces, can be collapsed on

some devices:● Private Memory

per work-item● Local Memory

per compute unit● Global/Constant Memory

● Qualifiers:__private, __local, __constant and __global


OpenCL – Examples (1)● Simple vector addition kernel (compute device) code


OpenCL – Examples (2)● Host Code – Initialization of a GPU device

and associated context / command queue:


OpenCL – Examples (3)● Host Code – allocate device memory buffers

and create / build program:


OpenCL – Examples (4)● Host code – create and run compute kernel:


OpenCL – Examples (5)● Kernel Code – Matrix Transpose:


Open Questions / Research● Shared code for vastly different hardware enables

research opportunities:● Distribution of kernels on devices

● Run given kernel on GPU or CPU, or maybe split?● Requires

● Analysis of kernels, either statically or dynamically● Lookup or benchmarking of available hardware at runtime● Fast decision algorithm using this information

– Either analytical or machine learning


Summary● Modern and future systems contain massively parallel,

heterogeneous hardware● Worth the headache because of performance potential

● OpenCL● Open standard platform for programming such systems● Data- and task-parallel execution models

– In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter

● Relaxed consistency shared memory– Distinct collapsible address spaces

● Release soon!

Thank you!

Consult the accompanying seminar document for a complete list of references.

STI Cell

NVIDIA GTX 280 ATI/AMD Rv770

AMD PhenomIntel Nehalem