20101030 opencl intro

Brief Introduction to OpenCL

Hu Zi Ming

[email protected]

2010-10-30

1 / 24Brief Introduction to OpenCL

N

mailto://[email protected]

Outline

1 Some Background about OpenCLCPU vs. GPUWhat is OpenCLAdvantages & Disadvantages

2 Programming with OpenCL

3 Demo about OpenCL


N

CPU vs. GPU

CPU:

Make single thread fastHide latency though large cache

GPU:

Improvement thoughputHide latency though prarllelism


N

CPU vs. GPU

CPU:

Make single thread fastHide latency though large cache

GPU:

Improvement thoughputHide latency though prarllelism


N

Before OpenCL. . .

Nvidia CUDA

ATI stream

Microsoft DirectComputer

. . . . . .

Apple said, Let there be standard

And there was OpenCL


N

Before OpenCL. . .

Nvidia CUDA

ATI stream


. . . . . .




N

Before OpenCL. . .

Nvidia CUDA

ATI stream


. . . . . .




N

What is OpenCL

Open Computing Language

Based on C for CUDA but slightly lower

Originally developed by Apple

Handed over to the Khronos Group now

Can be used in parallel computing


N

Advantages

Support heterogeneous platforms

Task-based(CPU) and data-based(GPU) parallelism for parallelcomputing

Improve memory bandwidth and compute bandwidth greatly

Extends the GPU power w/o been locked in one manufacturer

Support extensions like OpenGL

Support ES mode for mobile devices


N

Disadvantages

Tunning is hardware-specific

Algorithm is binded with data shape

Recursion is not available now

Function pointer is not supported now


N

Outline

1 Some Background about OpenCL

2 Programming with OpenCLPrerequisiteMain Flow of Host CodeFour Models

3 Demo about OpenCL


N

Prerequisite

Driver support OpenCL

ATI Stream SDK/NVIDIA CUDA Toolkit/. . .

Host code: control kernel code

OpenCL kernel code: written in OpenCL and run on devices


N

Main Flow of Host Code

Get information about the platform and devices

Select devices to be used in execution

Create an OpenCL context

Create a command queue

Create memory buffer objects

Create program object

Load the kernel source code and compile it

Create kernel object

Set kernel arguments

Execute the kernel

Copy memory from GPU to CPU


N

OpenCL Summary


N

Four Models

Platform model

Execution model

Memory model

Programming model


N

Platform Model

A host connected to one or more OpenCL devices

Device can be divided into one or more compute units (CUs)

Compute unit can be further divided into one or moreprocessing elements (PEs)

Application send commands from host to PE

PE within CU execute instructions as SIMD/SPMD units


N

Platform Model (Cont.)


N

Execution Model

Work item is the basic unit of work

Kernel is code for work item

Executed on OpenCL devices, basically a C function

Host program executed on host

Create index space based on NDRange

Organize work-item as work-group


N

Execution Model








N

Execution Model








N

Execution Model








N

Execution Model (Cont.)


N

Memory Model

Global mem: r/w to all work-item in all work-groups

Constant mem: global mem and remain constant duringexecution

Local mem: local to a work-group

Private mem: private to work-item

Data move path: host -¿ global -¿ local and back


N

Memory Model

Global mem: r/w to all work-item in all work-groups

Constant mem: global mem and remain constant duringexecution

Local mem: local to a work-group

Private mem: private to work-item

Data move path: host -¿ global -¿ local and back


N

Memory Model


N

Programming Model

Data parallel programming model

Task parallel programming model

Synchronization


N

Outline

1 Some Background about OpenCL

2 Programming with OpenCL

3 Demo about OpenCLMatrix AddMatrix Multiply


N

Kernel Code

normal add

__kernel void add(__global int *a, __global int *b, __global int *c) {int i = get_global_id(0);c[i] = a[i] + b[i];

}


N

Normal Kernel Code

normal multiply

__kernel void mul(__global int *a, __global int *b, __global int *c) {int x = get_global_id(1);int y = get_global_id(0);int i = 0;c[y * WC + x] = 0;for (; i < W; i++) {

c[y * WC + x] += a[y * WA + i] * b[i * WB + x];}

}


N

Kernel Code with Block Support

multiply with block support

__kernel void mul(__global float *a, __global float *b, __global float *c,__local float *as, __local float *bs) {int x = get_global_id(1);int y = get_global_id(0);int bx = get_group_id(1);int by = get_group_id(0);int tx = get_local_id(1);int ty = get_local_id(0);

int tmp_val = 0;c[x * WC + y] = 0;for (int i = 0; i < WA / BLOCK_SIZE; i++) {

as[ty * BLOCK_SIZE + tx] = a[y * WA + x];bs[ty * BLOCK_SIZE + tx] = b[y * WA + x];barrier(CLK_LOCAL_MEM_FENCE);

for (int j = 0; j < BLOCK_SIZE; j++) {tmp_val += a[y * WA + i] * b[i * WB + x];barrier(CLK_LOCAL_MEM_FENCE);

}

c[y * WB + x] = tmp_val;}

}


N

Q AND A


N

Technology

20101030 opencl intro