31
Brief Introduction to OpenCL Hu Zi Ming [email protected] 2010-10-30 1 / 24 Brief Introduction to OpenCL N

20101030 opencl intro

Embed Size (px)

Citation preview

Page 1: 20101030 opencl intro

Brief Introduction to OpenCL

Hu Zi Ming

[email protected]

2010-10-30

1 / 24Brief Introduction to OpenCL

N

Page 2: 20101030 opencl intro

Outline

1 Some Background about OpenCLCPU vs. GPUWhat is OpenCLAdvantages & Disadvantages

2 Programming with OpenCL

3 Demo about OpenCL

2 / 24Brief Introduction to OpenCL

N

Page 3: 20101030 opencl intro

CPU vs. GPU

CPU:

Make single thread fastHide latency though large cache

GPU:

Improvement thoughputHide latency though prarllelism

3 / 24Brief Introduction to OpenCL

N

Page 4: 20101030 opencl intro

CPU vs. GPU

CPU:

Make single thread fastHide latency though large cache

GPU:

Improvement thoughputHide latency though prarllelism

3 / 24Brief Introduction to OpenCL

N

Page 5: 20101030 opencl intro

Before OpenCL. . .

Nvidia CUDA

ATI stream

Microsoft DirectComputer

. . . . . .

Apple said, Let there be standard

And there was OpenCL

4 / 24Brief Introduction to OpenCL

N

Page 6: 20101030 opencl intro

Before OpenCL. . .

Nvidia CUDA

ATI stream

Microsoft DirectComputer

. . . . . .

Apple said, Let there be standard

And there was OpenCL

4 / 24Brief Introduction to OpenCL

N

Page 7: 20101030 opencl intro

Before OpenCL. . .

Nvidia CUDA

ATI stream

Microsoft DirectComputer

. . . . . .

Apple said, Let there be standard

And there was OpenCL

4 / 24Brief Introduction to OpenCL

N

Page 8: 20101030 opencl intro

What is OpenCL

Open Computing Language

Based on C for CUDA but slightly lower

Originally developed by Apple

Handed over to the Khronos Group now

Can be used in parallel computing

5 / 24Brief Introduction to OpenCL

N

Page 9: 20101030 opencl intro

Advantages

Support heterogeneous platforms

Task-based(CPU) and data-based(GPU) parallelism for parallelcomputing

Improve memory bandwidth and compute bandwidth greatly

Extends the GPU power w/o been locked in one manufacturer

Support extensions like OpenGL

Support ES mode for mobile devices

6 / 24Brief Introduction to OpenCL

N

Page 10: 20101030 opencl intro

Disadvantages

Tunning is hardware-specific

Algorithm is binded with data shape

Recursion is not available now

Function pointer is not supported now

7 / 24Brief Introduction to OpenCL

N

Page 11: 20101030 opencl intro

Outline

1 Some Background about OpenCL

2 Programming with OpenCLPrerequisiteMain Flow of Host CodeFour Models

3 Demo about OpenCL

8 / 24Brief Introduction to OpenCL

N

Page 12: 20101030 opencl intro

Prerequisite

Driver support OpenCL

ATI Stream SDK/NVIDIA CUDA Toolkit/. . .

Host code: control kernel code

OpenCL kernel code: written in OpenCL and run on devices

9 / 24Brief Introduction to OpenCL

N

Page 13: 20101030 opencl intro

Main Flow of Host Code

Get information about the platform and devices

Select devices to be used in execution

Create an OpenCL context

Create a command queue

Create memory buffer objects

Create program object

Load the kernel source code and compile it

Create kernel object

Set kernel arguments

Execute the kernel

Copy memory from GPU to CPU

10 / 24Brief Introduction to OpenCL

N

Page 14: 20101030 opencl intro

OpenCL Summary

11 / 24Brief Introduction to OpenCL

N

Page 15: 20101030 opencl intro

Four Models

Platform model

Execution model

Memory model

Programming model

12 / 24Brief Introduction to OpenCL

N

Page 16: 20101030 opencl intro

Platform Model

A host connected to one or more OpenCL devices

Device can be divided into one or more compute units (CUs)

Compute unit can be further divided into one or moreprocessing elements (PEs)

Application send commands from host to PE

PE within CU execute instructions as SIMD/SPMD units

13 / 24Brief Introduction to OpenCL

N

Page 17: 20101030 opencl intro

Platform Model (Cont.)

14 / 24Brief Introduction to OpenCL

N

Page 18: 20101030 opencl intro

Execution Model

Work item is the basic unit of work

Kernel is code for work item

Executed on OpenCL devices, basically a C function

Host program executed on host

Create index space based on NDRange

Organize work-item as work-group

15 / 24Brief Introduction to OpenCL

N

Page 19: 20101030 opencl intro

Execution Model

Work item is the basic unit of work

Kernel is code for work item

Executed on OpenCL devices, basically a C function

Host program executed on host

Create index space based on NDRange

Organize work-item as work-group

15 / 24Brief Introduction to OpenCL

N

Page 20: 20101030 opencl intro

Execution Model

Work item is the basic unit of work

Kernel is code for work item

Executed on OpenCL devices, basically a C function

Host program executed on host

Create index space based on NDRange

Organize work-item as work-group

15 / 24Brief Introduction to OpenCL

N

Page 21: 20101030 opencl intro

Execution Model

Work item is the basic unit of work

Kernel is code for work item

Executed on OpenCL devices, basically a C function

Host program executed on host

Create index space based on NDRange

Organize work-item as work-group

15 / 24Brief Introduction to OpenCL

N

Page 22: 20101030 opencl intro

Execution Model (Cont.)

16 / 24Brief Introduction to OpenCL

N

Page 23: 20101030 opencl intro

Memory Model

Global mem: r/w to all work-item in all work-groups

Constant mem: global mem and remain constant duringexecution

Local mem: local to a work-group

Private mem: private to work-item

Data move path: host -¿ global -¿ local and back

17 / 24Brief Introduction to OpenCL

N

Page 24: 20101030 opencl intro

Memory Model

Global mem: r/w to all work-item in all work-groups

Constant mem: global mem and remain constant duringexecution

Local mem: local to a work-group

Private mem: private to work-item

Data move path: host -¿ global -¿ local and back

17 / 24Brief Introduction to OpenCL

N

Page 25: 20101030 opencl intro

Memory Model

18 / 24Brief Introduction to OpenCL

N

Page 26: 20101030 opencl intro

Programming Model

Data parallel programming model

Task parallel programming model

Synchronization

19 / 24Brief Introduction to OpenCL

N

Page 27: 20101030 opencl intro

Outline

1 Some Background about OpenCL

2 Programming with OpenCL

3 Demo about OpenCLMatrix AddMatrix Multiply

20 / 24Brief Introduction to OpenCL

N

Page 28: 20101030 opencl intro

Kernel Code

normal add

__kernel void add(__global int *a, __global int *b, __global int *c) {int i = get_global_id(0);c[i] = a[i] + b[i];

}

21 / 24Brief Introduction to OpenCL

N

Page 29: 20101030 opencl intro

Normal Kernel Code

normal multiply

__kernel void mul(__global int *a, __global int *b, __global int *c) {int x = get_global_id(1);int y = get_global_id(0);int i = 0;c[y * WC + x] = 0;for (; i < W; i++) {

c[y * WC + x] += a[y * WA + i] * b[i * WB + x];}

}

22 / 24Brief Introduction to OpenCL

N

Page 30: 20101030 opencl intro

Kernel Code with Block Support

multiply with block support

__kernel void mul(__global float *a, __global float *b, __global float *c,__local float *as, __local float *bs) {int x = get_global_id(1);int y = get_global_id(0);int bx = get_group_id(1);int by = get_group_id(0);int tx = get_local_id(1);int ty = get_local_id(0);

int tmp_val = 0;c[x * WC + y] = 0;for (int i = 0; i < WA / BLOCK_SIZE; i++) {

as[ty * BLOCK_SIZE + tx] = a[y * WA + x];bs[ty * BLOCK_SIZE + tx] = b[y * WA + x];barrier(CLK_LOCAL_MEM_FENCE);

for (int j = 0; j < BLOCK_SIZE; j++) {tmp_val += a[y * WA + i] * b[i * WB + x];barrier(CLK_LOCAL_MEM_FENCE);

}

c[y * WB + x] = tmp_val;}

}

23 / 24Brief Introduction to OpenCL

N

Page 31: 20101030 opencl intro

Q AND A

24 / 24Brief Introduction to OpenCL

N