35
Programming GPUs with SYCL Gordon Brown – Staff Software Engineer, SYCL C++ Edinburgh – July 2016

Programming GPUs with SYCL - C++ Edinburghcppedinburgh.uk/slides/201607-sycl.pdf · •Create a C++ for OpenCL ecosystem •Define an open portable standard •Provide the performance

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Programming GPUs with SYCLGordon Brown – Staff Software Engineer, SYCL

C++ Edinburgh – July 2016

© 2016 Codeplay Software Ltd.2

Agenda

• Introduction to GPGPU• Why program GPUs?

• CPU vs GPU architecture

• General GPU programming tips

• SYCL for OpenCL• Overview

• Features

• SYCL example• Vector add

© 2016 Codeplay Software Ltd.3

Introduction to GPGPU

© 2016 Codeplay Software Ltd.4

Why Program GPUs?

• Need for parallelism to gain performance• “Free lunch” provided by Moore’s law is over

• Adding even more CPU cores is showing diminishing returns

• GPUs are extremely efficient for• Data parallel tasks

• Arithmetic heavy computations

© 2016 Codeplay Software Ltd.5

CPU vs GPU

© 2016 Codeplay Software Ltd.6

CPU vs GPU

• Data parallelism• Large number of small execution units• Single instruction on all multiple

execution units in lock-step• Lower power consumption• Higher memory bandwidth• Sequential memory access

• Task parallelism• Small number of large cores• Separate instructions on each core

independently• Higher power consumption• Lower memory bandwidth• Random memory access

© 2016 Codeplay Software Ltd.7

Common CPU Architecture

CPU Core

Registers

CPU Core

Registers

CPU Core

Registers

RAM

Cache

CPU Core

Registers

© 2016 Codeplay Software Ltd.8

Common GPU Architecture

Cache

Graphics RAM

Compute Unit

PE

PM

PE

PM

PE

PM

Shared Memory

Compute Unit

PE

PM

PE

PM

PE

PM

Shared Memory

Compute Unit

PE

PM

PE

PM

PE

PM

Shared Memory

Compute Unit

PE

PM

PE

PM

PE

PM

Shared Memory

© 2016 Codeplay Software Ltd.9

Common System Architecture

GPU

CPU RAM

PCI Bus

Graphics RAM

Lower bandwidth data transfer over PCI bus

© 2016 Codeplay Software Ltd.10

GPUs Execute in Lock-step

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

if (i < 10)

{

v = foo(i);

} else {

v = bar(i);

}

a[i] = v

int v = 0;

v = foo(i);

a[i] = v;

Waves of PEs are executed in lock-

step

PEs that don’t follow a branch still execute

with a mask

© 2016 Codeplay Software Ltd.11

GPUs Access Memory Sequentially

PE 1 PE 2 PE 3 PE 4

0 1 2 3 32 33 34 35… 64 65 66 67… 96 97 98 99…

PE 1 PE 2 PE 3 PE 4

0 1 2 3 32 33 34 35… 64 65 66 67… 96 97 98 99…

Non- coalesced memory access

Coalesced memory access

© 2016 Codeplay Software Ltd.12

General GPU Programming Tips

• Ensure the task is suitable• GPUs are most efficient for data parallel tasks• Performance gain from performing computation > cost of moving data

• Avoid branching• Waves of processing elements execute in lock-step• Both sides of branches execute with the other masked

• Avoid non-coalesced memory access• GPUs access memory more efficiently if accessed as contiguous blocks

• Avoid expensive data movement• The bottleneck in GPU programming is data movement between CPU and GPU

memory• It’s important to have data as close to the processing as possible

© 2016 Codeplay Software Ltd.13

SYCL for OpenCL

© 2016 Codeplay Software Ltd.14

What is OpenCL?

• Allows you to write kernels that execute on accelerators

• Allows you to copy data between the host CPU and accelerators

• Supports a wide range of devices

• Comes in two components:• Host side C API for en-queueing kernels and copying data

• Device side OpenCL C language for writing kernels

© 2016 Codeplay Software Ltd.15

Motivation of SYCL

• Make heterogeneous programming more accessible• Provide a foundation for efficient and portable template algorithms

• Create a C++ for OpenCL ecosystem• Define an open portable standard• Provide the performance and portability of OpenCL• Base only on standard C++

• Provide a high-level shared source model• Provide a high-level abstraction over OpenCL boiler plate code• Allow C++ template libraries to target OpenCL• Allow type safety across host and device

© 2016 Codeplay Software Ltd.16

SYCL for OpenCL

Cross-platform, single-source, high-level, C++ programming layerBuilt on top of OpenCL and based on standard C++14

© 2016 Codeplay Software Ltd.17

The SYCL Ecosystem

C++ Application

C++ Template Library

SYCL for OpenCL

OpenCL

C++ Template LibraryC++ Template Library

CPU GPU

© 2016 Codeplay Software Ltd.18

How Does Shared Source Work?:Regular C++ (Single Source)

CPU ObjectCPU

CompilerExecutable

C++Source

FileLinker

© 2016 Codeplay Software Ltd.19

How Does Shared Source Work?:OpenCL (Separate Source)

CPU ObjectCPU

CompilerExecutable

C++Source

File Linker

OpenCL C Source

OpenCL C code is compiled at runtime

© 2016 Codeplay Software Ltd.20

How Does Shared Source Work?:SYCL (Shared Source)

SYCL Compiler

SPIR Binary

CPU ObjectCPU

Compiler

Executable

C++Source

FileLinker

© 2016 Codeplay Software Ltd.21

Separating Data & Access

Buffer

Accessor Host

DeviceAccessor

Buffers allow type safe access across host and device

Accessors are used to detect dependencies

© 2016 Codeplay Software Ltd.22

Dependency Task Graphs

Buffer B

Buffer C

Buffer D

Buffer A

Kernel B

Kernel C

Kernel ARead Accessor

Write Accessor

Read Accessor

Write Accessor

Read Accessor

Write Accessor

Read AccessorKernel C

Kernel A Kernel B

© 2016 Codeplay Software Ltd.23

Supported Features

• Static polymorphism

• Lambdas

• Classes

• Operator overloading

• Templates

• Placement new

Non-supported features

• Dynamic polymorphism

• Dynamic allocation

• Exception handling

• RTTI

• Static variables

• Function pointers

Supported Subset of C++ in Device Code

© 2016 Codeplay Software Ltd.24

Compact High-level API

OpenCL

SYCL

© 2016 Codeplay Software Ltd.25

Example: Vector Add

© 2016 Codeplay Software Ltd.26

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

}

Include sycl.hpp for the whole SYCL runtime

© 2016 Codeplay Software Ltd.27

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

}

Create buffers to maintain the data across

host and device

The buffers synchronise upon

destruction

© 2016 Codeplay Software Ltd.28

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

cl::sycl::queue defaultQueue;

}

Create a queue to en-queue work

© 2016 Codeplay Software Ltd.29

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

cl::sycl::queue defaultQueue;

defaultQueue.submit([&] (cl::sycl::handler &cgh) {

});

}

Create a command group to define an asynchronous task

The scope of the command group is

defined by a lambda

© 2016 Codeplay Software Ltd.30

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

cl::sycl::queue defaultQueue;

defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);

auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);

auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);

});

}

Create accessors to give access to the data

on the device

© 2016 Codeplay Software Ltd.31

Example: Vector Add#include <CL/sycl.hpp>

template <typename T> kernel;

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

cl::sycl::queue defaultQueue;

defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);

auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);

auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);

cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(size)),

[=](cl::sycl::id<1> idx) {

}));

});

}

Create a parallel_forto define a kernel

© 2016 Codeplay Software Ltd.32

Example: Vector Add#include <CL/sycl.hpp>

template <typename T> kernel;

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size) {

cl::sycl::buffer<T, 1> inputABuf(inputA, size);

cl::sycl::buffer<T, 1> inputBBuf(inputB, size);

cl::sycl::buffer<T, 1> outputBuf(output, size);

cl::sycl::queue defaultQueue;

defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);

auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);

auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);

cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(size)),

[=](cl::sycl::id<1> idx) {

outputPtr[idx] = inputAPtr[idx] + inputBPtr[idx];

}));

});

}

Access the data via the accessor’s subscript

operator

You must provide a name for the

lambda

© 2016 Codeplay Software Ltd.33

Example: Vector Add

template <typename T>

void parallel_add(T *inputA, T *inputB, T *output, size_t size);

int main() {

float inputA[count] = { /* input a */ };

float inputB[count] = { /* input b */ };

float output[count] = { /* output */ };

parallel_add(inputA, inputB, output, count);

}

The result is stored in output upon returning

from parallel_add

© 2016 Codeplay Software Ltd.34

Community Edition

Coming soon!

Watch this space:

http://sycl.tech/

@codeplaysoft [email protected]

Thank You