30
Advanced / Other Programming Models Sathish Vadhiyar

Advanced / Other Programming Models Sathish Vadhiyar

Embed Size (px)

Citation preview

Page 1: Advanced / Other Programming Models Sathish Vadhiyar

Advanced / Other Programming Models

Sathish Vadhiyar

Page 2: Advanced / Other Programming Models Sathish Vadhiyar

OpenCL – Command Queues, Runtime Compilation, Multiple Devices

Sources: OpenCL overview from AMD OpenCL learning kit from AMD

Page 3: Advanced / Other Programming Models Sathish Vadhiyar

Introduction

OpenCL is a programming framework for heterogeneous computing resources

Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs

Many similarities with CUDA

Page 4: Advanced / Other Programming Models Sathish Vadhiyar
Page 5: Advanced / Other Programming Models Sathish Vadhiyar

Command QueuesA command queue is the mechanism for the

host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and

satisfying dependencies using events

A separate command queue is required for each device

Commands within the queue can be synchronous or asynchronous

Commands can execute in-order or out-of-order

5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer

Architecture Research Lab, with Ben Gaster, AMD © 2011

Page 6: Advanced / Other Programming Models Sathish Vadhiyar

Example – Image Rotation

Page 7: Advanced / Other Programming Models Sathish Vadhiyar

Slides 8, 11-16 of lecture 5 in openCL University kit

Page 8: Advanced / Other Programming Models Sathish Vadhiyar

Synchronization

Page 9: Advanced / Other Programming Models Sathish Vadhiyar

Synchronization in OpenCL

Synchronization is required if we use an out-of-order command queue or multiple command queues

Coarse synchronization granularity Per command queue basis

Finer synchronization granularity Per OpenCL operation basis using events

9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer

Architecture Research Lab, with Ben Gaster, AMD © 2011

Page 10: Advanced / Other Programming Models Sathish Vadhiyar

OpenCL Command Queue Control Command queue synchronization methods work on a per-queue

basis Flush: clFlush(cl_commandqueue)

Send all commands in the queue to the compute device

No guarantee that they will be complete when clFlush returns

Finish: clFinish(cl_commandqueue) Waits for all commands in the command queue to

complete before proceeding (host blocks on this call) Barrier: clEnqueueBarrier(cl_commandqueue)

Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute

10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer

Architecture Research Lab, with Ben Gaster, AMD © 2011

Page 11: Advanced / Other Programming Models Sathish Vadhiyar

OpenCL Events

Previous OpenCL synchronization functions only operated on a per-command-queue granularity

OpenCL events are needed to synchronize at a function granularity

Explicit synchronization is required for Out-of-order command queues Multiple command queues

11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer

Architecture Research Lab, with Ben Gaster, AMD © 2011

Page 12: Advanced / Other Programming Models Sathish Vadhiyar

Using User Events

A simple example of user events being triggered and used in a command queue

//Create user event which will start the write of buf1user_event = clCreateUserEvent(ctx, NULL);clEnqueueWriteBuffer( cq, buf1, CL_FALSE, ..., 1, &user_event , NULL);//The write of buf1 is now enqued and waiting on user_event

X = foo(); //Lots of complicated host processing code

clSetUserEventStatus(user_event, CL_COMPLETE);//The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo()

12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer

Architecture Research Lab, with Ben Gaster, AMD © 2011

Page 13: Advanced / Other Programming Models Sathish Vadhiyar

Multiple Devices

Page 14: Advanced / Other Programming Models Sathish Vadhiyar

Multiple Devices OpenCL can also be used to program multiple

devices (CPU, GPU, Cell, DSP etc.) OpenCL does not assume that data can be

transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an

intermediate transfer to the host

OpenCL events are used to synchronize execution on different devices within a context

Page 15: Advanced / Other Programming Models Sathish Vadhiyar

Compiling Code for Multiple Devices

Page 16: Advanced / Other Programming Models Sathish Vadhiyar

Charm++

Source: Tutorial Slides fromParallel Programming Lab, UIUCAuthors (Laxmikant Kale, Eric Bohm)

Page 17: Advanced / Other Programming Models Sathish Vadhiyar

Virtualization: Object-based Decomposition

In MPI, the number of processes is typically equal to the number of processors

Virtualization: Divide the computation into a large

number of pieces Independent of number of processors Typically larger than number of processors

Let the system map objects to processors

Page 18: Advanced / Other Programming Models Sathish Vadhiyar

The Charm++ Model Parallel objects (chares) communicate

via asynchronous method invocations (entry methods).

The runtime system maps chares onto processors and schedules execution of entry methods.

Chares can be dynamically created on any available processor

Can be accessed from remote processors

18Charm++ Basics

Page 19: Advanced / Other Programming Models Sathish Vadhiyar

04/21/23 CS 420 19

Processor Virtualization

User View

System implementation

User is only concerned with interaction between objects (VPs)

Page 20: Advanced / Other Programming Models Sathish Vadhiyar

20

Adaptive Overlap via Data-driven Objects Problem:

Processors wait for too long at “receive” statements With Virtualization, you get Data-driven

execution There are multiple entities (objects, threads) on each

proc No single object or threads holds up the processor Each one is “continued” when its data arrives

So: Achieves automatic and adaptive overlap of computation and communication

Page 21: Advanced / Other Programming Models Sathish Vadhiyar

Load Balancing

Page 22: Advanced / Other Programming Models Sathish Vadhiyar

04/21/23 CS 420 22

Using Dynamic Mapping to Processors Migration

Charm objects can migrate from one processor to another

Migration creates a new object on the destination processor while destroying the original

Use that for dynamic (and static, initial) load balancing

Measurement based, predictive strategies Based on object communication patterns

and computational loads

Page 23: Advanced / Other Programming Models Sathish Vadhiyar

Summary: Primary Advantages

Automatic mapping Migration and load balancing Asynchronous and message driven

communications Computation-communication overlap

Page 24: Advanced / Other Programming Models Sathish Vadhiyar

How it looks?

Page 25: Advanced / Other Programming Models Sathish Vadhiyar

Asynchronous Hello World

Program’s asynchronous flow Mainchare sends message to Hello

object Hello object prints “Hello World!” Hello object sends message back to the

mainchare Mainchare quits the application

Charm++ Basics 25

Page 26: Advanced / Other Programming Models Sathish Vadhiyar

Code and Workflow

Charm++ Basics 26

Page 27: Advanced / Other Programming Models Sathish Vadhiyar

Hello World: Array VersionMain Code

Charm++ Basics 27

Page 28: Advanced / Other Programming Models Sathish Vadhiyar

Array Code

Charm++ Basics 28

Page 29: Advanced / Other Programming Models Sathish Vadhiyar

Result$ ./charmrun +p3 ./hello 10Running “Hello World” with 10 elements using 3 processors.“Hello” from Hello chare #0 on processor 0 (told by -1)“Hello” from Hello chare #1 on processor 0 (told by 0)“Hello” from Hello chare #2 on processor 0 (told by 1)“Hello” from Hello chare #3 on processor 0 (told by 2)“Hello” from Hello chare #4 on processor 1 (told by 3)“Hello” from Hello chare #5 on processor 1 (told by 4)“Hello” from Hello chare #6 on processor 1 (told by 5)“Hello” from Hello chare #7 on processor 2 (told by 6)“Hello” from Hello chare #8 on processor 2 (told by 7)“Hello” from Hello chare #9 on processor 2 (told by 8)

Charm++ Basics 29

Page 30: Advanced / Other Programming Models Sathish Vadhiyar