Making OpenCLâ„¢ Simple with Haskell - AMD 3 | Making OpenCLâ„¢ Simple | January, 2011 | Public AGENDA

  • View
    6

  • Download
    0

Embed Size (px)

Text of Making OpenCLâ„¢ Simple with Haskell - AMD 3 | Making OpenCLâ„¢ Simple | January, 2011 |...

  • Making OpenCL™ Simple with Haskell

    Benedict R. Gaster

    January, 2011

  • 2 | Making OpenCL™ Simple | January, 2011 | Public

    Attribution and WARNING

     The ideas and work presented here are in collaboration with:

    – Garrett Morris (AMD intern 2010 & PhD student Portland State)

    Garrett is the Haskell expert and knows a lot more than I do about

    transforming Monads!

  • 3 | Making OpenCL™ Simple | January, 2011 | Public

    AGENDA

    Motivation

    Whistle stop introduction to OpenCL

    Bringing OpenCL to Haskell

    Lifting to something more in the spirit of Haskell

    Quasiquotation

    Issues we face using Haskell at AMD

    Questions

  • 4 | Making OpenCL™ Simple | January, 2011 | Public

    Motivation

  • 5 | Making OpenCL™ Simple | January, 2011 | Public

    OPENCL™ PROGRAM STRUCTURE

    DEVICE

    (OpenCL C)

    CPU

    (Platform and

    Runtime APIs)

    Host C/C++ Code OpenCL C Device Code

  • 6 | Making OpenCL™ Simple | January, 2011 | Public

    HELLO WORLD OPENCL™ C SOURCE

    __constant char hw[] = "Hello World\n";

    __kernel void hello(__global char * out) {

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

    }

  • 7 | Making OpenCL™ Simple | January, 2011 | Public

    HELLO WORLD OPENCL™ C SOURCE

    __constant char hw[] = "Hello World\n";

    __kernel void hello(__global char * out) {

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

    }

    • This is a separate source file (or string)

    • Cannot directly access host data

    • Compiled at runtime

  • 8 | Making OpenCL™ Simple | January, 2011 | Public

    HELLO WORLD - HOST PROGRAM // create the OpenCL context on a GPU device

    cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

    // get the list of GPU devices associated with context

    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

    NULL, &cb);

    devices = malloc(cb);

    clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

    // create a command-queue

    cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

    memobjs[0] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

    sizeof(cl_char)*strlen(“Hello World”, NULL, NULL);

    // create the program

    program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

    // build the program

    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    // create the kernel

    kernel = clCreateKernel(program, “vec_add”, NULL);

    // set the args values

    err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

    sizeof(cl_mem));

    // set work-item dimensions

    global_work_size[0] = strlen(“Hello World”);;

    // execute kernel

    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    // read output array

    err = clEnqueueReadBuffer(cmd_queue, memobjs[0], CL_TRUE, 0, strlen(“Hello World”) *sizeof(cl_char), dst, 0, NULL, NULL);

  • 9 | Making OpenCL™ Simple | January, 2011 | Public

    HELLO WORLD - HOST PROGRAM

    // create the OpenCL context on a GPU device

    cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

    // get the list of GPU devices associated with context

    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

    NULL, &cb);

    devices = malloc(cb);

    clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

    // create a command-queue

    cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

    // allocate the buffer memory objects

    memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_char)*strlen(“Hello World”), srcA, NULL);}

    // create the program

    program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

    // build the program

    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    // create the kernel

    kernel = clCreateKernel(program, “vec_add”, NULL);

    // set the args values

    err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

    sizeof(cl_mem));

    // set work-item dimensions

    global_work_size[0] = n;

    // execute kernel

    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    // read output array

    err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

    Define platform and queues

    Define Memory objects

    Create the program

    Build the program

    Create and setup kernel

    Execute the kernel

    Read results on the host

  • 10 | Making OpenCL™ Simple | January, 2011 | Public

    USING HASKELL OUR GOAL IS TO WRITE THIS

    import Language.OpenCL.Module

    hstr = "Hello world\n“

    hlen = length hstr + 1

    prog = initCL [$cl|

    __constant char hw[] = $hstr;

    __kernel void hello(__global char * out) {

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

    }

    |]

    main :: IO ()

    main = withNew prog $

    using (bufferWithFlags hwlen [WriteOnly]) $ \b ->

    do [k]

  • 11 | Making OpenCL™ Simple | January, 2011 | Public

    USING HASKELL OUR GOAL IS TO WRITE THIS

    import Language.OpenCL.Module

    hstr = "Hello world\n“

    hlen = length hstr + 1

    prog = initCL [$cl|

    __constant char hw[] = $hstr;

    __kernel void hello(__global char * out) {

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

    }

    |]

    main :: IO ()

    main = withNew prog $

    using (bufferWithFlags hwlen [WriteOnly]) $ \b ->

    do [k]

  • 12 | Making OpenCL™ Simple | January, 2011 | Public

    USING HASKELL OUR GOAL IS TO WRITE THIS

    import Language.OpenCL.Module

    hstr = "Hello world\n“

    hlen = length hstr + 1

    prog = initCL [$cl|

    __constant char hw[] = $hstr;

    __kernel void hello(__global char * out) {

    size_t tid = get_global_id(0);

    out[tid] = hw[tid];

    }

    |]

    main :: IO ()

    main = withNew prog $

    using (bufferWithFlags hwlen [WriteOnly]) $ \b ->

    do [k]

  • 13 | Making OpenCL™ Simple | January, 2011 | Public

    LEARN FROM COMMON USES

     In OpenCL we generally see:

    – Pick single device (often GPU or CL_DEVICE_TYPE_DEFAULT)

    – All “kernels” in cl_program object are used in application

     In CUDA the default for runtime mode is:

    – Pick single device (always GPU)

    – All “kernels” in scope are exported to the host application for specific

    translation unit, i.e. calling kernels is syntactic and behave similar to

    static linkage.

  • 14 | Making OpenCL™ Simple | January, 2011 | Public

    Whistle stop introduction

    to OpenCL™

  • 15 | Making OpenCL™ Simple | January, 2011 | Public

    IT’S A HETEROGENEOUS WORLD

     A modern platform Includes:

    – One or more CPUs

    – One or more GPUs

    – And more

    CPU CPU

    North Bridge System Memory

    Discrete GPU

    Graphics

    Memory

    PCIE

    Fusion

    GPU

  • 16 | Making OpenCL™ Simple | January, 2011 | Public

    OPENCL™ PLATFORM MODEL

    One Host + one or more Compute Devices

    – Each Compute Device is composed of one or more

    Compute Units

     Each Compute Unit is further divided into one or more Processing Elements

  • 17 | Making OpenCL™ Simple | January, 2011 | Public

    OPENCL™ EXECUTION MODEL

     An OpenCL application runs on a host which submits work to the compute devices

    – Work item: the basic unit of work on an OpenCL device

    – Kernel: the code for a work item. Basically a C function

    – Program: Collection of kernels and other functions (Analogous to a dynamic library)

    – Context: The environment within which work- items executes … includes devices and their memories and command queues

     Applications queue kernel execution instances

    – Queued in-order … one queue to a device

    – Executed in-order or out-of-order

    Queue Queue

    Context

    GPU CPU

  • 18 | Making OpenCL™ Simple | January, 2011 | Public

    THE BIG IDEA BEHIND OPENCL™

    OpenCL execution model …

    – Define N-dimensional computation domain

    – Execute a kernel at each point in computation domain

    void

    trad_mul(int n,

    const float *a,

    const float *b,

    float *c)

    {

    int i;

    for (i=0; i

  • 19 | Making OpenCL™ Simple | January, 2011 | Public

    THE BIG IDEA BEHIND OPENCL™

    OpenCL execution model …

    – Define N-dimensional computation domain

    – Execute a kernel at each point in computation domain

    void

    trad_mul(int n,

    const float *a,

    const float *b,