129
On OpenCL and Chaotic Phenomena Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Embed Size (px)

Citation preview

Page 1: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

On OpenCL and Chaotic Phenomena

Department of Computer Science & EngineeringUniversity of Colorado Denver

ByAli Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Page 2: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

It aims to utilize GPGPU through the use of OpenCL, an open standard for computing on heterogeneous platforms, including CPUs, GPUs.

◦for the computation involved in Chaotic Phenomena: Mandelbrot Set and bifurcation diagram of the Logistic

Map.

The performance provided by the GPU through OpenCL will be compared to CPU performance through OpenCL,

and plain C++

Goal of projects:

Page 3: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Introduction. Background. Implementation. Methodology, Results, and Analysis. Conclusion.

Outline:

Page 4: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Introduction.

Page 5: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Introduction of the project

Problem.Objectives.Approach.

Page 6: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

The investigation of chaotic phenomena requires heavy computation.◦ However, much chaotic phenomena exhibit large amounts of

data parallelism.◦ So, in that the same computation is performed over and over

on differing inputs.

The problem to be tackled by this project is o the usage of GPGPU, through OpenCL, to perform the

computations required to investigate chaotic phenomena. o In particular, the Mandelbrot Set and the bifurcation

diagram of the logistic map.

Problem:

Page 7: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Objectives: Successful implementation of algorithms for the

computation of the Mandelbrot Set and the bifurcation diagram of the logistic map, both in OpenCL and in plain C++.

Comparison of performance between OpenCL running on a GPU, OpenCL running on a CPU, and as control, a plain serial C++ implementation of the above mentioned chaotic phenomena.

Analysis of the benefits and complications derived from the usage of OpenCL for the computation of chaotic phenomena.

Page 8: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

The investigation of the effectivity of OpenCL for the computation of chaotic phenomena:

◦ It will be performed by comparing implementations of the algorithms for the computation of the Mandelbrot Set as well as the bifurcation diagram of the logistic map.

first in C++. Then in OpenCL on the GPU and on the CPU.

Approach:

Page 9: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Background

Page 10: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

It is the usage of graphics processing units for non-graphics-related computation.

Graphics processing units allow for massive parallelism due to their architecture.

GPGPU:

Key Concepts:

Page 11: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

GPU Architecture:- The architecture of graphics processing units are oriented towards

performing tasks that involve data parallelism.

http://pds.ucdenver.edu/index.php?p=video&c=tech&a=t&i=2

GPU has up to hundreds of cores (as compared to CPUs which have 8-16 at the most.

Each of those cores is capable of executing dozens of instruction streams as the same time.

Page 12: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

The general flow of computation on a GPU starts with the loading of data onto the GPU. ◦This is often one of the most expensive operations, since

the data has to travel through the bus. This step has the transfer of graphics primitives, such as

an image to be rendered. • In GPGPU, this is the transfer of the data to be operated

on. • In GPGPU, a shader is called a kernel instead, to reflect

the more general scope.• Afterwards, data is then transferred back to main

memory, where the CPU can operate on it once again.

Important Information:

Page 13: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set. It is often visualized by coloring the complex plane according to the

number of iterations it takes for a point to escape the circle of radius 2. When |z| >= 2, it is guaranteed to tend to infinity; i.e., when z exits a circle

of radius 2 centered around the origin zn+1 = zn2 + cz0 = c

where c = original point

Bifurcation diagram of the logistic map. It is a plot of the long term iterates of the logistic map, with r varying.

Defined by the mapping: x n+1 = rxn(1 – xn) where r > 0 and 0 <= x <= 1.

They are an ideal candidate for GPGPU, as their computation involves the repeated application.

Chaotic Phenomena:

http://www.rationalsys.com/robertpirsig.html

Page 14: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Are the computation of the Mandelbrot Set and logistic Map highly amenable to parallelization?

Mandelbrot SetBifurcation diagram of the logistic map

Page 15: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

OpenCL:

Open Computing Language is a language and a framework: - Developing and executing programs over heterogeneous devices and platforms.

- For example: CPUs, and GPUs. It includes: - A language(based on C99) for writing kernels. - APIs to define and control the platforms.

All major hardware manufacturers support OpenCL, including Nvidia , Intel [6], and AMD/ATI [2].

What is it?

Page 16: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

OpenCL:

Page 17: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Choosing Devices

Page 18: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

shows a simplified block diagram of a generalized GPU compute device.

Hardware Overview:

Page 19: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

This Figure illustrates the relationship of the ATI Stream Computing components.

The ATI Stream Computing Implementation of OpenCL:

Page 20: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

◦ GPUs compute devices can execute non-graphics functions by using kernels.

◦ Each instance of a kernel running on a compute unit is called a work-item.

◦ All the work-items are scheduled onto a group of stream cores.

◦ OpenCL maps the total number of work-items to be launched onto an n-dimensional grid.

◦ The developer can specify how to divide these items into work-groups.

◦ There are an integer number of wavefronts in each work-group.

The ATI Stream Computing Implementation of OpenCL:

Page 21: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Figure Work-Item Grouping Into Work-Groups and Wavefronts

The ATI Stream Computing Implementation of OpenCL:

Page 22: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Global and local dimensions

Page 23: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Synchronization within work-items

Page 24: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

OpenCL- Memory Model

Page 25: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

◦ All stream cores within a compute unit execute the same instruction for each cycle.

◦ A work item can issue one VLIW instruction per clock cycle.

◦ To hide latencies due to memory accesses and processing element operations, up to four work-items from the same wavefront are pipelined on the same stream core.

◦ Compute units operate independently of each other, so it is possible for each array to execute different instructions.

Work-Item Processing:

Page 26: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Foe example: branching, is done by combining all necessary paths as a wavefront. If work-items within a wavefront diverge, all paths are executed serially.

Masking of wavefronts is effected by constructs such as:if(x)

{. //items within these braces = A

..}

else{

. //items within these braces = B..}

- The wavefront mask is set true for lanes (elements/items) in which x is true, then execute A. - The mask then is inverted, and B is executed.

Flow Control:

Page 27: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

A kernel is a small, user-developed program that is run repeatedly on a stream of data.

There are Multiple kernel types vertex, pixel, geometry, domain, hull, and now compute.

Compute kernel: is a specific type of kernel that is not part of the traditional graphics pipeline.

Compute Kernel:

Page 28: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Before the Development of compute kernels, pixel shaders were responsible for non-graphic computing.

However, new hardware support compute Kernels which are a better suited for non-graphic computations (Applications).

The compute kernel type can be used for graphics.

Compute Kernel:

Page 29: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Two concepts relating to compute kernels that provide data-parallel

A single instruction is executed over all work-items in a wavefront in parallel. It is the lowest level that flow control can affect.

This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.

Work-groups are composed of wavefronts. Best performance is attained when the group size is an integer multiple of the wavefront size.

Wavefronts and Workgroups

Page 30: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

OpenCL has four memory domains: private local global Constant

The AMD Accelerated Parallel Processing system also recognize host (CPU) and PCI Express (PCIe) memory

Memory Architecture and Access:

Page 31: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

private memory- specific to a work-item; it is not visible to other work-items.

local memory - specific to a work-group; accessible only by work-items belonging to that work-group.

global memory- accessible to all work-items executing in

a context, as well as to the host (read, write, and map commands). constant memory

- read only region for host-allocated and -initialized objects that are not changed during kernel execution.

Memory Architecture and Access:

Page 32: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

host (CPU) memory - host-accessible region for an application’s data

structures and program data. PCIe memory

- part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device.

◦ Modifying this memory requires synchronization between the GPU compute device and the CPU.

Memory Architecture and Access:

Page 33: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Interrelationship of the memory domains:

Page 34: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Copy process occur among host to PCIe and PCIe to GPU compute device. • Memory Access

• Local Memory is faster than Memory Access because of Global Memory and Memory Access is faster than PCIe

• Global Buffer• It permits applications to read from and write to arbitrary locations in memory

• Image Read/Write• Image reads are cached through the texture system• It can be done by addressing the desired location in input memory using fetch unit

• Memory Load/Store• Only constants (read only buffers) are cached • Each work item can write to an arbitrary location within global buffer

• Communication between Host and GPU• PCI Express Bus • Command Processor or Processor API calls• DMA transfer

illustrates the interrelationship of the memories Cont.

Page 35: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

illustrates the standard dataflow between host (CPU) and GPU.

How to copy data? Two ways to copy data from the host to the GPU compute device memory:• Implicitly: by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.• Explicitly through: clEnqueueReadBuffer and clEnqueueWriteBuffer (clEnqueueReadImage, clEnqueueWriteImage.).

Page 36: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

block diagram of the GPU memory system. Up arrows read paths Down arrows write paths. WC write cache.

Global Memory Optimization

Page 37: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

GPU Memory Diagram consists of Multiple Compute Units and contains:◦ 32 kb local memory◦ L1 Cache◦ Registers◦ 16 processing elements with five way VLIW processor

• L1 Cache 8 kb per compute unit i.e. 160 kb for 20 compute units for ATI Redon One terabyte Bandwidth on ATI Redon• Multiple compute units share L2 cache with size of 512kb on ATI

Redon• Bandwidth of L1 Cache and Shared L2 Cache is 435 GB/s

•ATI Radeon HD 5870• ATI Radeon™ HD 5870 GPU has eight memory controllers connected

to multiple banks of GDDR5 memory• Memory clock speed is 1200 MHz with data rate of 4800 Mb/pin• Peak Bandwidth = (8 memory controllers) * (4800 Mb/pin) * (32 bits) *

(1 B/8b) = 154 GB/s

Global Memory Optimization cont.

Page 38: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Comparing Local, Global and Single Cache Miss Rate

Page 39: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

The Miss Rate decreases as the cache size increases Till L1 cache level, it decrease up to 10% At L2, the Global Miss Rate Decreases more 10% It is similar to the single cache miss rate at Level 2 cache L2 is not tied to CPU clock cycle, it affects the miss penalty

that is tied to the miss rate of 1st level cache For L2, Global Miss Rate should be considered Local cache rate is not good measure of the 2nd level cache. Local Miss Rate is the function of 1st level cache Local Miss Rate can be varied if 1st level cache varies

Global Miss Rate and Local Miss Rate

Page 40: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

It shows kind of same variation at level 1 and level 2 caches

For Single Miss Rate, the level variation remains same in both L1 and L2

As the cache size increase, the Single Miss Rate decreases

Single Miss Rate

Page 41: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

◦ GPU compute devices are very efficient at parallelizing large numbers of work-items in a manner transparent to the application.

◦ Each GPU compute device uses the large number of wavefronts to hide memory access latencies by having the resource scheduler switch the active wavefront in a given compute unit whenever the current wavefront is waiting for a memory access to complete.

◦ Hiding memory access latencies requires that each work-item contain a large number of ALU operations per memory load/store.

GPU Compute Device Scheduling:

Page 42: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Simplified Execution Of Work-Items On A Single Stream Core

GPU Compute Device Scheduling:

Page 43: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation.

Page 44: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation

Comparison will be done pairwise between:

Single-threaded C++ implementation

OpenCL backed by CPU driver Intel OpenCL driver

OpenCL backed by GPU driver AMD/ATI OpenCL driver (APP)

Page 45: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set

Page 46: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot Set The 2-dimensional region of the complex plane will be

divided into a 1024x1024 grid.

Each cell of the grid corresponds to a pixel in the visualization.

The Mandelbrot map will be performed up to 1024 times, or until the pixel escapes.

Page 47: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot Set

The iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL implementation

Each pixels will be iterated for up to 1024 iterations Or Until it escapes the circle of radius 2.

Afterwards, the pixels will be colored according to the number of iterations.

Page 48: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,c++ :int main() {    // left end of the x-axis    const float xL[] = {-2, 0.3f, -0.333939f, -0.44545f, -0.4222f};    // right end of the x-axis    const float xR[] = {1, 0.4f, -0.22282f, -0.11212f, -0.31111f};    // left (top) end of the y-axis    const float yL[] = {-1, 0.3f, -0.67946f, -0.81202f, -0.7076431f};    // right (bottom) end of the y-axis    const float yR[] = {1, 0.4f, -0.54478f, -0.40979f, -0.5729629f};    // number of sets    const int nSets = 5;    // maximum number of iterations    const int maxIter = 1024;    // PI    const float PI = 2*acos(0.0f);    // matrix containing number of iterations    int *mat = NULL;     // number of elements of the matrix    const int nEl = 1024;    // size of matrix    size_t datasize = sizeof(int)*(nEl*nEl);

Page 49: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,c++ :  // timings file    ofstream timef("timings.txt", fstream::app);    for (int setN = 0; setN < nSets; ++setN) {        timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl;        const int N_TIMES = 10;        clock_t start = clock();                for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) {            // perform calculation            for (int i = 0; i < nEl; ++i) {                for (int j = 0; j < nEl; ++j) {                    int idx = nEl*i + j;                    float x0 = xL[setN] + (xR[setN] - xL[setN])*j/nEl;                    float y0 = yL[setN] + (yR[setN] - yL[setN])*i/nEl;                     float x = x0;                    float y = y0;                     int nIter = 0;                    for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) {                        float x_ = x*x - y*y + x0;                        y = 2*x*y + y0;                        x = x_;                    }                     mat[idx] = nIter;                }            }        }   

Page 50: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot.cl:// calculates the mandelbrot set__kernel void mandelbrot(__global int* mat, float xL, float xR, float yL, float yR, int maxIter) {    int idx = get_global_id(0);    int i = idx / N_EL;    int j = idx % N_EL;       // initial x and y    float x0 = xL + (xR - xL)*j/N_EL;    float y0 = yL + (yR - yL)*i/N_EL;    float x = x0;    float y = y0;    int nIter = 0;    // iterate until escape or maximum iterations    for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) {        float x_ = x*x - y*y + x0;        y = 2*x*y + y0;        x = x_;    }    mat[idx] = nIter;}

Page 51: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:

// Get devices    cl_context_properties cprops[3] =         {CL_CONTEXT_PLATFORM, (cl_context_properties)(plat)(), 0};     cl::Context ctx(CL_DEVICE_TYPE_ALL,        cprops,        NULL,        NULL,        &status);     cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status);    checkErr(status, "Buffer()");     vector<cl::Device> devices;      devices = ctx.getInfo<CL_CONTEXT_DEVICES>();    timef << "# of devices: " << devices.size() << endl;

Page 52: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:

    // select device    cl::Device device = devices[0];     string devName;    device.getInfo(CL_DEVICE_NAME, &devName);    timef  << "Device Name: " << devName << endl;     // load program    ifstream f("mandelbrot.cl");    std::string progStr(istreambuf_iterator<char>(f),        (istreambuf_iterator<char>()));     cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1));     cl::Program program(ctx, source);    status = program.build(devices, "");    checkErr(status, "Program::build()");

We invoked this code to performs the computation for the OpenCL

implementations.

Page 53: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:       // get kernel

    cl::Kernel kernel(program, "mandelbrot", &status);    checkErr(status, "Kernel");     status = kernel.setArg(0, buff);    checkErr(status, "Kernel::setArg(0)");    status = kernel.setArg(5, maxIter);    checkErr(status, "Kernel::setArg(5)");     // calculate over sets    for (int setN = 0; setN < nSets; ++setN) {        status = kernel.setArg(1, xL[setN]);        checkErr(status, "Kernel::setArg(1)");        status = kernel.setArg(2, xR[setN]);        checkErr(status, "Kernel::setArg(2)");        status = kernel.setArg(3, yL[setN]);        checkErr(status, "Kernel::setArg(3)");        status = kernel.setArg(4, yR[setN]);        checkErr(status, "Kernel::setArg(4)");         timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl;         cl::CommandQueue queue(ctx, device, 0, &status);        checkErr(status, "CommandQueue()");         const int N_TIMES = 10;        clock_t start = clock();

Page 54: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:

            for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) {            // enqueue kernel            cl::Event event;            status = queue.enqueueNDRangeKernel(kernel,                cl::NullRange,                cl::NDRange(nEl*nEl),                cl::NullRange,                NULL,                &event);            checkErr(status, "enqueue()");             // wait for kernel to finish            event.wait();            // read to matrix (blocking)            status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL);            checkErr(status, "Read()");        }

Event: A token sent through a pipeline that can be used to enforce synchronization, flush caches, and report status back to the host application. 

Page 55: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:       

        clock_t end = clock();        double sec = (end - start)/ (double) CLOCKS_PER_SEC;        timef << "Time: " << sec << endl;        timef << "Per iter: " << sec / (double) N_TIMES << endl;         // create the image        CImage im;        im.Create(nEl, nEl, 24);        // paint the image        for (int i = 0; i < nEl; ++i) {            for (int j = 0; j < nEl; ++j) {                float u = mat[nEl*i + j];                if (u == maxIter) {                    // if part of set, pixel is black                    im.SetPixelRGB(j, i, 0, 0, 0);                } else {                    // otherwise, color it based on number of iterations                    float x = xL[setN] + (xR[setN] - xL[setN])*j/nEl;                    float y = yL[setN] + (yR[setN] - yL[setN])*i/nEl;                    float v = u;                    float c = v * 2.0f * PI / 256.0f;                    im.SetPixelRGB(j, i,                        ((1.0f + cos(c))*0.5f)*255,                        ((1.0f + cos(2.0f*c + 2.0f*PI/3.0f))*0.5f)*255,                        ((1.0f + cos(c - 2.0f*PI/3.0f))*0.5f)*255);                }            }        } 

Page 56: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Mandelbrot,OpenCL:

                      // save the image        ostringstream strm;        strm << "mandelbrot" << setN << ".bmp";        im.Save(strm.str().c_str());    }    timef.close();}

Page 57: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map

Page 58: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic MapThe 1-dimensional interval of the real axis (the r-values) will be divided into 1024.

Each division corresponds to a column in the diagram

220 iterations will be performed to warmup, starting with x=0.4

Page 59: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map

Following, 210 = 1024 iterations will be recorded

Afterwards, these will be plotted along the column

Again, the iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL

implementation

Page 60: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, C++ // timings file    ofstream timef("timings.txt", fstream::app);     // number of times to perform map (for benchmarking)    const int N_TIMES = 10;        for (int setN = 0; setN < nSets; ++setN) {        timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl;         clock_t start = clock();         for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) {            // Iterate the map            for (int idx = 0; idx < nEl; ++idx) {                float r = rL[setN] + (rR[setN] - rL[setN])*idx/nEl;                float x = 0.4f;                for (int i = 0; i < warmup; ++i) {                    x = r*x*(1-x);                }                for (int i = 0; i < maxIter; ++i) {                    mat[maxIter*idx + i] = x = r*x*(1-x);                }            }        }

Page 61: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic.cl// calcualtes the logistic function__kernel void logistic(__global float* mat, float rL, float rR, int warmup, int maxIter) {    int idx = get_global_id(0);     // r of the map to iterate on    float r = rL + (rR - rL)*idx/N_EL;    float x = 0.4f;    // warmup    for (int i = 0; i < warmup; ++i) {        x = r*x*(1-x);    }    // plotted iterates    for (int i = 0; i < maxIter; ++i) {        mat[maxIter*idx + i] = x = r*x*(1-x);    }}

Page 62: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL

    // get platforms    cl_uint nPlatforms = 0;    cl_platform_id *platforms = NULL;    vector<cl::Platform> platformList;     cl::Platform::get(&platformList);     ofstream timef("timings.txt", fstream::app);        string vendor;    platformList[0].getInfo((cl_platform_info) CL_PLATFORM_VENDOR, &vendor);     timef << "Platform by: " << vendor << endl;

Page 63: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL     // get context    cl_context_properties cprops[3] =         {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[1])(), 0};     cl::Context ctx(CL_DEVICE_TYPE_ALL,        cprops,        NULL,        NULL,        &status);     cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status);    checkErr(status, "Buffer()");     // get devices    vector<cl::Device> devices;      devices = ctx.getInfo<CL_CONTEXT_DEVICES>();    timef << "# of devices: " << devices.size() << endl;     cl::Device device = devices[0];     string devName;    device.getInfo(CL_DEVICE_NAME, &devName);    timef << "Device Name: " << devName << endl;

Page 64: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL

    // get program    ifstream f("logistic.cl");    std::string progStr(istreambuf_iterator<char>(f),        (istreambuf_iterator<char>()));     cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1));     cl::Program program(ctx, source);    status = program.build(devices, "");    checkErr(status, "Program::build()");     // get kernel    cl::Kernel kernel(program, "logistic", &status);    checkErr(status, "Kernel");

We invoked this function to performs the computation for the OpenCL

implementations.

Page 65: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL   // load arguments    status = kernel.setArg(0, buff);    checkErr(status, "Kernel::setArg(0)");    status = kernel.setArg(3, warmup);    checkErr(status, "Kernel::setArg(3)");    status = kernel.setArg(4, maxIter);    checkErr(status, "Kernel::setArg(4)");    // calculate over sets    for (int setN = 0; setN < nSets; ++setN) {        status = kernel.setArg(1, rL[setN]);        checkErr(status, "Kernel::setArg(1)");        status = kernel.setArg(2, rR[setN]);        checkErr(status, "Kernel::setArg(2)");                timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl;         cl::CommandQueue queue(ctx, device, 0, &status);        checkErr(status, "CommandQueue()");         const int N_TIMES = 10;        clock_t start = clock();   

Page 66: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL    for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) {            // enqueue kernel            cl::Event event;            status = queue.enqueueNDRangeKernel(kernel,                cl::NullRange,                cl::NDRange(nEl),                cl::NullRange,                NULL,                &event);            checkErr(status, "enqueue()");            // wait for kernel to finish            event.wait();            // read buffer to memory (blocking)            status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL);            checkErr(status, "Read()");        } Event: A token sent through a pipeline that can be used to enforce synchronization,

flush caches, and report status back to the host application. 

Page 67: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL    // Output timings        clock_t end = clock();        double sec = (end - start)/ (double) CLOCKS_PER_SEC;        timef << "Time: " << sec << endl;        timef << "Per iter: " << sec / (double) N_TIMES << endl;             // Create image        CImage im;        im.Create(nEl, -height, 24);        // White out the image        for (int i = 0; i < nEl; ++i) {            for (int j = 0; j < height; ++j) {                im.SetPixelRGB(i,j,255,255,255);            }        }       

Page 68: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Implementation – Logistic Map, OpenCL    // Plot the iterates        for (int i = 0; i < nEl; ++i) {            for (int j = 0; j < maxIter; ++j) {                float u = mat[nEl*i + j];                if (xL <= u && u < xR) {                    // only plot u if in range u -= xL;                    im.SetPixelRGB(i, height - 1 - (u*height/xR),0,0,0);                }            }        }        // Save the image        ostringstream strm;        strm << "logistic" << setN << ".bmp";        im.Save(strm.str().c_str());    }    timef.close();} 

Page 69: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Methodology, Results, Analysis

Page 70: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Methodology, Results, Analysis

MethodologyResultsAnalysis

Page 71: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Methodology:

The performance of each implementation for each chaotic phenomenon will be measured by timing them◦They will be timed for 10 runs for each set

◦Each set is a 2-D region (Mandelbrot Set) 1-D interval (Logistic map)

Page 72: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Methodology:

The means will be compared and graphedAfterwards, accuracy will be determined

◦Generated graphs will be compared Visually Numerically

Page 73: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Methodology:

Hardware used:◦Intel Core i5 Dual-Core M460 @ 2.53GHz◦AMD Radeon Mobility HD 5145 (ATI

RV710)

Page 74: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Results:

Mandelbrot Set.Logistic Map.

Page 75: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Results – Mandelbrot Set:

Set 1 Set 2 Set 3 Set 4 Set 50

1

2

3

4

5

6

7

8

Plain C++

OpenCL, GPU

OpenCL, CPU

Page 76: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Result – Mandelbrot Set:

Performance◦OpenCL implementation is faster than

Plain C++ by roughly 10 times.◦OpenCL running on CPU and GPU both

take less than a second◦Order of magnitude difference in runtime

Page 77: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Results – Mandelbrot Set:

Performance◦OpenCL running on GPU runs in ¾ the time of OpenCL on CPU

◦Less difference than expected (more on this later).

Page 78: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Results – Mandelbrot Set:

Accuracy◦To be determined by both visual comparison and numerical comparison of generated visualizations

◦Visualizations follow on the next slides

Page 79: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 1 - C++

Page 80: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 1 – OpenCL, GPU

Page 81: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 1 – OpenCL, CPU

Page 82: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 2 - C++

Page 83: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 2 – OpenCL, GPU

Page 84: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 2 – OpenCL, CPU

Page 85: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 3 – C++

Page 86: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 3 – OpenCL, GPU

Page 87: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 3 – OpenCL , CPU

Page 88: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 4 – C++

Page 89: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 4 – OpenCL, GPU

Page 90: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 4 – OpenCL, CPU

Page 91: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 5 – C++

Page 92: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 5 – OpenCL, GPU

Page 93: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Set 5 – OpenCL, CPU

Page 94: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Mandelbrot Set – Results:

Accuracy◦Visual comparison gives no apparent difference

◦Numerical comparison confirms this: no difference in number of iterations

◦Perfect accuracy

Page 95: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Results:

Set 1 Set 2 Set 3 Set 40

1

2

3

4

5

6

7

8

Plain C++

OpenCL, GPU

OpenCL, CPU

Page 96: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Results:

Performance◦Similar results to Mandelbrot Set◦Plain C++ takes roughly 7 seconds◦OpenCL running on CPU and GPU both

take less than a second◦Order of magnitude difference in runtime

Page 97: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Results:

Performance◦OpenCL running on GPU runs in ½ the time of OpenCL on CPU

◦Greater difference, but still less difference than expected (more on this later)

Page 98: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Results:

Accuracy◦Also to be determined by both visual comparison and numerical comparison of generated visualizations

◦Visualizations follow on the next slides

Page 99: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 1 – C++

Page 100: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 1 – OpenCL, GPU

Page 101: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 1 – OpenCL, CPU

Page 102: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 2 – C++

Page 103: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 2 – OpenCL, GPU

Page 104: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 2 – OpenCL, CPU

Page 105: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 3 – C++

Page 106: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 3 – OpenCL, GPU

Page 107: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 3 – OpenCL, CPU

Page 108: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 4 – C++

Page 109: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 4 – OpenCL, GPU

Page 110: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map – Set 4 – OpenCL, CPU

Page 111: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Logistic Map - Results

Accuracy◦Once again, no noticeable difference can be

observed visually◦Numerical comparison also confirms this

Page 112: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Analysis:

For both chaotic phenomena investigated, an order of magnitude difference in speed was observed between the OpenCL and plain C++ implementations

Also, no visible difference in accuracy was found Thus, OpenCL can be considered an excellent

way to boost performance.

Page 113: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Analysis:

For both chaotic phenomena, GPU was faster than CPU.

However, this difference is smaller than expected, considering the parallelism of the problem and of GPUs.

Page 114: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Analysis:

Possible explanation:◦Bus transfer from CPU to GPU takes too

much time. Possible solution:

◦Increase workload of the kernel so as to minimize required transfer

Page 115: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Analysis: CPU performance in and of itself is remarkable Improvement over plain C++ is an order of

magnitude, but only dual-core CPU was used Expected improvement: factor of 2 Actual: factor of 10 Possible explanation:

◦Excellent optimization by Intel OpenCL driver

Page 116: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Conclusions:

SummaryContributionsFuture Work

Page 117: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Summary:

GPGPU provides access to massive parallelism◦But only data parallelism

This is due to GPU architecture being specialized for massive data parallelism

OpenCL gives us easy access to GPGPU◦Along with parallelization for CPUs and embedded

devices

Page 118: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Summary

Chaotic phenomena require large amounts of computation

However, this is usually very data-parallelPrime examples:

◦Mandelbrot Set◦Bifurcation diagram of the logistic map

Page 119: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Summary:

OpenCL was used to investigate how useful GPGPU can be for investigation of chaotic phenomena

Results are spectacular:◦10x improvement over plain C++ Even for CPU-driven OpenCL

◦GPU-driven OpenCL still faster than CPU-driven OpenCL

Page 120: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Summary

Analysis of benefits and complications from using OpenCL◦Speed◦Accuracy◦Code complexity

Page 121: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Contributions:

This project shows that OpenCL can be used to greatly speed up computations for investigation of chaotic phenomena

And in general, computation of highly data-parallel work

Page 122: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Contributions:

OpenCL can be used regardless of whether a GPU is available◦OpenCL can be used to parallelize serial

implementations for CPU◦Still have massive improvements

Page 123: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Future Work:

Increase work load of the kernel, thus reducing data transfer required and latency incurred.◦Data transfer to GPU is very slow◦May have to contend with other bus-users◦Even without data, latency is high (off-chip)

Page 124: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Future Work:

Investigate using highly-optimized code for both OpenCL and C++◦More realistic comparison between OpenCL and

C++◦However, may accidentally lead to optimizing C++ more than OpenCL, or vice versa

Page 125: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Future Work:

Investigation of other chaotic phenomena◦Lorenz strange attractor◦Burning ship fractal◦Mandelbar fractal

In general, highly data-parallel work.

Page 126: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

References: These slides contain material developed and copyright by: Gita Alaghband (UC Denver). http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx. [1] Alligood, K. T., Sauer, T., and Yorke, J.A. Chaos: an introduction to dynamical systems. New York

City, NY: Springer-Verlag, 1997. Print. [2] “AMD Accelerated Parallel Processing SDK”. AMD Developer Central. AMD, n.d. Web. 6 Mar

2012. [3] Devaney, Robert L. An Introduction to Chaotic Dynamical Systems, 2nd ed,. Boulder, CO:

Westview Press, 2003. Print. [4] Garcia, V., E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. In Proceedings

of the CVPR Workshop on Computer Vision on GPU, 2008. Print. [5] Harrison, Owen, and John Waldron. “AES on SM3.0 compliant GPUs.” In Proceedings of CHES

2007. Print. [6] "Intel® OpenCL SDK." Intel Visual Computing Source. Intel Corporation, n.d.. Web. 6 Mar 2012. [7] Mancheril, Naju. “GPU-based Sorting in PostgreSQL.” Thesis, School of Computer Science -

Carnegie Mellon University. Print. [8] Milnor, John W. Dynamics in One Complex Variable. 3rd ed. In Annals of Mathematics Studies

160. Princeton, NJ: Princeton University Press, 2006. [9] “OpenCL.” Nvidia Developer Zone. Nvidia, n.d. Web. 6 Mar 2012. [10] "OpenCL.” OpenCL. Khronos Group, n.d. Web. 6 Mar 2012.

Page 127: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

[11] Scarpino, Matthew. OpenCL in Action. Greenwich, CT: Manning Publications, 2011. Print.

[12] Strogatz, Steven (2000). Nonlinear Dynamics and Chaos. Perseus Publishing.¬ [13] Vasiliadis, Giorgos, et al. “GrAVity: A Massively Parallel Antivirus Engine.” In

proceedings of RAID 2010. Print. [14] Vasiliadis, Giorgos, et al. “Regular Expression Matching on Graphics Hardware for

Intrusion Detection.” In proceedings of RAID 2009. Print. [16] “CUDA Zone.” Nvidia Developer Zone. Nvidia. n.d. Web. 27 Mar 2012. [17] “Next Generation CUDA Architecture, Code Named Fermi.” Nvidia. n.d. Web. 27 Mar

2012. [18] Friedrichs, M.S. et al. "Accelerating Molecular Dynamic Simulation on Graphics

Processing Units". Journal of Computational Chemistry 30 (6): 864–72, 2009. Web. 27 Mar 2012.

[19] Pande, Vijay and Stanford University. “Folding@home.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.

[20] Pande, Vijay and Stanford University. “Folding@home team stats pages.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.

[21] Fung, et al. "Mediated Reality Using Computer Graphics Hardware for Computer Vision". In Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), Seattle, WA, 7-10 2002, 83--89. Web. 27 Mar 2012.

References:

Page 128: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

[22] Harris, Mark. “Mapping computational concepts to GPUs.” In ACM SIGGRAPH 2005 Courses (Los Angeles, California, 31 July – 4 August 2005). J. Fujii, Ed. SIGGRAPH '05. ACM Press, New York, NY, 50. Web. 27 Mar 2012.

[23] “About the Khronos Group.” Khronos Group, n.d. Web. 27 Mar 2012. [24] Fang, Jianbin, et al. “A Comprehensive Performance Comparison of CUDA and OpenCL.“ In Parallel

Processing(ICPP), 2011 International Conference on 13-16 Sept. 2011. Web. 27 Mar 2012. [25] Jaaskelainen, P.O. “OpenCL-based design methodology for application-specific processors.” In

Embedded Computer Systems (SAMOS), 2010 International Conference on, pp. 223- 230. Web. 27 Mar 2012.

[26] Li, T.Y.; Yorke, J.A. (1975). "Period Three Implies Chaos" (PDF). American Mathematical Monthly 82 (10): 985–92. Web. 27 Mar 2012.

[27] Farber, Rob. “Cuda, Supercomputing for the Masses: Part 17.” In Dr. Dobb’s, 14 Apr 2010. Web. 27 Mar 2012.

[28] Weisstein, Eric W. "Logistic Map." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/LogisticMap.html

[29] “Mandel zoom 01 head and shoulder.jpg.’ Wikimedia Commons. Web. 27 Mar 2012. [30] “Mandel zoom 00 mandelbrot set.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [31] “TwoLorenzOrbits.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [32] “Logistic Bifurcation map High Resolution.png.” Wikimedia Commons. Web. 27 Mar 2012. http://www.rationalsys.com/robertpirsig.html

References:

Page 129: Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

Question ?