Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

On OpenCL and Chaotic Phenomena

Department of Computer Science & EngineeringUniversity of Colorado Denver

ByAli Alkhathlan, Ali Alsaadi, and Mohamed Khalifa

It aims to utilize GPGPU through the use of OpenCL, an open standard for computing on heterogeneous platforms, including CPUs, GPUs.

◦for the computation involved in Chaotic Phenomena: Mandelbrot Set and bifurcation diagram of the Logistic

Map.

The performance provided by the GPU through OpenCL will be compared to CPU performance through OpenCL,

and plain C++

Goal of projects:

Introduction. Background. Implementation. Methodology, Results, and Analysis. Conclusion.

Outline:

Introduction.

Introduction of the project

Problem.Objectives.Approach.

The investigation of chaotic phenomena requires heavy computation.◦ However, much chaotic phenomena exhibit large amounts of

data parallelism.◦ So, in that the same computation is performed over and over

on differing inputs.

The problem to be tackled by this project is o the usage of GPGPU, through OpenCL, to perform the

computations required to investigate chaotic phenomena. o In particular, the Mandelbrot Set and the bifurcation

diagram of the logistic map.

Problem:

Objectives: Successful implementation of algorithms for the

computation of the Mandelbrot Set and the bifurcation diagram of the logistic map, both in OpenCL and in plain C++.

Comparison of performance between OpenCL running on a GPU, OpenCL running on a CPU, and as control, a plain serial C++ implementation of the above mentioned chaotic phenomena.

Analysis of the benefits and complications derived from the usage of OpenCL for the computation of chaotic phenomena.

The investigation of the effectivity of OpenCL for the computation of chaotic phenomena:

◦ It will be performed by comparing implementations of the algorithms for the computation of the Mandelbrot Set as well as the bifurcation diagram of the logistic map.

first in C++. Then in OpenCL on the GPU and on the CPU.

Approach:

Background

It is the usage of graphics processing units for non-graphics-related computation.

Graphics processing units allow for massive parallelism due to their architecture.

GPGPU:

Key Concepts:

GPU Architecture:- The architecture of graphics processing units are oriented towards

performing tasks that involve data parallelism.

http://pds.ucdenver.edu/index.php?p=video&c=tech&a=t&i=2

GPU has up to hundreds of cores (as compared to CPUs which have 8-16 at the most.

Each of those cores is capable of executing dozens of instruction streams as the same time.

http://pds.ucdenver.edu/index.php?p=video&c=tech&a=t&i=2

The general flow of computation on a GPU starts with the loading of data onto the GPU. ◦This is often one of the most expensive operations, since

the data has to travel through the bus. This step has the transfer of graphics primitives, such as

an image to be rendered. • In GPGPU, this is the transfer of the data to be operated

on. • In GPGPU, a shader is called a kernel instead, to reflect

the more general scope.• Afterwards, data is then transferred back to main

memory, where the CPU can operate on it once again.

Important Information:

Mandelbrot Set. It is often visualized by coloring the complex plane according to the

number of iterations it takes for a point to escape the circle of radius 2. When |z| >= 2, it is guaranteed to tend to infinity; i.e., when z exits a circle

of radius 2 centered around the origin zn+1 = zn2 + cz0 = c

where c = original point

Bifurcation diagram of the logistic map. It is a plot of the long term iterates of the logistic map, with r varying.

Defined by the mapping: x n+1 = rxn(1 – xn) where r > 0 and 0 <= x <= 1.

They are an ideal candidate for GPGPU, as their computation involves the repeated application.

Chaotic Phenomena:

http://www.rationalsys.com/robertpirsig.html


Are the computation of the Mandelbrot Set and logistic Map highly amenable to parallelization?

Mandelbrot SetBifurcation diagram of the logistic map

OpenCL:

Open Computing Language is a language and a framework: - Developing and executing programs over heterogeneous devices and platforms.

- For example: CPUs, and GPUs. It includes: - A language(based on C99) for writing kernels. - APIs to define and control the platforms.

All major hardware manufacturers support OpenCL, including Nvidia , Intel [6], and AMD/ATI [2].

What is it?

OpenCL:

Choosing Devices

shows a simplified block diagram of a generalized GPU compute device.

Hardware Overview:

This Figure illustrates the relationship of the ATI Stream Computing components.

The ATI Stream Computing Implementation of OpenCL:

◦ GPUs compute devices can execute non-graphics functions by using kernels.

◦ Each instance of a kernel running on a compute unit is called a work-item.

◦ All the work-items are scheduled onto a group of stream cores.

◦ OpenCL maps the total number of work-items to be launched onto an n-dimensional grid.

◦ The developer can specify how to divide these items into work-groups.

◦ There are an integer number of wavefronts in each work-group.


Figure Work-Item Grouping Into Work-Groups and Wavefronts


Global and local dimensions

Synchronization within work-items

OpenCL- Memory Model

◦ All stream cores within a compute unit execute the same instruction for each cycle.

◦ A work item can issue one VLIW instruction per clock cycle.

◦ To hide latencies due to memory accesses and processing element operations, up to four work-items from the same wavefront are pipelined on the same stream core.

◦ Compute units operate independently of each other, so it is possible for each array to execute different instructions.

Work-Item Processing:

Foe example: branching, is done by combining all necessary paths as a wavefront. If work-items within a wavefront diverge, all paths are executed serially.

Masking of wavefronts is effected by constructs such as:if(x)

{. //items within these braces = A

..}

else{

. //items within these braces = B..}

- The wavefront mask is set true for lanes (elements/items) in which x is true, then execute A. - The mask then is inverted, and B is executed.

Flow Control:

A kernel is a small, user-developed program that is run repeatedly on a stream of data.

There are Multiple kernel types vertex, pixel, geometry, domain, hull, and now compute.

Compute kernel: is a specific type of kernel that is not part of the traditional graphics pipeline.

Compute Kernel:

Before the Development of compute kernels, pixel shaders were responsible for non-graphic computing.

However, new hardware support compute Kernels which are a better suited for non-graphic computations (Applications).

The compute kernel type can be used for graphics.

Compute Kernel:

Two concepts relating to compute kernels that provide data-parallel

A single instruction is executed over all work-items in a wavefront in parallel. It is the lowest level that flow control can affect.

This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.

Work-groups are composed of wavefronts. Best performance is attained when the group size is an integer multiple of the wavefront size.

Wavefronts and Workgroups

OpenCL has four memory domains: private local global Constant

The AMD Accelerated Parallel Processing system also recognize host (CPU) and PCI Express (PCIe) memory

Memory Architecture and Access:

private memory- specific to a work-item; it is not visible to other work-items.

local memory - specific to a work-group; accessible only by work-items belonging to that work-group.

global memory- accessible to all work-items executing in

a context, as well as to the host (read, write, and map commands). constant memory

- read only region for host-allocated and -initialized objects that are not changed during kernel execution.


host (CPU) memory - host-accessible region for an application’s data

structures and program data. PCIe memory

- part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device.

◦ Modifying this memory requires synchronization between the GPU compute device and the CPU.


Interrelationship of the memory domains:

Copy process occur among host to PCIe and PCIe to GPU compute device. • Memory Access

• Local Memory is faster than Memory Access because of Global Memory and Memory Access is faster than PCIe

• Global Buffer• It permits applications to read from and write to arbitrary locations in memory

• Image Read/Write• Image reads are cached through the texture system• It can be done by addressing the desired location in input memory using fetch unit

• Memory Load/Store• Only constants (read only buffers) are cached • Each work item can write to an arbitrary location within global buffer

• Communication between Host and GPU• PCI Express Bus • Command Processor or Processor API calls• DMA transfer

illustrates the interrelationship of the memories Cont.

illustrates the standard dataflow between host (CPU) and GPU.

How to copy data? Two ways to copy data from the host to the GPU compute device memory:• Implicitly: by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.• Explicitly through: clEnqueueReadBuffer and clEnqueueWriteBuffer (clEnqueueReadImage, clEnqueueWriteImage.).

block diagram of the GPU memory system. Up arrows read paths Down arrows write paths. WC write cache.

Global Memory Optimization

GPU Memory Diagram consists of Multiple Compute Units and contains:◦ 32 kb local memory◦ L1 Cache◦ Registers◦ 16 processing elements with five way VLIW processor

• L1 Cache 8 kb per compute unit i.e. 160 kb for 20 compute units for ATI Redon One terabyte Bandwidth on ATI Redon• Multiple compute units share L2 cache with size of 512kb on ATI

Redon• Bandwidth of L1 Cache and Shared L2 Cache is 435 GB/s

•ATI Radeon HD 5870• ATI Radeon™ HD 5870 GPU has eight memory controllers connected

to multiple banks of GDDR5 memory• Memory clock speed is 1200 MHz with data rate of 4800 Mb/pin• Peak Bandwidth = (8 memory controllers) * (4800 Mb/pin) * (32 bits) *

(1 B/8b) = 154 GB/s

Global Memory Optimization cont.

Comparing Local, Global and Single Cache Miss Rate

The Miss Rate decreases as the cache size increases Till L1 cache level, it decrease up to 10% At L2, the Global Miss Rate Decreases more 10% It is similar to the single cache miss rate at Level 2 cache L2 is not tied to CPU clock cycle, it affects the miss penalty

that is tied to the miss rate of 1st level cache For L2, Global Miss Rate should be considered Local cache rate is not good measure of the 2nd level cache. Local Miss Rate is the function of 1st level cache Local Miss Rate can be varied if 1st level cache varies

Global Miss Rate and Local Miss Rate

It shows kind of same variation at level 1 and level 2 caches

For Single Miss Rate, the level variation remains same in both L1 and L2

As the cache size increase, the Single Miss Rate decreases

Single Miss Rate

◦ GPU compute devices are very efficient at parallelizing large numbers of work-items in a manner transparent to the application.

◦ Each GPU compute device uses the large number of wavefronts to hide memory access latencies by having the resource scheduler switch the active wavefront in a given compute unit whenever the current wavefront is waiting for a memory access to complete.

◦ Hiding memory access latencies requires that each work-item contain a large number of ALU operations per memory load/store.

GPU Compute Device Scheduling:

Simplified Execution Of Work-Items On A Single Stream Core

GPU Compute Device Scheduling:

Implementation.

Implementation

Comparison will be done pairwise between:

Single-threaded C++ implementation

OpenCL backed by CPU driver Intel OpenCL driver

OpenCL backed by GPU driver AMD/ATI OpenCL driver (APP)

Mandelbrot Set

Implementation – Mandelbrot Set The 2-dimensional region of the complex plane will be

divided into a 1024x1024 grid.

Each cell of the grid corresponds to a pixel in the visualization.

The Mandelbrot map will be performed up to 1024 times, or until the pixel escapes.

Implementation – Mandelbrot Set

The iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL implementation

Each pixels will be iterated for up to 1024 iterations Or Until it escapes the circle of radius 2.

Afterwards, the pixels will be colored according to the number of iterations.

Implementation – Mandelbrot,c++ :int main() { // left end of the x-axis const float xL[] = {-2, 0.3f, -0.333939f, -0.44545f, -0.4222f}; // right end of the x-axis const float xR[] = {1, 0.4f, -0.22282f, -0.11212f, -0.31111f}; // left (top) end of the y-axis const float yL[] = {-1, 0.3f, -0.67946f, -0.81202f, -0.7076431f}; // right (bottom) end of the y-axis const float yR[] = {1, 0.4f, -0.54478f, -0.40979f, -0.5729629f}; // number of sets const int nSets = 5; // maximum number of iterations const int maxIter = 1024; // PI const float PI = 2*acos(0.0f); // matrix containing number of iterations int *mat = NULL; // number of elements of the matrix const int nEl = 1024; // size of matrix size_t datasize = sizeof(int)*(nEl*nEl);

Implementation – Mandelbrot,c++ : // timings file ofstream timef("timings.txt", fstream::app); for (int setN = 0; setN < nSets; ++setN) { timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl; const int N_TIMES = 10; clock_t start = clock(); for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // perform calculation for (int i = 0; i < nEl; ++i) { for (int j = 0; j < nEl; ++j) { int idx = nEl*i + j; float x0 = xL[setN] + (xR[setN] - xL[setN])*j/nEl; float y0 = yL[setN] + (yR[setN] - yL[setN])*i/nEl; float x = x0; float y = y0; int nIter = 0; for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) { float x_ = x*x - y*y + x0; y = 2*x*y + y0; x = x_; } mat[idx] = nIter; } } }

Implementation – Mandelbrot.cl:// calculates the mandelbrot set__kernel void mandelbrot(__global int* mat, float xL, float xR, float yL, float yR, int maxIter) { int idx = get_global_id(0); int i = idx / N_EL; int j = idx % N_EL; // initial x and y float x0 = xL + (xR - xL)*j/N_EL; float y0 = yL + (yR - yL)*i/N_EL; float x = x0; float y = y0; int nIter = 0; // iterate until escape or maximum iterations for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) { float x_ = x*x - y*y + x0; y = 2*x*y + y0; x = x_; } mat[idx] = nIter;}

Implementation – Mandelbrot,OpenCL:

// Get devices cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(plat)(), 0}; cl::Context ctx(CL_DEVICE_TYPE_ALL, cprops, NULL, NULL, &status); cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status); checkErr(status, "Buffer()"); vector<cl::Device> devices; devices = ctx.getInfo<CL_CONTEXT_DEVICES>(); timef << "# of devices: " << devices.size() << endl;


// select device cl::Device device = devices[0]; string devName; device.getInfo(CL_DEVICE_NAME, &devName); timef << "Device Name: " << devName << endl; // load program ifstream f("mandelbrot.cl"); std::string progStr(istreambuf_iterator<char>(f), (istreambuf_iterator<char>())); cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1)); cl::Program program(ctx, source); status = program.build(devices, ""); checkErr(status, "Program::build()");

We invoked this code to performs the computation for the OpenCL

implementations.

Implementation – Mandelbrot,OpenCL: // get kernel

cl::Kernel kernel(program, "mandelbrot", &status); checkErr(status, "Kernel"); status = kernel.setArg(0, buff); checkErr(status, "Kernel::setArg(0)"); status = kernel.setArg(5, maxIter); checkErr(status, "Kernel::setArg(5)"); // calculate over sets for (int setN = 0; setN < nSets; ++setN) { status = kernel.setArg(1, xL[setN]); checkErr(status, "Kernel::setArg(1)"); status = kernel.setArg(2, xR[setN]); checkErr(status, "Kernel::setArg(2)"); status = kernel.setArg(3, yL[setN]); checkErr(status, "Kernel::setArg(3)"); status = kernel.setArg(4, yR[setN]); checkErr(status, "Kernel::setArg(4)"); timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl; cl::CommandQueue queue(ctx, device, 0, &status); checkErr(status, "CommandQueue()"); const int N_TIMES = 10; clock_t start = clock();


for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // enqueue kernel cl::Event event; status = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(nEl*nEl), cl::NullRange, NULL, &event); checkErr(status, "enqueue()"); // wait for kernel to finish event.wait(); // read to matrix (blocking) status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL); checkErr(status, "Read()"); }

Event: A token sent through a pipeline that can be used to enforce synchronization, flush caches, and report status back to the host application.


clock_t end = clock(); double sec = (end - start)/ (double) CLOCKS_PER_SEC; timef << "Time: " << sec << endl; timef << "Per iter: " << sec / (double) N_TIMES << endl; // create the image CImage im; im.Create(nEl, nEl, 24); // paint the image for (int i = 0; i < nEl; ++i) { for (int j = 0; j < nEl; ++j) { float u = mat[nEl*i + j]; if (u == maxIter) { // if part of set, pixel is black im.SetPixelRGB(j, i, 0, 0, 0); } else { // otherwise, color it based on number of iterations float x = xL[setN] + (xR[setN] - xL[setN])*j/nEl; float y = yL[setN] + (yR[setN] - yL[setN])*i/nEl; float v = u; float c = v * 2.0f * PI / 256.0f; im.SetPixelRGB(j, i, ((1.0f + cos(c))*0.5f)*255, ((1.0f + cos(2.0f*c + 2.0f*PI/3.0f))*0.5f)*255, ((1.0f + cos(c - 2.0f*PI/3.0f))*0.5f)*255); } } }


// save the image ostringstream strm; strm << "mandelbrot" << setN << ".bmp"; im.Save(strm.str().c_str()); } timef.close();}

Logistic Map

Implementation – Logistic MapThe 1-dimensional interval of the real axis (the r-values) will be divided into 1024.

Each division corresponds to a column in the diagram

220 iterations will be performed to warmup, starting with x=0.4

Implementation – Logistic Map

Following, 210 = 1024 iterations will be recorded

Afterwards, these will be plotted along the column

Again, the iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL

implementation

Implementation – Logistic Map, C++ // timings file ofstream timef("timings.txt", fstream::app); // number of times to perform map (for benchmarking) const int N_TIMES = 10; for (int setN = 0; setN < nSets; ++setN) { timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl; clock_t start = clock(); for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // Iterate the map for (int idx = 0; idx < nEl; ++idx) { float r = rL[setN] + (rR[setN] - rL[setN])*idx/nEl; float x = 0.4f; for (int i = 0; i < warmup; ++i) { x = r*x*(1-x); } for (int i = 0; i < maxIter; ++i) { mat[maxIter*idx + i] = x = r*x*(1-x); } } }

Implementation – Logistic.cl// calcualtes the logistic function__kernel void logistic(__global float* mat, float rL, float rR, int warmup, int maxIter) { int idx = get_global_id(0); // r of the map to iterate on float r = rL + (rR - rL)*idx/N_EL; float x = 0.4f; // warmup for (int i = 0; i < warmup; ++i) { x = r*x*(1-x); } // plotted iterates for (int i = 0; i < maxIter; ++i) { mat[maxIter*idx + i] = x = r*x*(1-x); }}

Implementation – Logistic Map, OpenCL

// get platforms cl_uint nPlatforms = 0; cl_platform_id *platforms = NULL; vector<cl::Platform> platformList; cl::Platform::get(&platformList); ofstream timef("timings.txt", fstream::app); string vendor; platformList[0].getInfo((cl_platform_info) CL_PLATFORM_VENDOR, &vendor); timef << "Platform by: " << vendor << endl;

Implementation – Logistic Map, OpenCL // get context cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[1])(), 0}; cl::Context ctx(CL_DEVICE_TYPE_ALL, cprops, NULL, NULL, &status); cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status); checkErr(status, "Buffer()"); // get devices vector<cl::Device> devices; devices = ctx.getInfo<CL_CONTEXT_DEVICES>(); timef << "# of devices: " << devices.size() << endl; cl::Device device = devices[0]; string devName; device.getInfo(CL_DEVICE_NAME, &devName); timef << "Device Name: " << devName << endl;

Implementation – Logistic Map, OpenCL

// get program ifstream f("logistic.cl"); std::string progStr(istreambuf_iterator<char>(f), (istreambuf_iterator<char>())); cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1)); cl::Program program(ctx, source); status = program.build(devices, ""); checkErr(status, "Program::build()"); // get kernel cl::Kernel kernel(program, "logistic", &status); checkErr(status, "Kernel");

We invoked this function to performs the computation for the OpenCL

implementations.

Implementation – Logistic Map, OpenCL // load arguments status = kernel.setArg(0, buff); checkErr(status, "Kernel::setArg(0)"); status = kernel.setArg(3, warmup); checkErr(status, "Kernel::setArg(3)"); status = kernel.setArg(4, maxIter); checkErr(status, "Kernel::setArg(4)"); // calculate over sets for (int setN = 0; setN < nSets; ++setN) { status = kernel.setArg(1, rL[setN]); checkErr(status, "Kernel::setArg(1)"); status = kernel.setArg(2, rR[setN]); checkErr(status, "Kernel::setArg(2)"); timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl; cl::CommandQueue queue(ctx, device, 0, &status); checkErr(status, "CommandQueue()"); const int N_TIMES = 10; clock_t start = clock();

Implementation – Logistic Map, OpenCL for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // enqueue kernel cl::Event event; status = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(nEl), cl::NullRange, NULL, &event); checkErr(status, "enqueue()"); // wait for kernel to finish event.wait(); // read buffer to memory (blocking) status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL); checkErr(status, "Read()"); } Event: A token sent through a pipeline that can be used to enforce synchronization,

flush caches, and report status back to the host application.

Implementation – Logistic Map, OpenCL // Output timings clock_t end = clock(); double sec = (end - start)/ (double) CLOCKS_PER_SEC; timef << "Time: " << sec << endl; timef << "Per iter: " << sec / (double) N_TIMES << endl; // Create image CImage im; im.Create(nEl, -height, 24); // White out the image for (int i = 0; i < nEl; ++i) { for (int j = 0; j < height; ++j) { im.SetPixelRGB(i,j,255,255,255); } }

Implementation – Logistic Map, OpenCL // Plot the iterates for (int i = 0; i < nEl; ++i) { for (int j = 0; j < maxIter; ++j) { float u = mat[nEl*i + j]; if (xL <= u && u < xR) { // only plot u if in range u -= xL; im.SetPixelRGB(i, height - 1 - (u*height/xR),0,0,0); } } } // Save the image ostringstream strm; strm << "logistic" << setN << ".bmp"; im.Save(strm.str().c_str()); } timef.close();}

Methodology, Results, Analysis

Methodology, Results, Analysis

MethodologyResultsAnalysis

Methodology:

The performance of each implementation for each chaotic phenomenon will be measured by timing them◦They will be timed for 10 runs for each set

◦Each set is a 2-D region (Mandelbrot Set) 1-D interval (Logistic map)

Methodology:

The means will be compared and graphedAfterwards, accuracy will be determined

◦Generated graphs will be compared Visually Numerically

Methodology:

Hardware used:◦Intel Core i5 Dual-Core M460 @ 2.53GHz◦AMD Radeon Mobility HD 5145 (ATI

RV710)

Results:

Mandelbrot Set.Logistic Map.

Results – Mandelbrot Set:

Set 1 Set 2 Set 3 Set 4 Set 50

1

2

3

4

5

6

7

8

Plain C++

OpenCL, GPU

OpenCL, CPU

Result – Mandelbrot Set:

Performance◦OpenCL implementation is faster than

Plain C++ by roughly 10 times.◦OpenCL running on CPU and GPU both

take less than a second◦Order of magnitude difference in runtime


Performance◦OpenCL running on GPU runs in ¾ the time of OpenCL on CPU

◦Less difference than expected (more on this later).


Accuracy◦To be determined by both visual comparison and numerical comparison of generated visualizations

◦Visualizations follow on the next slides

Mandelbrot Set – Set 1 - C++

Mandelbrot Set – Set 1 – OpenCL, GPU

Mandelbrot Set – Set 1 – OpenCL, CPU

Mandelbrot Set – Set 2 - C++



Mandelbrot Set – Set 3 – C++


Mandelbrot Set – Set 3 – OpenCL , CPU







Mandelbrot Set – Results:

Accuracy◦Visual comparison gives no apparent difference

◦Numerical comparison confirms this: no difference in number of iterations

◦Perfect accuracy

Logistic Map – Results:

Set 1 Set 2 Set 3 Set 40

1

2

3

4

5

6

7

8

Plain C++

OpenCL, GPU

OpenCL, CPU


Performance◦Similar results to Mandelbrot Set◦Plain C++ takes roughly 7 seconds◦OpenCL running on CPU and GPU both

take less than a second◦Order of magnitude difference in runtime


Performance◦OpenCL running on GPU runs in ½ the time of OpenCL on CPU

◦Greater difference, but still less difference than expected (more on this later)


Accuracy◦Also to be determined by both visual comparison and numerical comparison of generated visualizations

◦Visualizations follow on the next slides

Logistic Map – Set 1 – C++

Logistic Map – Set 1 – OpenCL, GPU

Logistic Map – Set 1 – OpenCL, CPU










Logistic Map - Results

Accuracy◦Once again, no noticeable difference can be

observed visually◦Numerical comparison also confirms this

Analysis:

For both chaotic phenomena investigated, an order of magnitude difference in speed was observed between the OpenCL and plain C++ implementations

Also, no visible difference in accuracy was found Thus, OpenCL can be considered an excellent

way to boost performance.

Analysis:

For both chaotic phenomena, GPU was faster than CPU.

However, this difference is smaller than expected, considering the parallelism of the problem and of GPUs.

Analysis:

Possible explanation:◦Bus transfer from CPU to GPU takes too

much time. Possible solution:

◦Increase workload of the kernel so as to minimize required transfer

Analysis: CPU performance in and of itself is remarkable Improvement over plain C++ is an order of

magnitude, but only dual-core CPU was used Expected improvement: factor of 2 Actual: factor of 10 Possible explanation:

◦Excellent optimization by Intel OpenCL driver

Conclusions:

SummaryContributionsFuture Work

Summary:

GPGPU provides access to massive parallelism◦But only data parallelism

This is due to GPU architecture being specialized for massive data parallelism

OpenCL gives us easy access to GPGPU◦Along with parallelization for CPUs and embedded

devices

Summary

Chaotic phenomena require large amounts of computation

However, this is usually very data-parallelPrime examples:

◦Mandelbrot Set◦Bifurcation diagram of the logistic map

Summary:

OpenCL was used to investigate how useful GPGPU can be for investigation of chaotic phenomena

Results are spectacular:◦10x improvement over plain C++ Even for CPU-driven OpenCL

◦GPU-driven OpenCL still faster than CPU-driven OpenCL

Summary

Analysis of benefits and complications from using OpenCL◦Speed◦Accuracy◦Code complexity

Contributions:

This project shows that OpenCL can be used to greatly speed up computations for investigation of chaotic phenomena

And in general, computation of highly data-parallel work

Contributions:

OpenCL can be used regardless of whether a GPU is available◦OpenCL can be used to parallelize serial

implementations for CPU◦Still have massive improvements

Future Work:

Increase work load of the kernel, thus reducing data transfer required and latency incurred.◦Data transfer to GPU is very slow◦May have to contend with other bus-users◦Even without data, latency is high (off-chip)

Future Work:

Investigate using highly-optimized code for both OpenCL and C++◦More realistic comparison between OpenCL and

C++◦However, may accidentally lead to optimizing C++ more than OpenCL, or vice versa

Future Work:

Investigation of other chaotic phenomena◦Lorenz strange attractor◦Burning ship fractal◦Mandelbar fractal

In general, highly data-parallel work.

References: These slides contain material developed and copyright by: Gita Alaghband (UC Denver). http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx. [1] Alligood, K. T., Sauer, T., and Yorke, J.A. Chaos: an introduction to dynamical systems. New York

City, NY: Springer-Verlag, 1997. Print. [2] “AMD Accelerated Parallel Processing SDK”. AMD Developer Central. AMD, n.d. Web. 6 Mar

2012. [3] Devaney, Robert L. An Introduction to Chaotic Dynamical Systems, 2nd ed,. Boulder, CO:

Westview Press, 2003. Print. [4] Garcia, V., E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. In Proceedings

of the CVPR Workshop on Computer Vision on GPU, 2008. Print. [5] Harrison, Owen, and John Waldron. “AES on SM3.0 compliant GPUs.” In Proceedings of CHES

2007. Print. [6] "Intel® OpenCL SDK." Intel Visual Computing Source. Intel Corporation, n.d.. Web. 6 Mar 2012. [7] Mancheril, Naju. “GPU-based Sorting in PostgreSQL.” Thesis, School of Computer Science -

Carnegie Mellon University. Print. [8] Milnor, John W. Dynamics in One Complex Variable. 3rd ed. In Annals of Mathematics Studies

160. Princeton, NJ: Princeton University Press, 2006. [9] “OpenCL.” Nvidia Developer Zone. Nvidia, n.d. Web. 6 Mar 2012. [10] "OpenCL.” OpenCL. Khronos Group, n.d. Web. 6 Mar 2012.

http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx

http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx

[11] Scarpino, Matthew. OpenCL in Action. Greenwich, CT: Manning Publications, 2011. Print.

[12] Strogatz, Steven (2000). Nonlinear Dynamics and Chaos. Perseus Publishing.¬ [13] Vasiliadis, Giorgos, et al. “GrAVity: A Massively Parallel Antivirus Engine.” In

proceedings of RAID 2010. Print. [14] Vasiliadis, Giorgos, et al. “Regular Expression Matching on Graphics Hardware for

Intrusion Detection.” In proceedings of RAID 2009. Print. [16] “CUDA Zone.” Nvidia Developer Zone. Nvidia. n.d. Web. 27 Mar 2012. [17] “Next Generation CUDA Architecture, Code Named Fermi.” Nvidia. n.d. Web. 27 Mar

2012. [18] Friedrichs, M.S. et al. "Accelerating Molecular Dynamic Simulation on Graphics

Processing Units". Journal of Computational Chemistry 30 (6): 864–72, 2009. Web. 27 Mar 2012.

[19] Pande, Vijay and Stanford University. “Folding@home.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.

[20] Pande, Vijay and Stanford University. “Folding@home team stats pages.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.

[21] Fung, et al. "Mediated Reality Using Computer Graphics Hardware for Computer Vision". In Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), Seattle, WA, 7-10 2002, 83--89. Web. 27 Mar 2012.

References:

[22] Harris, Mark. “Mapping computational concepts to GPUs.” In ACM SIGGRAPH 2005 Courses (Los Angeles, California, 31 July – 4 August 2005). J. Fujii, Ed. SIGGRAPH '05. ACM Press, New York, NY, 50. Web. 27 Mar 2012.

[23] “About the Khronos Group.” Khronos Group, n.d. Web. 27 Mar 2012. [24] Fang, Jianbin, et al. “A Comprehensive Performance Comparison of CUDA and OpenCL.“ In Parallel

Processing(ICPP), 2011 International Conference on 13-16 Sept. 2011. Web. 27 Mar 2012. [25] Jaaskelainen, P.O. “OpenCL-based design methodology for application-specific processors.” In

Embedded Computer Systems (SAMOS), 2010 International Conference on, pp. 223- 230. Web. 27 Mar 2012.

[26] Li, T.Y.; Yorke, J.A. (1975). "Period Three Implies Chaos" (PDF). American Mathematical Monthly 82 (10): 985–92. Web. 27 Mar 2012.

[27] Farber, Rob. “Cuda, Supercomputing for the Masses: Part 17.” In Dr. Dobb’s, 14 Apr 2010. Web. 27 Mar 2012.

[28] Weisstein, Eric W. "Logistic Map." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/LogisticMap.html

[29] “Mandel zoom 01 head and shoulder.jpg.’ Wikimedia Commons. Web. 27 Mar 2012. [30] “Mandel zoom 00 mandelbrot set.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [31] “TwoLorenzOrbits.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [32] “Logistic Bifurcation map High Resolution.png.” Wikimedia Commons. Web. 27 Mar 2012. http://www.rationalsys.com/robertpirsig.html

References:


Question ?

Documents

Department of Computer Science & Engineering University of Colorado Denver By Ali Alkhathlan, Ali Alsaadi, and Mohamed Khalifa