17
LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms Hamidreza Zaboli School of Computer Science Carleton University Ottawa, Canada K1S 5B6 [email protected] October 13, 2008 1 Introduction Parallel Computing has been one of the most important problems in computer science as well as a solution to satisfy the growing need of computation speed. Although in the past decades, parallel computing has been an interested area, it is going to be more popular and useful among simple users. In the past years, the area was strange for home users due to the high cost of installing a parallel computer machine. Besides that, large necessary space to install such systems, inexistence of efficient algorithms, etc. caused the area to be popular only among computer scientists. These days, by reducing the cost of producing different parallel systems with various architectures, these machines can be seen in every PC. Intel multicore processor is one of the most common parallel processors on current desktops and laptops. The ease of accessing such simple parallel processors makes experts to redesign and revise old scientific algorithms and software architectures which were originally designated for sequential processing. In fact, in order to keep today’s computations efficient, most of the basic algorithms which are optimal for single-core processors, should be modified. Some of these algorithms are sorting algorithms, searching algorithms, matrix operatons, and graph algorithms. As a result of revising basic algorithms, advanced algorithms in different scientific areas, which use basic algorithms as a small portion of their bodies, need to be revised as well. It is clear that scientific algorithms in the fields of computational geometry, image processing, bioinformatics, etc. use the basic mentioned algorithms to solve a bigger problem related to their special fields. Therefore, because of the vast use of parallel processors, especially different multicore processors, and in order to keep use of them efficient, it is necessary to redesign both basic and advanced algorithms in different scientific fields. In this project, we are going to review basic and advanced problems of computational geometry on parallel machines. Note that the parallel platforms that we chose are current multicore platforms. They includes Cell-BE of IBM,Sony,Toshiba and Multicore processor of Intel. In the past years, researchers did many works on parallel computational geometry. These works are mostly on expensive, large-size parallel machines and sometimes on virtual and theoretical parallel models such as PRAM. These days, by running small parallel processing 1

LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

LITERATURE REVIEW:

Parallel Computational Geometry on Multicore Processors

A Review of Basic Algorithms

Hamidreza ZaboliSchool of Computer Science

Carleton UniversityOttawa, Canada K1S [email protected]

October 13, 2008

1 Introduction

Parallel Computing has been one of the most important problems in computer science aswell as a solution to satisfy the growing need of computation speed. Although in the pastdecades, parallel computing has been an interested area, it is going to be more popular anduseful among simple users. In the past years, the area was strange for home users due to thehigh cost of installing a parallel computer machine. Besides that, large necessary space toinstall such systems, inexistence of efficient algorithms, etc. caused the area to be popularonly among computer scientists. These days, by reducing the cost of producing differentparallel systems with various architectures, these machines can be seen in every PC. Intelmulticore processor is one of the most common parallel processors on current desktops andlaptops. The ease of accessing such simple parallel processors makes experts to redesign andrevise old scientific algorithms and software architectures which were originally designatedfor sequential processing. In fact, in order to keep today’s computations efficient, mostof the basic algorithms which are optimal for single-core processors, should be modified.Some of these algorithms are sorting algorithms, searching algorithms, matrix operatons,and graph algorithms. As a result of revising basic algorithms, advanced algorithms indifferent scientific areas, which use basic algorithms as a small portion of their bodies, needto be revised as well. It is clear that scientific algorithms in the fields of computationalgeometry, image processing, bioinformatics, etc. use the basic mentioned algorithms tosolve a bigger problem related to their special fields. Therefore, because of the vast useof parallel processors, especially different multicore processors, and in order to keep use ofthem efficient, it is necessary to redesign both basic and advanced algorithms in differentscientific fields. In this project, we are going to review basic and advanced problems ofcomputational geometry on parallel machines. Note that the parallel platforms that wechose are current multicore platforms. They includes Cell-BE of IBM,Sony,Toshiba andMulticore processor of Intel.

In the past years, researchers did many works on parallel computational geometry. Theseworks are mostly on expensive, large-size parallel machines and sometimes on virtual andtheoretical parallel models such as PRAM. These days, by running small parallel processing

1

Page 2: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

units on PCs(multi-processors and multicores), it seems important to revise the algorithmsof computational geometry in order to make them optimal for running on the multicoreprocessors.

Thus, in this work, we start with reviewing current works on these processors, speciallywe are interested in some basic algorithms that are widely used in computational geometrysuch as list ranking and sorting algorithms. Due to the novelty of the mentioned multicoreprocessors, there are only a few works on them. In this survey, we review these works andaddress related useful works.

In addition, we give a fast review on the works related to basic and advanced computa-tional geometry algorithms on graphic processor unit(GPU). This matter is interesting dueto the fact that GPU is a helpful co-processor that can undertake a significant portion of thecomputations especially graphic and geometric computations. Current GPUs are going tobe a huge SIMD (single instruction multiple data) processors exploited in PCs. Tough theyare originally designated to perform graphic computations on the pixels of output image,their high abilities of performing parallel computations and the increasing number of simpleprocessors inside a GPU have made them interesting parallel machines for researchers. Asit is illustrated in Fig. 1, GPUs are large parallel processors compared to current parallelprocessors. For example, one of the most powerfull and recent Cell processors (cell broad-band engine) have 8 parallel cores (Fig. 2) which is a few compared to a Nvidia GeForce8 GPU which has 128 simple thread processors. However computation power of a singlecore in a multicore processor is much more than a single core of a GPU, number of par-allel streams that can be computed in a GPU is many times more than this number in amulticore processor. Especially when computations can be performed in a SIMDized way,workloads requested by multicore for accessing memory will be time-consuming comparedto the workloads in GPU. Besides the huge parallel architecture of GPUs, they have ahigher bandwidth to access main memory. Note that even in current multicore proccesors,data transfer between CPU and main memory, and the provided bandwidth are problematicissues that force experts to deal with them.

These advantages of GPUs over CPUs in parallel computations create a good opportu-nity to transfer some problems and computations of CPUs to GPUs, even for problems thatare not related to graphic and geometric computations. On the other hand, transfering andsolving problems especially computational geometry problems using GPUs necessitate therevision and assessment of some basic algorithms that are widely used in advanced com-putational geometry problems. In this literature review, we are interested in the basic andadvanced computational geometry problems and we present a fast review on them.

Fig. 1. The architecture of a Nvidia GeForce 8 GPU. Each 8 thread processors have ashared local memory of size 16KB[1].

2

Page 3: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

Fig. 2. The architecture of a Cell processor with 1 PPE and 8 SPEs[2].

Afterwards, as a perspective of the project, we are going to find the possible optimiza-tions, modifications, and solutions applicable to both basic studied algorithms and advancedproblems of computational geometry on the multicore platform. We expect to reach an en-couraging area with a variety of computational geometry problems that can be improvedto work more efficiently on multicore platforms.

2 Literature Review

In this section, we start with a general review over the studies and works on the multicoreplatforms and address related works. Next, we concentrate on our interested basic algo-rithms and problems i.e. list ranking and sorting on the multicore platforms and presentand discuss our selected works in detail. We suppose that reader is familiar with the impor-tance of these problems in the field of computational geometry. Just to give a hint, we pointout that most of the graph and tree problems are not solvable without having knowledgeabout sorting and list ranking. Please refer to [3] for more information about the use ofsorting and list ranking algorithms in different computational geometry problems.

2.1 Hardware-Related Optimization Techniques on Multicore Processors

In the recent few years, by introducing multicore processors, researchers focused on theseprocessors and studied their different properties. They are also trying to find the impactof them on the running of different problems and algorithms. Because the multicore ar-chitectures are new, as a tradition, researchers have firstly tried to introduce them andtheir especial characteristics that may have impact on algorithms and different propertiesof scientific computations such as computational complexity, communication complexity,execution time, I/O loads, etc. For example, K. Nyberg in [4] and V. Pankratius et. al. in[5] tried to introduce the opportunities that multicore architecture prepares for a new gen-eration of algorithms and softwares. K. Nyberg in [4] discussed about the old multi-taskingscheme of Ada programming language and tried to renovate multi-tasking programmingenvironment in Ada with respect to the new multicore architecture. The paper tried to testthe impact of multi-tasking in Ada on the multicore. On the other hand, general purposeapplications and aspects of software engineering on the multicore have been presented in [5].The paper considers four case studies of general-purpose applications coded with differentprogramming languages(C++ with OpenMP, Java,C-sharp) and shows the results of theirrunnings on two multicore platforms manufactured by Intel and Sun microsystems.

In addition to basic concepts of algorithms and software systems on the new parallelarchitecture i.e. multicore, an important area that catch the attention of researchers is howto find and design methods that can help to achieve a better performance on the multicore.

3

Page 4: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

There are some reports about the tricks and techniques that are applied to algorithms toenhance the performance of the multicore. Most of these approaches, fall into tricks thattry to hide the latency of accessing memory by DMA requests in order to keep all cores ofthe processor busy. Generally talking, these tricks and techniques try to use all resourcesof the machine, (both the multicore processor and other resources such as memory, I/O,and GPU) as much as possible. M. Rafique et. al. in [6] presented an approach to dealwith I/O intensive workloads on Cell processor. The paper tries to overlap the time spenton getting/putting data from/to memory by a technique called asynchronous prefetching.In this method, the blocks of the data which will be processed in the next iterations of theexecution are requested to be fetched in advance, while the cores of the Cell are workingon the current data. Please note that the current data has been prefetched in the previousiterations. Briefly, the paper presents studies on the following items:

1. I/O path in Cell processor2. I/O tricks and techniques applicable to Cell processor3. evaluation of data prefetching techniques on Cell processorA prefetching technique which is fairly popular is double buffering. This technique has

been discussed in [7]. We will study this paper in detail in the next section. A thread-to-coreassignment method has been proposed in [8]. In the parallel execution of a parallel code,different parts of the code are divided to parallel parts so that each part can be run by athread. Finally, every thread is sent to a core. The paper uses the fact that each threadwith respect to its related code section needs some resources. On the other hand, each coreof CPU at a moment, has access to a set of resources not all of them. The method triesto assign threads to cores so that the allocated pair of thread and core is the best matchat the moment and the core can provide the thread with the required resources. Moreover,fairness and overall throughput should be taken into account. The method uses the factthat most codes have several phases of execution that show approximately similar runtimecharacteristics compared to other phases. If we can approximately predict similarity amongprogram phases, then we can use this information to assign threads to cores optimally[8].M. Chu et. al. in [9] and Anderson in [10] proposed a similar method in that they partitiondata access requests over cores and their associated local stores, instead of partitioningexecution of the code. L. Liu et. al. in [11] show that if the number of cores exceeds acertain threshold, performance will degrade due to the bandwidth problem between CPUand memory. As it can be seen in Fig. 3, multicore processor of Intel has 4 cores thatare connected to main memory using a memory controller hub(MCH). MCH also providesa connection for data transfer between main memory and other devices. By increasingnumber and volume of main memory accesses, MCH will be a bottleneck[11]. The papertries to define a measure called ”memory access intensity” and uses it to determine thementioned threshold.

Fig. 3. Block diagram of a 4-core Intel multicore processor[11].

4

Page 5: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

2.1.1 Double-Buffering Technique on Multicore

In this section we briefly review an efficient technique called double-buffering which is ex-ploited for data prefetching by cores of multicore processors in order to hide data transferlatency. Double-buffering is an especial type of a more general technique which is calledmulti-buffering. Double-buffering and multi-buffering are popular techniques that have ap-plications in different aspects of computer science and engineering, both hardware relatedand software related. Double-buffering technique is interesting because it is the most effec-tive and suitable type of multi-buffering techniques with respect to the small local stores ofcores of multicore processors[7]. For example the capacity of the local store available to eachcore of Cell processor is only 256KB. For more information on multi-bufering techniques onmulticore, including single-buffering, double-buffering, and triple-buffering, please refer to[12]. Please note that using more buffers avoids most complexities of double buffering butit needs more memory that is not a desirable request for multicore processors[7]. Double-buffering technique has been used in basic algorithms on multicore platforms recently. Forexample, B. Gedik et. al. in [2] and D. Bader et. al. in [13] have used double-bufferingto improve the performance of sorting and list ranking algorithms on multicore platforms.Especially they tried to hide memory access latency using this technique.

In what follows we give a brief review of a recent work by J. Sancho et. al. in [7] thatanalyzed the impact of double-buffering on two multicore processors: Cell processor of IBM,Sony, and Toshiba and Quad-core Opteron of AMD. Assume two major operations that areperformed when it is required to transfer data from the cores of CPU to main memoryand in the reverse direction, i.e. from main memory to the cores of CPU. We name theseoperations as ”put transfer” and ”get transfer”, respectively[7]. Also we place two separatedbuffers in the local store of each cores of processor, one allocated for computations and theother for getting/putting data from/to main memory. Data transfers are performed onone buffer at the same time that the other buffer is used for computations[7]. This modelis shown in Fig. 4. In fact, double-buffering is possible and useful for computations andprocesses in which it is possible to know or predict what data is necessary or should beprocessed in the next iterations. Using this method, data transfers can be overlapped withcomputations and consequently, the latency of data transfer will be hidden.

Experiments have been performed on the two mentioned multicore processors. In thefirst experiment, on Quad-core Opteron, its hardware prefetcher is used to prefetch datafrom main memory. The hardware prefetcher prefetches data from the memory to L1 cache.

Fig. 4. Double-Buffering model for the local store of a core: A buffer is designated forcomputations and the other for both getting/putting data from/to memory[7].

5

Page 6: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

In order to do this, the hardware prefetcher watches memory accesses during a set ofiterations and then predicts the memory blocks or lines that are more likely to be requestedin the next iterations and then, it prefetches them. For example, if two consecutive accessesto memory occur in blocks n and n+1, then the hardware prefetcher prefetches block n+2.However the hardware prefetcher works automatically and does not need any particularattention, it does not have the flexibility of software prefetching. On the other hand, Cellprocessor has dedicated DMA engines to transfer data between the cores of Cell proces-sor(which are called SPEs) and main memory. These DMA engines can be controlled byuser. DMA controllers can operate and transfer data at the time that SPEs are processingother data. The following plots illustrate elapsed time and the achieved speedup usingdouble-buffering technique in comparison to using a single buffer. Fig. 5 shows results ofusing double-buffering on 4 Quad-core Opteron processors which work in parallel[7]. Fig.5(a) shows the execution time of double-buffering technique compared to single-bufferingby increasing the number of cores. Please note that the Y axis is in logaritmic scale. As canbe seen, by increasing the number of cores, execution time using both techniques decreases.Also, in Fig. 5(b), achieved speedup of using double-buffering technique against single-buffering technique is shown. Speedup is evaluated for two prefetching cases: stride-1 andstride-1K. Please note that stride is defined as the distance between each two consecutiveblocks of the data which is being prefetched from memory.

Fig. 5. (a)Execution time of double and single buffering schemes by increasing thenumber of cores on 4 Quad-core Opteron, (b)Speedup of double-buffering technique over

single-buffering technique with two strides by increasing computation intensity[7].

Finally, Fig. 6. compares the execution times of using double-buffering technique onboth Cell processor and Quad-core Opteron by increasing computation intensity. As it canbe seen, the Quad-core Opteron outperforms the Cell processor. This superiority, as au-thors stated, is due to the higher aggregate memory bandwidth and higher peak processingrate[7]. However, by increasing computation intensity, the superiority of Opteron againstCell reduces, it still is higher than Cell processor. Please note that with a computationintensity of 1, Opteron operates about 6 times faster than Cell and with a computationintensity of 20, Opteron operates only about 2 times faster than Cell. Please refer to [7] formore information on the computation intensity and how it is measured by authors.

2.2 Basic Problems and Algorithms on Multicore Processors

Up to this point, we reviewed the studies on multicore processors which are not about algo-rithms but about the aspects of hardware and techniques that can improve the performanceand speed of executions.

6

Page 7: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

Fig. 6. Execution times of using double-buffering technique on Quad-core Opteron andCell processor[7].

As we mentioned earlier, due to the novelty of the multicore processors, the numberof these works are not many. As a matter of fact, works related to different problemsand algorithms on the multicore are a few. They are on basic methods and algorithmssuch as sorting, matrix multiplication, dynamic programming, and list ranking. The otherissue that is important, is compilers and programming environments that are necessaryfor execution and evaluation of algorithms and codes on the multicore. Unfortunately,There is no dedicated compiler for multicore platforms so far and current programmingsand executions are done in the traditional environments which are originally designatedfor sequential processing. At best, programming environment is a modified or optimizedcompiler or programming language that may not achieve the highest possible performanceof the multicore processor. Some of these environments are MPI, OpenMP, and CILK.MPI is a modified version of C++ that is designated for programming in message-passingparallel programming models. However A. Kumar et. al. in [14] studied the feasibility ofusing MPI for programming on the Cell multicore processor. OpenMp is a similar modifiedversion of C++ designated for shared-memory parallel architecture. As it can be seenin the architecture of the multicore processors, and as it has been treated so far in therecent research and projects, multicore processors fall mostly into the category of shared-memory architectures. Therefore, using OpenMP for programming on the multicore ispreferable. In addition to the mentioned environments, a new extension of C++ called Cilkhas been developed for programming on the multicore platforms. Please refer to [15] formore information.

The other instances of basic algorithms on the multicore are dynamic programming andmatrix operations which are discussed in [16] and [17], respectively. The first one evalu-ated three classes of dynamic programming algorithms on the multicore: local dependencydynamic programming, gaussian elimination paradigm, and parenthesis problem. Experi-ments are performed on an AMD Opteron with 8 cores. The second one tried to evaluatelinear algebra matrix operations on the multicore.

Recently, a new work on the problem of sequence alignment has been performed on Cellprocessor. Although this problem is not an advanced problem, it is considered one levelupper than the mentioned basic algorithms. It seems that researchers are going to transfergradually from basic algorithms to complex and especial algorithms that have applicationsonly in a especial scientific field. A. Sarje et. al. in [18] presented two advanced align-ment techniques called spliced alignments and syntenic alignments implemented on the Cell

7

Page 8: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

processor. These alignment techniques are used in biological applications. Experimentalresults show speedups of about 4 on the Cell processor compared to serial algorithms onthe Cell and Pentium 4 processors[18].

2.3 Sorting Algorithms on Multicore Processors

In this section we give a brief review of sorting algorithms on multicore processors. Thereare some works on big parallel machines and multi-processors that can be found in [19].However, as we have seen so far, there are only a few works that studied the optimization ofsorting algorithms on multicore in [2], [20] . In most cases, the routine for designing a sortingalgorithm for multicore processor is like the following. First, a basic sorting algorithm thatseems to be effective and efficient for parallel sorting is chosen. Then it is used as the kernelof the whole sorting process and may be modified for achieving better performance. Finally,some optimizations are applied to the whole process for a more effective implementationand result. Multi-buffering and memory latency hiding are of popular modifications. Inthe following, we discuss CellSort algorithm by B. Gedik et. al. which uses a bitonicsorting kernel to design a sorting process on the Cell processor. More information aboutthis algorithm can be found in [2].

2.3.1 CellSort: A sorting algorithm on the Cell Processor

CellSort, as the authors mentioned in their report, is based on distributed bitonic mergewith a SIMDized bitonic sorting kernel[2]. Their sorting process has three levels. At theinnermost level which is called single-SPE local sort, local data items to a core of the Cellprocessor are sorted using only that core. We refer to the cores of the Cell processor asSPEs. In the second level, the data items which are stored in the local stores of SPEs aresorted using a distributed bitonic sort. This level is called in-core sort because the sortingin this level is performed inside the Cell processor, between the SPEs or inside a SPE. SPEsare connected using a bus called Element Interconnect Bus(EIB). The architecture of theCell processor, SPEs, PPE, and EIB are shown in Fig. 2. Practically, we need to sort dataitems that do not fit inside the local stores of the Cell processor. Therefore, The sortingalgorithm should be able to handle such situations. CellSort has this ability and handlesit by moving data back and forth between the Cell processor and main memory. As theauthors stated, this level is called distributed out-of-core sort. This level is very costly andtime-consuming because of the data transfer between Cell processor and main memory.

Single-SPE local sort is a simple bitonic sort that is applied to the local data items inthe local memory of the SPEs. Fig. 7 shows a simple scheme of the simple bitonic sort[2].Assuming a list of n items, bitonic sort starts with lists of size 1, merging them at eachstage, and doubling the size of the lists up to merging two sorted bitonic lists of size n/2.This bitonic merging process is shown in Fig. 7. The computational complexity of simplebitonic sort for sorting m items is O(m ∗ log2m).

One of the optimizations that can be applied to the sorting process is using SIMDinstructions that are provided by the SPEs of cell processor. Using SIMD instruction set,two vectors each containing 4 items can be compared and sorted only by three SIMDinstructions, while without using them, it takes up to 8 operations for the two vectors tobe sorted[2]. Given two vectors of size 4 in the bitonic merging process, the comparisoninstruction of the SIMD set combined with a selection instruction can put 4 lower itemsinto a vector and put the 4 higher items into the other vector. These instructions and the

8

Page 9: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

support that cell processor provides for using them improve the whole sorting process toachieve a higher speedup. SIMD instruction set can be used for comparison of any twovectors with 4 or more data items. Even for vectors with less than 4 items, it is possible tomodify them so that SIMD instructions can be applied.

Fig. 7. Phases of bitonic merging process for sorting 8 numbers[2].

For example consider the last phase of comparisons in Fig. 7. Comparisons are performingon every two consecutive data items. In this case, SIMD instructions can not be used tocompare and swap each item with its consecutive item. But it is possible to change thelocations of items so that each item is compared and swaped with its consecutive item usingSIMD instructions. This can be performed by shuffling. Fig. 8 shows the possible shufflingthat can be applied to the last stage of Fig. 7. After shuffling, SIMD instructions are usedand finally, another shuffling is applied to the items.

Fig. 8. Using SIMD instructions for comparing and swapping vectors with less than 4items[2].

After local sorts, next level is distributed in-core sort in which all SPEs co-operates tosort a vector of size 8m (m is the number of items in the local store of a SPE.). In this level,first of all, every SPE sorts its local numbers(data items) using the single-SPE local sort sothat each two consecutive SPEs sort their local numbers in opposite order (one ascendingand the other descending). Now, we can apply bitonic merges starting from lists with mitems (which generates a sorted list of length 2m) to the final list in the last phase withlength 8m.

Please note that for merging two lists of items in two consecutive SPEs, it is possibleto divide the lists so that each SPE compares and swaps only half of the numbers in thelists. For example assume the first phase of merging lists of SPEs. Every SPE has a listof numbers which is sorted in the opposite order of its consecutive SPE. Therefore, it ispossible to compare each number of the first list with its corresponding number in thesecond list, and put the smaller number in the first list and the bigger number in the second

9

Page 10: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

list. Because comparing and swapping of each number with its corresponding number isindependent of the other numbers, it is possible to divide the list of numbers and assignthem to different SPEs. In order to do so, we split each of two consecutive lists into twoequal lists as it is shown in Fig. 9. Then we assign the first half of the first list to the firstSPE to be compared and swapped with the first half of the second list. Similar to this, weassign the second half of the first list to the second SPE to be compared and swapped withthe second half of the second list. Using this way, both SPEs will be busy and have equalworkloads to do. Besides that, each SPE has to transfer only half its list to the other SPE.This will result in a lower numbers of communications between SPEs. Fig. 9 shows thedescribed optimization for the phase in which the length of each list is 2m.

Fig. 9. Dividing lists of SPEs and their assignings to different SPEs for the phase in whichlength of a sorted list is 2m[2].

The last level of bitonic sorting on the Cell processor is distributed out-of-core sort.In this level, firstly a series of in-core sorts are performed so that all of data items aredivided into lists of size 8m. Each sorted list is sorted in the opposite order of its nextand previous list. Next, out-of-core bitonic merge is applied to these lists until the finalsorted list including all of the data items is obtained[2]. Therefore, The whole process ofbitonic sorting is a distributed bitonic merge which includes many bitonic in-core mergesand single-SPE local sorts. Note that in the distributed out-of-core merge there is a hugenumber of data transfers between the SPEs of Cell processor and main memory. Thesedata transfers are very time-consuming. In order to hide this latency, prefetching is usedwhich is described in section 2.1.1. The prefetching is performed using DMA requests whichare available to SPEs. Every SPE can request a DMA to get (i+1)th block of data itemswhile sorting (i)th block of data and also can put (i-1)th block of data items into mainmemory. Using this way, a great deal of latency will be overlapped with computationsof SPEs. Total computational complexity of the whole sorting process, assuming p SPEsand a total of n data items is O((n/p)log2n)[2]. CellSort has been evaluated on a 16-coreCell processor(IBM QS20 Cell blade) and its result has been compared to the followingalgorithms on Cell processor, Intel Xeon, Intel Pentium 4: simple bitonic sort, SIMDizedbitonic sort, shell sort, and quick sort. Maximum number of items in a single SPE is 32K.Number of SPEs is 16 and maximum number of sorted items is 128M. Results of evaluationsare separated based on three levels for single-SPE local sort, distributed in-core sort, anddistributed out-of-core sort. The first two plots shown in Fig. 10 are achieved results ofsorting 32K of numbers using single-SPE local sort[2]. It shows the sorting time of SIMDizedbitonic sort relative to the other algorithms on the Cell processor and Intel processors. Fig10(a) is sorting time for sorting integer numbers and Fig 10(b) is sorting time for floatnumbers. As can be seen, SIMDized bitonic sort achieved the best results compared to theothers.

10

Page 11: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

Fig. 10. Single-SPE local sort: (a)integer numbers,(b)float numbers[2].

Next experiment is distributed in-core sort. Again, the results of bitonic sort has beencompared to the result of the other algorithms on the mentioned processors. Fig. 11summerizes these results. Please note that the Y axis is the time of other algorithmsrelative to the time of in-core bitnic sort. Fig 11(a) is sorting time for sorting integernumbers and Fig 11(b) is sorting time for float numbers. Again, bitonic in-core sort on theCell processor outperforms the other algorithms, even it showed better results than bitonicsort implemented on Intel Xeon processor.

Fig. 11. Distributed in-core bitonic sort: (a)integer numbers,(b)float numbers[2].

The last and the most comprehensive experiment is on distributed out-of-core sort. Inthis experiment 16 SPEs are used for the evaluation of distributed out-of-core bitonic sort.Similar to the two previous experiments, this experiment is performed separately for integerand float numbers for up to 128M of data items and results are shown in Fig 12(a),(b). Asit is shown in the figures, distributed out-of-core bitonic sort on the Cell processor achievedthe best results compared to the other combinations of algorithms and processors.

Fig. 12. Distributed out-of-core bitonic sort: (a)integer numbers,(b)float numbers[2].

11

Page 12: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

2.4 List Ranking on Multicore

List ranking is one of the most important problems that is a requisite for many problems inthe field of computational geometry. Given a list of nodes, list ranking problem determinesthe rank (distance) of each node relative to the first node. As we have seen so far, there isonly one study on list ranking evaluated on multicore platforms. The work has been doneby D. Bader et. al. in [13]. In the following, we briefly present this work and we refer readerto [13] for more information. Assume a list of n nodes, each has a prefix property and avalue property, so that the value of each node is 1 and the prefix of each node is sum of itsvalue and the prefix of its previous node. Based on the above assumptions, list ranking in[13] is performed in four steps as follows: first, the algorithm partitions the input list intos sublists by randomly choosing n/(s− 1) nodes. Then, prefix of each node in the sublistsis computed. Finally, each node of the lists computes its final prefix by adding its prefixto the prefix of the last node in the previous sublist[13]. Also in this algorithm, similarto CellSort, prefetching using DMA requests are used to hide the latency of getting datafrom memory. Therefore, total number of DMA requests is in O(n) and computationalcomplexity of the algorithm for each core is in O(n/p). Thus, if we have a Cell processorwith 8 SPEs, computational complexity of each SPE is O(n/8).

Although using prefetching technique can reduce the latency of data transfer especiallyin sorting, it can not help in the problem of list ranking. Because nobody knows the locationof the next node of the current node in main memory. In this case, prefetching techniquecan only prefetch the blocks of data around the current node and performs some predictions.Even these predictions can not help much. As a result, by occurring cache misses due tothe inexistence of the next node in the current fetched block of data, the number of DMArequests and consequently data transfer will be high.

To solve this problem and hide these latencies, multi-threading technique is used in[13]. Because SPEs do not support hardware multi-threading, a software managed multi-threading is used in [13] so that many threads, each one including a sublist are allocatedto each SPE. As a result, whenever a cache miss occurs which means that next node isnot inside the fetched block of data, a thread switching is performed and a DMA requestis isuued. Meanwhile, the SPE is kept busy by undertaking the next thread. Threads areassigned to a SPE using a round robin scheme. Fig. [13] shows software managed multi-threading technique. Using this technique, as authors stated, there will be no stall if thereare sufficient number of threads.

Fig. 13. Software managed multi-threading technique. Many thread are allocated to eachSPE[13].

Experiments are performed on a IBM BladeCenter QS20, with two 3.2 GHz Cell proces-sors. One of the processors is used for measuring performance. Fig. 14 illustrates runningtime of the list ranking algorithm for each SPE. As can be seen in the figure, running timesare different because the lengths of the lists allocated to SPEs are different. Meanwhile, theexperiment is performed for two cases: one without software multi-threading and the other

12

Page 13: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

using multi-threading. As it can be seen, using multi-threading with 64 sublists(each SPEis allocated with 8 sublists), the total required time for list ranking is significantly less thanthat without multi-threading.

Fig. 14. Result of list ranking with and without multi-threading and running times ofSPEs[13].

In another experiment, the proposed list ranking algorithm on Cell processor has beencompared to the algorithm on the other single and parallel processors with a list of size8 million nodes. Fig. 15 shows these comparisons for two types of lists: random list andordered list. Fig. 15(a) shows the comparison of the algorithm on Cell processor with othersingle processors. As can be seen, Cell processor achieved a smaller running time comparedto the other single processors. Also, in Fig 15(b), the same comparison has been performedfor Cell processor and some other parallel processors. However in the figure, Cell processoris not the best, it performed list ranking comparably to the other parallel processors.

Fig. 15. Comparison of the list ranking on Cell processor with (a)single processors and(b)parallel processors with 8 million ordered and random nodes[13].

2.5 Graphic Processor Units: Basic and Advanced Problems

As we discussed in section 1, GPUs are efficient parallel processors that can act as a co-processor and even in some situations have a higher performance than CPUs, especiallyfor computations that can be SIMDized. GPUs are originally designated for graphic andgeometric computations. There are some research both basic and advanced in the fields ofimage processing and computational geometry implemented on GPUs. Some of these worksinclude delaunay triangulation, biomedical image analysis, motion estimation, etc. These

13

Page 14: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

problems have been discussed and implemented in [21], [22], [23], respectively to run onGPU. GPU Programming needs special environments that are designated for this purpose.Without these environments, efficient use of parallel simple cores of a GPU is not possible.Computer unified device architecture(CUDA) is one of these programming environments.This environment has been used for coding of different algorithms and problems in therecent generations of Nvidia GPUs. In the following, we review a sorting algorithm calledGPUTerasort which is implemented on GPU and uses bitonic sort as kernel. For moredetail, please refer to [24].

2.5.1 Sorting Algorithms on GPU

The idea behind sorting on GPUs is that they are large parallel vector co-processors whilethe disadvantage of CPU-based sorting algorithms is significant cache misses on largedatasets[24]. Current GPUs have 10x higher main memory bandwidth and use data par-allelism to achieve 10x more operations per second than CPUs[24]. Many researchers haveproposed GPU-based sorting algorithms. T. Purcell et. al. in [25] proposed a bitonic sortmethod on GPU. N. Govindaraju in [26] proposed a sorting algorithm based on a peri-odic balanced sorting network and used texture mapping and blending operations of GPUfor sorting. Another recent work by N. Govindaraju in [24] called GPUTeraSort used abitonic sort as its sorting algorithm and implemented it on GPU. The implementation triesto handle sorting of wide keys with huge number of records resided in external memory.Their method tries to exploit maximum processing power of both GPU and CPU. Most ofexternal memory sorting algorithms pass two phases. In the first phase, a set of files thatare locally sorted is produced and in the second phase, all input files are globally sorted[24].External sorting algorithms have to deal with I/O communications, because they sort largesets of data and files that can not fit in main memory. GPUTeraSort has 5 stages[24]:

1. Reader: Reads input files into main memory buffer. To achieve more speed, files aredivided into parallel disks so that data can be transfered to memory in parallel.

2. Key-Generator: Compute (key,pointer) pairs for the records of data from the buffer.3. Sorter: Reads and sorts the key-pointer pairs. This stage is both computation-

intensive and memory-intensive because it should read the pairs from main memory, sortthem and write them back to main memory.

4. Reorder: Rearranges the buffer according to the sorted key-pointer pairs to generatea sorted output buffer.

5. Writer: Writes the output buffer into parallel disks.In fact, in GPUTeraSort, sorting stage is performed by GPU which can sort in a higher

parallel rate than CPU, especially it frees CPU to achieve higher I/O performance[24]. Be-sides that, GPU has access to main memory with a higher bandwidth compared to CPUand this can provide a higher throughput. Fig. 16 shows a comparison of required totalsorting time by increasing the number of records between GPUTeraSort and some high per-formance CPU-based sorting algorithms. CPU-based algorithms are a Quicksort algorithmsevaluated on three multicore processors and GPUTeraSort is evaluated on a Nvidia 7900GTX GPU. As it can be seen in the figure, GPUTeraSort has a comparable performanceto the other CPU-based algorithms. Especially when the price of the platform is takeninto account[24]. Another comparison is illustrated in Fig. 17 between GPUTeraSort andsome other GPU-based algorithms proposed in [25] and [26]. As can be seen, GPUTeraSortoutperforms the other GPU-based algorithms and takes less time for sorting by increasingthe number of records.

14

Page 15: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

Fig. 16. Comparison of the total required sorting time of GPUTeraSort to Quicksortimplementations on multicore processors by increasing the number of records[24]

Fig. 17. Comparison of the total required sorting time of GPUTeraSort to the otherGPU-based algorithms proposed in [19] and [20] by increasing the number of records[24].

The computational complexity of GPUTeraSort is O(n∗log2n2 ) and its communication com-

plexity is O(n), so that data transfer time takes only 10 percent of the total sorting time.This is better shown in Fig.18(a). This figure shows the curves of the required time for datatransfering compared to the total time for sorting by increasing the number of records. Fig.18(b) shows the required time for each stage of GPUTeraSort algorithm on three differentGPUs: Nvidia 6800, Nvidia 6800 Ultra, and Nvidia 7800 GT.

Fig. 18. (a)Comparison of data transfer time with total required sorting time ofGPUTeraSort[24], (b)Comparison of the total required sorting time of GPUTeraSort tothe other GPU-based algorithms proposed in [25] and [26] by increasing the number of

records.

15

Page 16: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

References

[1] Tom R. Halfhill, PARALLEL PROCESSING WITH CUDA, MICROPROCESSOR Re-port, www.MPRonline.com,2008.

[2] Bugra Gedik, Rajesh R. Bordawekar, Philip S. Yu, CellSort: High PerformanceSorting on the Cell Processor, Proceedings of the 33rd international conference on Verylarge data bases, pages: 1286-1297, Austria, 2007.

[3] M. de Berg, M. van Kreveld, M. Overmars, O. Schwarzkopf, Computational Geom-etry: Algorithms and Applications, Springer, 1997.

[4] K. Nyberg, Multi-core + multi-tasking = multi-opportunity?, ACM SIGAda AdaLetters, Volume XXVII , Issue 3, Pages: 79 - 82, 2007.

[5] Victor Pankratius, Christoph Schaefer, Ali Jannesari, Walter F. Tichy, Softwareengineering for multicore systems: an experience report, IWMSE ’08: Proceedings of the1st international workshop on Multicore software engineering, May 2008.

[6] M. Mustafa Rafique, Ali R. Butt, Dimitrios S. Nikolopoulos, DMA-based prefetchingfor i/o-intensive workloads on the cell architecture, Proceedings of the 2008 conference onComputing frontiers, Pages 23-32, May 2008.

[7] J. Sancho, D. Kerbyson, Analysis of double buffering on two different multicorearchitectures: Quad-core Opteron and the Cell-BE, IEEE International Symposium onParallel and Distributed Processing, 2008. pages: 1-12, 2008.

[8] Tyler Sondag, Viswanath Krishnamurthy, Hridesh Rajan Predictive thread-to-coreassignment on a heterogeneous multi-core processor, Proceedings of the 4th workshop onProgramming languages and operating systems, 2007.

[9] M. Chu, R. Ravindran,S. Mahlke, Data Access Partitioning for Fine-grain Paral-lelism on Multicore Architectures, 40th Annual IEEE/ACM International Symposium onMicroarchitecture, pages: 369-380, 2007.

[10] James H. Anderson, John M. Calandrino, Parallel task scheduling on multicoreplatforms, ACM SIGBED Review, Volume 3 , Issue 1, Special issue: The work-in-progress(WIP) session of the RTSS 2005, Pages: 1-6, 2006.

[11] Lixia Liu, Zhiyuan Li, Ahmed H. Sameh, Analyzing memory access intensity inparallel programs on multicore, International Conference on Supercomputing, Pages 359-367, 2008.

[12] T. Chen, Z. Sura, K. OBrien, and K. OBrien. Optimizing the Use of Static Buffersfor DMA on a CELL Chip. In Proceedings of the 19th International Workshop on Languagesand Compilers for Parallel Computing (LCPC 2006), Orleans, Louisiana, 2006.

[13] D. Bader, V. Agarwall, K. Madduri, On the design and analysis of irregular al-gorithms on the Cell processor: A case study on list ranking. In Proc. of IEEE IPDPS,2007.

[14] A. Kumar, N. Jayam, A. Srinivasan, G. Senthilkumar, P. Baruah, S. Kapoor, M.Krishna, R. Sarma, Feasibility study of MPI implementation on the heterogeneous multi-core cell BE architecture, Proceedings of the nineteenth annual ACM symposium on Parallelalgorithms and architectures,Pages: 55 - 56, 2007.

[15] Http://www.Cilk.com[16] R. Chowdhury, V. Ramachandran, Cache-efficient dynamic programming algorithms

for multicores, Proceedings of the twentieth annual symposium on Parallelism in algorithmsand architectures,Pages 207-216, 2008.

[17] E. Chan, E. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings

16

Page 17: LITERATURE REVIEW: Parallel Computational Geometry on ...parallelcomputingproject.synthasite.com/resources/Hamed.pdfsorting and list ranking algorithms in different computational

of the nineteenth annual ACM symposium on Parallel algorithms and architectures, Pages:116 - 125, 2007.

[18] A. Sarje, S. Aluru, Parallel biological sequence alignments on the Cell BroadbandEngine, IEEE International Symposium on Parallel and Distributed Processing, Pages: 1-11, 2008.

[19] S. G. Akl. Parallel Sorting Algorithms. Academic Press Inc., 1990.[20] H. Inoue, T. Moriyama, H. Komatsu, T. Nakatani, AA-Sort: A New Parallel Sorting

Algorithm for Multi-Core SIMD Processors, Proceedings of the 16th International Confer-ence on Parallel Architecture and Compilation Techniques, Pages 189-198, 2007.

[21] G. Rong, Tiow-Seng Tan, Thanh-Tung Cao, Computing two-dimensional Delaunaytriangulation using graphics hardware, Proceedings of the 2008 symposium on Interactive3D graphics and games, Pages 89-97, 2008.

[22] T. Hartley, U. Catalyurek, A. Ruiz, F. Igual, R. Mayo, M. Ujaldon, Biomedicalimage analysis on a cooperative cluster of GPUs and multicores, Proceedings of the 22ndannual international conference on Supercomputing,Pages 15-25, 2008.

[23] Wei-Nien Chen,Hsueh-Ming Hang, H.264/AVC MOTION ESTIMATION IMPL-MENTATION ON COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA), IEEE In-ternational Conference on Multimedia and Expo, 2008.

[24] N. Govindaraju, J. Gray, R. Kumar, D. Manocha, GPUTeraSort: high performancegraphics co-processor sorting for large database management, International Conference onManagement of Data, Pages: 325 - 336, 2006.

[25] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon map-ping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference onGraphics Hardware, pages 4150, 2003.

[26] N. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate streammining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD,2005.

17