Aa sort-v4

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD

ProcessorsBy: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami

Presented By: M. Edirisinghe, H. Nawarathna

Content

• Introduction

• SIMD instruction set

• AA-sort algorithm

• In-core algorithm

• Out-of-core algorithm

• Sorting scheme in AA-sort

• Experimental results

Introduction

• High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading

• Many processors provide Single Instruction Multiple Data (SIMD) instructions

3

SIMD instructions

• multiple processing elements perform the same operation on multiple data points simultaneously

4

SIMD Instructions

• Advantages:–Data parallelism–Reduce the number of conditional branches

in programs (can use vector compare and vector select instead)

5

SIMD Instruction Set

• Used Vector Multimedia eXtension (VMX or AltiVec) instructions

• Provides a set of 128 bit vector registers–Use four 32 bit values

• Useful VMX instructions for sorting:–Vector Compare–Vector Selected–Vector Permutation

6

Sorting Algorithms and SIMD

• Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort)

• It incur additional overhead and attenuate the benefits of SIMD instructions

7

Paper’s Contribution

• Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N)

8

AA-Sort Algorithm

• Assumptions:–First element of the array to be sorted is

aligned on a 128 bit boundary–Number of elements in the array, N, is a

multiple of four

9

AA-Sort Algorithm

• Array of integer values a[N] is equivalent to an array of vector integers va[N/4]

10

AA-Sort Algorithm

• Consist of 2 algorithms:1. In-core sorting algorithm2. Out-of-core sorting algorithm

• Phases of execution:–Divide all of the data into blocks that fit into the

cache of the processor–Sort each block with the in-core sorting algorithm–Merge the sorted blocks with the out-of-core

sorting algorithm 11

Combsort

• Extension to bubble sort (kill turtles-lower values in the end)

• Compares and swaps non-adjacent elements• Improves performance• Computational complexity N log (N) average• Problems with SIMD instructions:

–Unaligned memory access–Loop-carried dependencies

12

Combsort

In-Core Algorithm

• Execution steps:1. Sort values within each vector in ascending order2. Execute combsort to sort the values into the transposed order

14

In-Core Algorithm

• Use extended Combsort

15

In-Core Algorithm

3. Reorder the values from the transposed order into the original order

16

In-Core Algorithm

• All 3 steps can be executed using SIMD instructions without unaligned memory access

• Computational complexity dominated by step 2

–Average O(N log N)–Worst case O(N^2)

• Poor memory access locality–Performance degrade if the data cannot fit

into the cache of the processor17

18

Out of core Algorithm• Used to merge two sorted vectors

– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted– c = [b:a] = merge and sort (a, b)

a0 a1 a2 a3

b0 b1 b2 b3

sorted

a

b

c0 c1 c2 c3 c4 c5 c6 c7

[b:a] = vector_merge(a,b)

sorted

sorted

19

Dataflow of Merge

a0 a1 a2 a3 b0 b2 b3

sorted sorted

b1

min00 max00 min11 max11 min22 max22 min33 max33< < < <

< <

< < <

lg(P + 1) stages, P – No of elements in a vector

Here P = 4lg(P + 1) = 3

20

Merge Operation

21

Out of core Algorithm

• No unaligned memory accesses• Better memory access locality compared with

in-core sorting algorithm– Higher performance when data cannot fit in the

cache

22

Overall AA Sort Scheme

• Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor

• Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block.

• Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads

23

Overall AA Sort Scheme Contd.No of elements of data = NNo of elements per block = BNo of blocks = (N/B)

Considering In-core sorting phaseComputational time for the in-core sorting of each block proportional to B log(B)Complexity of in-core sorting = O(N)

Considering out-of-core sorting phaseMerging sorted blocks in out-of-core sorting involves log(N/B) stagesComputational complexity of each stage = O(N)Complexity of out-of-core sorting = O(N log(N))

Hence, Computational complexity of entire AA-sort = O(N log(N))

24

Overall AA Sort Scheme Contd.

An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4

25

Experimental Setup• PowerPC 970MP System

– Two 2.5 GHz dual-core processors– 8GB system memory– Each core had 1MB L2 cache memory– Linux kernel 2.6.20

• System with Cell BE processors– Two 2.4 GHz processors– 1GB system memory– Only SPE cores were used (16 SPE cores with

256KB local memory each)– Linux kernel 2.6.15

26

Implementation• Half of the size of L2 cache as the block size

– 512KB (128K of 32 bit values) on PowerPC 970MP– 128KB (32K of 32 bit values) on the SPE

• Shrink factor – 1.28• Multiway merge technique with out-of-core

sorting– 4 way merge– Number or merging stages reduced from log2(N/B)

to log4(N/B)

27

Effects of Using SIMD Instructions

Acceleration by SIMDinstructions for sorting 16 K random

integers on one core of PowerPC 970MP

Branch misprediction rate.

28

Performance for 32 bit Integers

Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes.

29

Performance for 32 bit Integers Contd.

Performance comparison on one

PowerPC 970MP core for various input datasets with 32 million integers.

30


The execution time of parallel versions of AA-sort and GPUTeraSort onup to 4 cores of PowerPC 970MP.

31


Scalability with increasing number of cores on Cell BE for 32 million integers

32

Conclusions

• Describes a new parallel sorting algorithm called Aligned Access Sort

• The algorithm does not involve any unaligned memory accesses

• Evaluated on PowerPC 970MP and Cell Broadband Engine Processors

• Demonstrated better scalability and performance in both sequential and parallel versions

33

Conclusions Contd.

• Evaluation was performed only on 32 bit integers• Performance comparison was performed on

limited number of architectures– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core

SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand

• Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads

34

Thank You.

Education

Aa sort-v4