34
AA-Sort: A New Parallel Sorting Algorithm for Multi- Core SIMD Processors By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami Presented By: M. Edirisinghe, H. Nawarathna

Aa sort-v4

Embed Size (px)

DESCRIPTION

Aligned Access Sort

Citation preview

Page 1: Aa sort-v4

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD

ProcessorsBy: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami

Presented By: M. Edirisinghe, H. Nawarathna

Page 2: Aa sort-v4

Content

• Introduction

• SIMD instruction set

• AA-sort algorithm

• In-core algorithm

• Out-of-core algorithm

• Sorting scheme in AA-sort

• Experimental results

Page 3: Aa sort-v4

Introduction

• High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading

• Many processors provide Single Instruction Multiple Data (SIMD) instructions

3

Page 4: Aa sort-v4

SIMD instructions

• multiple processing elements perform the same operation on multiple data points simultaneously

4

Page 5: Aa sort-v4

SIMD Instructions

• Advantages:–Data parallelism–Reduce the number of conditional branches

in programs (can use vector compare and vector select instead)

5

Page 6: Aa sort-v4

SIMD Instruction Set

• Used Vector Multimedia eXtension (VMX or AltiVec) instructions

• Provides a set of 128 bit vector registers–Use four 32 bit values

• Useful VMX instructions for sorting:–Vector Compare–Vector Selected–Vector Permutation

6

Page 7: Aa sort-v4

Sorting Algorithms and SIMD

• Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort)

• It incur additional overhead and attenuate the benefits of SIMD instructions

7

Page 8: Aa sort-v4

Paper’s Contribution

• Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N)

8

Page 9: Aa sort-v4

AA-Sort Algorithm

• Assumptions:–First element of the array to be sorted is

aligned on a 128 bit boundary–Number of elements in the array, N, is a

multiple of four

9

Page 10: Aa sort-v4

AA-Sort Algorithm

• Array of integer values a[N] is equivalent to an array of vector integers va[N/4]

10

Page 11: Aa sort-v4

AA-Sort Algorithm

• Consist of 2 algorithms:1. In-core sorting algorithm2. Out-of-core sorting algorithm

• Phases of execution:–Divide all of the data into blocks that fit into the

cache of the processor–Sort each block with the in-core sorting algorithm–Merge the sorted blocks with the out-of-core

sorting algorithm 11

Page 12: Aa sort-v4

Combsort

• Extension to bubble sort (kill turtles-lower values in the end)

• Compares and swaps non-adjacent elements• Improves performance• Computational complexity N log (N) average• Problems with SIMD instructions:

–Unaligned memory access–Loop-carried dependencies

12

Page 13: Aa sort-v4

Combsort

Page 14: Aa sort-v4

In-Core Algorithm

• Execution steps:1. Sort values within each vector in ascending order2. Execute combsort to sort the values into the transposed order

14

Page 15: Aa sort-v4

In-Core Algorithm

• Use extended Combsort

15

Page 16: Aa sort-v4

In-Core Algorithm

3. Reorder the values from the transposed order into the original order

16

Page 17: Aa sort-v4

In-Core Algorithm

• All 3 steps can be executed using SIMD instructions without unaligned memory access

• Computational complexity dominated by step 2

–Average O(N log N)–Worst case O(N^2)

• Poor memory access locality–Performance degrade if the data cannot fit

into the cache of the processor17

Page 18: Aa sort-v4

18

Out of core Algorithm• Used to merge two sorted vectors

– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted– c = [b:a] = merge and sort (a, b)

a0 a1 a2 a3

b0 b1 b2 b3

sorted

a

b

c0 c1 c2 c3 c4 c5 c6 c7

[b:a] = vector_merge(a,b)

sorted

sorted

Page 19: Aa sort-v4

19

Dataflow of Merge

a0 a1 a2 a3 b0 b2 b3

sorted sorted

b1

min00 max00 min11 max11 min22 max22 min33 max33< < < <

< <

< < <

lg(P + 1) stages, P – No of elements in a vector

Here P = 4lg(P + 1) = 3

Page 20: Aa sort-v4

20

Merge Operation

Page 21: Aa sort-v4

21

Out of core Algorithm

• No unaligned memory accesses• Better memory access locality compared with

in-core sorting algorithm– Higher performance when data cannot fit in the

cache

Page 22: Aa sort-v4

22

Overall AA Sort Scheme

• Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor

• Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block.

• Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads

Page 23: Aa sort-v4

23

Overall AA Sort Scheme Contd.No of elements of data = NNo of elements per block = BNo of blocks = (N/B)

Considering In-core sorting phaseComputational time for the in-core sorting of each block proportional to B log(B)Complexity of in-core sorting = O(N)

Considering out-of-core sorting phaseMerging sorted blocks in out-of-core sorting involves log(N/B) stagesComputational complexity of each stage = O(N)Complexity of out-of-core sorting = O(N log(N))

Hence, Computational complexity of entire AA-sort = O(N log(N))

Page 24: Aa sort-v4

24

Overall AA Sort Scheme Contd.

An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4

Page 25: Aa sort-v4

25

Experimental Setup• PowerPC 970MP System

– Two 2.5 GHz dual-core processors– 8GB system memory– Each core had 1MB L2 cache memory– Linux kernel 2.6.20

• System with Cell BE processors– Two 2.4 GHz processors– 1GB system memory– Only SPE cores were used (16 SPE cores with

256KB local memory each)– Linux kernel 2.6.15

Page 26: Aa sort-v4

26

Implementation• Half of the size of L2 cache as the block size

– 512KB (128K of 32 bit values) on PowerPC 970MP– 128KB (32K of 32 bit values) on the SPE

• Shrink factor – 1.28• Multiway merge technique with out-of-core

sorting– 4 way merge– Number or merging stages reduced from log2(N/B)

to log4(N/B)

Page 27: Aa sort-v4

27

Effects of Using SIMD Instructions

Acceleration by SIMDinstructions for sorting 16 K random

integers on one core of PowerPC 970MP

Branch misprediction rate.

Page 28: Aa sort-v4

28

Performance for 32 bit Integers

Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes.

Page 29: Aa sort-v4

29

Performance for 32 bit Integers Contd.

Performance comparison on one

PowerPC 970MP core for various input datasets with 32 million integers.

Page 30: Aa sort-v4

30

Performance for 32 bit Integers Contd.

The execution time of parallel versions of AA-sort and GPUTeraSort onup to 4 cores of PowerPC 970MP.

Page 31: Aa sort-v4

31

Performance for 32 bit Integers Contd.

Scalability with increasing number of cores on Cell BE for 32 million integers

Page 32: Aa sort-v4

32

Conclusions

• Describes a new parallel sorting algorithm called Aligned Access Sort

• The algorithm does not involve any unaligned memory accesses

• Evaluated on PowerPC 970MP and Cell Broadband Engine Processors

• Demonstrated better scalability and performance in both sequential and parallel versions

Page 33: Aa sort-v4

33

Conclusions Contd.

• Evaluation was performed only on 32 bit integers• Performance comparison was performed on

limited number of architectures– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core

SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand

• Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads

Page 34: Aa sort-v4

34

Thank You.