Upload
malithi-edirisinghe
View
207
Download
0
Embed Size (px)
DESCRIPTION
Aligned Access Sort
Citation preview
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD
ProcessorsBy: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami
Presented By: M. Edirisinghe, H. Nawarathna
Content
• Introduction
• SIMD instruction set
• AA-sort algorithm
• In-core algorithm
• Out-of-core algorithm
• Sorting scheme in AA-sort
• Experimental results
Introduction
• High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading
• Many processors provide Single Instruction Multiple Data (SIMD) instructions
3
SIMD instructions
• multiple processing elements perform the same operation on multiple data points simultaneously
4
SIMD Instructions
• Advantages:–Data parallelism–Reduce the number of conditional branches
in programs (can use vector compare and vector select instead)
5
SIMD Instruction Set
• Used Vector Multimedia eXtension (VMX or AltiVec) instructions
• Provides a set of 128 bit vector registers–Use four 32 bit values
• Useful VMX instructions for sorting:–Vector Compare–Vector Selected–Vector Permutation
6
Sorting Algorithms and SIMD
• Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort)
• It incur additional overhead and attenuate the benefits of SIMD instructions
7
Paper’s Contribution
• Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N)
8
AA-Sort Algorithm
• Assumptions:–First element of the array to be sorted is
aligned on a 128 bit boundary–Number of elements in the array, N, is a
multiple of four
9
AA-Sort Algorithm
• Array of integer values a[N] is equivalent to an array of vector integers va[N/4]
10
AA-Sort Algorithm
• Consist of 2 algorithms:1. In-core sorting algorithm2. Out-of-core sorting algorithm
• Phases of execution:–Divide all of the data into blocks that fit into the
cache of the processor–Sort each block with the in-core sorting algorithm–Merge the sorted blocks with the out-of-core
sorting algorithm 11
Combsort
• Extension to bubble sort (kill turtles-lower values in the end)
• Compares and swaps non-adjacent elements• Improves performance• Computational complexity N log (N) average• Problems with SIMD instructions:
–Unaligned memory access–Loop-carried dependencies
12
Combsort
In-Core Algorithm
• Execution steps:1. Sort values within each vector in ascending order2. Execute combsort to sort the values into the transposed order
14
In-Core Algorithm
• Use extended Combsort
15
In-Core Algorithm
3. Reorder the values from the transposed order into the original order
16
In-Core Algorithm
• All 3 steps can be executed using SIMD instructions without unaligned memory access
• Computational complexity dominated by step 2
–Average O(N log N)–Worst case O(N^2)
• Poor memory access locality–Performance degrade if the data cannot fit
into the cache of the processor17
18
Out of core Algorithm• Used to merge two sorted vectors
– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted– c = [b:a] = merge and sort (a, b)
a0 a1 a2 a3
b0 b1 b2 b3
sorted
a
b
c0 c1 c2 c3 c4 c5 c6 c7
[b:a] = vector_merge(a,b)
sorted
sorted
19
Dataflow of Merge
a0 a1 a2 a3 b0 b2 b3
sorted sorted
b1
min00 max00 min11 max11 min22 max22 min33 max33< < < <
< <
< < <
lg(P + 1) stages, P – No of elements in a vector
Here P = 4lg(P + 1) = 3
20
Merge Operation
21
Out of core Algorithm
• No unaligned memory accesses• Better memory access locality compared with
in-core sorting algorithm– Higher performance when data cannot fit in the
cache
22
Overall AA Sort Scheme
• Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor
• Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block.
• Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads
23
Overall AA Sort Scheme Contd.No of elements of data = NNo of elements per block = BNo of blocks = (N/B)
Considering In-core sorting phaseComputational time for the in-core sorting of each block proportional to B log(B)Complexity of in-core sorting = O(N)
Considering out-of-core sorting phaseMerging sorted blocks in out-of-core sorting involves log(N/B) stagesComputational complexity of each stage = O(N)Complexity of out-of-core sorting = O(N log(N))
Hence, Computational complexity of entire AA-sort = O(N log(N))
24
Overall AA Sort Scheme Contd.
An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4
25
Experimental Setup• PowerPC 970MP System
– Two 2.5 GHz dual-core processors– 8GB system memory– Each core had 1MB L2 cache memory– Linux kernel 2.6.20
• System with Cell BE processors– Two 2.4 GHz processors– 1GB system memory– Only SPE cores were used (16 SPE cores with
256KB local memory each)– Linux kernel 2.6.15
26
Implementation• Half of the size of L2 cache as the block size
– 512KB (128K of 32 bit values) on PowerPC 970MP– 128KB (32K of 32 bit values) on the SPE
• Shrink factor – 1.28• Multiway merge technique with out-of-core
sorting– 4 way merge– Number or merging stages reduced from log2(N/B)
to log4(N/B)
27
Effects of Using SIMD Instructions
Acceleration by SIMDinstructions for sorting 16 K random
integers on one core of PowerPC 970MP
Branch misprediction rate.
28
Performance for 32 bit Integers
Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes.
29
Performance for 32 bit Integers Contd.
Performance comparison on one
PowerPC 970MP core for various input datasets with 32 million integers.
30
Performance for 32 bit Integers Contd.
The execution time of parallel versions of AA-sort and GPUTeraSort onup to 4 cores of PowerPC 970MP.
31
Performance for 32 bit Integers Contd.
Scalability with increasing number of cores on Cell BE for 32 million integers
32
Conclusions
• Describes a new parallel sorting algorithm called Aligned Access Sort
• The algorithm does not involve any unaligned memory accesses
• Evaluated on PowerPC 970MP and Cell Broadband Engine Processors
• Demonstrated better scalability and performance in both sequential and parallel versions
33
Conclusions Contd.
• Evaluation was performed only on 32 bit integers• Performance comparison was performed on
limited number of architectures– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core
SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand
• Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads
34
Thank You.