AA-sort with SSE4.1

AA-sort with SSE4.1AA-sort with SSE4.1

Cybozu Labs

2012/6/16 MITSUNARI Shigeo(@herumi)

x86/x64 optimization seminar 4(#x86opti)

/292

AgendaAgenda Introduction of AA-sort classic combsort vectorized combsort vectorized merge

benchmark

2012/6/16 #x86opti 4

/293

AA-sortAA-sort Aligned-Access sort proposed by Hiroshi Inoue, etc. in

"A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011http://www.research.ibm.com/trl/people/inouehrs/

SPE_SIMDsort.htmhttp://www.research.ibm.com/trl/people/inouehrs/pact2007.htm

For SIMDless conditional branch, no unaligned data access

For multicore processorsthey implemented it for PowerPC and Cell BE

O(n log n) complexity I tried it for Intel CPU(not complete) https://github.com/herumi/opti/blob/master/intsort.hpp

current version is for only one processor2012/6/16 #x86opti 4

/294

AA-sortAA-sort vectorized combsort for a block (<= L2cache?) vectorized merge sorted block

2012/6/16 #x86opti 4

input array

block 0 block 1 block 2 block3 ...

< < < < ...

sort sort sort sort

< < ...

merge merge

< ...

merge

/295

AA-sort algorithmAA-sort algorithm sort each block O(n log n)

merge sorted block O(n)

2012/6/16 #x86opti 4

/296

classic combsort(1/2)classic combsort(1/2) improved bubble sort unstable O(n log n) compare two elements having a gap(>=1)

gap is divided by shrink factor (about 1.3)

2012/6/16 #x86opti 4

size_t nextGap(size_t N) { return (N * 10) / 13; }

void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } …

/297

classic combsort(2/2)classic combsort(2/2) gap = 1 means bubble sort loop until the array is fully sorted

2012/6/16 #x86opti 4

… for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; }}

/298

gap functiongap function Combsort11 last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by

http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm

a little faster if line(*) is appended

2012/6/16 #x86opti 4

size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n;}

/299

vectorized combsortvectorized combsort step1 : sort values within each vector(32bitx4) step2 : SIMD version combsort step3 : reorder data

2012/6/16 #x86opti 4

1 116 8 9 3 5 7 12 14 0 4 20

3 5 0

9 7 1

6 12 4

8 14 20

…

…

…

…

v0 v1 v2 v3

+0+1+2+3

…

…

…

…

sortstep1

0 1 3102

104

105

389

391

392

511

515

612

…

…

…

…

101

380

502

973

step2

389

392

0 1 3 …101

102

104

105

…380

391

…

step3

sort

...

/2910

step1step1 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3 step1.2 : transpose

2012/6/16 #x86opti 4

3 5 0

2 7 1

8 12 4

9 14 20

8

2

13

15 sort

0 3 5

1 2 2

4 8 12

9 14 15

8

7

13

20

step1.1

0 1 4

3 2 8

5 2 12

8 7 13

9

14

15

20

step1.2

transpose

v0 v1 v2 v3

/2911

sort of 4 itemssort of 4 items use max ud, minud for uint32_t x 4

2012/6/16 #x86opti 4

min01 max01 min23 max23

v0 v1 v2 v3

min0123s=max(min01,min2

3)

t=min(max01,max2

3)max0123

min0123 min(s,t) max(s,t) max0123

a b

min(a,b) max(a,b)

<

< <

< <

<

sorted

/2912

source of step1.1source of step1.1 V128 is a type of 32-bit integer x 4 pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3

2012/6/16 #x86opti 4

void sort_step1_vec(V128 x[4]){ V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t);}

/2913

transpose of 4x4 matrixtranspose of 4x4 matrix use unpcklps and unpckhps

2012/6/16 #x86opti 4

3 5 0

2 7 1

8 12 4

9 14 20

8

2

13

15

x0 x1 x2 x3

3 5 8

0 8 4

2 7 9

1 2 20

12

13

14

15

t0 t1 t2 t3

+0+1+2+3

t0=unpcklps(x0,x2)t2=unpckhps(x0,x2)t1=unpcklps(x1,x3)t3=unpckhps(x1,x3)

3 5 8

0 8 4

2 7 9

1 2 20

12

13

14

15

t0 t1 t2 t3

x0=unpcklps(t0,t1)x1=unpckhps(t0,t1)x2=unpcklps(t2,t3)x3=unpckhps(t2,t3)

3 2 8

5 7 12

0 1 4

8 2 13

9

14

20

15

x0 x1 x2 x3

/2914

source of transpose and step1source of transpose and step1

2012/6/16 #x86opti 4

void transpose(V128 x[4]){ V128 x0 = x[0]; V128 x1 = x[1]; V128 x2 = x[2]; V128 x3 = x[3]; V128 t0 = unpcklps(x0, x2); V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3);}

void sort_step1(V128 *va, size_t N){ for(size_t i = 0; i < N; i+= 4) { sort_step1_vec(&va[i]); transpose(&va[i]); }}

/2915

SIMD version combsortSIMD version combsort first half code use vector_cmpswap vector_cmpswap_skew

2012/6/16 #x86opti 4

bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ...

/2916

vector_cmpswapvector_cmpswap no conditional branch

2012/6/16 #x86opti 4

void vector_cmpswap(V128& a, V128& b){ V128 t = pmaxud(a, b); a = pminud(a, b); b = t;}

if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);

vectorised

a b

min(a,b)

max(a,b)

<

/2917

vector_cmpswap_skewvector_cmpswap_skew for boundary of array

2012/6/16 #x86opti 4

a

b

a3min(a2,b3

)min(a1,b2

)min(a0,b1

)a'

a3 a2 a1 a0

b3 b2 b1 b0

max(a2,b3)

max(a1,b2)

max(a0,b1)

b0b'

(a',b') = vector_cmpswap_ske(a,b)

/2918

isSortedVecisSortedVec check whether array is sorted ptest_zf(a, b) is true if (a & b) == 0 a <= b max(a,b) == b c := max(a,b) – b == 0 pcmpgtd is for int32_t, so we can't use it

2012/6/16 #x86opti 4

bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true;}

/2919

loop for gap == 1loop for gap == 1 vectorised bubble sort for gap == 1 retire if loop count reaches maxLoop

fall to std::sort almost rare

2012/6/16 #x86opti 4

const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; }

/2920

AA-sort algorithmAA-sort algorithm sort each block O(n log n)

merge sorted block O(n)

2012/6/16 #x86opti 4

/2921

merge two sorted vectormerge two sorted vector a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted c = [b:a] = merge and sort (a, b)

2012/6/16 #x86opti 4

a0 a1 a2 a3

b0 b1 b2 b3

sorted

a

b

c0 c1 c2 c3 c0 c1 c2 c3

[b:a] = vector_merge(a,b)

sorted

sorted

data flow of mergedata flow of merge

2012/6/16 #x86opti 4 /2922

a0 a1 a2 a3 b0 b2 b3

sorted sorted

b1

min00 max00 min11 max11 min22 max22 min33 max33< < < <

< <

< < <

/2923

source of vector_mergesource of vector_merge Too complex good idea?

2012/6/16 #x86opti 4

void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8);}

/2924

std::merge()std::merge() merge [begin1, end1) and [begin2, end2)

2012/6/16 #x86opti 4

template <class In1, class In2, class Out>Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out){ for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); }}

/2925

vectorised mergevectorised merge merge arrays with vector_merge()

2012/6/16 #x86opti 4

void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; }

; compare ta0 with tb0

block size and rate of sortblock size and rate of sort What is good size for vectorised sort? half size of L2 is recommended for PowerPC 970MP

L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t

BS = 32Ki seems good for Xeon, Core i7 profile of sort and merge

2012/6/16 #x86opti 4 /2926

64Ki

128Ki

256Ki

512Ki1Mi

2Mi4Mi

8Mi0

20

40

60

80

100

merge(%)sort(%)

/2927

Benchmark(1/3)Benchmark(1/3) AA-sort vs std::sort for random data Xeon X5650 + gcc-4.6.3

4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi

2012/6/16 #x86opti 4

16 32 64 128

256

512

1Ki2Ki4Ki8Ki 16Ki

32Ki

64Ki

128Ki

256Ki

512Ki

1Mi

2Mi

4Mi

8Mi

1

10

100

1000

10000

100000

1000000

10000000std::sort

# of uint32_t

clock

cycl

e

fast

Benchmark(2/3)Benchmark(2/3) sort 64Ki uint on Xeon + gcc-4.6.3 AA-sort speed does not strongly depend on pattern

2012/6/16 #x86opti 4 /2928

random

16bit random

8bit random

all zero

almost

sorte

d

sorte

d

reverse

d0

5000

10000

15000

20000

25000

std::sort

AA-sort

fast

Benchmark(3/3)Benchmark(3/3) sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11

2012/6/16 #x86opti 4 /2929

fast

random

16bit random

8bit random

all zero

almost

sorte

d

sorte

d

reverse

d0

2000

4000

6000

8000

10000

12000

14000

16000

std::sort(gcc)AA-sort(gcc)std::sort(VC)AA-sort(VC)

Technology

AA-sort with SSE4.1