Upload
mitsunari-shigeo
View
2.358
Download
0
Embed Size (px)
Citation preview
AA-sort with SSE4.1AA-sort with SSE4.1
Cybozu Labs
2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)
/292
AgendaAgenda Introduction of AA-sort classic combsort vectorized combsort vectorized merge
benchmark
2012/6/16 #x86opti 4
/293
AA-sortAA-sort Aligned-Access sort proposed by Hiroshi Inoue, etc. in
"A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011http://www.research.ibm.com/trl/people/inouehrs/
SPE_SIMDsort.htmhttp://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
For SIMDless conditional branch, no unaligned data access
For multicore processorsthey implemented it for PowerPC and Cell BE
O(n log n) complexity I tried it for Intel CPU(not complete) https://github.com/herumi/opti/blob/master/intsort.hpp
current version is for only one processor2012/6/16 #x86opti 4
/294
AA-sortAA-sort vectorized combsort for a block (<= L2cache?) vectorized merge sorted block
2012/6/16 #x86opti 4
input array
block 0 block 1 block 2 block3 ...
< < < < ...
sort sort sort sort
< < ...
merge merge
< ...
merge
/295
AA-sort algorithmAA-sort algorithm sort each block O(n log n)
merge sorted block O(n)
2012/6/16 #x86opti 4
/296
classic combsort(1/2)classic combsort(1/2) improved bubble sort unstable O(n log n) compare two elements having a gap(>=1)
gap is divided by shrink factor (about 1.3)
2012/6/16 #x86opti 4
size_t nextGap(size_t N) { return (N * 10) / 13; }
void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } …
/297
classic combsort(2/2)classic combsort(2/2) gap = 1 means bubble sort loop until the array is fully sorted
2012/6/16 #x86opti 4
… for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; }}
/298
gap functiongap function Combsort11 last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by
http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm
a little faster if line(*) is appended
2012/6/16 #x86opti 4
size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n;}
/299
vectorized combsortvectorized combsort step1 : sort values within each vector(32bitx4) step2 : SIMD version combsort step3 : reorder data
2012/6/16 #x86opti 4
1 116 8 9 3 5 7 12 14 0 4 20
3 5 0
9 7 1
6 12 4
8 14 20
…
…
…
…
v0 v1 v2 v3
+0+1+2+3
…
…
…
…
sortstep1
0 1 3102
104
105
389
391
392
511
515
612
…
…
…
…
101
380
502
973
step2
389
392
0 1 3 …101
102
104
105
…380
391
…
step3
sort
...
/2910
step1step1 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3 step1.2 : transpose
2012/6/16 #x86opti 4
3 5 0
2 7 1
8 12 4
9 14 20
8
2
13
15 sort
0 3 5
1 2 2
4 8 12
9 14 15
8
7
13
20
step1.1
0 1 4
3 2 8
5 2 12
8 7 13
9
14
15
20
step1.2
transpose
v0 v1 v2 v3
/2911
sort of 4 itemssort of 4 items use max ud, minud for uint32_t x 4
2012/6/16 #x86opti 4
min01 max01 min23 max23
v0 v1 v2 v3
min0123s=max(min01,min2
3)
t=min(max01,max2
3)max0123
min0123 min(s,t) max(s,t) max0123
a b
min(a,b) max(a,b)
<
< <
< <
<
sorted
/2912
source of step1.1source of step1.1 V128 is a type of 32-bit integer x 4 pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3
2012/6/16 #x86opti 4
void sort_step1_vec(V128 x[4]){ V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t);}
/2913
transpose of 4x4 matrixtranspose of 4x4 matrix use unpcklps and unpckhps
2012/6/16 #x86opti 4
3 5 0
2 7 1
8 12 4
9 14 20
8
2
13
15
x0 x1 x2 x3
3 5 8
0 8 4
2 7 9
1 2 20
12
13
14
15
t0 t1 t2 t3
+0+1+2+3
t0=unpcklps(x0,x2)t2=unpckhps(x0,x2)t1=unpcklps(x1,x3)t3=unpckhps(x1,x3)
3 5 8
0 8 4
2 7 9
1 2 20
12
13
14
15
t0 t1 t2 t3
x0=unpcklps(t0,t1)x1=unpckhps(t0,t1)x2=unpcklps(t2,t3)x3=unpckhps(t2,t3)
3 2 8
5 7 12
0 1 4
8 2 13
9
14
20
15
x0 x1 x2 x3
/2914
source of transpose and step1source of transpose and step1
2012/6/16 #x86opti 4
void transpose(V128 x[4]){ V128 x0 = x[0]; V128 x1 = x[1]; V128 x2 = x[2]; V128 x3 = x[3]; V128 t0 = unpcklps(x0, x2); V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3);}
void sort_step1(V128 *va, size_t N){ for(size_t i = 0; i < N; i+= 4) { sort_step1_vec(&va[i]); transpose(&va[i]); }}
/2915
SIMD version combsortSIMD version combsort first half code use vector_cmpswap vector_cmpswap_skew
2012/6/16 #x86opti 4
bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ...
/2916
vector_cmpswapvector_cmpswap no conditional branch
2012/6/16 #x86opti 4
void vector_cmpswap(V128& a, V128& b){ V128 t = pmaxud(a, b); a = pminud(a, b); b = t;}
if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
vectorised
a b
min(a,b)
max(a,b)
<
/2917
vector_cmpswap_skewvector_cmpswap_skew for boundary of array
2012/6/16 #x86opti 4
a
b
a3min(a2,b3
)min(a1,b2
)min(a0,b1
)a'
a3 a2 a1 a0
b3 b2 b1 b0
max(a2,b3)
max(a1,b2)
max(a0,b1)
b0b'
(a',b') = vector_cmpswap_ske(a,b)
/2918
isSortedVecisSortedVec check whether array is sorted ptest_zf(a, b) is true if (a & b) == 0 a <= b max(a,b) == b c := max(a,b) – b == 0 pcmpgtd is for int32_t, so we can't use it
2012/6/16 #x86opti 4
bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true;}
/2919
loop for gap == 1loop for gap == 1 vectorised bubble sort for gap == 1 retire if loop count reaches maxLoop
fall to std::sort almost rare
2012/6/16 #x86opti 4
const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; }
/2920
AA-sort algorithmAA-sort algorithm sort each block O(n log n)
merge sorted block O(n)
2012/6/16 #x86opti 4
/2921
merge two sorted vectormerge two sorted vector a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted c = [b:a] = merge and sort (a, b)
2012/6/16 #x86opti 4
a0 a1 a2 a3
b0 b1 b2 b3
sorted
a
b
c0 c1 c2 c3 c0 c1 c2 c3
[b:a] = vector_merge(a,b)
sorted
sorted
data flow of mergedata flow of merge
2012/6/16 #x86opti 4 /2922
a0 a1 a2 a3 b0 b2 b3
sorted sorted
b1
min00 max00 min11 max11 min22 max22 min33 max33< < < <
< <
< < <
/2923
source of vector_mergesource of vector_merge Too complex good idea?
2012/6/16 #x86opti 4
void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8);}
/2924
std::merge()std::merge() merge [begin1, end1) and [begin2, end2)
2012/6/16 #x86opti 4
template <class In1, class In2, class Out>Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out){ for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); }}
/2925
vectorised mergevectorised merge merge arrays with vector_merge()
2012/6/16 #x86opti 4
void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; }
; compare ta0 with tb0
block size and rate of sortblock size and rate of sort What is good size for vectorised sort? half size of L2 is recommended for PowerPC 970MP
L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
BS = 32Ki seems good for Xeon, Core i7 profile of sort and merge
2012/6/16 #x86opti 4 /2926
64Ki
128Ki
256Ki
512Ki1Mi
2Mi4Mi
8Mi0
20
40
60
80
100
merge(%)sort(%)
/2927
Benchmark(1/3)Benchmark(1/3) AA-sort vs std::sort for random data Xeon X5650 + gcc-4.6.3
4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
2012/6/16 #x86opti 4
16 32 64 128
256
512
1Ki2Ki4Ki8Ki 16Ki
32Ki
64Ki
128Ki
256Ki
512Ki
1Mi
2Mi
4Mi
8Mi
1
10
100
1000
10000
100000
1000000
10000000std::sort
# of uint32_t
clock
cycl
e
fast
Benchmark(2/3)Benchmark(2/3) sort 64Ki uint on Xeon + gcc-4.6.3 AA-sort speed does not strongly depend on pattern
2012/6/16 #x86opti 4 /2928
random
16bit random
8bit random
all zero
almost
sorte
d
sorte
d
reverse
d0
5000
10000
15000
20000
25000
std::sort
AA-sort
fast
Benchmark(3/3)Benchmark(3/3) sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11
2012/6/16 #x86opti 4 /2929
fast
random
16bit random
8bit random
all zero
almost
sorte
d
sorte
d
reverse
d0
2000
4000
6000
8000
10000
12000
14000
16000
std::sort(gcc)AA-sort(gcc)std::sort(VC)AA-sort(VC)