1Sof
twar
e P
erfo
rman
ce O
ptim
isat
ion
Gro
upIm
peria
l Col
lege
, Lo
ndon
Improving the Performance of Morton Layout by Array Alignment and
Loop Unrolling
Reducing the Price of Naivety
Jeyarajan Thiyagalingam
Olav Beckmann and Paul H.J. Kelly
Software Performance Optimisation Group,
Imperial College, London
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
2
Motivation • Consider two code variants of a matrix multiply
IJK Variantfor( i=0; i<N; i++ ) for( j=0; j<N; j++ ) for( k=0; k<N; k++ ) C[i,j] += A[i,k] * B[k,j]
IKJ Variantfor( i=0; i<N; i++ ) for( k=0; k<N; k++ ) for( j=0; j<N; j++ ) C[i,j] += A[i,k] * B[k,j]
• Both code variants are valid, apparently same complexity.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
3
The price of naivety
• Depending on problem size and architecture, the IKJ variant can be up to 10 times faster than IJK.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
4
Performance Programming Model• Naively-written code can suffer a factor 10
performance hit• Sometimes the compiler can help; none of the
compilers we used interchanged these loops.• A robust performance programming model would
have to account for the capabilities of the compiler• Offering a clear Performance Programming Model
should be part of Compiler Research.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
5
Compromise – blocked layout
Reason for differences in performance:
Row-major traversal uses 4 words per block
But column-major traversal uses only 1 word per block
Bandwidth wasted with CM
Blocked: 4-word cache block contains 2x2 subarray:
Row-major traversal uses 2 words per blockColumn-major traversal uses 2 words per block
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
2
8
2
2
2
.
.
.
.
8 8 8 . . . . 4444
636259585554
616057565352
474643423938
454441403736
3534
3332
23221918 3130
2928
2726
252421201716
32 1514
1312
1110
98
76
541
5150
4948
4
4
4
40
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
6
Recursively-blocked layout
• Real machines have deep memory hierarchies• Therefore, need to apply blocking recursively• Layout of the blocks: Z-Morton (one of a number of
space-filling curves)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0)
(0,1)
(1,0)
(1,1)
(0,2)
(0,3)
(1,2)
(1,3)
(2,0)
(2,1)
(3,0)
(3,1)
(2,2)
(2,3)
(3,2)
(3,3)
Offset 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
(i,j) (0,4)
(0,5)
(1,4)
(1,5)
(0,6)
(0,7)
(1,6)
(1,6)
(2,4)
(2,5)
(3,4)
(3,5)
(2,6)
(2,7)
(3,6)
(3,7)
1514
1312
1110
98
76
54
6362
6160
5958
5756
5554
5352
4746
4544
4342
4140
3938
3736
3534
3332
3130
2928
2726
2524
1514
1312
1110
98
5150
4948
32
10
1514
1312
1110
98
76
54
6362
6160
5958
5756
5554
5352
4746
4544
4342
4140
3938
3736
3534
3332
3130
2928
2726
2524
1514
1312
1110
98
5150
4948
32
10
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
7
Morton Layout – A Compromise
Morton storage layout is unbiased towards either row- or column-major traversal.
Row-major Traversal
Block Size
RM
Array
Morton
Array
32B 75% 50%
128B 93.7% 75%
8kB page 99.9% 96.87%
Column-major Traversal
Block Size
RM
Array
Morton
Array
32B 0% 50%
128B 0% 75%
8kB 0% 96.87%
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
8
So have we solved the problem?
• Unfortunately, the basic Morton Scheme often performs disappointingly.
• At least Morton does not seem to suffer from pathological drops in performance.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
9
Alignment
Statement that Morton is unbiased turns out to be based on assumption that a cache line maps to start of Morton block.
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0)
(0,1)
(1,0)
(1,1)
(0,2)
(0,3)
(1,2)
(1,3)
(2,0)
(2,1)
(3,0)
(3,1)
(2,2)
(2,3)
(3,2)
(3,3)
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(i,j) (0,0)
(0,1)
(1,0)
(1,1)
(0,2)
(0,3)
(1,2)
(1,3)
(2,0)
(2,1)
(3,0)
(3,1)
(2,2)
(2,3)
(3,2)
(3,3)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
10
It turns out that Morton layout is only unbiased for even power-of-two cache line sizesThe same problems happen when mis-aligning the base address
Alignment
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
11
We calculated miss-rates systematically for all levels of memory hierarchyIn each case, we calculated the miss-rates for all possible alignments of the base address.The difference in miss-rates between best and worst alignment of the base address of Morton arrays can be up to a factor of 1.5 for even power-of-two cache lines, a factor of 2 for odd power-of-two cache lines.
Alignment
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
12
Alignment
The overall miss-rates drop exponentially with block size, but access times are generally assumed to increase geometrically with block size.
Morton Order Missrates for Row-major and Colum-major Traversal
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32 64 128 256 512 1024
Blocksize in Double Words
Mis
sra
te
RM Maximum
RM Minimum
CM Maximum
CM Minimum
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
13
AlignmentWith canonical layouts, it is often necessary to pad the row or column length in order to avoid pathological behaviour. Finding the right amount of padding is not trivial.Theoretically, one should align the base address of Morton arrays to the largest significant block size in the memory hierarchy – i.e. page size. Aliasing in the memory hierarchy can spoil the theory.For example, on Pentium 4, the following aliasing patterns cause problems
2K – map to same L1 cache line16K – aliases in store-forwarding logic32K – map to the same L2 cache line64K – indistinguishable in L1 cache
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
14
Address calculation
With lexicographic (aka canonical) layout, it’s easy to calculate the offset S of A[i,j] in a NM array A:
Srm(i,j) = Ni + j Scm(i,j) = i + Mj
(if N and M are powers of two, this is bit-wise concatenation of i and j)
In loops, the multiplication is replaced by an incrementWhen unrolling loops, the address calculation can be strength-reduced.
How can we calculate the Morton offset?
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
15
Address calculationMorton indices can be calculated by using the bit-concatenation idea of RM/CM for power-of-two arrays recursively:For a 2x2 array, if i and j are the indices, then the location is (i << 1) | j.
Let D0(i) = in0 … i10i00
Let D1(i) = 0in … 0i10i0
Then Smz(i,j) = D0(i) | D1(j)Dilation is rather expensive for inner loopStrength reduction (Wise et al)
D0(i+1) = ((D0(i) | Ones0) + 1) & Ones1
D1(i+1) = ((D1(i) | Ones1) + 1) & Ones0
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
16
Address calculation
Idea: use lookup tables for D0(i) and D1(j)
A[MortonTabEven[i] + MortonTabOdd[j]]
When can we do strength reduction? In general Smz(i,j+1) could be anywhere
D0(i + 1) = ???
D0(i + k) = D0(i) + D0(k) if i’s and k’s bits do not overlap.
We can do strength reduction
D0(i + k) = D0(i) + D0(k) as long as i = 2n and k < 2n
With this, we can do loop unrolling
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
17
Unrolled Code with Stength-Reductiondouble mmijk_unrolled(unsigned sz,FLOATTYPE *A,FLOATTYPE
*B,FLOATTYPE *C) unsigned i,j,k;for (i=0;i<sz;i++){ unsigned int t1i=MortonTabOdd[i]; for (j=0;j<sz;j++){ unsigned int t0j=MortonTabEven[j]; for (k=0;k<sz;k+=4){
unsigned int t0k=MortonTabEven[k];unsigned int t1k=MortonTabOdd[k];
C[t1i+t0j] += A[t1i+t0k] *B[t1k+t0j];C[t1i+t0j] += A[t1i+t0k + 2] *B[t1k+t0j + 1];C[t1i+t0j] += A[t1i+t0k + 8] *B[t1k+t0j + 4];C[t1i+t0j] += A[t1i+t0k +10] *B[t1k+t0j + 5];
} } }
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
18
So have we solved the problem?
• Unrolling significantly reduces the overhead of the Basic Morton Scheme.
• IKJ is still faster than IJK – might be due to having two table lookups in the inner loop.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
19
Benchmarks• Suite of simple numerical kernels
operating on 2D arrays of doubles• Used the compilers and flags which the
vendors used for their SPEC-CFP2000 results
MM-ijk Matrix Multiply, ijk loop nestMM-ikj Matrix Multiply, ikj loop nestCholk Cholesky-K VariantJacobi2d 2 Dimensional, 4point stencil smootherADI Alternating Direction Implicit kernel ij, ij
order
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
20
Experimental Setup
• We used identical clusters of (student) lab machines during off-peak periods
• Extensive scripting to automate data collection
• Dixon Test to remove outliers from the measurements
• Use median instead of mean.• Overall more than 26M
measurements
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
21
Architectures AMD, Thunderbird 1.8GHz, 512MB DDR-RAM
64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way 64B block L2.Intel C Compiler v7.1 for Linux. Flags: “-xK -ipo –static +FDO”
Pentium III, Coppermine 450MHz, 256MB SDRAM16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way 32B block L2.Intel C Compiler v7.1 for Linux: Flags: “-xK –ipo –O3 –static +FDO”
Sun, SunFire 6800, UltraSparc III 750MHz64KB, 4-way, 32Byte block L1 cache, 8MB Direct Mapped L2 CacheSun Workshop Compiler V6, Flags: “-fast –xcrossfile –xalias_level=std +FDO”
Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz64KB, 2-way, 64Byte block L1 cache, 4MB Direct Mapped L2 CacheCompaq C Compiler V6 , Flags: “–arch ev6 -fast –O4”
Pentium 4, 2.0GHz, 512MB DDR-RAM8KB, 8-way,64Byte block L1 cache, 256KB, 8-way 64B block L2.Intel C Compiler v7 for Linux: Flags: “-xW –ipo –O3 –static +FDO”
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
22
Alpha (L1:64KB/2-w/64Byte, L2:4MB/DM)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
23
Athlon (L1:64KB/2-w/64Byte, L2:256KB/8-w/64B)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
24
Pentium III (L1:16KB/4-w/32Byte, L2:256KB/8-w/32B)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
25
Pentium 4 (L1:8KB/8-w/64Byte, L2:256KB/8-w/64B)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
26
Sparc (L1:64KB/4-w/32Byte, L2:8MB Direct-mapped/64Byte)
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
27
Summary• The Basic Morton Scheme often performs
disappointingly.• Page-aligning the base address theoretically
maximises spatial locality.• Unrolling is facilitated by carefully aligning the start
iteration of unrolled loops to power-of-two indices into the array.
• With base-address alignment and unrolling for strength-reduction of index calculation, Morton layout is beginning to actually work.
So
ftwa
re P
erf
orm
an
ce O
ptim
isa
tion
Gro
upIm
pe
ria
l Co
lleg
e, L
on
do
n
28
Future Work• Larger Factors of Unrolling
– Until now only factor 4 hand-unrolled– We have used code generation to unroll by larger factors, and it
seems that there are more improvements to be had.• Prefetching
– It’s likely that hardware prefetching will fetch the wrong things– Turn off hardware prefetching, use the right, compiler-directed
prefetching instead.• Tiling
– Storage layout transformations and iteration space transformations are complimentary
– But we should do both.