Download ppt - Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly

1Sof

twar

e P

erfo

rman

ce O

ptim

isat

ion

Gro

upIm

peria

l Col

lege

, Lo

ndon

Improving the Performance of Morton Layout by Array Alignment and

Loop Unrolling

Reducing the Price of Naivety

Jeyarajan Thiyagalingam

Olav Beckmann and Paul H.J. Kelly

Software Performance Optimisation Group,

Imperial College, London

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

2

Motivation • Consider two code variants of a matrix multiply

IJK Variantfor( i=0; i<N; i++ ) for( j=0; j<N; j++ ) for( k=0; k<N; k++ ) C[i,j] += A[i,k] * B[k,j]

IKJ Variantfor( i=0; i<N; i++ ) for( k=0; k<N; k++ ) for( j=0; j<N; j++ ) C[i,j] += A[i,k] * B[k,j]

• Both code variants are valid, apparently same complexity.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

3

The price of naivety

• Depending on problem size and architecture, the IKJ variant can be up to 10 times faster than IJK.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

4

Performance Programming Model• Naively-written code can suffer a factor 10

performance hit• Sometimes the compiler can help; none of the

compilers we used interchanged these loops.• A robust performance programming model would

have to account for the capabilities of the compiler• Offering a clear Performance Programming Model

should be part of Compiler Research.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

5

Compromise – blocked layout

Reason for differences in performance:

Row-major traversal uses 4 words per block

But column-major traversal uses only 1 word per block

Bandwidth wasted with CM

Blocked: 4-word cache block contains 2x2 subarray:

Row-major traversal uses 2 words per blockColumn-major traversal uses 2 words per block

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47

48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63

2

8

2

2

2

.

.

.

.

8 8 8 . . . . 4444

636259585554

616057565352

474643423938

454441403736

3534

3332

23221918 3130

2928

2726

252421201716

32 1514

1312

1110

98

76

541

5150

4948

4

4

4

40

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

6

Recursively-blocked layout

• Real machines have deep memory hierarchies• Therefore, need to apply blocking recursively• Layout of the blocks: Z-Morton (one of a number of

space-filling curves)

Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(i,j) (0,0)

(0,1)

(1,0)

(1,1)

(0,2)

(0,3)

(1,2)

(1,3)

(2,0)

(2,1)

(3,0)

(3,1)

(2,2)

(2,3)

(3,2)

(3,3)

Offset 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

(i,j) (0,4)

(0,5)

(1,4)

(1,5)

(0,6)

(0,7)

(1,6)

(1,6)

(2,4)

(2,5)

(3,4)

(3,5)

(2,6)

(2,7)

(3,6)

(3,7)

1514

1312

1110

98

76

54

6362

6160

5958

5756

5554

5352

4746

4544

4342

4140

3938

3736

3534

3332

3130

2928

2726

2524

1514

1312

1110

98

5150

4948

32

10

1514

1312

1110

98

76

54

6362

6160

5958

5756

5554

5352

4746

4544

4342

4140

3938

3736

3534

3332

3130

2928

2726

2524

1514

1312

1110

98

5150

4948

32

10

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

7

Morton Layout – A Compromise

Morton storage layout is unbiased towards either row- or column-major traversal.

Row-major Traversal

Block Size

RM

Array

Morton

Array

32B 75% 50%

128B 93.7% 75%

8kB page 99.9% 96.87%

Column-major Traversal

Block Size

RM

Array

Morton

Array

32B 0% 50%

128B 0% 75%

8kB 0% 96.87%

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

8

So have we solved the problem?

• Unfortunately, the basic Morton Scheme often performs disappointingly.

• At least Morton does not seem to suffer from pathological drops in performance.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

9

Alignment

Statement that Morton is unbiased turns out to be based on assumption that a cache line maps to start of Morton block.

Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(i,j) (0,0)

(0,1)

(1,0)

(1,1)

(0,2)

(0,3)

(1,2)

(1,3)

(2,0)

(2,1)

(3,0)

(3,1)

(2,2)

(2,3)

(3,2)

(3,3)

Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(i,j) (0,0)

(0,1)

(1,0)

(1,1)

(0,2)

(0,3)

(1,2)

(1,3)

(2,0)

(2,1)

(3,0)

(3,1)

(2,2)

(2,3)

(3,2)

(3,3)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

10

It turns out that Morton layout is only unbiased for even power-of-two cache line sizesThe same problems happen when mis-aligning the base address

Alignment

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

11

We calculated miss-rates systematically for all levels of memory hierarchyIn each case, we calculated the miss-rates for all possible alignments of the base address.The difference in miss-rates between best and worst alignment of the base address of Morton arrays can be up to a factor of 1.5 for even power-of-two cache lines, a factor of 2 for odd power-of-two cache lines.

Alignment

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

12

Alignment

The overall miss-rates drop exponentially with block size, but access times are generally assumed to increase geometrically with block size.

Morton Order Missrates for Row-major and Colum-major Traversal

0

0.2

0.4

0.6

0.8

1

1.2

4 8 16 32 64 128 256 512 1024

Blocksize in Double Words

Mis

sra

te

RM Maximum

RM Minimum

CM Maximum

CM Minimum

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

13

AlignmentWith canonical layouts, it is often necessary to pad the row or column length in order to avoid pathological behaviour. Finding the right amount of padding is not trivial.Theoretically, one should align the base address of Morton arrays to the largest significant block size in the memory hierarchy – i.e. page size. Aliasing in the memory hierarchy can spoil the theory.For example, on Pentium 4, the following aliasing patterns cause problems

2K – map to same L1 cache line16K – aliases in store-forwarding logic32K – map to the same L2 cache line64K – indistinguishable in L1 cache

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

14

Address calculation

With lexicographic (aka canonical) layout, it’s easy to calculate the offset S of A[i,j] in a NM array A:

Srm(i,j) = Ni + j Scm(i,j) = i + Mj

(if N and M are powers of two, this is bit-wise concatenation of i and j)

In loops, the multiplication is replaced by an incrementWhen unrolling loops, the address calculation can be strength-reduced.

How can we calculate the Morton offset?

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

15

Address calculationMorton indices can be calculated by using the bit-concatenation idea of RM/CM for power-of-two arrays recursively:For a 2x2 array, if i and j are the indices, then the location is (i << 1) | j.

Let D0(i) = in0 … i10i00

Let D1(i) = 0in … 0i10i0

Then Smz(i,j) = D0(i) | D1(j)Dilation is rather expensive for inner loopStrength reduction (Wise et al)

D0(i+1) = ((D0(i) | Ones0) + 1) & Ones1

D1(i+1) = ((D1(i) | Ones1) + 1) & Ones0

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

16

Address calculation

Idea: use lookup tables for D0(i) and D1(j)

A[MortonTabEven[i] + MortonTabOdd[j]]

When can we do strength reduction? In general Smz(i,j+1) could be anywhere

D0(i + 1) = ???

D0(i + k) = D0(i) + D0(k) if i’s and k’s bits do not overlap.

We can do strength reduction

D0(i + k) = D0(i) + D0(k) as long as i = 2n and k < 2n

With this, we can do loop unrolling

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

17

Unrolled Code with Stength-Reductiondouble mmijk_unrolled(unsigned sz,FLOATTYPE *A,FLOATTYPE

*B,FLOATTYPE *C) unsigned i,j,k;for (i=0;i<sz;i++){ unsigned int t1i=MortonTabOdd[i]; for (j=0;j<sz;j++){ unsigned int t0j=MortonTabEven[j]; for (k=0;k<sz;k+=4){

unsigned int t0k=MortonTabEven[k];unsigned int t1k=MortonTabOdd[k];

C[t1i+t0j] += A[t1i+t0k] *B[t1k+t0j];C[t1i+t0j] += A[t1i+t0k + 2] *B[t1k+t0j + 1];C[t1i+t0j] += A[t1i+t0k + 8] *B[t1k+t0j + 4];C[t1i+t0j] += A[t1i+t0k +10] *B[t1k+t0j + 5];

} } }

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

18

So have we solved the problem?

• Unrolling significantly reduces the overhead of the Basic Morton Scheme.

• IKJ is still faster than IJK – might be due to having two table lookups in the inner loop.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

19

Benchmarks• Suite of simple numerical kernels

operating on 2D arrays of doubles• Used the compilers and flags which the

vendors used for their SPEC-CFP2000 results

MM-ijk Matrix Multiply, ijk loop nestMM-ikj Matrix Multiply, ikj loop nestCholk Cholesky-K VariantJacobi2d 2 Dimensional, 4point stencil smootherADI Alternating Direction Implicit kernel ij, ij

order

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

20

Experimental Setup

• We used identical clusters of (student) lab machines during off-peak periods

• Extensive scripting to automate data collection

• Dixon Test to remove outliers from the measurements

• Use median instead of mean.• Overall more than 26M

measurements

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

21

Architectures AMD, Thunderbird 1.8GHz, 512MB DDR-RAM

64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way 64B block L2.Intel C Compiler v7.1 for Linux. Flags: “-xK -ipo –static +FDO”

Pentium III, Coppermine 450MHz, 256MB SDRAM16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way 32B block L2.Intel C Compiler v7.1 for Linux: Flags: “-xK –ipo –O3 –static +FDO”

Sun, SunFire 6800, UltraSparc III 750MHz64KB, 4-way, 32Byte block L1 cache, 8MB Direct Mapped L2 CacheSun Workshop Compiler V6, Flags: “-fast –xcrossfile –xalias_level=std +FDO”

Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz64KB, 2-way, 64Byte block L1 cache, 4MB Direct Mapped L2 CacheCompaq C Compiler V6 , Flags: “–arch ev6 -fast –O4”

Pentium 4, 2.0GHz, 512MB DDR-RAM8KB, 8-way,64Byte block L1 cache, 256KB, 8-way 64B block L2.Intel C Compiler v7 for Linux: Flags: “-xW –ipo –O3 –static +FDO”

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

22

Alpha (L1:64KB/2-w/64Byte, L2:4MB/DM)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

23

Athlon (L1:64KB/2-w/64Byte, L2:256KB/8-w/64B)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

24

Pentium III (L1:16KB/4-w/32Byte, L2:256KB/8-w/32B)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

25

Pentium 4 (L1:8KB/8-w/64Byte, L2:256KB/8-w/64B)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

26

Sparc (L1:64KB/4-w/32Byte, L2:8MB Direct-mapped/64Byte)

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

27

Summary• The Basic Morton Scheme often performs

disappointingly.• Page-aligning the base address theoretically

maximises spatial locality.• Unrolling is facilitated by carefully aligning the start

iteration of unrolled loops to power-of-two indices into the array.

• With base-address alignment and unrolling for strength-reduction of index calculation, Morton layout is beginning to actually work.

So

ftwa

re P

erf

orm

an

ce O

ptim

isa

tion

Gro

upIm

pe

ria

l Co

lleg

e, L

on

do

n

28

Future Work• Larger Factors of Unrolling

– Until now only factor 4 hand-unrolled– We have used code generation to unroll by larger factors, and it

seems that there are more improvements to be had.• Prefetching

– It’s likely that hardware prefetching will fetch the wrong things– Turn off hardware prefetching, use the right, compiler-directed

prefetching instead.• Tiling

– Storage layout transformations and iteration space transformations are complimentary

– But we should do both.