Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Fast Matrix Multiplication Over GF3

Matthew LambertAdvised by Dr. B. David Saunders & Dr. Stephen Siegel

February 13, 2015

Matrix Multiplication

Outline

I Fast Matrix Multiplication

I I Strassen’s Algorithm: Well studied O(n2.807)divide-and-conquer algorithm

I Method of the Four Russians (Arlazarov et al., 1970):

O( n3

log(n) ) algorithm that is effective over small finite fields

(GF2, GF3, GF5). Very well studied for GF2 and anecdotallygood for GF3.

I Aim to find thresholds for each algorithm over GF3.

Outline

I Fast Matrix MultiplicationI I Strassen’s Algorithm: Well studied O(n2.807)

divide-and-conquer algorithm

I Method of the Four Russians (Arlazarov et al., 1970):

O( n3




Outline


divide-and-conquer algorithmI Method of the Four Russians (Arlazarov et al., 1970):

O( n3




Outline


divide-and-conquer algorithmI Method of the Four Russians (Arlazarov et al., 1970):

O( n3




Arithmetic over GF3

I Simply arithmetic mod 3.

I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Arithmetic over GF3


I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Arithmetic over GF3


I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.






0 1 1 10 1 0 0







0 1 1 10 1 0 0







0 1 1 10 1 0 0







0 1 1 10 1 0 0


Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.


0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1


I Ci =∑n

j=0 Ai ,j × Bj





0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1


I Ci =∑n

j=0 Ai ,j × Bj





0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1


I Ci =∑n

j=0 Ai ,j × Bj





0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1


I Ci =∑n

j=0 Ai ,j × Bj




Four Russians Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I Instead of indexing one bit at a time, we index with t = 2 bits

at a time, yielding n2

t additions and thus O(n3

t ) running time.

I With multiple bits as index, we are adding multiple rows atonce.

I We need to quickly compute the linear combinations of thet = 2 rows we are adding.


0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0




t ) running time.




0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0




t ) running time.




I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I

index description contents

00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0


I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I


00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0


I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I


00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0


I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I


00 0 0 0 0 010 r3 0 0 0 101 r3 0 0 1 011 r3 + r4 0 0 1 1

I

0 1 0 01 0 0 01 1 0 00 0 0 0

+

0 0 0 10 0 1 00 0 1 00 0 1 1

=

0 1 0 11 0 1 01 1 1 00 0 1 1

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

0 01 r1


I We perform n2



t ) time.



t tables.

I

index description

0 01 r1


I We perform n2



t ) time.



t tables.

I

index description

0 01 r1


I We perform n2



t ) time.



t tables.

I

index description

0 01 r1


I We perform n2



t ) time.



t tables.

I

index description

00 010 r101 r211 r1 + r2


I We perform n2



t ) time.



t tables.

I

index description

000 0100 r1010 r2110 r1 + r2001 r3101 r1 + r3011 r2 + r3111 r1 + r2 + r3

Four Russians Multiplication over GF2: Algorithm

I

Data: A,B,C , n × n matricesResult: C ← A× B

for i ← 0 to n/t doT ← table of 2t combinations of rows [i ∗ t, (i + 1) ∗ t) of Bfor j ← 0 to n do

a← bits [i ∗ t, (i + 1) ∗ t) of row i of ACj ← Cj + T [a]

end

end

I Running time: n2

t vector adds, totaling O(n3

t ) time; nt tables

created, totaling O(2tn2

t ) time.

I Let t = log2(n), then the additions take O( n3

log2(n)) time and

the tables take O( n3

log2(n)) time to create.


I




end

end

I Running time: n2


t ) time; nt tables


t ) time.


log2(n)) time and




I




end

end

I Running time: n2


t ) time; nt tables


t ) time.


log2(n)) time and



Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?














































Four Russians Multiplication over GF3: Tables

I Three approaches considered to address creation of rowcombinations and indexing problem.

I Use one table of size 2t rows and add with it twice.

I Use one table of size 4t rows and add with it once.

I Use one table of size 3t rows and one of size 4t indices andadd with it once.
















Four Russians Multiplication over GF3: 2t approach

I First approach: create a table of 2t combinations of rows as inGF2 case. Index once into the table with the low t bits andonce with the high t bits. The 2 = 112 representation used inthe bitsliced storage is advantageous.

I

index contents

00 010 r101 r211 r1 + r2


I First approach: create a table of 2t combinations of rows as inGF2 case. Index once into the table with the low t bits andonce with the high t bits. The 2 = 112 representation used inthe bitsliced storage is advantageous.

I

index contents

00 010 r101 r211 r1 + r2


I Second approach: create a table of size 4t rows containing all3t combinations of rows and some unused rows. We will indexinto the rows directly.

I

index contents0000 01000 r10100 r21100 r1 + r200101010 −r101101110 −r1 + r2

index contents000110010101 −r21101 r1 − r20011101101111111 −r1 − r2


I Second approach: create a table of size 4t rows containing all3t combinations of rows and some unused rows. We will indexinto the rows directly.

I

index contents0000 01000 r10100 r21100 r1 + r200101010 −r101101110 −r1 + r2

index contents000110010101 −r21101 r1 − r20011101101111111 −r1 − r2


I Third approach: create a table of 3t rows and a table of 4t

indices or pointers to map an index to a combination.

I

index contents0000 01000 r10100 −r11100 r20010 r1 + r21010 −r1 + r20110 −r21110 −r1 − r20001 r1 − r2

index destination0000 00001000 10000100 11001100 001000101010 010001101110 101000011001 011001011101 00010011101101111111 1110


I Third approach: create a table of 3t rows and a table of 4t

indices or pointers to map an index to a combination.

I

index contents0000 01000 r10100 −r11100 r20010 r1 + r21010 −r1 + r20110 −r21110 −r1 − r20001 r1 − r2

index destination0000 00001000 10000100 11001100 001000101010 010001101110 101000011001 011001011101 00010011101101111111 1110

Four Russians Multiplication over GF3: Table Summary

Imethod memory cost element access adds ops per table2t 2t rows direct 2 6× 2tn3t 3t rows and 4t indirect 1 3.5× 3tn4t 4t rows direct 1 3.5× 4tn

I Because the 3t and 4t approaches contain ± eachcombination of rows, we only need to use the six operationaddition to construct half of the elements. We can then usethe one operation negation to construct the other half.

I Only one supplemental table for the 3t approach needs to becreated for each value of t used.









Four Russians Multiplication over GF3: Results

I Initial development showed 4t method to be considerablyslower, so it was abandoned in favor of optimizing the othertwo approaches.

I


I Initial development showed 4t method to be considerablyslower, so it was abandoned in favor of optimizing the othertwo approaches.

I


I

n 3t time 2t time classical time

64 0.000021 0.000012 0.000083128 0.000073 0.00006 0.000331256 0.000326 0.000317 0.00116512 0.00195 0.00203 0.005991024 0.0133 0.0142 0.06372048 0.0941 0.100 0.2384096 1.674 1.069 2.577

I


I

n 3t time 2t time classical time

64 0.000021 0.000012 0.000083128 0.000073 0.00006 0.000331256 0.000326 0.000317 0.00116512 0.00195 0.00203 0.005991024 0.0133 0.0142 0.06372048 0.0941 0.100 0.2384096 1.674 1.069 2.577

I

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.




































Other Four Russians Thoughts

I We assume we can operate on full 64-bit words. Performanceis decreased if we must not modify unused bits.

I If t = log(n), then the table of row combinations is the samesize as the matrix so there are definite practical limitations.

Other Four Russians Thoughts

I We assume we can operate on full 64-bit words. Performanceis decreased if we must not modify unused bits.

I If t = log(n), then the table of row combinations is the samesize as the matrix so there are definite practical limitations.

Strassen’s Algorithm vs Classical Divide-and-Conquer

Classical divide-and-conquer multiplication vs Strassenmultiplication on top of 3t approach of the Method of the FourRussians. Strassen was faster at all tested dimensions.

Strassen’s Algorithm: 2t vs 3t as base case

Three levels of recursion performed. Even with base cases wherethe 3t approach should outperform 2t approach, the 2t base caseyielded faster results in all cases (though the multiple tablesmethod outperforms 2t by about 14%).

Strassen’s Algorithm: Conclusions and Thresholds

I Strassen’s algorithm is faster than classical and due tomemory requirements is at some point faster than the Methodof the Four Russians.

I When using a base case of the 2t approach, the threshold toswitch from Strassen’s algorithm to Four Russians on themachine used is somewhere between 960 and 1728.

I Normally Strassen’s algorithm works best on matrices withdimensions a power of 2 or a power of 2 times the base casedimension. With bitsliced or packed storage it is important totry to maintain matrices with dimensions multiples of 64 at alllevels of recursion, as working with less than 1 machine wordis more expensive.

I Further improvements ongoing?
















Summary

I Successfully developed and implemented the Method of theFour Russians over GF3 yielding noticeable performance overclassical multiplication.

I Implemented Strassen’s algorithm on top of the Method ofthe Four Russians.

I Different implementations of the Method of the Four Russiansmight be preferred depending on specific problem andmachine.

Summary




Summary




Best value of t

For GF2 we can assume all operations are essentially equal in time:only different cost is extracting bits. All other costs are 64-bit xors.For (m × n) ∗ (n × k) = (m × k), we spend mnk/t time addingand nk2t/t time creating tables.Minimize mnk

t + k2t

t .

Partial w.r.t. t: kn(2t(t∗log(2)−1)−m)t2

.

Set equal to zero and solve for t = W (m/e)+1log(2) .

dimension 64 128 256 512 1024 2048 4096

t 4.79 5.51 6.26 7.04 7.85 8.67 9.52log2(dim) 6 7 8 9 10 11 12

Selecting t in GF3

For 3t approach: we spend mnk/t time adding and nk3t/t time

creating tables. t = W (m/e)+1log(3) .

For 2t approach: we spend 2mnk/t time adding and nk2t/t time

creating tables. t = W (2m/e)+1log(2)

dimension 64 128 256 512 1024 2048 4096

3t min t 3.02 3.47 3.94 4.44 4.95 5.47 6.00experimental t 3 4 4 4 4 5 6log3(dim) 3.78 4.42 5.05 5.68 6.31 6.94 7.57

dimension 64 128 256 512 1024 2048 4096

2t min t 4.41 6.26 7.04 7.85 8.67 9.51 10.37experimental t 6 6 7 7 8 9 9log2(dim) 6 7 8 9 10 11 12

GF3 bitsliced operations

s ← x0 ⊕ y1t ← x1 ⊕ y0

r0 ← (x0 ⊕ y1) ∧ (x1 ⊕ y0)r1 ← s ∨ t

Figure: GF3 bitsliced addition in six operations: r ← x + y

t ← x0 ⊕ y0r0 ← t ∨ (x1 ⊕ y1)

r1 ← (t ⊕ y1) ∧ (y0 ⊕ x1)

Figure: GF3 bitsliced subtraction in six operations: r ← x − y

r0 ← x0r1 ← x0 ⊕ x1

Figure: GF3 bitsliced negation in one operation: r ← −x

Documents

Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm