78
Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015

Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Fast Matrix Multiplication Over GF3

Matthew LambertAdvised by Dr. B. David Saunders & Dr. Stephen Siegel

February 13, 2015

Page 2: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication

Page 3: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Outline

I Fast Matrix Multiplication

I I Strassen’s Algorithm: Well studied O(n2.807)divide-and-conquer algorithm

I Method of the Four Russians (Arlazarov et al., 1970):

O( n3

log(n) ) algorithm that is effective over small finite fields

(GF2, GF3, GF5). Very well studied for GF2 and anecdotallygood for GF3.

I Aim to find thresholds for each algorithm over GF3.

Page 4: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Outline

I Fast Matrix MultiplicationI I Strassen’s Algorithm: Well studied O(n2.807)

divide-and-conquer algorithm

I Method of the Four Russians (Arlazarov et al., 1970):

O( n3

log(n) ) algorithm that is effective over small finite fields

(GF2, GF3, GF5). Very well studied for GF2 and anecdotallygood for GF3.

I Aim to find thresholds for each algorithm over GF3.

Page 5: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Outline

I Fast Matrix MultiplicationI I Strassen’s Algorithm: Well studied O(n2.807)

divide-and-conquer algorithmI Method of the Four Russians (Arlazarov et al., 1970):

O( n3

log(n) ) algorithm that is effective over small finite fields

(GF2, GF3, GF5). Very well studied for GF2 and anecdotallygood for GF3.

I Aim to find thresholds for each algorithm over GF3.

Page 6: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Outline

I Fast Matrix MultiplicationI I Strassen’s Algorithm: Well studied O(n2.807)

divide-and-conquer algorithmI Method of the Four Russians (Arlazarov et al., 1970):

O( n3

log(n) ) algorithm that is effective over small finite fields

(GF2, GF3, GF5). Very well studied for GF2 and anecdotallygood for GF3.

I Aim to find thresholds for each algorithm over GF3.

Page 7: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Arithmetic over GF3

I Simply arithmetic mod 3.

I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Page 8: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Arithmetic over GF3

I Simply arithmetic mod 3.

I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Page 9: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Arithmetic over GF3

I Simply arithmetic mod 3.

I

+ 0 1 2

0 0 1 21 1 2 02 2 0 1

I

× 0 1 2

0 0 0 01 0 1 22 0 2 1

Page 10: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.

Page 11: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.

Page 12: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.

Page 13: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.

Page 14: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Representation of GF3

I With only three possible values we ideally need two bits foreach element.

I Packed storage: elements stored consecutively.0 0 1 0 0 1 0 1

I To account for possible carry when adding, we actually need 3bits for each element. 5 bit ops to add two 21-element words.

I Bitsliced storage: low bits and high bits stored separately(Boothby & Bradshaw, 2009). Store 2 as 112 instead of 102.

0 1 1 10 1 0 0

I 6 bit ops to add or subtract two 64-element SlicedUnits. 1 bitop needed to negate.

Page 15: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.

Page 16: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.

Page 17: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.

Page 18: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.

Page 19: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Matrix Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 00 0 0 10 0 1 0

=

0 1 0 11 0 1 01 1 1 00 0 1 1

I Row i of A determines row i of C .

I Ci =∑n

j=0 Ai ,j × Bj

I Over GF2, addition is an xor operation.

I We perform up to n2 row additions, yielding O(n3) runningtime.

I We have a speedup if we do not have to do O(n2) rowadditions.

Page 20: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I Instead of indexing one bit at a time, we index with t = 2 bits

at a time, yielding n2

t additions and thus O(n3

t ) running time.

I With multiple bits as index, we are adding multiple rows atonce.

I We need to quickly compute the linear combinations of thet = 2 rows we are adding.

Page 21: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I Instead of indexing one bit at a time, we index with t = 2 bits

at a time, yielding n2

t additions and thus O(n3

t ) running time.

I With multiple bits as index, we are adding multiple rows atonce.

I We need to quickly compute the linear combinations of thet = 2 rows we are adding.

Page 22: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I Instead of indexing one bit at a time, we index with t = 2 bits

at a time, yielding n2

t additions and thus O(n3

t ) running time.

I With multiple bits as index, we are adding multiple rows atonce.

I We need to quickly compute the linear combinations of thet = 2 rows we are adding.

Page 23: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I

index description contents

00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0

Page 24: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I

index description contents

00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0

Page 25: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I

index description contents

00 0 0 0 0 010 r1 1 0 0 001 r2 0 1 0 011 r1 + r2 1 1 0 0

I

0 1 0 01 0 0 01 1 0 00 0 0 0

Page 26: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2

I

0 1 1 01 0 0 11 1 0 10 0 1 1

×

1 0 0 00 1 0 0

0 0 0 10 0 1 0

I

index description contents

00 0 0 0 0 010 r3 0 0 0 101 r3 0 0 1 011 r3 + r4 0 0 1 1

I

0 1 0 01 0 0 01 1 0 00 0 0 0

+

0 0 0 10 0 1 00 0 1 00 0 1 1

=

0 1 0 11 0 1 01 1 1 00 0 1 1

Page 27: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

0 01 r1

Page 28: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

0 01 r1

Page 29: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

0 01 r1

Page 30: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

0 01 r1

Page 31: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

00 010 r101 r211 r1 + r2

Page 32: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Fast table creation

I We perform n2

t row additions using nt tables.

I The additions are performed in O(n3

t ) time.

I The tables can be constructed in O(2tn2

t ) time: i.e., onevector addition for each 2t rows in n

t tables.

I

index description

000 0100 r1010 r2110 r1 + r2001 r3101 r1 + r3011 r2 + r3111 r1 + r2 + r3

Page 33: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Algorithm

I

Data: A,B,C , n × n matricesResult: C ← A× B

for i ← 0 to n/t doT ← table of 2t combinations of rows [i ∗ t, (i + 1) ∗ t) of Bfor j ← 0 to n do

a← bits [i ∗ t, (i + 1) ∗ t) of row i of ACj ← Cj + T [a]

end

end

I Running time: n2

t vector adds, totaling O(n3

t ) time; nt tables

created, totaling O(2tn2

t ) time.

I Let t = log2(n), then the additions take O( n3

log2(n)) time and

the tables take O( n3

log2(n)) time to create.

Page 34: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Algorithm

I

Data: A,B,C , n × n matricesResult: C ← A× B

for i ← 0 to n/t doT ← table of 2t combinations of rows [i ∗ t, (i + 1) ∗ t) of Bfor j ← 0 to n do

a← bits [i ∗ t, (i + 1) ∗ t) of row i of ACj ← Cj + T [a]

end

end

I Running time: n2

t vector adds, totaling O(n3

t ) time; nt tables

created, totaling O(2tn2

t ) time.

I Let t = log2(n), then the additions take O( n3

log2(n)) time and

the tables take O( n3

log2(n)) time to create.

Page 35: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF2: Algorithm

I

Data: A,B,C , n × n matricesResult: C ← A× B

for i ← 0 to n/t doT ← table of 2t combinations of rows [i ∗ t, (i + 1) ∗ t) of Bfor j ← 0 to n do

a← bits [i ∗ t, (i + 1) ∗ t) of row i of ACj ← Cj + T [a]

end

end

I Running time: n2

t vector adds, totaling O(n3

t ) time; nt tables

created, totaling O(2tn2

t ) time.

I Let t = log2(n), then the additions take O( n3

log2(n)) time and

the tables take O( n3

log2(n)) time to create.

Page 36: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 37: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 38: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 39: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 40: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 41: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Considerations

I In GF2, we had 2t row combinations to compute. In GF3, wehave to compute 3t combinations of rows.

I In GF2, extracting bits for indices was easy, with bitslicedrepresentation, this is problematic for GF3.

I How do we use1 1 01 0 0

as an index that corresponds to

r2 − r3, which is nominally 21 (0 + 3 + 18 = 101012)?

I If t = 3, there are only 27 combinations of rows, but both thelow bits range between 000 and 111.

I Too expensive to map index to range 0 to 3t directly, so weconcatenate the high and low bits to give an index between 0and 4t .

I How do our tables look with these indices?

Page 42: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Tables

I Three approaches considered to address creation of rowcombinations and indexing problem.

I Use one table of size 2t rows and add with it twice.

I Use one table of size 4t rows and add with it once.

I Use one table of size 3t rows and one of size 4t indices andadd with it once.

Page 43: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Tables

I Three approaches considered to address creation of rowcombinations and indexing problem.

I Use one table of size 2t rows and add with it twice.

I Use one table of size 4t rows and add with it once.

I Use one table of size 3t rows and one of size 4t indices andadd with it once.

Page 44: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Tables

I Three approaches considered to address creation of rowcombinations and indexing problem.

I Use one table of size 2t rows and add with it twice.

I Use one table of size 4t rows and add with it once.

I Use one table of size 3t rows and one of size 4t indices andadd with it once.

Page 45: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Tables

I Three approaches considered to address creation of rowcombinations and indexing problem.

I Use one table of size 2t rows and add with it twice.

I Use one table of size 4t rows and add with it once.

I Use one table of size 3t rows and one of size 4t indices andadd with it once.

Page 46: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 2t approach

I First approach: create a table of 2t combinations of rows as inGF2 case. Index once into the table with the low t bits andonce with the high t bits. The 2 = 112 representation used inthe bitsliced storage is advantageous.

I

index contents

00 010 r101 r211 r1 + r2

Page 47: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 2t approach

I First approach: create a table of 2t combinations of rows as inGF2 case. Index once into the table with the low t bits andonce with the high t bits. The 2 = 112 representation used inthe bitsliced storage is advantageous.

I

index contents

00 010 r101 r211 r1 + r2

Page 48: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 4t approach

I Second approach: create a table of size 4t rows containing all3t combinations of rows and some unused rows. We will indexinto the rows directly.

I

index contents0000 01000 r10100 r21100 r1 + r200101010 −r101101110 −r1 + r2

index contents000110010101 −r21101 r1 − r20011101101111111 −r1 − r2

Page 49: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 4t approach

I Second approach: create a table of size 4t rows containing all3t combinations of rows and some unused rows. We will indexinto the rows directly.

I

index contents0000 01000 r10100 r21100 r1 + r200101010 −r101101110 −r1 + r2

index contents000110010101 −r21101 r1 − r20011101101111111 −r1 − r2

Page 50: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 3t approach

I Third approach: create a table of 3t rows and a table of 4t

indices or pointers to map an index to a combination.

I

index contents0000 01000 r10100 −r11100 r20010 r1 + r21010 −r1 + r20110 −r21110 −r1 − r20001 r1 − r2

index destination0000 00001000 10000100 11001100 001000101010 010001101110 101000011001 011001011101 00010011101101111111 1110

Page 51: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: 3t approach

I Third approach: create a table of 3t rows and a table of 4t

indices or pointers to map an index to a combination.

I

index contents0000 01000 r10100 −r11100 r20010 r1 + r21010 −r1 + r20110 −r21110 −r1 − r20001 r1 − r2

index destination0000 00001000 10000100 11001100 001000101010 010001101110 101000011001 011001011101 00010011101101111111 1110

Page 52: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Table Summary

Imethod memory cost element access adds ops per table2t 2t rows direct 2 6× 2tn3t 3t rows and 4t indirect 1 3.5× 3tn4t 4t rows direct 1 3.5× 4tn

I Because the 3t and 4t approaches contain ± eachcombination of rows, we only need to use the six operationaddition to construct half of the elements. We can then usethe one operation negation to construct the other half.

I Only one supplemental table for the 3t approach needs to becreated for each value of t used.

Page 53: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Table Summary

Imethod memory cost element access adds ops per table2t 2t rows direct 2 6× 2tn3t 3t rows and 4t indirect 1 3.5× 3tn4t 4t rows direct 1 3.5× 4tn

I Because the 3t and 4t approaches contain ± eachcombination of rows, we only need to use the six operationaddition to construct half of the elements. We can then usethe one operation negation to construct the other half.

I Only one supplemental table for the 3t approach needs to becreated for each value of t used.

Page 54: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Table Summary

Imethod memory cost element access adds ops per table2t 2t rows direct 2 6× 2tn3t 3t rows and 4t indirect 1 3.5× 3tn4t 4t rows direct 1 3.5× 4tn

I Because the 3t and 4t approaches contain ± eachcombination of rows, we only need to use the six operationaddition to construct half of the elements. We can then usethe one operation negation to construct the other half.

I Only one supplemental table for the 3t approach needs to becreated for each value of t used.

Page 55: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Results

I Initial development showed 4t method to be considerablyslower, so it was abandoned in favor of optimizing the othertwo approaches.

I

Page 56: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Results

I Initial development showed 4t method to be considerablyslower, so it was abandoned in favor of optimizing the othertwo approaches.

I

Page 57: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Results

I

n 3t time 2t time classical time

64 0.000021 0.000012 0.000083128 0.000073 0.00006 0.000331256 0.000326 0.000317 0.00116512 0.00195 0.00203 0.005991024 0.0133 0.0142 0.06372048 0.0941 0.100 0.2384096 1.674 1.069 2.577

I

Page 58: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Results

I

n 3t time 2t time classical time

64 0.000021 0.000012 0.000083128 0.000073 0.00006 0.000331256 0.000326 0.000317 0.00116512 0.00195 0.00203 0.005991024 0.0133 0.0142 0.06372048 0.0941 0.100 0.2384096 1.674 1.069 2.577

I

Page 59: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 60: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 61: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 62: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 63: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 64: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Four Russians Multiplication over GF3: Conclusions

I Reasonable speedup over classical: 2.5x (and improving)

I 3t approach is not as successful as it theoretically should be.

I Larger tables do not fit into L1 cache as well as 2t-sizedtables?

I More complicated access increases cost?

I Some precedent for multiple additions: M4RI (Albrecht, 2010)increases number of additions if it means smaller tables, somultiple tables can fit into L1 cache.

I Preliminary testing with multiple tables sees 3t approachcompute 4096× 4096 in 0.868 seconds, and 2t approach in1.01 seconds.

Page 65: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Other Four Russians Thoughts

I We assume we can operate on full 64-bit words. Performanceis decreased if we must not modify unused bits.

I If t = log(n), then the table of row combinations is the samesize as the matrix so there are definite practical limitations.

Page 66: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Other Four Russians Thoughts

I We assume we can operate on full 64-bit words. Performanceis decreased if we must not modify unused bits.

I If t = log(n), then the table of row combinations is the samesize as the matrix so there are definite practical limitations.

Page 67: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm vs Classical Divide-and-Conquer

Classical divide-and-conquer multiplication vs Strassenmultiplication on top of 3t approach of the Method of the FourRussians. Strassen was faster at all tested dimensions.

Page 68: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm: 2t vs 3t as base case

Three levels of recursion performed. Even with base cases wherethe 3t approach should outperform 2t approach, the 2t base caseyielded faster results in all cases (though the multiple tablesmethod outperforms 2t by about 14%).

Page 69: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm: Conclusions and Thresholds

I Strassen’s algorithm is faster than classical and due tomemory requirements is at some point faster than the Methodof the Four Russians.

I When using a base case of the 2t approach, the threshold toswitch from Strassen’s algorithm to Four Russians on themachine used is somewhere between 960 and 1728.

I Normally Strassen’s algorithm works best on matrices withdimensions a power of 2 or a power of 2 times the base casedimension. With bitsliced or packed storage it is important totry to maintain matrices with dimensions multiples of 64 at alllevels of recursion, as working with less than 1 machine wordis more expensive.

I Further improvements ongoing?

Page 70: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm: Conclusions and Thresholds

I Strassen’s algorithm is faster than classical and due tomemory requirements is at some point faster than the Methodof the Four Russians.

I When using a base case of the 2t approach, the threshold toswitch from Strassen’s algorithm to Four Russians on themachine used is somewhere between 960 and 1728.

I Normally Strassen’s algorithm works best on matrices withdimensions a power of 2 or a power of 2 times the base casedimension. With bitsliced or packed storage it is important totry to maintain matrices with dimensions multiples of 64 at alllevels of recursion, as working with less than 1 machine wordis more expensive.

I Further improvements ongoing?

Page 71: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm: Conclusions and Thresholds

I Strassen’s algorithm is faster than classical and due tomemory requirements is at some point faster than the Methodof the Four Russians.

I When using a base case of the 2t approach, the threshold toswitch from Strassen’s algorithm to Four Russians on themachine used is somewhere between 960 and 1728.

I Normally Strassen’s algorithm works best on matrices withdimensions a power of 2 or a power of 2 times the base casedimension. With bitsliced or packed storage it is important totry to maintain matrices with dimensions multiples of 64 at alllevels of recursion, as working with less than 1 machine wordis more expensive.

I Further improvements ongoing?

Page 72: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Strassen’s Algorithm: Conclusions and Thresholds

I Strassen’s algorithm is faster than classical and due tomemory requirements is at some point faster than the Methodof the Four Russians.

I When using a base case of the 2t approach, the threshold toswitch from Strassen’s algorithm to Four Russians on themachine used is somewhere between 960 and 1728.

I Normally Strassen’s algorithm works best on matrices withdimensions a power of 2 or a power of 2 times the base casedimension. With bitsliced or packed storage it is important totry to maintain matrices with dimensions multiples of 64 at alllevels of recursion, as working with less than 1 machine wordis more expensive.

I Further improvements ongoing?

Page 73: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Summary

I Successfully developed and implemented the Method of theFour Russians over GF3 yielding noticeable performance overclassical multiplication.

I Implemented Strassen’s algorithm on top of the Method ofthe Four Russians.

I Different implementations of the Method of the Four Russiansmight be preferred depending on specific problem andmachine.

Page 74: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Summary

I Successfully developed and implemented the Method of theFour Russians over GF3 yielding noticeable performance overclassical multiplication.

I Implemented Strassen’s algorithm on top of the Method ofthe Four Russians.

I Different implementations of the Method of the Four Russiansmight be preferred depending on specific problem andmachine.

Page 75: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Summary

I Successfully developed and implemented the Method of theFour Russians over GF3 yielding noticeable performance overclassical multiplication.

I Implemented Strassen’s algorithm on top of the Method ofthe Four Russians.

I Different implementations of the Method of the Four Russiansmight be preferred depending on specific problem andmachine.

Page 76: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Best value of t

For GF2 we can assume all operations are essentially equal in time:only different cost is extracting bits. All other costs are 64-bit xors.For (m × n) ∗ (n × k) = (m × k), we spend mnk/t time addingand nk2t/t time creating tables.Minimize mnk

t + k2t

t .

Partial w.r.t. t: kn(2t(t∗log(2)−1)−m)t2

.

Set equal to zero and solve for t = W (m/e)+1log(2) .

dimension 64 128 256 512 1024 2048 4096

t 4.79 5.51 6.26 7.04 7.85 8.67 9.52log2(dim) 6 7 8 9 10 11 12

Page 77: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

Selecting t in GF3

For 3t approach: we spend mnk/t time adding and nk3t/t time

creating tables. t = W (m/e)+1log(3) .

For 2t approach: we spend 2mnk/t time adding and nk2t/t time

creating tables. t = W (2m/e)+1log(2)

dimension 64 128 256 512 1024 2048 4096

3t min t 3.02 3.47 3.94 4.44 4.95 5.47 6.00experimental t 3 4 4 4 4 5 6log3(dim) 3.78 4.42 5.05 5.68 6.31 6.94 7.57

dimension 64 128 256 512 1024 2048 4096

2t min t 4.41 6.26 7.04 7.85 8.67 9.51 10.37experimental t 6 6 7 7 8 9 9log2(dim) 6 7 8 9 10 11 12

Page 78: Fast Matrix Multiplication Over GF3lambert/prelim_slides.pdf · Outline I Fast Matrix Multiplication I I Strassen’s Algorithm: Well studied O(n2:807) divide-and-conquer algorithm

GF3 bitsliced operations

s ← x0 ⊕ y1t ← x1 ⊕ y0

r0 ← (x0 ⊕ y1) ∧ (x1 ⊕ y0)r1 ← s ∨ t

Figure: GF3 bitsliced addition in six operations: r ← x + y

t ← x0 ⊕ y0r0 ← t ∨ (x1 ⊕ y1)

r1 ← (t ⊕ y1) ∧ (y0 ⊕ x1)

Figure: GF3 bitsliced subtraction in six operations: r ← x − y

r0 ← x0r1 ← x0 ⊕ x1

Figure: GF3 bitsliced negation in one operation: r ← −x