A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring

A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of

Strassen’s Algorithm as a Case Study

Enric Morancho

Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech

Barcelona, [email protected]

23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing

Turku, Finland, March 4th − 6th, 2015

Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45

Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian Elimination

4 Case study

5 Conclusions and Future work


Outline

1 Introduction

2 Background


4 Case study



Introduction

Gaussian Elimination (GE) is one of the key algorithms in linearalgebraWe discuss a vector implementation of GE over GF(2)We apply this implementation to a case study:

Enumerating all matrix-multiply algorithms over GF(2) similar toStrassen’s algorithm

Strassen’s algorithm: first known sub-cubic matrix-multiply algorithm

The search engine relies on solving more than 1012 GaussianEliminations over GF(2) matrices


Outline

1 Introduction

2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions


4 Case study



Outline

1 Introduction



4 Case study



Gaussian Elimination (GE)

One of the key algorithms in linear algebraApplications: Solving LSE’s, inverting nonsingular matrices,...

GE transforms a matrix into a matrix in row (column) echelon formForward elimination

2 1 4 32 4 10 36 6 18 94 2 8 3

2 1 4 00 3 6 30 0 0 30 0 0 0

Gauss-Jordan Elimination (GJE) transforms a matrix into a matrixin reduced row (column) echelon form

Forward elimination + Back substitution2 1 4 32 4 10 36 6 18 94 2 8 3

2 1 4 00 3 6 30 0 0 30 0 0 0

1 0 1 00 1 2 00 0 0 10 0 0 0


Gaussian Elimination (GE)

GE iteratively applies three elementary transformations:Swapping rows (columns)Scaling rows (columns)Adding to a row (column) a scalar multiple of another row (column)

GE is defined over an algebraic fieldInfinite fields: Q, R, ...

Computer arithmetic over infinite fields introduces round-off errorsGE implementations use pivoting techniques to minimize them

Finite fields: the Galois Field of two elements (GF(2)), ...Computer arithmetic over finite fields is always exact


Outline

1 Introduction



4 Case study



Gaussian Elimination over GF(2)

GF(2) is the Galois field of two elements (aka F2, binary field)GF (2) = {0,1}addition ≡ bitwise XOR

subtraction and addition are the same operation (+1 = -1)

multiplication ≡ bitwise ANDImplementation remarks

Gaussian Elimination can be specialized for GF(2)The only element different from 0 is 1

Gauss-Jordan Elimination can be easily merged into GEPivoting is unnecessary

Computer arithmetic over GF(2) is always exact

ApplicationsFactoring large integer numbers, cryptography, pattern matching,...


Outline

1 Introduction



4 Case study



Vector extensions

Most current processors implement vector instructionsx86 family supports several vector-instruction sets and lengths

MMX: 64-bit vector registersSSE: 128-bit vector registersAVX2: 256-bit vector registers

Scalar processors may emulate vector instructionsSWAR: SIMD within a registerInterprets general-purpose registers as a bitvectorsBitwise operations (AND, shifts,...) can be seen as SIMD operations


Outline

1 Introduction

2 Background

3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution

4 Case study



Preliminaries

We focus in row echelon formsVector instructions allow us to exploit the parallelism available inthe matrix transformationsWe represent GF(2) matrices as vector registers

Both row-major and column-major layoutsMain steps of GE

For each column:Finding the pivotRow swappingForward elimination and Back substitution


Outline

1 Introduction

2 Background


4 Case study



Finding the pivot element

For each column, GE must locate a pivot (an element 6= 0)First: column must be extracted

Depends on matrix layoutSecond: pivot must be located

Naïve solution: iterating bit by bitOptimized solution: using bit-scanning machine instructions

tzcnt (AVX2)


Outline

1 Introduction

2 Background


4 Case study



Row swapping

Depends on matrix layoutRow-major layout:

Vector instruction sets offer permute instructionsColumn-major layout:

Several vector instructions are required


Row swapping

Example of swapping rows 3 and 5 on a 6× 4 GF(2) matrix storedin column-major layout


Outline

1 Introduction

2 Background


4 Case study



Forward elimination and Back substitution

Add pivot row to the rows with non-zero entries in the pivot columnWe perform all additions at the same timeExample on a 4× 6 GF(2) matrix stored in row-major layout


Outline

1 Introduction

2 Background


4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults



Outline

1 Introduction

2 Background





Block-recursive matrix-multiply algorithms

A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6

b) Strassen’s algorithm

n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7

c) Winograd’s variant

n 2.807



A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6


n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7


n 2.807



A · B =

(A11 A12A21 A22

)·(

B11 B12B21 B22

)=

(C11 C12C21 C22

)= C

P1 = A11 · B11

P2 = A12 · B21

P3 = A11 · B12

P4 = A12 · B22

P5 = A21 · B11

P6 = A22 · B21

P7 = A21 · B12

P8 = A22 · B22

C11 = P1 + P2

C12 = P3 + P4

C21 = P5 + P6

C22 = P7 + P8

a) Conventional

n 3

P1 = (A11 + A22) · (B11 + B22)

P2 = (A21 + A22) · B11

P3 = A11 · (B12 − B22)

P4 = A22 · (−B11 + B21)

P5 = (A11 + A12) · B22

P6 = (−A11 + A21) · (B11 + B12)

P7 = (A12 − A22) · (B21 + B22)

C11 = P1 + P4 − P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 − P2 + P3 + P6


n 2.807

P1 = A11 · B11

P2 = A12 · B21

P3 = A22 · (B11 − B12 − B21 + B22)

P4 = (A11 − A21) · (−B12 + B22)

P5 = (A21 + A22) · (−B11 + B12)

P6 = (A11 + A12 − A21 − A22) · B22

P7 = (A11 − A21 − A22) · (B11 − B12 + B22)

C11 = P1 + P2

C12 = P1 + P5 + P6 − P7

C21 = P1 − P3 + P4 − P7

C22 = P1 + P4 + P5 − P7


n 2.807



Several matrix-multiply algorithms with sub-cubic complexityn2.81 [Strassen, 1969]n2.79 [Pan, 1979]n2.78 [Bini, 1979]n2.55 [Schönhage, 1981]n2.373 [Coppersmith and Winograd, 1987]n2.37286 [LeGall, 2014]

But Strassen’s algorithm is optimal for 2× 2 matricesThere are other algorithms similar to Strassen’s?

Using a genetic algorithm, [Oh and Moon, 2010] searched forStrassen-like algorithms over R

Their partial search discovered 608 Strassen-like algorithms


Case Study

Enumerating all the Strassen-like algorithms for 2× 2 GF(2)matrices


Outline

1 Introduction

2 Background





Formulation

Strassen-like algorithms make 7 recursive callsEach call computes a bilinear form

Pi =(αi1A11 + αi2A12 + αi3A21 + αi4A22)

·(βi1B11 + βi2B12 + βi3B21 + βi4B22) αij , βij ∈ GF (2)

There are (24 − 1)2 = 225 possible Pi ’s and(225

7

)≈ 5.27× 1012

candidate algorithms

An algorithm is correct (computes A · B) if there existsimultaneously next four linear combinations:

C11 :∑7

k=1 φk1Pk = A11 · B11 + A12 · B21

C12 :∑7

k=1 φk2Pk = A11 · B12 + A12 · B22

C21 :∑7

k=1 φk3Pk = A21 · B11 + A22 · B21

C22 :∑7

k=1 φk4Pk = A21 · B12 + A22 · B22

φij ∈ GF (2)


Representation

We represent each candidate algorithm as a 16× 11 GF(2) matrix

δijk = 1⇔ αij = βik = 1

δ1,1,1 ... δ7,1,1 1 0 0 0δ1,1,2 ... δ7,1,2 0 1 0 0δ1,1,3 ... δ7,1,3 0 0 0 0δ1,1,4 ... δ7,1,4 0 0 0 0δ1,2,1 ... δ7,2,1 0 0 0 0δ1,2,2 ... δ7,2,2 0 0 0 0δ1,2,3 ... δ7,2,3 1 0 0 0δ1,2,4 ... δ7,2,4 0 1 0 0δ1,3,1 ... δ7,3,1 0 0 1 0δ1,3,2 ... δ7,3,2 0 0 0 1δ1,3,3 ... δ7,3,3 0 0 0 0δ1,3,4 ... δ7,3,4 0 0 0 0δ1,4,1 ... δ7,4,1 0 0 0 0δ1,4,2 ... δ7,4,2 0 0 0 0δ1,4,3 ... δ7,4,3 0 0 1 0δ1,4,4 ... δ7,4,4 0 0 0 1


Resolution

We apply Gauss-Jordan elimination to the 16× 11 matrixIf the form of the resulting matrix is:

(I7 φ7×4Z9×7 Z9×4

) I7 7× 7 identity matrixZ9×7, Z9×4 9× 7, 9× 4 zero matricesφ7×4 7× 4 arbitrary matrix

then φ7×4 are the coefficients of the linear combinations of P1...P7that generate the Cij ’s of A · B


Search engines

Search engineIterates through

(2257

)≈ 5.2× 1012 candidate algorithms

Applies Gauss-Jordan elimination to the related 16× 11 GF(2)matricesChecks the form of the resulting matrix

SpecializationsIf a pivot is not found in anyone of the first seven columns, thealgorithm is discarded

Redundant Pi ’s

Gauss-Jordan is applied just on the first seven columnsAvoid explicitly swapping rowsFiltering-out algorithms that do not use all eight subproducts

A11 ·B11,A12 ·B21,A11 ·B12,A12 ·B22,A21 ·B11,A22 ·B21,A21 ·B12,A22 ·B22


Outline

1 Introduction

2 Background





Experimental setup

Evaluated search engines6 scenarios:

AVX2, SSE, Scalar SWAR, Scalar no-SWAR, M4RI library (GE),M4RI library (GJE)

2 search engines for each scenarioGeneric: applies GJE/GE to all candidate algorithmsSpecialized: implement specializationsIn M4RI scenarios, specialization includes only the filteringmechanism

Search engines differ on the implementation of Gauss-Jordanelimination

Column-major layoutEvaluation platform

Intel Xeon CPU E3-1220 v3 (3.1 GHz, 4-core, Haswell)Implements AVX2 vector extension (256-bit)

Search space is dynamically distributed among the 4 cores


Outline

1 Introduction

2 Background





Performance results

Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18

Relative execution time of the search enginesWith respect to specialized AVX2 engineThe lower, the better


Performance results: vector vs. scalar


Speedup of AVX2 impl. wrt. vector and pure scalar impl.AVX2 vs. SSE: ≈ 1.3XAVX2 vs. Scalar SWAR: ≈ 1.9XAVX2 vs. Scalar no-SWAR: ≈ 5.7X-6.8X

Scalar implementations are not competitiveThe larger the vector-register length, the better the performance


Performance results: vector vs. algebra library


In our case study, library implementations are not competitiveThey are optimized for larger matrices


Performance results: impact of code specializations


Impact of code specialization: almost 4XLarger than doubling or even quadruplicating vector lengthImpact break down:

Discarding algorithms if a pivot is not found: negligibleApplying elimination just on seven columns: 1.4X - 1.6XAvoiding row interchange: 1.2X - 1.4XFiltering-out algorithms without eight subproducts: 1.9X


Case-study results

We have found 20 Strassen-like algorithmsExcluding permutations and symmetric versionsDetailed in our paper

Results coherent with [Oh and Moon, 2010]They found additional algorithms where coefficient 0.5 is involved


Outline

1 Introduction

2 Background


4 Case study



Conclusions

We have discussed a vector implementation of GaussianEliminationOur evaluation develops a case study

Requires solving more than 1012 GE’s over 16× 11 GF(2) matricesVector implementations clearly outperform scalar implementationWe point out the impact of code specialization for our case study

Speedup: almost 4XImpact larger than doubling or quadruplicating vector-register length

Performance of analyzed algebra libraries is not competitiveLibraries are optimized for larger matrices


Future work

Developing efficient implementations of Gaussian Elimination forlarger matrices


A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of

Strassen’s Algorithm as a Case Study

Enric Morancho

Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech

Barcelona, [email protected]

23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing

Turku, Finland, March 4th − 6th, 2015


Documents

A Vector Implementation of Gaussian Elimination over …personals.ac.upc.edu/enricm/Pubs/pdp2015_slides.pdf · A Vector Implementation of Gaussian Elimination over GF(2): Exploring