Upload
nguyenquynh
View
215
Download
1
Embed Size (px)
Citation preview
A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of
Strassen’s Algorithm as a Case Study
Enric Morancho
Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech
Barcelona, [email protected]
23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing
Turku, Finland, March 4th − 6th, 2015
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 1 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 2 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 3 / 45
Introduction
Gaussian Elimination (GE) is one of the key algorithms in linearalgebraWe discuss a vector implementation of GE over GF(2)We apply this implementation to a case study:
Enumerating all matrix-multiply algorithms over GF(2) similar toStrassen’s algorithm
Strassen’s algorithm: first known sub-cubic matrix-multiply algorithm
The search engine relies on solving more than 1012 GaussianEliminations over GF(2) matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 4 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 5 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 6 / 45
Gaussian Elimination (GE)
One of the key algorithms in linear algebraApplications: Solving LSE’s, inverting nonsingular matrices,...
GE transforms a matrix into a matrix in row (column) echelon formForward elimination
2 1 4 32 4 10 36 6 18 94 2 8 3
2 1 4 00 3 6 30 0 0 30 0 0 0
Gauss-Jordan Elimination (GJE) transforms a matrix into a matrixin reduced row (column) echelon form
Forward elimination + Back substitution2 1 4 32 4 10 36 6 18 94 2 8 3
2 1 4 00 3 6 30 0 0 30 0 0 0
1 0 1 00 1 2 00 0 0 10 0 0 0
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 7 / 45
Gaussian Elimination (GE)
GE iteratively applies three elementary transformations:Swapping rows (columns)Scaling rows (columns)Adding to a row (column) a scalar multiple of another row (column)
GE is defined over an algebraic fieldInfinite fields: Q, R, ...
Computer arithmetic over infinite fields introduces round-off errorsGE implementations use pivoting techniques to minimize them
Finite fields: the Galois Field of two elements (GF(2)), ...Computer arithmetic over finite fields is always exact
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 8 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 9 / 45
Gaussian Elimination over GF(2)
GF(2) is the Galois field of two elements (aka F2, binary field)GF (2) = {0,1}addition ≡ bitwise XOR
subtraction and addition are the same operation (+1 = -1)
multiplication ≡ bitwise ANDImplementation remarks
Gaussian Elimination can be specialized for GF(2)The only element different from 0 is 1
Gauss-Jordan Elimination can be easily merged into GEPivoting is unnecessary
Computer arithmetic over GF(2) is always exact
ApplicationsFactoring large integer numbers, cryptography, pattern matching,...
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 10 / 45
Outline
1 Introduction
2 BackgroundGaussian EliminationGaussian Elimination over GF(2)Vector extensions
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 11 / 45
Vector extensions
Most current processors implement vector instructionsx86 family supports several vector-instruction sets and lengths
MMX: 64-bit vector registersSSE: 128-bit vector registersAVX2: 256-bit vector registers
Scalar processors may emulate vector instructionsSWAR: SIMD within a registerInterprets general-purpose registers as a bitvectorsBitwise operations (AND, shifts,...) can be seen as SIMD operations
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 12 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 13 / 45
Preliminaries
We focus in row echelon formsVector instructions allow us to exploit the parallelism available inthe matrix transformationsWe represent GF(2) matrices as vector registers
Both row-major and column-major layoutsMain steps of GE
For each column:Finding the pivotRow swappingForward elimination and Back substitution
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 14 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 15 / 45
Finding the pivot element
For each column, GE must locate a pivot (an element 6= 0)First: column must be extracted
Depends on matrix layoutSecond: pivot must be located
Naïve solution: iterating bit by bitOptimized solution: using bit-scanning machine instructions
tzcnt (AVX2)
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 16 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 17 / 45
Row swapping
Depends on matrix layoutRow-major layout:
Vector instruction sets offer permute instructionsColumn-major layout:
Several vector instructions are required
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 18 / 45
Row swapping
Example of swapping rows 3 and 5 on a 6× 4 GF(2) matrix storedin column-major layout
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 19 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian EliminationFinding the pivot elementRow swappingForward elimination and Back substitution
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 20 / 45
Forward elimination and Back substitution
Add pivot row to the rows with non-zero entries in the pivot columnWe perform all additions at the same timeExample on a 4× 6 GF(2) matrix stored in row-major layout
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 21 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 22 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 23 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 24 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 25 / 45
Block-recursive matrix-multiply algorithms
A · B =
(A11 A12A21 A22
)·(
B11 B12B21 B22
)=
(C11 C12C21 C22
)= C
P1 = A11 · B11
P2 = A12 · B21
P3 = A11 · B12
P4 = A12 · B22
P5 = A21 · B11
P6 = A22 · B21
P7 = A21 · B12
P8 = A22 · B22
C11 = P1 + P2
C12 = P3 + P4
C21 = P5 + P6
C22 = P7 + P8
a) Conventional
n 3
P1 = (A11 + A22) · (B11 + B22)
P2 = (A21 + A22) · B11
P3 = A11 · (B12 − B22)
P4 = A22 · (−B11 + B21)
P5 = (A11 + A12) · B22
P6 = (−A11 + A21) · (B11 + B12)
P7 = (A12 − A22) · (B21 + B22)
C11 = P1 + P4 − P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 − P2 + P3 + P6
b) Strassen’s algorithm
n 2.807
P1 = A11 · B11
P2 = A12 · B21
P3 = A22 · (B11 − B12 − B21 + B22)
P4 = (A11 − A21) · (−B12 + B22)
P5 = (A21 + A22) · (−B11 + B12)
P6 = (A11 + A12 − A21 − A22) · B22
P7 = (A11 − A21 − A22) · (B11 − B12 + B22)
C11 = P1 + P2
C12 = P1 + P5 + P6 − P7
C21 = P1 − P3 + P4 − P7
C22 = P1 + P4 + P5 − P7
c) Winograd’s variant
n 2.807
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 26 / 45
Block-recursive matrix-multiply algorithms
Several matrix-multiply algorithms with sub-cubic complexityn2.81 [Strassen, 1969]n2.79 [Pan, 1979]n2.78 [Bini, 1979]n2.55 [Schönhage, 1981]n2.373 [Coppersmith and Winograd, 1987]n2.37286 [LeGall, 2014]
But Strassen’s algorithm is optimal for 2× 2 matricesThere are other algorithms similar to Strassen’s?
Using a genetic algorithm, [Oh and Moon, 2010] searched forStrassen-like algorithms over R
Their partial search discovered 608 Strassen-like algorithms
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 27 / 45
Case Study
Enumerating all the Strassen-like algorithms for 2× 2 GF(2)matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 28 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 29 / 45
Formulation
Strassen-like algorithms make 7 recursive callsEach call computes a bilinear form
Pi =(αi1A11 + αi2A12 + αi3A21 + αi4A22)
·(βi1B11 + βi2B12 + βi3B21 + βi4B22) αij , βij ∈ GF (2)
There are (24 − 1)2 = 225 possible Pi ’s and(225
7
)≈ 5.27× 1012
candidate algorithms
An algorithm is correct (computes A · B) if there existsimultaneously next four linear combinations:
C11 :∑7
k=1 φk1Pk = A11 · B11 + A12 · B21
C12 :∑7
k=1 φk2Pk = A11 · B12 + A12 · B22
C21 :∑7
k=1 φk3Pk = A21 · B11 + A22 · B21
C22 :∑7
k=1 φk4Pk = A21 · B12 + A22 · B22
φij ∈ GF (2)
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 30 / 45
Representation
We represent each candidate algorithm as a 16× 11 GF(2) matrix
δijk = 1⇔ αij = βik = 1
δ1,1,1 ... δ7,1,1 1 0 0 0δ1,1,2 ... δ7,1,2 0 1 0 0δ1,1,3 ... δ7,1,3 0 0 0 0δ1,1,4 ... δ7,1,4 0 0 0 0δ1,2,1 ... δ7,2,1 0 0 0 0δ1,2,2 ... δ7,2,2 0 0 0 0δ1,2,3 ... δ7,2,3 1 0 0 0δ1,2,4 ... δ7,2,4 0 1 0 0δ1,3,1 ... δ7,3,1 0 0 1 0δ1,3,2 ... δ7,3,2 0 0 0 1δ1,3,3 ... δ7,3,3 0 0 0 0δ1,3,4 ... δ7,3,4 0 0 0 0δ1,4,1 ... δ7,4,1 0 0 0 0δ1,4,2 ... δ7,4,2 0 0 0 0δ1,4,3 ... δ7,4,3 0 0 1 0δ1,4,4 ... δ7,4,4 0 0 0 1
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 31 / 45
Resolution
We apply Gauss-Jordan elimination to the 16× 11 matrixIf the form of the resulting matrix is:
(I7 φ7×4Z9×7 Z9×4
) I7 7× 7 identity matrixZ9×7, Z9×4 9× 7, 9× 4 zero matricesφ7×4 7× 4 arbitrary matrix
then φ7×4 are the coefficients of the linear combinations of P1...P7that generate the Cij ’s of A · B
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 32 / 45
Search engines
Search engineIterates through
(2257
)≈ 5.2× 1012 candidate algorithms
Applies Gauss-Jordan elimination to the related 16× 11 GF(2)matricesChecks the form of the resulting matrix
SpecializationsIf a pivot is not found in anyone of the first seven columns, thealgorithm is discarded
Redundant Pi ’s
Gauss-Jordan is applied just on the first seven columnsAvoid explicitly swapping rowsFiltering-out algorithms that do not use all eight subproducts
A11 ·B11,A12 ·B21,A11 ·B12,A12 ·B22,A21 ·B11,A22 ·B21,A21 ·B12,A22 ·B22
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 33 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 34 / 45
Experimental setup
Evaluated search engines6 scenarios:
AVX2, SSE, Scalar SWAR, Scalar no-SWAR, M4RI library (GE),M4RI library (GJE)
2 search engines for each scenarioGeneric: applies GJE/GE to all candidate algorithmsSpecialized: implement specializationsIn M4RI scenarios, specialization includes only the filteringmechanism
Search engines differ on the implementation of Gauss-Jordanelimination
Column-major layoutEvaluation platform
Intel Xeon CPU E3-1220 v3 (3.1 GHz, 4-core, Haswell)Implements AVX2 vector extension (256-bit)
Search space is dynamically distributed among the 4 cores
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 35 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case studyPreliminariesAdaptation to the case studyExperimental setupResults
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 36 / 45
Performance results
Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18
Relative execution time of the search enginesWith respect to specialized AVX2 engineThe lower, the better
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 37 / 45
Performance results: vector vs. scalar
Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18
Speedup of AVX2 impl. wrt. vector and pure scalar impl.AVX2 vs. SSE: ≈ 1.3XAVX2 vs. Scalar SWAR: ≈ 1.9XAVX2 vs. Scalar no-SWAR: ≈ 5.7X-6.8X
Scalar implementations are not competitiveThe larger the vector-register length, the better the performance
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 38 / 45
Performance results: vector vs. algebra library
Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18
In our case study, library implementations are not competitiveThey are optimized for larger matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 39 / 45
Performance results: impact of code specializations
Generic Special.AVX2 (256-bit vectors) 3.74 1.00SSE (128-bit vectors) 5.01 1.28Scalar SWAR (64-bit vectors) 7.43 1.92Scalar no-SWAR (pure scalar) 21.13 6.88M4RI (library GE) 11.30 5.96M4RI (library GJE) 17.22 9.18
Impact of code specialization: almost 4XLarger than doubling or even quadruplicating vector lengthImpact break down:
Discarding algorithms if a pivot is not found: negligibleApplying elimination just on seven columns: 1.4X - 1.6XAvoiding row interchange: 1.2X - 1.4XFiltering-out algorithms without eight subproducts: 1.9X
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 40 / 45
Case-study results
We have found 20 Strassen-like algorithmsExcluding permutations and symmetric versionsDetailed in our paper
Results coherent with [Oh and Moon, 2010]They found additional algorithms where coefficient 0.5 is involved
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 41 / 45
Outline
1 Introduction
2 Background
3 Vector implementation of Gaussian Elimination
4 Case study
5 Conclusions and Future work
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 42 / 45
Conclusions
We have discussed a vector implementation of GaussianEliminationOur evaluation develops a case study
Requires solving more than 1012 GE’s over 16× 11 GF(2) matricesVector implementations clearly outperform scalar implementationWe point out the impact of code specialization for our case study
Speedup: almost 4XImpact larger than doubling or quadruplicating vector-register length
Performance of analyzed algebra libraries is not competitiveLibraries are optimized for larger matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 43 / 45
Future work
Developing efficient implementations of Gaussian Elimination forlarger matrices
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 44 / 45
A Vector Implementation of Gaussian Eliminationover GF(2): Exploring the Design-Space of
Strassen’s Algorithm as a Case Study
Enric Morancho
Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya, BarcelonaTech
Barcelona, [email protected]
23rd Euromicro International Conference onParallel, Distributed, and Network-Based Processing
Turku, Finland, March 4th − 6th, 2015
Enric Morancho (UPC) A Vector Implementation of GE EuroPDP-2015 45 / 45