Upload
others
View
44
Download
0
Embed Size (px)
Citation preview
Devin A. MatthewsInstitute for Computational Engineering and Sciences (ICES),
The University of Texas at Austin
High-Performance Tensor Contraction without BLAS
Tensor computations – in particular tensor contraction (TC) – are important kernels inmany scientific computing applications (SCAs). Due to the fundamental similarity of TC tomatrix multiplication (MM) and to the availability of optimized implementations such as theBLAS, tensor operations have traditionally been implemented in terms of BLASoperations, incurring both a performance and a storage overhead. Instead, we implementTC using the much more flexible BLIS framework, which allows for reshaping of thetensor to be fused with internal partitioning and packing operations, requiring no explicitreshaping operations or additional workspace. This implementation, TBLIS, achievesperformance approaching that of MM, and in some cases considerably higher than that oftraditional TC. Our implementation also supports multithreading using an approachidentical to that used for MM in BLIS, with similar performance characteristics. Thecomplexity of managing tensor- to-matrix transformations is also handled automatically inour approach, greatly simplifying its use in SCAs.
ReferencesGoto approach: K. Goto and R.A. van de Geijn, Trans. Math. Softw. 34, 3, Article 12 (2008).BLIS: github.com/flame/blis, shpc.ices.utexas.edu/publications.htmlTBLIS: github.com/devinamatthews/tblis, D. Matthews, arXiv:1607.00291.GETT (independent ”native” TC algorithm): P. Springer and P. Bientinesi, arXiv:1607.00145.
micro&kernel+
2nd+loop+around+micro&kernel+
1st+loop+around+micro&kernel+
5th+loop+around+micro&kernel+
4th+loop+around+micro&kernel+
3rd+loop+around+micro&kernel+
mR
mR
1
+=+
+=+
+=+
+=+
+=+
+=+
nC nC
kC
kC
mC mC
1
nR
kC
nR
Pack Ai → Ai ~
Pack Bp → Bp ~
nR
A Bj Cj
Ap
Ai
Bp Cj
Ai ~ Bp
~
Bp ~ Ci
Ci
kC
L3+cache+L2+cache+L1+cache+registers+
main+memory+
mR
1
+=#
+=# 1
nR
kC
L3#cache#L2#cache#L1#cache#registers#
main#memory#
Matrix7like#logical#layout#
Packed#internal#layout#
mR
+=#
+=#
kC mC mC
kC
nR nR
Ai Bp
Ai ~
Ci
Ci
nC nC
C" A" B"+=#
Tensor#physical#layout#
#1st#loop#around#micro7kernel#
#micro7kernel#
Pack “Ai”→ Ai ~
Pack “Bp”→ Bp ~
Bp ~
Tensor to
matrix packing kernel
(Possibly)scatteredupdate
Tabcde 6⇥ 3⇥ 2⇥ 3⇥ 4
18⇥ 24
Tensor
Matrix
“a” dimension is stride-1, other dimensions haveincreasing strides (6, 18, 36, 108).
Column “cdb” dimension has stride of “c” (6x3=18) most of the time. Blocking lets us use this constant stride when possible, and a scattered algorithm otherwise.
Row “ae” dimension similarly is stride-1 most of the time. (i.e. M is “row-major”)
[rc]scatM stores offset for each position in rows or columns. [rc]bsM stores stride for each block or zero for irregular blocks.
M(cdb)(ae)
• Large memory overhead.• Transpose does not
parallelize well.• Difficult to optimize out
transposes.
taeim V mkec
W akic
A = B · C
ck
ckem
emai
ai
BLAS
“TTDT”
Tensortranspose
“LoG”
taeim V mkecW ak
ic �
�X
m
[tam]ei [Vmk]ec[Wak]ic8 i, e, c
for ifor efor cGEMM
=
=
• Low data reuse.• Poor cache behavior.• Inner kernel is not always
GEMM — may be level 1 or 2 BLAS.
mR
1
+=#
+=# 1
nR
kC
L3#cache#L2#cache#L1#cache#registers#
main#memory#
Matrix7like#logical#layout#
Packed#internal#layout#
mR
+=#
+=#
kC mC mC
kC
nR nR
Ai Bp
Ai ~
Ci
Ci
nC nC
C" A" B"+=#
Tensor#physical#layout#
#1st#loop#around#micro7kernel#
#micro7kernel#
Pack “Ai”→ Ai ~
Pack “Bp”→ Bp ~
Bp ~
“Native” (TBLIS)
taeim V mkecW ak
ic�=
• Performance nearly identical to GEMM.
Tensor Contraction
Cij = Σk Aki � Bkj
=T
=
Cijk = Σmn Akmni � Bnjm
Tensor contraction:
is essentially a generalization of…
Matrix multiplication:
for n-dimensional arrays (tensors). The most common current methods for computing tensor contractions are:
• Transpose-transpose-DGEMM-transpose (TTDT), which reshapes (transposes) the tensors into matrices.
• Loop-over-GEMM (LoG), which uses explicit loops over a small GEMM operation.
A high-performance “native” algorithm is ideal.
Leveraging the BLIS Framework Tensors as Matrices: The Block-Scatter Matrix Results and Conclusions
Random “square” tensor contractions with TTDT and TBLIS, compared to matrix multiplication.Even for “square” cases where overhead is minimized, TTDT incurs as much as a 50% penalty,while TBLIS is consistent at 5-10%. TBLIS is also highly scalable. Since TBLIS uses the BLISmicrokernels and blocking parameters, performance improvements in BLIS and porting to newarchitectures automatically carries over to TBLIS.
Speedup of TBLIS over TTDT for the GETT benchmark (arXiv:1607.00145), which spansmemory-bound contractions (left) to compute-bound contractions (right). TTDT is efficient forcompute-bound problems because transposition is only O(n2), but TBLIS speedup increasessharply for more memory-bound problems. The left-most contractions have non-unit-strideaccess during packing, which hinders TBLIS. This overhead can be overcome by using a 3Dpacking kernel (future work).
BLIS partitions the input matrices into successively smaller and smaller blocks,eventually reaching the small, fixed-size microkernel, such as 4x4 (pictured above),8x4, 6x8, etc. These same “register blocksizes” are used during packing.
TBLIS uses this fact to encode tensor blocks as regular matrix blocks wheneverpossible, so that no overhead is incurred. Irregular blocks which do not have aconstant row or column stride are handled specially by storing the offset for eachrow and column individually.
The tensor dimensions can also be reordered to assure unit-stride access in atleast two of the tensors (see next panel for the effects of non-unit-stride access).
BLIS extends the high-performance Goto approach to matrixmultiplication, such as used in GotoBLAS and OpenBLAS, byadding two partitioning loops in plain C, and only requiring a smallmicrokernel to be highly optimized in assembly.
TBLIS uses this five-loop framework to perform tensor contractionby logically treating the tensor as a matrix during partitioning, andonly requiring extra tensor-aware steps (yellow and green above)during packing of the input tensors and updating the output tensor.The high-performance microkernel and cache blockingparameters are used unchanged, and overall overhead is small.
Threading is simple to implement, and can be added hierarchicallyto four out of the five loops, leading to high scalability.
TBLIS on
0
5
10
15
20
25
30
35
40
45
0 1000 2000 3000 4000 5000
GFLO
PS/s
Approx. M=N=K
MKLBLIS
TBLISTTDT
0
5
10
15
20
25
30
35
40
45
0 500 1000 1500 2000
GFLO
PS/s
Approx. M=N=K
MKLBLIS
TBLISTTDT
Xeon E5-2690 v31 core 12 cores
0
1
2
3
4
5
abcde-efb
ad-cf
abcde-efc
ad-bf
abcd-dbea-ec
abcde-ecbfa
-fd
abcd-deca-be
abc-bda-dc
abcd-ebad-ce
abcdef-
dega-gfb
c
abcdef-
dfg
b-geac
abcdef-
degb-gfa
c
abcdef-
degc-gfa
b
abc-dca-bd
abcd-ea-ebcd
abcd-eb-aecd
abcd-ec-abed
abc-adec-ebd
ab-cad-dcb
ab-acd-dbc
abc-acd-db
abc-adc-bd
ab-ac-cb
abcd-aebf-
fdec
abcd-eafd
-fb
ec
abcd-aebf-
dfc
e
Speedup o
ver
TTD
T
0
5
10
15
20
abcde-efb
ad-cf
abcde-efc
ad-bf
abcd-dbea-ec
abcde-ecbfa
-fd
abcd-deca-be
abc-bda-dc
abcd-ebad-ce
abcdef-
dega-gfb
c
abcdef-
dfg
b-geac
abcdef-
degb-gfa
c
abcdef-
degc-gfa
b
abc-dca-bd
abcd-ea-ebcd
abcd-eb-aecd
abcd-ec-abed
abc-adec-ebd
ab-cad-dcb
ab-acd-dbc
abc-acd-db
abc-adc-bd
ab-ac-cb
abcd-aebf-
fdec
abcd-eafd
-fb
ec
abcd-aebf-
dfc
e
Speedup o
ver
TTD
T
Xeon E5-2690 v31 core 12 cores
GFL
OPS
/s p
er c
ore
GFL
OPS
/s p
er c
ore
0 1 2 3 4 5 108
109
110
111
112
113
216
217
218
219
220
221
324
325
326
327
328
329
018
3654
7290
624
4260
7896
1230
4866
84102
18
0
101 1 01
18
18
18
rscatM rbsM
cscatM
cbsM