Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
A CUDA Library for High-Performance Tensor Primitives
CUTENSOR
Paul Springer, November 20th 2019
2
CONTRIBUTERS
Chenhan Yu
Markus Höhnerbach
Miguel Ferrer Avila
Andrew Kerr
Manish Gupta
Timothy Costa
Alex Fit-Florea
3
WHAT IS A TENSOR?
• mode-0: scalar 𝛼
• mode-1: vector 𝐴𝑖
• mode-2: matrix 𝐴𝑖,𝑗
• mode-n: general tensor 𝐴𝑖,𝑗,𝑘
4
WHAT IS A TENSOR?
• mode-0: scalar 𝛼
• mode-1: vector 𝐴𝑖
• mode-2: matrix 𝐴𝑖,𝑗
• mode-n: general tensor 𝐴𝑖,𝑗,𝑘,𝑙
5
WHAT IS A TENSOR?
• mode-0: scalar 𝛼
• mode-1: vector 𝐴𝑖
• mode-2: matrix 𝐴𝑖,𝑗
• mode-n: general tensor 𝐴𝑖,𝑗,𝑘,𝑙,𝑚
6
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
A Success Story
= 𝛼 +
7
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
A Success Story
= 𝛼 +
= ∗
8
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
= 𝛼 +
= ∗
A Success Story
9
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
• 1980 – BLAS Level 3: Matrix-Matrix
= 𝛼 +
= ∗
= ∗
A Success Story
10
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
• 1980 – BLAS Level 3: Matrix-Matrix
= 𝛼 +
= ∗
= ∗
A Success Story
11
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
• 1980 – BLAS Level 3: Matrix-Matrix
• Now? – BLAS Level 4: Tensor-Tensor
= 𝛼 +
= ∗
= ∗
= ∗
A Success Story
12
BASIC LINEAR SUBPROGRAMS
• 1969 – BLAS Level 1: Vector-Vector
• 1972 – BLAS Level 2: Matrix-Vector
• 1980 – BLAS Level 3: Matrix-Matrix
• Now? – BLAS Level 4: Tensor-Tensor
A Success Story
= 𝛼 +
= ∗
= ∗
= ∗
13
CUTENSORA High-Performance CUDA Library for Tensor Primitives
• Tensor contractions (generalization of matrix-matrix multiplication)
• Tensor reductions
• Element-wise operations (e.g., permutations, additions)
= A BD + + C
= ∑ ( )A BD * + C
= ∑ ( )AD + C
14
CUTENSORKey Features
• Tensor-Core support
• Built on CUTLASS 2.0
• Flexible and modern interface
• No mallocs inside the lib
• Tensor Cores active by default
• Transpose-free contractions
• Arbitrary data layouts
• Extensive mixed-precision support
• Complex-times-Real
• FP64 data + FP32 compute
• FP32 data + FP16 compute
15
TENSORS ARE UBIQUITOUSPotential Use Cases: HPC & AI
TAL-SH: https://github.com/DmitryLyakh/TAL_SHItensor: http://itensor.org & https://github.com/kshyatt/ITensors.jlJulia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU
TAL-SH
Scikit-CUDA
Pyro
16
Tensor Contractions
17
TENSOR CONTRACTIONSExamples
• Einstein notation (einsum)
• Modes that appear in A and B are contracted
• Examples
• 𝐷𝑚,𝑛 = α σ𝑘𝐴𝑚,𝑘 ∗ 𝐵𝑘,𝑛 // GEMM
= ∑ ( )A BD * + C
TF: https://www.tensorflow.org/api_docs/python/tf/einsumpyTorch: https://pytorch.org/docs/stable/torch.html#torch.einsum
18
TENSOR CONTRACTIONSExamples
• Einstein notation (einsum)
• Modes that appear in A and B are contracted
• Examples
• 𝐷𝑚,𝑛 = α 𝐴𝑚,𝑘 ∗ 𝐵𝑘,𝑛 // GEMM
• 𝐷𝑚1,𝑛,𝑚2= α 𝐴𝑚1,𝑘,𝑚2
∗ 𝐵𝑘,𝑛 // Tensor Contraction
• 𝐷𝑚1,𝑛1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘,𝑚2
∗ 𝐵𝑘,𝑛2,𝑛1 // Tensor Contraction
• 𝐷𝑚1,𝑛1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘1,𝑚2,𝑘2
∗ 𝐵𝑘1,𝑘2,𝑛2,𝑛1 // Multi-mode Tensor Contraction
= ∑ ( )A BD * + C
19
TENSOR CONTRACTIONSExamples (cont.)
• Examples
• 𝐷𝑚,𝑛 = α 𝐴𝑚 ∗ 𝐵𝑛 // outer product
• 𝐷𝑚1,𝑛,𝑚2= α 𝐴𝑚1,𝑚2
∗ 𝐵𝑛 // outer product
• 𝐷𝑚1,𝑚2= α 𝐴𝑚1,𝑘1,𝑚2,𝑘2
∗ 𝐵𝑘1,𝑘2 // GEMV-like tensor contraction
• 𝐷𝑚1,𝑛1,𝑙1= α 𝐴𝑚1,𝑘,𝑙1
∗ 𝐵𝑘,𝑛1,𝑙1 // batched GEMM
• 𝐷𝑚1,𝑛1,𝑙1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘,𝑙1,𝑚2
∗𝐵𝑘,𝑛2,𝑛1,𝑙1 // single-mode batched tensor contraction
• 𝐷𝑚1,𝑛1,𝑙1,𝑛2,𝑚2,𝑙2= α 𝐴𝑚1,𝑘,𝑙2, 𝑙1,𝑚2
∗ 𝐵𝑘,𝑛2,𝑛1,𝑙1,𝑙2 // multi-mode batched tensor contraction
= ∑ ( )A BD * + C
20
TENSOR CONTRACTIONSKey Challenges
• Keep the fast FPUs busy
• Reuse data in shared memory & registers as much as possible
• Coalesced accesses to/from global memory
CUTLASS 2.0
21
TENSOR CONTRACTIONS
• Loading a scalar 𝛼
• Loading a vector 𝐴𝑖
• Loading a matrix 𝐴𝑖,𝑗
• Loading a general tensor 𝐴𝑖,𝑗,𝑘
Key Challenges
✅
✅
(✅)
((✅))
22
TENSOR CONTRACTIONSTechnical insight
= *D A B
23
TENSOR CONTRACTIONSTechnical insight
= *D A B
𝒜 ℬ𝒟
To S
HM
EM
𝒜 ℬ𝒟 = GEMM-like ( , )
24
TENSOR CONTRACTIONSTechnical insight
= *D A B
𝒟
𝒜 ℬ𝒟 = GEMM-like ( , )
ℬ𝒜
25
TENSOR CONTRACTIONSTechnical insight
= *D A B
𝒟
𝒜 ℬ𝒟 = GEMM-like ( , )
ℬ
𝒜
To G
lobal
26
TENSOR CONTRACTIONSPerformance Guidelines
• Arrange modes similarly in all tensors
• E.g., 𝐶𝑎,𝑏,𝑐 = 𝐴𝑎,𝑘,𝑐𝐵𝑘,𝑏 is preferable to 𝐶𝑎,𝑏,𝑐 = 𝐴𝑐,𝑘,𝑎𝐵𝑘,𝑏
• Keep the extent of the fastest-varying mode as large as possible
• E.g., 𝐶𝑎,𝑏,𝑐 ∈ 𝑅1000×100×10 is preferable to 𝐶𝑐,𝑏,𝑎 ∈ 𝑅10×100×1000
• Keep batched modes as the slowest-varying modes (i.e., with the largest strides)
• E.g., 𝐶𝑎,𝑏,𝑐,𝑙 = 𝐴𝑎,𝑘,𝑐,𝑙𝐵𝑘,𝑏,𝑙 is preferable to 𝐶𝑎,𝑏,𝑐,𝑙 = 𝐴𝑎,𝑘,𝑙,𝑐𝐵𝑙,𝑘,𝑏
*assuming generalized column-major layout
27
PERFORMANCETensor Contractions
= A BC *
TBILS (https://github.com/devinamatthews/tblis)
~7x Speedup
Arithmetic Intensity
Random tensor contractions:• 3D to 6D tensors• FP64
Benchmark available at: http://tensornetwork.org/benchmarks/
Hardware:• Tesla GV100• 2x Intel Xeon Platinum 8168
28
PERFORMANCETensor Contractions
= A BC *
Arithmetic Intensity
Random tensor contractions:• 3D to 6D tensors• FP64 (data) & FP32 (compute)
TBILS (https://github.com/devinamatthews/tblis) Benchmark available at: http://tensornetwork.org/benchmarks/
Hardware:• Tesla GV100• 2x Intel Xeon Platinum 8168
29
PERFORMANCETensor Contractions
= A BC *
Arithmetic Intensity
Random tensor contractions:• 3D to 6D tensors• FP32 (data) + Tensor Core
• Accumulation in FP32• Inputs truncated to FP16
TBILS (https://github.com/devinamatthews/tblis) Benchmark available at: http://tensornetwork.org/benchmarks/
Hardware:• Tesla GV100• 2x Intel Xeon Platinum 8168
30
ITENSOR – PEPS SIMULATION
Credits: • Katharine Hyatt (Flatiron Institute) – https://github.com/itensor/ITensorsGPU.jl• Miles Stoudenmire and Matt Fishman (Flatiron Instute) - https://github.com/ITensor/ITensors.jl• Tim Besard (Ghent University) - https://github.com/JuliaGPU
*results are based on cuTENSOR’s early-access version
S z
S z + S + S − +
S + S −
+ . . .
Speedups: up to 23.6x
31
Tensor Reductions
32
TENSOR REDUCTIONExamples
• 𝐷𝑤,ℎ,𝑛 = α Reduce( 𝐴𝑐,𝑤,ℎ,𝑛 , ADD) // Reduction over mode c
• 𝐷𝑤,𝑛 = α Reduce( 𝐴𝑐,𝑤,ℎ,𝑛, ADD) // Reduction over mode c and h
• 𝐷𝑤,𝑛 = α Reduce( 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛), ADD) // RELU + Reduction over mode c and h
• 𝐷𝑤,𝑛 = α Reduce( 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛), MAX) // RELU + Max-reduction over mode c and h
= ∑ ( )AD + C
33
Element-wise Operations
34
A
Data Type
Unary Op
Binary Op
Scale
B
Data Type
Unary Op
Scale
C
Data Type
Unary Op
Scale
Binary Op
Data Type
D
35
ELEMENT-WISE TENSOR OPERATIONSExamples
• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛
• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛 + β𝐵𝑐,𝑤,ℎ,𝑛
• 𝐷𝑤,ℎ,𝑐,𝑛 = min( α 𝐴𝑐,𝑤,ℎ,𝑛 , β 𝐵𝑐,𝑤,ℎ,𝑛)
• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛 + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛
• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛) + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛
• 𝐷𝑤,ℎ,𝑐,𝑛 = 𝐹𝑃32( α 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛) + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛)
= α + β + 𝛾A BD C
Enables users to fuse multiple element-wise calls.
36
cuTENSOR’s API
37
CONTRACTION OPERATIONAPI
cutensorStatus_t cutensorInitContractionDescriptor ( const cutensorHandle_t *handle,
cutensorContractionDescriptor_t *desc,const cutensorTensorDescriptor_t *descA, const int modeA[], const uint32_t alignmentRequirementA,const cutensorTensorDescriptor_t *descB, const int modeB[], const uint32_t alignmentRequirementB,const cutensorTensorDescriptor_t *descC, const int modeC[], const uint32_t alignmentRequirementC,const cutensorTensorDescriptor_t *descD, const int modeD[], const uint32_t alignmentRequirementD,cutensorComputeType_t typeCompute );
= A BC *
cutensorStatus_t cutensorContraction( const cutensorHandle_t *handle,const cutensorContractionPlan_toid *plan,const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D,void *workspace, uint64_t workspaceSize, cudaStream_t stream );
cutensorStatus_t cutensorInitContractionFind ( const cutensorHandle_t *handle,cutensorContractionFind *find, const cutensorAlgo_t algo );
cutensorStatus_t cutensorInitContractionPlan ( const cutensorHandle_t *handle,cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc,const cutensorContractionFind *find, uint64_t workspaceSize );
38
• 𝐷𝑚,𝑛,𝑢 = α 𝐴𝑚,𝑘,𝑢 ∗ 𝐵𝑘,𝑛
cutensorInitContractionDescriptor ( &handle, &desc,descA, { ‘m’, ‘k’, ‘u’ }, 256, descB, { ‘k’, ‘n’ }, 256,descC, { ‘m’, ‘n’, ‘u’ }, 256,descD, { ‘m’, ‘n’, ‘u’ }, 256,CUTENSOR_R_MIN_F32 );
cutensorInitContractionFind ( &handle, &find, ... );
cutensorInitContractionPlan ( &handle, &plan, &desc, &find,... );
cutensorContraction ( handle,&plan, alpha, A, B,beta, C, D,workspace, workspaceSize, stream );
API
CONTRACTION OPERATION
= A BC *
39
• 𝐷𝑚,𝑛,𝑢 = α 𝐴𝑚,𝑘,𝑢 ∗ 𝐵𝑘,𝑛
cutensorInitContractionDescriptor ( &handle, &desc, // Define ProblemdescA, { ‘m’, ‘k’, ‘u’ }, 256, descB, { ‘k’, ‘n’ }, 256,descC, { ‘m’, ‘n’, ‘u’ }, 256,descD, { ‘m’, ‘n’, ‘u’ }, 256,CUTENSOR_R_MIN_F32 );
cutensorInitContractionFind ( &handle, &find, ... ); // Define search space
cutensorInitContractionPlan ( &handle, &plan, &desc, &find,... ); // Initialize plan
cutensorContraction ( handle, // Execute contraction&plan, alpha, A, B,beta, C, D,workspace, workspaceSize, stream );
API
CONTRACTION OPERATION
= A BC *
40
CONCLUSION
Available at: https://developer.nvidia.com/cuTENSOR
Your feedback is highly appreciated.
= ∑ ( )A BD * + C
~5
x S
peedu
p
~1
2x S
peedu
p
= α + β + 𝛾A BD C
cuTENSOR: A CUDA library for high-performance tensor primitives
42
REDUCTION OPERATIONAPI
cutensorStatus_t cutensorReduction ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[],const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],
void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, cudaStream_t stream );
= ∑ ( )AD + C
43
REDUCTION OPERATIONAPI
cutensorStatus_t cutensorReduction ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[],const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],
void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, cudaStream_t stream );
• 𝐷𝑤,ℎ,𝑛 = 𝛼𝑅𝑒𝑑𝑢𝑐𝑒(𝐴𝑐,𝑤,ℎ,𝑛) + 𝛽𝐶𝑤,ℎ,𝑛
auto status = cutensorReduction ( handle,alpha, A, descA, { ‘c’, ‘w’, ‘h’, ‘n’ },beta, C, descC, { ‘w’, ‘h’, ‘n’ },
D, descC, { ‘w’, ‘h’, ‘n’ },CUTENSOR_OP_ADD, CUTENSOR_R_MIN_32F,stream );
= ∑ ( )AD + C
44
ELEMENT-WISE OPERATIONAPI
cutensorStatus_t cutensorElementwiseTrinary ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int modeB[],
const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],
cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute,cudaStream_t stream );
= α + β + 𝛾A BD C
45
ELEMENT-WISE OPERATIONAPI
cutensorStatus_t cutensorElementwiseTrinary ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],
void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute,cudaStream_t stream );
• 𝐷𝑤,ℎ,𝑐,𝑛 = min(α𝐴𝑐,𝑤,ℎ,𝑛, 𝛽𝐵𝑐,𝑤,ℎ) + 𝛾𝐶𝑤,ℎ,𝑐,𝑛
auto status = cutensorElementwiseTrinary ( handle,alpha, A, descA, { ‘c’, ‘w’, ‘h’, ‘n’ }, beta, B, descB, { ‘c’, ‘w’, ‘h’ }, gamma, C, descC, { ‘w’, ‘h’, ‘c’, ‘n’ },
D, descD, { ‘w’, ‘h’, ‘c’, ‘n’ },CUTENSOR_OP_MIN, CUTENSOR_OP_ADD, CUTENSOR_R_MIN_16F,stream );
= α + β + 𝛾A BD C
46
PERFORMANCEElement-wise Operation
* FP32 tensor permutation (e.g., reformatting)
= α + βA BC
~5x over two-socket CPU
HPTT (https://github.com/springer13/hptt)