A CUDA Library for High-Performance Tensor Primitives€¦ · A CUDA Library for High-Performance Tensor Primitives CUTENSOR Paul Springer, November 20th 2019 [email protected]

A CUDA Library for High-Performance Tensor Primitives

CUTENSOR

Paul Springer, November 20th 2019

[email protected]

mailto:*[email protected]

2

CONTRIBUTERS

Chenhan Yu

Markus Höhnerbach

Miguel Ferrer Avila

Andrew Kerr

Manish Gupta

Timothy Costa

Alex Fit-Florea

3

WHAT IS A TENSOR?

• mode-0: scalar 𝛼

• mode-1: vector 𝐴𝑖

• mode-2: matrix 𝐴𝑖,𝑗

• mode-n: general tensor 𝐴𝑖,𝑗,𝑘

4

WHAT IS A TENSOR?




• mode-n: general tensor 𝐴𝑖,𝑗,𝑘,𝑙

5

WHAT IS A TENSOR?




• mode-n: general tensor 𝐴𝑖,𝑗,𝑘,𝑙,𝑚

6

BASIC LINEAR SUBPROGRAMS

• 1969 – BLAS Level 1: Vector-Vector

A Success Story

= 𝛼 +

7



• 1972 – BLAS Level 2: Matrix-Vector

A Success Story

= 𝛼 +

= ∗

8




= 𝛼 +

= ∗

A Success Story

9




• 1980 – BLAS Level 3: Matrix-Matrix

= 𝛼 +

= ∗

= ∗

A Success Story

10





= 𝛼 +

= ∗

= ∗

A Success Story

11





• Now? – BLAS Level 4: Tensor-Tensor

= 𝛼 +

= ∗

= ∗

= ∗

A Success Story

12





• Now? – BLAS Level 4: Tensor-Tensor

A Success Story

= 𝛼 +

= ∗

= ∗

= ∗

13

CUTENSORA High-Performance CUDA Library for Tensor Primitives

• Tensor contractions (generalization of matrix-matrix multiplication)

• Tensor reductions

• Element-wise operations (e.g., permutations, additions)

= A BD + + C

= ∑ ( )A BD * + C

= ∑ ( )AD + C

14

CUTENSORKey Features

• Tensor-Core support

• Built on CUTLASS 2.0

• Flexible and modern interface

• No mallocs inside the lib

• Tensor Cores active by default

• Transpose-free contractions

• Arbitrary data layouts

• Extensive mixed-precision support

• Complex-times-Real

• FP64 data + FP32 compute

• FP32 data + FP16 compute

15

TENSORS ARE UBIQUITOUSPotential Use Cases: HPC & AI

TAL-SH: https://github.com/DmitryLyakh/TAL_SHItensor: http://itensor.org & https://github.com/kshyatt/ITensors.jlJulia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU

TAL-SH

Scikit-CUDA

Pyro

https://github.com/DmitryLyakh/TAL_SH

http://itensor.org/

https://github.com/kshyatt/ITensors.jl

https://github.com/Jutho/TensorOperations.jl

16

Tensor Contractions

17

TENSOR CONTRACTIONSExamples

• Einstein notation (einsum)

• Modes that appear in A and B are contracted

• Examples

• 𝐷𝑚,𝑛 = α σ𝑘𝐴𝑚,𝑘 ∗ 𝐵𝑘,𝑛 // GEMM

= ∑ ( )A BD * + C

TF: https://www.tensorflow.org/api_docs/python/tf/einsumpyTorch: https://pytorch.org/docs/stable/torch.html#torch.einsum

https://www.tensorflow.org/api_docs/python/tf/einsum

https://pytorch.org/docs/stable/torch.html#torch.einsum

18

TENSOR CONTRACTIONSExamples

• Einstein notation (einsum)

• Modes that appear in A and B are contracted

• Examples

• 𝐷𝑚,𝑛 = α 𝐴𝑚,𝑘 ∗ 𝐵𝑘,𝑛 // GEMM

• 𝐷𝑚1,𝑛,𝑚2= α 𝐴𝑚1,𝑘,𝑚2

∗ 𝐵𝑘,𝑛 // Tensor Contraction

• 𝐷𝑚1,𝑛1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘,𝑚2

∗ 𝐵𝑘,𝑛2,𝑛1 // Tensor Contraction

• 𝐷𝑚1,𝑛1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘1,𝑚2,𝑘2

∗ 𝐵𝑘1,𝑘2,𝑛2,𝑛1 // Multi-mode Tensor Contraction

= ∑ ( )A BD * + C

19

TENSOR CONTRACTIONSExamples (cont.)

• Examples

• 𝐷𝑚,𝑛 = α 𝐴𝑚 ∗ 𝐵𝑛 // outer product

• 𝐷𝑚1,𝑛,𝑚2= α 𝐴𝑚1,𝑚2

∗ 𝐵𝑛 // outer product

• 𝐷𝑚1,𝑚2= α 𝐴𝑚1,𝑘1,𝑚2,𝑘2

∗ 𝐵𝑘1,𝑘2 // GEMV-like tensor contraction

• 𝐷𝑚1,𝑛1,𝑙1= α 𝐴𝑚1,𝑘,𝑙1

∗ 𝐵𝑘,𝑛1,𝑙1 // batched GEMM

• 𝐷𝑚1,𝑛1,𝑙1,𝑛2,𝑚2= α 𝐴𝑚1,𝑘,𝑙1,𝑚2

∗𝐵𝑘,𝑛2,𝑛1,𝑙1 // single-mode batched tensor contraction

• 𝐷𝑚1,𝑛1,𝑙1,𝑛2,𝑚2,𝑙2= α 𝐴𝑚1,𝑘,𝑙2, 𝑙1,𝑚2

∗ 𝐵𝑘,𝑛2,𝑛1,𝑙1,𝑙2 // multi-mode batched tensor contraction

= ∑ ( )A BD * + C

20

TENSOR CONTRACTIONSKey Challenges

• Keep the fast FPUs busy

• Reuse data in shared memory & registers as much as possible

• Coalesced accesses to/from global memory

CUTLASS 2.0

21

TENSOR CONTRACTIONS

• Loading a scalar 𝛼

• Loading a vector 𝐴𝑖

• Loading a matrix 𝐴𝑖,𝑗

• Loading a general tensor 𝐴𝑖,𝑗,𝑘

Key Challenges

✅

✅

(✅)

((✅))

22

TENSOR CONTRACTIONSTechnical insight

= *D A B

23


= *D A B

𝒜 ℬ𝒟

To S

HM

EM

𝒜 ℬ𝒟 = GEMM-like ( , )

24


= *D A B

𝒟


ℬ𝒜

25


= *D A B

𝒟


ℬ

𝒜

To G

lobal

26

TENSOR CONTRACTIONSPerformance Guidelines

• Arrange modes similarly in all tensors

• E.g., 𝐶𝑎,𝑏,𝑐 = 𝐴𝑎,𝑘,𝑐𝐵𝑘,𝑏 is preferable to 𝐶𝑎,𝑏,𝑐 = 𝐴𝑐,𝑘,𝑎𝐵𝑘,𝑏

• Keep the extent of the fastest-varying mode as large as possible

• E.g., 𝐶𝑎,𝑏,𝑐 ∈ 𝑅1000×100×10 is preferable to 𝐶𝑐,𝑏,𝑎 ∈ 𝑅10×100×1000

• Keep batched modes as the slowest-varying modes (i.e., with the largest strides)

• E.g., 𝐶𝑎,𝑏,𝑐,𝑙 = 𝐴𝑎,𝑘,𝑐,𝑙𝐵𝑘,𝑏,𝑙 is preferable to 𝐶𝑎,𝑏,𝑐,𝑙 = 𝐴𝑎,𝑘,𝑙,𝑐𝐵𝑙,𝑘,𝑏

*assuming generalized column-major layout

27

PERFORMANCETensor Contractions

= A BC *

TBILS (https://github.com/devinamatthews/tblis)

~7x Speedup

Arithmetic Intensity

Random tensor contractions:• 3D to 6D tensors• FP64

Benchmark available at: http://tensornetwork.org/benchmarks/

Hardware:• Tesla GV100• 2x Intel Xeon Platinum 8168

https://github.com/devinamatthews/tblis

http://tensornetwork.org/benchmarks/

28


= A BC *


Random tensor contractions:• 3D to 6D tensors• FP64 (data) & FP32 (compute)

TBILS (https://github.com/devinamatthews/tblis) Benchmark available at: http://tensornetwork.org/benchmarks/




29


= A BC *


Random tensor contractions:• 3D to 6D tensors• FP32 (data) + Tensor Core

• Accumulation in FP32• Inputs truncated to FP16

TBILS (https://github.com/devinamatthews/tblis) Benchmark available at: http://tensornetwork.org/benchmarks/




30

ITENSOR – PEPS SIMULATION

Credits: • Katharine Hyatt (Flatiron Institute) – https://github.com/itensor/ITensorsGPU.jl• Miles Stoudenmire and Matt Fishman (Flatiron Instute) - https://github.com/ITensor/ITensors.jl• Tim Besard (Ghent University) - https://github.com/JuliaGPU

*results are based on cuTENSOR’s early-access version

S z

S z + S + S − +

S + S −

+ . . .

Speedups: up to 23.6x

https://github.com/itensor/ITensorsGPU.jl

https://github.com/ITensor/ITensors.jl

https://github.com/JuliaGPU

31

Tensor Reductions

32

TENSOR REDUCTIONExamples

• 𝐷𝑤,ℎ,𝑛 = α Reduce( 𝐴𝑐,𝑤,ℎ,𝑛 , ADD) // Reduction over mode c

• 𝐷𝑤,𝑛 = α Reduce( 𝐴𝑐,𝑤,ℎ,𝑛, ADD) // Reduction over mode c and h

• 𝐷𝑤,𝑛 = α Reduce( 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛), ADD) // RELU + Reduction over mode c and h

• 𝐷𝑤,𝑛 = α Reduce( 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛), MAX) // RELU + Max-reduction over mode c and h

= ∑ ( )AD + C

33

Element-wise Operations

34

A

Data Type

Unary Op

Binary Op

Scale

B

Data Type

Unary Op

Scale

C

Data Type

Unary Op

Scale

Binary Op

Data Type

D

35

ELEMENT-WISE TENSOR OPERATIONSExamples

• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛

• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛 + β𝐵𝑐,𝑤,ℎ,𝑛

• 𝐷𝑤,ℎ,𝑐,𝑛 = min( α 𝐴𝑐,𝑤,ℎ,𝑛 , β 𝐵𝑐,𝑤,ℎ,𝑛)

• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝐴𝑐,𝑤,ℎ,𝑛 + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛

• 𝐷𝑤,ℎ,𝑐,𝑛 = α 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛) + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛

• 𝐷𝑤,ℎ,𝑐,𝑛 = 𝐹𝑃32( α 𝑅𝐸𝐿𝑈(𝐴𝑐,𝑤,ℎ,𝑛) + β 𝐵𝑤,ℎ,𝑐,𝑛 + 𝛾 𝐶𝑤,ℎ,𝑐,𝑛)

= α + β + 𝛾A BD C

Enables users to fuse multiple element-wise calls.

36

cuTENSOR’s API

37

CONTRACTION OPERATIONAPI

cutensorStatus_t cutensorInitContractionDescriptor ( const cutensorHandle_t *handle,

cutensorContractionDescriptor_t *desc,const cutensorTensorDescriptor_t *descA, const int modeA[], const uint32_t alignmentRequirementA,const cutensorTensorDescriptor_t *descB, const int modeB[], const uint32_t alignmentRequirementB,const cutensorTensorDescriptor_t *descC, const int modeC[], const uint32_t alignmentRequirementC,const cutensorTensorDescriptor_t *descD, const int modeD[], const uint32_t alignmentRequirementD,cutensorComputeType_t typeCompute );

= A BC *

cutensorStatus_t cutensorContraction( const cutensorHandle_t *handle,const cutensorContractionPlan_toid *plan,const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D,void *workspace, uint64_t workspaceSize, cudaStream_t stream );

cutensorStatus_t cutensorInitContractionFind ( const cutensorHandle_t *handle,cutensorContractionFind *find, const cutensorAlgo_t algo );

cutensorStatus_t cutensorInitContractionPlan ( const cutensorHandle_t *handle,cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc,const cutensorContractionFind *find, uint64_t workspaceSize );

38

• 𝐷𝑚,𝑛,𝑢 = α 𝐴𝑚,𝑘,𝑢 ∗ 𝐵𝑘,𝑛

cutensorInitContractionDescriptor ( &handle, &desc,descA, { ‘m’, ‘k’, ‘u’ }, 256, descB, { ‘k’, ‘n’ }, 256,descC, { ‘m’, ‘n’, ‘u’ }, 256,descD, { ‘m’, ‘n’, ‘u’ }, 256,CUTENSOR_R_MIN_F32 );

cutensorInitContractionFind ( &handle, &find, ... );

cutensorInitContractionPlan ( &handle, &plan, &desc, &find,... );

cutensorContraction ( handle,&plan, alpha, A, B,beta, C, D,workspace, workspaceSize, stream );

API

CONTRACTION OPERATION

= A BC *

39

• 𝐷𝑚,𝑛,𝑢 = α 𝐴𝑚,𝑘,𝑢 ∗ 𝐵𝑘,𝑛

cutensorInitContractionDescriptor ( &handle, &desc, // Define ProblemdescA, { ‘m’, ‘k’, ‘u’ }, 256, descB, { ‘k’, ‘n’ }, 256,descC, { ‘m’, ‘n’, ‘u’ }, 256,descD, { ‘m’, ‘n’, ‘u’ }, 256,CUTENSOR_R_MIN_F32 );

cutensorInitContractionFind ( &handle, &find, ... ); // Define search space

cutensorInitContractionPlan ( &handle, &plan, &desc, &find,... ); // Initialize plan

cutensorContraction ( handle, // Execute contraction&plan, alpha, A, B,beta, C, D,workspace, workspaceSize, stream );

API

CONTRACTION OPERATION

= A BC *

40

CONCLUSION

Available at: https://developer.nvidia.com/cuTENSOR

Your feedback is highly appreciated.

= ∑ ( )A BD * + C

~5

x S

peedu

p

~1

2x S

peedu

p


cuTENSOR: A CUDA library for high-performance tensor primitives

[email protected] [email protected]

https://developer.nvidia.com/cuTensor

mailto:[email protected]

mailto:[email protected]

42

REDUCTION OPERATIONAPI

cutensorStatus_t cutensorReduction ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[],const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],

void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, cudaStream_t stream );

= ∑ ( )AD + C

43

REDUCTION OPERATIONAPI

cutensorStatus_t cutensorReduction ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[],const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],

void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, cudaStream_t stream );

• 𝐷𝑤,ℎ,𝑛 = 𝛼𝑅𝑒𝑑𝑢𝑐𝑒(𝐴𝑐,𝑤,ℎ,𝑛) + 𝛽𝐶𝑤,ℎ,𝑛

auto status = cutensorReduction ( handle,alpha, A, descA, { ‘c’, ‘w’, ‘h’, ‘n’ },beta, C, descC, { ‘w’, ‘h’, ‘n’ },

D, descC, { ‘w’, ‘h’, ‘n’ },CUTENSOR_OP_ADD, CUTENSOR_R_MIN_32F,stream );

= ∑ ( )AD + C

44

ELEMENT-WISE OPERATIONAPI

cutensorStatus_t cutensorElementwiseTrinary ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int modeB[],

const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],

cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute,cudaStream_t stream );


45

ELEMENT-WISE OPERATIONAPI

cutensorStatus_t cutensorElementwiseTrinary ( const cutensorHandle_t *handle,const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int modeC[],

void *D, const cutensorTensorDescriptor_t *descD, const int modeD[],cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeCompute,cudaStream_t stream );

• 𝐷𝑤,ℎ,𝑐,𝑛 = min(α𝐴𝑐,𝑤,ℎ,𝑛, 𝛽𝐵𝑐,𝑤,ℎ) + 𝛾𝐶𝑤,ℎ,𝑐,𝑛

auto status = cutensorElementwiseTrinary ( handle,alpha, A, descA, { ‘c’, ‘w’, ‘h’, ‘n’ }, beta, B, descB, { ‘c’, ‘w’, ‘h’ }, gamma, C, descC, { ‘w’, ‘h’, ‘c’, ‘n’ },

D, descD, { ‘w’, ‘h’, ‘c’, ‘n’ },CUTENSOR_OP_MIN, CUTENSOR_OP_ADD, CUTENSOR_R_MIN_16F,stream );


46

PERFORMANCEElement-wise Operation

* FP32 tensor permutation (e.g., reformatting)

= α + βA BC

~5x over two-socket CPU

HPTT (https://github.com/springer13/hptt)

Documents

A CUDA Library for High-Performance Tensor Primitives€¦ · A CUDA Library for High-Performance Tensor Primitives CUTENSOR Paul Springer, November 20th 2019 [email protected]