38
a brief survey of tensors multilinear algebra, decompositions, method of moments and deep learning Berton Earnshaw Senior Scientist, Savvysherpa [email protected]

A brief survey of tensors

Embed Size (px)

Citation preview

a brief survey of tensorsmultilinear algebra, decompositions, method of moments

and deep learning

Berton Earnshaw Senior Scientist, Savvysherpa

[email protected]

tensors, tensors everywhere…

tensor = multidimensional array

vector matrix tensor

v ∊ ℝ64 X ∊ ℝ8x8 𝓧 ∊ ℝ4x4x4

third-order tensors

𝓧 ∊ ℝ7x5x8

color image is 3rd-order tensor

color video is 4th-order tensor

MNIST is third-order tensor

facial images database is 6th-order tensor

tensors are just bookkeeping devices!

easy right?

multilinear algebra

wrong!

multilinear algebra basicsmode-n product

𝓧 ∊ ℝ3x4x2

U ∊ ℝ2x3

𝓨 = 𝓧 x1 U ∊ ℝ2x4x2

multilinear algebra basicsmode-n matricization (unfolding)

𝓧 ∊ ℝ3x4x2

mode-1

mode-2

mode-3

multilinear algebra basics

Kronecker product

A ∊ ℝI x J, B ∊ ℝK x L

A ⨂ B ∊ ℝIK x JL

mode-n matricization of multilinear product

tensor decompositions

decomposition = compression

X ∊ ℝI x J, A ∊ ℝI x K, B ∊ ℝJ x K (K ≪ I,J)

X ≈ ABT

(I+J)KIJ

matrix # entriesX

ABT

compression = understanding

X ∊ ℝI x J, A ∊ ℝI x K, B ∊ ℝJ x K (K ≪ I,J)

X ≈ ABT

columns of B = basis of low-dim. subspacerows of A = low-dim. representation of X

A,B encode X using fewer bits!

X = abT = a○b

=

rank-1 matrices

X ≈ ABT

ABT = ∑Kk=1 akbkT = ∑Kk=1 ak○bk

decomposition = sum rank-1 matrices

𝓧 = a○b○c

rank-1 tensors

𝓧 ≈ [[A,B,C]]

[[A,B,C]] = ∑Kk=1 ak○bk○ck

When K is minimal, called the Canonical Polyadic Decomposition

decomposition = sum rank-1 tensors

MNIST decomposition

MNIST as matrix

MNIST matrix decompositionX ≈ ABT = ∑Kk=1 ak○bk (K = 10)

MNIST matrix reconstruction

original 70,000 * (28 * 28) = 54,880,000

matrix (70,000 + (28 * 28)) * 10 = 707,840 1.29%

matrix decomposition requires ~1% original memory!

representation size

MNIST as tensor

MNIST tensor decomposition𝓧 ≈ [[A,B,C]] = ∑Kk=1 ak○bk○ck (K = 10)

MNIST tensor reconstruction

original 70,000 * (28 * 28) = 54,880,000

matrix (70,000 + (28 * 28)) * 10 = 707,840 1.29%

tensor (70,000 + 28 + 28) * 10 = 700,560 1.28%

tensor decomposition requires less memory than matrix

representation size

𝓧 = [[A,B,C]] = ∑Kk=1 ak○bk○ck

Kruskal’s rank theorem: CPD is unique (up to unavoidable scaling and permutation of columns of A, B and C) if

rA + rB + rC ≥ 2K + 2where rM is the Kruskal rank of the matrix M (the largest number r such that any choice of r columns of M is linearly independent).

Note: rank(M) ≥ rM

uniqueness of decomposition

learning via method of moments

learning via method of moments

learning via method of moments

𝓧 = [[𝓖;A,B,C]] = 𝓖 x1 A x2 B x3 C

Tucker decomposition

classification via convolutional networks

Task: learn Y score functions hy such that the prediction rule

is as accurate as possible.

coefficient tensor

tensor decomposition

hierarchical Tucker decomposition

universality and depth efficiency

references

https://github.com/bearnshaw/mnist-factorization

• Kolda, Tamara G., and Brett W. Bader. "Tensor decompositions and applications." SIAM review 51.3 (2009): 455-500.

• Cichocki, Andrzej, et al. "Tensor decompositions for signal processing applications: From two-way to multiway component analysis." IEEE Signal Processing Magazine 32.2 (2015): 145-163.

• Anandkumar, Animashree, et al. "Tensor decompositions for learning latent variable models." Journal of Machine Learning Research 15.1 (2014): 2773-2832.

• Sedghi, Hanie, and Anima Anandkumar. "Training Input-Output Recurrent Neural Networks through Spectral Methods." arXiv preprint arXiv:1603.00954 (2016).

• Cohen, Nadav, Or Sharir, and Amnon Shashua. "On the expressive power of deep learning: A tensor analysis." arXiv preprint arXiv:1509.05009 554 (2015).

• Cohen, Nadav, and Amnon Shashua. "Convolutional rectifier networks as generalized tensor decompositions." International Conference on Machine Learning (ICML). 2016.