Tensor Train decomposition in machine learning

Tensor Train in machine learning

Alexander Novikov

October 11, 2016

Alexander Novikov Tensor Train in machine learning October 11, 2016 1 / 26

Recommender systems

Assume low-rank structure.


Tensor Train summary

Tensor Train (TT) decomposition [Oseledets 2011]:

A compact representation for tensors (=multidimensional array);

Allows for efficient application of linear algebra operations.


Low-rank decomposition

A23 =

G1 G2

i2 = 3i1 = 2

Ai1i2 = G1[i1]︸︷︷︸1×r

G2[i2]︸︷︷︸r×1

A = G1G2

G1 – collection of rows, G2 – collection of columns:


Tensor Train decomposition

A2423 =

G1 G2 G3 G4

i2 = 4 i3 = 2 i4 = 3i1 = 2

Ai1...id = G1[i1]︸︷︷︸1×r

G2[i2]︸︷︷︸r×r

. . . Gd [id ]︸︷︷︸r×1

An example of computing one element of 4-dimensional tensor:


Tensor Train decomposition Cont’d

Tensor A is said to be in the TT-format, ifAi1,...,id = G1[i1] G2[i2] · · · Gd [id ], ik ∈ {1, . . . , n},

where Gk [ik ] — is a matrix of size rk−1 × rk , r0 = rd = 1.Notation & terminology:

Gk — TT-cores;rk — TT-ranks;r = max

k=0,...,drk — the maximal TT-rank.

The TT-format uses O(ndr2) memory to store nd elements. Efficient only

if the TT-rank is small.


TT-format: example

Ai1,i2,i3 = i1 + i2 + i3,

i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.

Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],

G1[i1] =[

i1 1]

G2[i2] =[

1 0i2 1

]G3[i3] =

[1i3

]Lets check:

A(i1, i2, i3) =[

i1 1] [ 1 0

i2 1

] [1i3

]=

=[

i1 + i2 1] [ 1

i3

]= i1 + i2 + i3.


TT-format: example

Ai1,i2,i3 = i1 + i2 + i3,

i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.

Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],

G1[i1] =[

i1 1]

G2[i2] =[

1 0i2 1

]G3[i3] =

[1i3

]Lets check:

A(i1, i2, i3) =[

i1 1] [ 1 0

i2 1

] [1i3

]=

=[

i1 + i2 1] [ 1

i3

]= i1 + i2 + i3.


TT-format: example

Ai1,i2,i3 = i1 + i2 + i3,

i1 ∈ {1, 2, 3}, i2 ∈ {1, 2, 3, 4}, i3 ∈ {1, 2, 3, 4, 5}.

Ai1,i2,i3 = G1[i1]G2[i2]G3[i3],

G1 =([

1 1]

,[2 1

],[3 1

])G2 =

([1 01 1

],

[1 02 1

],

[1 03 1

],

[1 04 1

])

G3 =([

11

],

[12

],

[13

],

[14

],

[15

])The tensor has 3 · 4 · 5 = 60 elements.The TT-format use 32 parameters to describe it.


Sum of tensors

Tensors A and B are in the TT-format:Ai1...id = GA

1 [i1] · · ·GAd [id ], Bi1...id = GB

1 [i1] · · ·GBd [id ].

Find the TT-format ofC = A + B,

Ci1...id = Ai1...id + Bi1...id .

TT-cores of the result:

GCk [ik ] =

[GA

k [ik ] 00 GB

k [ik ]

], k = 2, . . . , d − 1,

GC1 [i1] =

[GA

1 [i1] GB1 [i1]

], GC

d [id ] =[

GAd [id ]

GBd [id ]

].

TT-ranks of the result are sums of the TT-ranks.


Sum of tensors

Tensors A and B are in the TT-format:Ai1...id = GA

1 [i1] · · ·GAd [id ], Bi1...id = GB

1 [i1] · · ·GBd [id ].

Find the TT-format ofC = A + B,

Ci1...id = Ai1...id + Bi1...id .

TT-cores of the result:

GCk [ik ] =

[GA

k [ik ] 00 GB

k [ik ]

], k = 2, . . . , d − 1,

GC1 [i1] =

[GA

1 [i1] GB1 [i1]

], GC

d [id ] =[

GAd [id ]

GBd [id ]

].

TT-ranks of the result are sums of the TT-ranks.


TT-rounding

Given a tensor A in the TT-format with rank r , the TT-rounding[Oseledets, 2011]:

A = tt-round(A, ε), ε > 0

finds the tensor A such that1 ‖A− A‖F ≤ ε‖A‖F ;

2 TT-rank of A is minimal among all B:‖A−B‖F ≤ ε√

d−1‖A‖F .

Where ‖A‖F =√∑

i1,...,id A2i1,...,id .


How to find TT-decomposition of a given tensor

Analytical formulas for special cases;

An exact algorithm based on SVD for medium tensor. E.g. for a58 ≈ 400 000 tensor takes 8 ms on my laptop;

For large tensors (e.g. 250), approximate algorithms that look at afraction of the tensor elements: DMRG-cross [Savostyanov andOseledets, 2011], AMEn-cross [Dolgov and Savostyanov, 2013].


TT-format operations

Operation Rank of the result

C = c ·A r(C) = r(A)C = A + c r(C) = r(A)+1C = A + B r(C) ≤ r(A)+r(B)C = A�B r(C) ≤ r(A)r(B)C = round(A, ε) r(C) ≤ r(A)sumA –‖A‖F –

(Ask me about differential equations)


Example application: TensorNet

1 Neural networks use fully-connected layers: y = f (Wx + b).2 The matrix W is of millions parameters.3 Lets store and train the matrix W in the TT-format.

Can’t work for general matrices, but for VGG-16 net we compressed4048× 4048 matrix to 320 params without loss of accuracy.


Linear model

Modely(x) = wᵀx + b,

b ∈ R, w ∈ Rd

Loss functionN∑

k=1`(wᵀx(k) + b, y (k)

).

Linear regressionLogistic regressionLinear SVM...


Need for interactions

Linear models give everyone same recommendationsSame story e.g. in bag-of-words text tasksUse interactions (products of features)!


Models with interactions

y(x) = b + wᵀx +∑i ,j

Pijxixj ,

b ∈ R, w ∈ Rd , P ∈ Rd×d

For d features d2 parameters: overfitting on sparse data

Complexity is also d2

For recommender systems d is millions

SVM with polynomial kernel has same drawbacks


Factorization machines

y(x) = b + wᵀx +∑i ,j

Pijxixj

Factorization machines [Rendle 2010] use rank r for P

y(x) =b + wᵀx +∑i ,j

( r∑f =1

Vif Vjf

)xixj ,

b ∈ R, w ∈ Rd , V ∈ Rd×r

Matrix P = VV ᵀ is not sparse, but structured (low rank)Control the number of parameters with rCan represent almost any matrix with large r


High order analysis

Factorization machines model (3rd order)

y(x) =b + wᵀx +∑i ,j

( r∑f =1

Vif Vjf

)xixj

+∑i ,j,k

( r∑f =1

Uif Ujf Ukf

)xixjxk .

In fact, Factorization machines just use CP-decomposition for the weighttensor Pi ,j,k :

Pijk =r∑

f =1Uif Ujf Ukf

ButConverge poorly with high order

Complexity of inference and learning


Exponential machinesLets encode interactions by binary code. Every bit indicates ifcorresponded feature is included or not in current interaction.Exponential machines example (d = 3):

y(x) = W000 + W100 x1 + W010 x2 + W001x3

+ W110 x1x2 + W101 x1x3 + W011 x2x3

+ W111 x1x2x3.

In general:

y(x) =1∑

i1=0. . .

1∑id =0

Wi1,...,id x i11 . . . x id

d ,

W ∈ R2×...×2 with TT-rank r

Captures all 2d interactionsControl the number of parameters with TT-rank rCan represent any polynomial function with large r


Exponential machinesLets encode interactions by binary code. Every bit indicates ifcorresponded feature is included or not in current interaction.Exponential machines example (d = 3):

y(x) = W000 + W100 x1 + W010 x2 + W001x3

+ W110 x1x2 + W101 x1x3 + W011 x2x3

+ W111 x1x2x3.

In general:

y(x) =1∑

i1=0. . .

1∑id =0

Wi1,...,id x i11 . . . x id

d ,

W ∈ R2×...×2 with TT-rank r

Captures all 2d interactionsControl the number of parameters with TT-rank rCan represent any polynomial function with large rAlexander Novikov Tensor Train in machine learning October 11, 2016 19 / 26

Exponential machines inference

Linear O(r2d) inference:

y(x) =∑

i1,...,idG1[i1] . . . Gd [id ]

( d∏k=1

x ikk

)

=∑

i1,...,idx i1

1 G1[i1] . . . x idd Gd [id ]

=

1∑i1=0

x i11 G1[i1]

. . .

1∑id =0

x idd Gd [id ]

= A1︸︷︷︸

1×r

A2︸︷︷︸r×r

. . . Ad︸︷︷︸r×1

,


Exponential machines learning

minimizeW

N∑k=1

`(〈W , X (k)〉, y (k)

),

subject to TT-rank(W) = r0,

1 Autodiff to compute gradients with respect to TT-cores G2 OR Riemannian optimization

Theorem [Holtz, 2012]The set of all d-dimensional tensors with fixed TT-rank r

Mr = {W ∈ R2×...×2 : TT-rank(W) = r}forms a Riemannian manifold.


Riemannian optimization

− ∂L∂Wt

TWMr

−Gt

TT-roundWt+1

Mr

projection

Wt


Riemannian optimization Cont’d

Loss function

L(W) =N∑

k=1`(〈W , X (k)〉, y (k)

)Gradient

∂L∂W =

N∑k=1

∂`

∂y X (k).

Where X is of TT-rank 1!

X i1...id =d∏

k=1x ik

k .


Experiments: optimization

10-1 100 101 102

time (s)

10-17

10-15

10-13

10-11

10-9

10-7

10-5

10-3

10-1

train

loss

Cores GD

Cores SGD 100

Cores SGD 500

Riemann GD

Riemann 100

Riemann 500

Riemann GD rand init

(a) Car dataset

10-1 100 101 102 103 104

time (s)

10-16

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

train

loss

Cores GD

Cores SGD 100

Cores SGD 500

Riemann GD

Riemann 100

Riemann 500

Riemann GD rand init

(b) HIV dataset


Experiments: classification

1 We generated 105 train and 105 test objects and d = 30 features.2 Xij ∼ U{−1, +1}.3 Ground truth for 3 interactions of order 2:

y(x) = ε1x1x5 + ε2x3x8 + ε3x4x5; ε1, ε2, ε3 ∼ U(−1, 1).4 We used 20 interactions of order 6.

Method Test AUC Training time (s) Inference time (s)

Log. reg. 0.50± 0.0 0.4 0.0RF 0.55± 0.0 21.4 1.3SVM RBF 0.50± 0.0 2262.6 1076.1SVM poly. 2 0.50± 0.0 1152.6 852.0SVM poly. 6 0.56± 0.0 4090.9 754.82-nd order FM 0.50± 0.0 638.2 0.16-th order FM 0.57± 0.05 1412.0 0.2ExM rank 2 0.54± 0.05 198.4 0.1ExM rank 4 0.69± 0.02 443.0 0.1ExM rank 8 0.75± 0.02 998.3 0.2


Conclusion

Tensor Train decomposition compactly represent tensors.

Can parametrize machine learning models with TT-tensors.

E.g. the weights of a neural network.

Or modeling all 2d interactions (products of features).

Control the number of underlying parameters via TT-rank.

Riemannian optimization learning sometimes outperforms SGD.

There is a Python code for everything: TT, TensorNet, andExponential Machines.


Data & Analytics

Tensor Train decomposition in machine learning