BEYONDMATRIXCOMPLETION& - People | MIT CSAIL

ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

BEYOND MATRIX COMPLETION

Based on joint work with Boaz Barak (Harvard)

Matrix compleMon

Part I:

THE NETFLIX PROBLEM

users

movies

THE NETFLIX PROBLEM

users

movies

1-‐5 stars

THE NETFLIX PROBLEM

users

movies

480,189

17,770

THE NETFLIX PROBLEM

users

movies

480,189 1%

observed

17,770

THE NETFLIX PROBLEM

users

movies

480,189 1%

observed

17,770

Can we (approximately) fill-‐in the missing entries?

MATRIX COMPLETION Let M be an unknown, approximately low-‐rank matrix


≈ + … + M +


≈ + … + M +

drama


≈ + … + M +

comedy drama


≈ + … + M +

comedy drama sports


≈ + … + M +

comedy drama sports

Model: we are given random observaMons Mi,j for all i,j Ω


≈ + … + M +

comedy drama sports

Model: we are given random observaMons Mi,j for all i,j Ω

Is there an efficient algorithm to recover M?

MATRIX COMPLETION

The natural formulaMon is non-‐convex, and NP-‐hard

min rank(X) s.t. (i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

MATRIX COMPLETION

The natural formulaMon is non-‐convex, and NP-‐hard

min rank(X) s.t. (i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

There is a powerful, convex relaxaMon…

THE NUCLEAR NORM

Consider the singular value decomposi;on of X:

X = U Σ VT

orthogonal orthogonal diagonal

THE NUCLEAR NORM


X = U Σ VT


Let σ1 ≥ σ2 ≥ … σr > σr+1 = … σm = 0 be the singular values

THE NUCLEAR NORM


X = U Σ VT


Let σ1 ≥ σ2 ≥ … σr > σr+1 = … σm = 0 be the singular values

Then rank(X) = r, and X = σ1 + σ2 + … + σr * (nuclear norm)

[Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht],

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

This yields a convex relaxaMon, that can be solved efficiently:

(P)


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|


(P)

Theorem: If M is n x n and has rank r, and is C-‐incoherent then (P) recovers M exactly from C6nrlog2n observaMons


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

This is nearly opMmal, since there are O(nr) parameters


(P)

Theorem: If M is n x n and has rank r, and is C-‐incoherent then (P) recovers M exactly from C6nrlog2n observaMons

Robust PCA [Candes et al.], [Chandrasekaran et al.], … Can we recover a low rank matrix from sparse corrupMons?

Robust PCA [Candes et al.], [Chandrasekaran et al.], … Can we recover a low rank matrix from sparse corrupMons?

Superresolu;on, compressed sensing off-‐the-‐grid [Candes, Fernandez-‐Granda], [Tang et al.], …

Can we recover well-‐separated points from low-‐frequency measurements?

f1 f2 f3 f4

Higher order structure?

Part II:

TENSOR COMPLETION

Can using more than two auributes can lead to beuer recommendaMons?

TENSOR COMPLETION


e.g. Groupon

users

ac;vi;es

TENSOR COMPLETION users

;me


;me: season, Mme of day, weekday/weekend, etc

e.g. Groupon

TENSOR COMPLETION

T ≈ i = 1

r

users

;me


snow sports

;me: season, Mme of day, weekday/weekend, etc

e.g. Groupon

TENSOR COMPLETION

T ≈ ai bi ci × × i = 1

r

users

;me


TENSOR COMPLETION

T ≈ ai bi ci × × i = 1

r

users

;me

Can we (approximately) fill-‐in the missing entries?


THE TROUBLE WITH TENSORS Natural approach (suggested by many authors):

(P) min X s.t. * (i,j,k) Ω

|Xi,j,k–Ti,j,k| ≤ η 1 |Ω|

tensor nuclear norm

THE TROUBLE WITH TENSORS

(P) min X s.t. * (i,j,k) Ω

|Xi,j,k–Ti,j,k| ≤ η 1 |Ω|

tensor nuclear norm

[Gurvits], [Liu], [Harrow, Montanaro]

The tensor nuclear norm is NP-‐hard to compute!

Natural approach (suggested by many authors):

In fact, most of the linear algebra toolkit is ill-‐posed, or computa;onally hard for tensors…

e.g. [Hillar, Lim] “Most Tensor Problems are NP-‐Hard”


e.g. [Hillar, Lim] “Most Tensor Problems are NP-‐Hard”


Most Tensor Problems are NP-Hard 0:3

Table I. Tractability of Tensor Problems

Problem Complexity

Bivariate Matrix Functions over R, C Undecidable (Proposition 12.2)

Bilinear System over R, C NP-hard (Theorems 2.6, 3.7, 3.8)

Eigenvalue over R NP-hard (Theorem 1.3)

Approximating Eigenvector over R NP-hard (Theorem 1.5)

Symmetric Eigenvalue over R NP-hard (Theorem 9.3)

Approximating Symmetric Eigenvalue over R NP-hard (Theorem 9.6)

Singular Value over R, C NP-hard (Theorem 1.7)

Symmetric Singular Value over R NP-hard (Theorem 10.2)

Approximating Singular Vector over R, C NP-hard (Theorem 6.3)

Spectral Norm over R NP-hard (Theorem 1.10)

Symmetric Spectral Norm over R NP-hard (Theorem 10.2)

Approximating Spectral Norm over R NP-hard (Theorem 1.11)

Nonnegative Definiteness NP-hard (Theorem 11.2)

Best Rank-1 Approximation NP-hard (Theorem 1.13)

Best Symmetric Rank-1 Approximation NP-hard (Theorem 10.2)

Rank over R or C NP-hard (Theorem 8.2)

Enumerating Eigenvectors over R #P-hard (Corollary 1.16)

Combinatorial Hyperdeterminant NP-, #P-, VNP-hard (Theorems 4.1 , 4.2, Corollary 4.3)

Geometric Hyperdeterminant Conjectures 1.9, 13.1

Symmetric Rank Conjecture 13.2

Bilinear Programming Conjecture 13.4

Bilinear Least Squares Conjecture 13.5

Note: Except for positive definiteness and the combinatorial hyperdeterminant, which apply to 4-tensors,all problems refer to the 3-tensor case.

and n be positive integers. For the purposes of this article, a 3-tensor A over F is anl ⇥m⇥ n array of elements of F:

A = Jaijk

Kl,m,n

i,j,k=1 2 Fl⇥m⇥n. (1)

These objects are natural multilinear generalizations of matrices in the following way.For any positive integer d, let e1, . . . , ed denote the standard basis1 in the F-vector

space Fd. A bilinear function f : Fm⇥Fn ! F can be encoded by a matrix A = [aij

]

m,n

i,j=1 2Fm⇥n, in which the entry a

ij

records the value of f(ei

, ej

) 2 F. By linearity in eachcoordinate, specifying A determines the values of f on all of Fm ⇥ Fn; in fact, we havef(u,v) = u

>Av for any vectors u 2 Fm and v 2 Fn. Thus, matrices both encode 2-dimensional arrays of numbers and specify all bilinear functions. Notice also that ifm = n and A = A> is symmetric, then

f(u,v) = u

>Av = (u

>Av)

>= v

>A>u = v

>Au = f(v,u).

Thus, symmetric matrices are bilinear maps invariant under coordinate exchange.

1Formally, ei is the vector in Fd with a 1 in the ith coordinate and zeroes everywhere else. In this article,vectors in Fn will always be column-vectors.

Journal of the ACM, Vol. 0, No. 0, Article 0, Publication date: June 2013.

FLATTENING A TENSOR

Many tensor methods rely on flaTening:

FLATTENING A TENSOR


flat( ) = n 1

n3

n2n3

n1

FLATTENING A TENSOR


flat( ) = n 1

n3

ac;vi;es ;me

users

×

FLATTENING A TENSOR


flat( ) = n 1

n3

(j,k)

i

Ti,j,k

FLATTENING A TENSOR


flat( ) = n 1

n3

(j,k)

i

Ti,j,k

This is a rearrangement of the entries, into a matrix, that does not increase its rank

FLATTENING A TENSOR


flat( ) = n 1

n3

(j,k)

i

Ti,j,k

flat( ai bi ci × × i = 1

r

) = ai vec(bici) × i = 1

r

T

n2n3-‐dimensional vector

Let n1 = n2 = n3 = n

We would need O(n2r) observaMons to fill-‐in flat(T)

Let n1 = n2 = n3 = n


There are many other variants of flaTening, but with comparable guarantees

[Liu, Musialski, Wonka, Ye], [Gandy, Recht, Yamada], [SignoreTo, De Lathauwer, Suykens], [Tomioko, Hayashi, Kashima], [Mu, Huang, Wright, Goldfarb], …

Let n1 = n2 = n3 = n


Can we beat flauening?



Let n1 = n2 = n3 = n


Can we beat flauening?

Can we make beuer predicMons than we do by treaMng each ac;vity x ;me as unrelated?



Nearly opMmal algorithms for noisy tensor compleMon

Part III:

Suppose we are given |Ω|= m noisy observaMons from T:

OUR RESULTS

T = ai bi ci × × i = 1

r

σi

with |σi|, |ai|∞, |bi|∞, |ci|∞ ≤ C

+ noise

bdd by η

Theorem: There is an efficient algorithm that with prob 1-‐δ, outputs X with

Suppose we are given |Ω|= m noisy observaMons from T:

OUR RESULTS

i,j,k

|Xi,j,k–Ti,j,k| ≤ 1 n3

+ ln(2/δ)

m 2C3r n3/2log4n m C3r + 2η

T = ai bi ci × × i = 1

r

σi

with |σi|, |ai|∞, |bi|∞, |ci|∞ ≤ C

+ noise

bdd by η

SOME REMARKS


i,j,k


+ ln(2/δ)


SOME REMARKS


i,j,k


+ ln(2/δ)


In many se}ngs, 1-‐o(1) fracMon of entries of T are at least r1/2/polylog(n) in magnitude

SOME REMARKS

When m = Ω (n3/2r), average error is asymptoMcally smaller and hence Xi,j,k = (1±o(1))Ti,j,k for 1-‐o(1) fracMon of entries


i,j,k


+ ln(2/δ)



SOME REMARKS

When m = Ω (n3/2r), average error is asymptoMcally smaller and hence Xi,j,k = (1±o(1))Ti,j,k for 1-‐o(1) fracMon of entries


i,j,k


+ ln(2/δ)


“Almost all of the entries, almost en;rely correct”


SOME REMARKS


i,j,k


+ ln(2/δ)


SOME REMARKS


i,j,k


+ ln(2/δ)


For r = n3/2-‐δ (highly overcomplete), we only need to observe an n-‐δ fracMon of the entries

SOME REMARKS

Algorithms for decomposing such tensors given all the entries need stronger (e.g. factors are random) assumpMons


i,j,k


+ ln(2/δ)


For r = n3/2-‐δ (highly overcomplete), we only need to observe an n-‐δ fracMon of the entries

LOWER BOUNDS

Not only is the tensor nuclear norm hard to compute, but…

LOWER BOUNDS


Noisy tensor compleMon with m observaMons

Refute random 3-‐SAT with m clauses

LOWER BOUNDS



Refute random 3-‐SAT with m clauses

Believed to be hard, If m = n3/2-‐δ

Matrix compleMon revisited: ConnecMons to random CSPs

Part IV:

Can we disMnguish between low-‐rank and random?

+ noise


aT

a

Case #1: Approximately low-‐rank

Mi,j = random ±1 w/ probability ¼

+ noise

For each (i,j) Ω

aiaj w/ probability ¾


where each ai = ±1

aT

a

Case #1: Approximately low-‐rank


, Mi,j = random ±1 For each (i,j) Ω


Case #2: Random

random

, Mi,j = random ±1 For each (i,j) Ω


Case #2: Random

random

In Case #1 the entries are (somewhat) predictable, but in Case #2 they are completely unpredictable

There are two very different communiMes that (essenMally) auacked this same disMnguishing problem:


The community working on matrix comple;on


The community working on matrix comple;on The community working on refu;ng random CSPs

AN INTERPRETATION

(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

We can interpret:

±1 r.v. as a random 2-‐XOR formula ψ

AN INTERPRETATION

In parMcular each observaMon/fctn value maps to a clause:

(i, j, σ)

We can interpret:


(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

vi vj = σ �

variables constraint

AN INTERPRETATION

(and vice-‐versa)

In parMcular each observaMon/fctn value maps to a clause:

(i, j, σ) vi vj = σ �

We can interpret:


(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

variables constraint

STRONG REFUTATION

We will say that an algorithm strongly refutes* random 2-‐XOR with m clauses if:

STRONG REFUTATION


(1) On any 2-‐XOR formula ψ, it outputs val where:

OPT(ψ) ≤ val(ψ)

STRONG REFUTATION



OPT(ψ) ≤ val(ψ)

largest fracMon of clauses of ψ that can be saMsfied

STRONG REFUTATION



OPT(ψ) ≤ val(ψ)

STRONG REFUTATION



OPT(ψ) ≤ val(ψ)

(2) With high probability (for random ψ with m clauses):

val(ψ) = 1 2 + o(1)

Lemma: If (i1, j1; σ1), …, (im, jm; σm) ψ then

σ

i

j

A 2 OPT(ψ) – ≤ 1

n 1 m 2

A

Lemma: If (i1, j1; σ1), …, (im, jm; σm) ψ then j

Proof: Map the assignment to a unit vector so that xi = ±1/√n and take the quadraMc form on A

2 OPT(ψ) – ≤ 1 n

1 m 2

A σ

i

A

1 m A 1

mn



2 OPT(ψ) – ≤ 1 n

1 m 2

A σ

i

A

1 m A 1

mn OPT(ψ) ≤

1 2 + o(1)



2 OPT(ψ) – ≤ 1 n

1 m 2

A σ

i

A

m = ω(n)

This solves the strong refutaMon problem…



2 OPT(ψ) – ≤ 1 n

1 m 2

A σ

i

A

1 m A 1

mn OPT(ψ) ≤

1 2 + o(1) m = ω(n)





The same spectral bound implies:

An algorithm for strongly refuMng random 2-‐XOR (1)

The community working on refu;ng random CSPs




An algorithm for strongly refuMng random 2-‐XOR An algorithm for the disMnguishing problem

(1)

(2)





An algorithm for strongly refuMng random 2-‐XOR An algorithm for the disMnguishing problem

(1)

(2)


GeneralizaMon bounds for the nuclear norm (3)

It also yields bounds on how well the soluMon to the convex program generalizes [Srebro, Shraibman] …

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

An approach through sta;s;cal learning theory:


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|


empirical error: (i,j) Ω

|Xi,j–Mi,j| 1 |Ω|


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|


empirical error: (i,j) Ω


(≤ η)


min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|


empirical error:

predic;on error:

(i,j) Ω


|Xi,j–Mi,j| 1

(≤ η)

n2


Then if we let

K = { X s.t. * X ≤ 1 } = } abT s.t. conv { a b ≤ , 1


Then if we let


sup X K

–

generaliza;on error:

empirical error (X) predic;on error (X) (on Ω)


Then if we let


generalizaMon error ≤ best agreement with random funcMon Theorem: “ ”

(on Ω)

sup X K

–




Then if we let


generalizaMon error ≤ best agreement with random funcMon Theorem: “ ”

(on Ω)

Rademacher complexity

sup X K

–



More precisely:

sup X K a = 1

X ia,ja σa 1 m

Rademacher complexity (Rm(K))


More precisely:

sup X K a = 1

X ia,ja σa 1 m


= 1 m A


More precisely:

sup X K a = 1

X ia,ja σa 1 m


= 1 m A


1 m A 1

mn

More precisely:

sup X K a = 1

X ia,ja σa 1 m


= 1 m A

Rm(K) = o( ) 1 n


1 m A 1

mn m = ω(n)



Noisy matrix compleMon with m observaMons

Strongly refute* random 2-‐XOR/2-‐SAT with m clauses

*Want an algorithm that cerMfies a formula is far from saMsfiable

Rademacher Complexity




Strongly refute* random 3-‐XOR/3-‐SAT with m clauses Rademacher

Complexity





[Coja-‐Oghlan, Goerdt, Lanka]


Complexity


vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( )

vj'

vk'

vi σ' vj

vk

σ

[Coja-‐Ohglan, Goerdt, Lanka]: Reduce 3-‐XOR to 4-‐XOR

vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( ) ( vj vk vj’ vk’ = σ σ’ � � � )

vj'

vk'

vi σ' vj

vk

σ vj'

vk'

vi

vj

vk σ σ’



vj'

vk'

vi σ' vj

vk

σ vj'

vk'

vi

vj

vk σ σ’

This yields n5p2 = n2 logO(1)n clauses



vj'

vk'

vi σ' vj

vk

σ vj'

vk'

vi

vj

vk σ σ’

This yields n5p2 = n2 logO(1)n clauses


Warning: The 4-‐XOR clauses are not independent!

vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( ) ( vj vk vj’ vk’ = σ σ’ � � � ) [Coja-‐Ohglan, Goerdt, Lanka]: Reduce 3-‐XOR to 4-‐XOR

vj'

vk'

vi

vj

vk

(j, j’)

A


vj'

vk'

vi

vj

vk

(j, j’)

(k,k’)

A


vj'

vk'

vi

vj

vk σ σ’

(j, j’)

(k,k’)

A


Hence the paired variables for the rows (and colns) come from different clauses!

vj'

vk'

vi

vj

vk σ σ’

(j, j’)

(k,k’)

A


Hence the paired variables for the rows (and colns) come from different clauses!

vj'

vk'

vi

vj

vk σ σ’

(j, j’)

(k,k’)

A + trace method






Complexity




Strongly refute* random 3-‐XOR/3-‐SAT with m clauses



We then embed this algorithm into the sixth level of the sum-‐of-‐squares hierarchy, to get a relaxaMon for tensor predicMon

Embedding in SOS


i,j,k



+ ln(2/δ)


GENERALIZATION BOUNDS

Suppose we are given |Ω|= m noisy observaMons Ti,j,k ± η, and the factors of T are C-‐incoherent:

This comes from giving an efficiently computable norm whose Rademacher complexity is asymptoMcally smaller than the trivial bound whenever m=Ω(n3/2log4n)

K �


Suppose we are given |Ω|= m noisy observaMons Ti,j,k ± η, and the factors of T are C-‐incoherent:

GENERALIZATION BOUNDS

i,j,k


+ ln(2/δ)


SUMMARY

We gave an algorithm for 3rd-‐order tensor predicMon that uses m = n3/2rlog4n observaMons

New Algorithm:

SUMMARY

There is an inefficient algorithm that use m = nrlogn observaMons

New Algorithm:

An Inefficient Algorithm:


(via tensor nuclear norm)

SUMMARY

There is an inefficient algorithm that use m = nrlogn observaMons

Even for nδ rounds of the powerful sum-‐of-‐squares hierarchy, no norm solves tensor predicMon with m = n3/2-‐δr observaMons

New Algorithm:

An Inefficient Algorithm:

A Phase Transi;on:


(via tensor nuclear norm)

New direcMons in computaMonal vs. staMsMcal tradeoffs

Epilogue:

DISCUSSION

Convex programs are unreasonably effecMve for linear inverse problems!

DISCUSSION


But we gave simple linear inverse problems that exhibit striking gaps between efficient and inefficient esMmators

DISCUSSION



Where else are there computaMonal vs staMsMcal tradeoffs?

DISCUSSION

Where else are there computaMonal vs staMsMcal tradeoffs?

New Direc;on: Explore computaMonal vs. staMsMcal tradeoffs through the powerful sum-‐of-‐squares hierarchy



Documents

BEYONDMATRIXCOMPLETION& - People | MIT CSAIL