BEYONDMATRIXCOMPLETION& - People | MIT CSAIL

ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

BEYOND MATRIX COMPLETION

Based on joint work with Boaz Barak (Harvard)

Matrix compleMon

Part I:

THE NETFLIX PROBLEM

movies

THE NETFLIX PROBLEM

movies

1-‐5 stars

THE NETFLIX PROBLEM

movies

480,189

17,770

THE NETFLIX PROBLEM

movies

480,189 1%

observed

17,770

THE NETFLIX PROBLEM

movies

480,189 1%

observed

17,770

Can we (approximately) fill-‐in the missing entries?

MATRIX COMPLETION Let M be an unknown, approximately low-‐rank matrix

≈ + … + M +

comedy drama

≈ + … + M +

comedy drama sports

≈ + … + M +

comedy drama sports

Model: we are given random observaMons Mi,j for all i,j Ω

≈ + … + M +

comedy drama sports

Model: we are given random observaMons Mi,j for all i,j Ω

Is there an efficient algorithm to recover M?

MATRIX COMPLETION

The natural formulaMon is non-‐convex, and NP-‐hard

min rank(X) s.t. (i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

MATRIX COMPLETION

The natural formulaMon is non-‐convex, and NP-‐hard

min rank(X) s.t. (i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

There is a powerful, convex relaxaMon…

THE NUCLEAR NORM

Consider the singular value decomposi;on of X:

X = U Σ VT

orthogonal orthogonal diagonal

THE NUCLEAR NORM

X = U Σ VT

Let σ1 ≥ σ2 ≥ … σr > σr+1 = … σm = 0 be the singular values

THE NUCLEAR NORM

X = U Σ VT

Let σ1 ≥ σ2 ≥ … σr > σr+1 = … σm = 0 be the singular values

Then rank(X) = r, and X = σ1 + σ2 + … + σr * (nuclear norm)

[Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht],

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

This yields a convex relaxaMon, that can be solved efficiently:

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

Theorem: If M is n x n and has rank r, and is C-‐incoherent then (P) recovers M exactly from C6nrlog2n observaMons

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

This is nearly opMmal, since there are O(nr) parameters

Theorem: If M is n x n and has rank r, and is C-‐incoherent then (P) recovers M exactly from C6nrlog2n observaMons

Robust PCA [Candes et al.], [Chandrasekaran et al.], … Can we recover a low rank matrix from sparse corrupMons?

Superresolu;on, compressed sensing off-‐the-‐grid [Candes, Fernandez-‐Granda], [Tang et al.], …

Can we recover well-‐separated points from low-‐frequency measurements?

f1 f2 f3 f4

Higher order structure?

Part II:

TENSOR COMPLETION

Can using more than two auributes can lead to beuer recommendaMons?

TENSOR COMPLETION

e.g. Groupon

ac;vi;es

TENSOR COMPLETION users

;me: season, Mme of day, weekday/weekend, etc

e.g. Groupon

TENSOR COMPLETION

T ≈ i = 1

snow sports

;me: season, Mme of day, weekday/weekend, etc

e.g. Groupon

TENSOR COMPLETION

T ≈ ai bi ci × × i = 1

TENSOR COMPLETION

T ≈ ai bi ci × × i = 1

Can we (approximately) fill-‐in the missing entries?

THE TROUBLE WITH TENSORS Natural approach (suggested by many authors):

(P) min X s.t. * (i,j,k) Ω

|Xi,j,k–Ti,j,k| ≤ η 1 |Ω|

tensor nuclear norm

THE TROUBLE WITH TENSORS

(P) min X s.t. * (i,j,k) Ω

|Xi,j,k–Ti,j,k| ≤ η 1 |Ω|

tensor nuclear norm

[Gurvits], [Liu], [Harrow, Montanaro]

The tensor nuclear norm is NP-‐hard to compute!

Natural approach (suggested by many authors):

In fact, most of the linear algebra toolkit is ill-‐posed, or computa;onally hard for tensors…

e.g. [Hillar, Lim] “Most Tensor Problems are NP-‐Hard”

Most Tensor Problems are NP-Hard 0:3

Table I. Tractability of Tensor Problems

Problem Complexity

Bivariate Matrix Functions over R, C Undecidable (Proposition 12.2)

Bilinear System over R, C NP-hard (Theorems 2.6, 3.7, 3.8)

Eigenvalue over R NP-hard (Theorem 1.3)

Approximating Eigenvector over R NP-hard (Theorem 1.5)

Symmetric Eigenvalue over R NP-hard (Theorem 9.3)

Approximating Symmetric Eigenvalue over R NP-hard (Theorem 9.6)

Singular Value over R, C NP-hard (Theorem 1.7)

Symmetric Singular Value over R NP-hard (Theorem 10.2)

Approximating Singular Vector over R, C NP-hard (Theorem 6.3)

Spectral Norm over R NP-hard (Theorem 1.10)

Symmetric Spectral Norm over R NP-hard (Theorem 10.2)

Approximating Spectral Norm over R NP-hard (Theorem 1.11)

Nonnegative Definiteness NP-hard (Theorem 11.2)

Best Rank-1 Approximation NP-hard (Theorem 1.13)

Best Symmetric Rank-1 Approximation NP-hard (Theorem 10.2)

Rank over R or C NP-hard (Theorem 8.2)

Enumerating Eigenvectors over R #P-hard (Corollary 1.16)

Combinatorial Hyperdeterminant NP-, #P-, VNP-hard (Theorems 4.1 , 4.2, Corollary 4.3)

Geometric Hyperdeterminant Conjectures 1.9, 13.1

Symmetric Rank Conjecture 13.2

Bilinear Programming Conjecture 13.4

Bilinear Least Squares Conjecture 13.5

Note: Except for positive definiteness and the combinatorial hyperdeterminant, which apply to 4-tensors,all problems refer to the 3-tensor case.

and n be positive integers. For the purposes of this article, a 3-tensor A over F is anl ⇥m⇥ n array of elements of F:

A = Jaijk

Kl,m,n

i,j,k=1 2 Fl⇥m⇥n. (1)

These objects are natural multilinear generalizations of matrices in the following way.For any positive integer d, let e1, . . . , ed denote the standard basis1 in the F-vector

space Fd. A bilinear function f : Fm⇥Fn ! F can be encoded by a matrix A = [aij

i,j=1 2Fm⇥n, in which the entry a

records the value of f(ei

) 2 F. By linearity in eachcoordinate, specifying A determines the values of f on all of Fm ⇥ Fn; in fact, we havef(u,v) = u

>Av for any vectors u 2 Fm and v 2 Fn. Thus, matrices both encode 2-dimensional arrays of numbers and specify all bilinear functions. Notice also that ifm = n and A = A> is symmetric, then

f(u,v) = u

>Av = (u

>A>u = v

>Au = f(v,u).

Thus, symmetric matrices are bilinear maps invariant under coordinate exchange.

1Formally, ei is the vector in Fd with a 1 in the ith coordinate and zeroes everywhere else. In this article,vectors in Fn will always be column-vectors.

Journal of the ACM, Vol. 0, No. 0, Article 0, Publication date: June 2013.

FLATTENING A TENSOR

Many tensor methods rely on flaTening:

FLATTENING A TENSOR

flat( ) = n 1

FLATTENING A TENSOR

flat( ) = n 1

ac;vi;es ;me

FLATTENING A TENSOR

flat( ) = n 1

Ti,j,k

FLATTENING A TENSOR

flat( ) = n 1

Ti,j,k

This is a rearrangement of the entries, into a matrix, that does not increase its rank

FLATTENING A TENSOR

flat( ) = n 1

Ti,j,k

flat( ai bi ci × × i = 1

) = ai vec(bici) × i = 1

n2n3-‐dimensional vector

Let n1 = n2 = n3 = n

We would need O(n2r) observaMons to fill-‐in flat(T)

Let n1 = n2 = n3 = n

There are many other variants of flaTening, but with comparable guarantees

[Liu, Musialski, Wonka, Ye], [Gandy, Recht, Yamada], [SignoreTo, De Lathauwer, Suykens], [Tomioko, Hayashi, Kashima], [Mu, Huang, Wright, Goldfarb], …

Let n1 = n2 = n3 = n

Can we beat flauening?

Let n1 = n2 = n3 = n

Can we beat flauening?

Can we make beuer predicMons than we do by treaMng each ac;vity x ;me as unrelated?

Nearly opMmal algorithms for noisy tensor compleMon

Part III:

Suppose we are given |Ω|= m noisy observaMons from T:

OUR RESULTS

T = ai bi ci × × i = 1

with |σi|, |ai|∞, |bi|∞, |ci|∞ ≤ C

+ noise

bdd by η

Theorem: There is an efficient algorithm that with prob 1-‐δ, outputs X with

Suppose we are given |Ω|= m noisy observaMons from T:

OUR RESULTS

|Xi,j,k–Ti,j,k| ≤ 1 n3

+ ln(2/δ)

m 2C3r n3/2log4n m C3r + 2η

T = ai bi ci × × i = 1

with |σi|, |ai|∞, |bi|∞, |ci|∞ ≤ C

+ noise

bdd by η

SOME REMARKS

+ ln(2/δ)

SOME REMARKS

+ ln(2/δ)

In many se}ngs, 1-‐o(1) fracMon of entries of T are at least r1/2/polylog(n) in magnitude

SOME REMARKS

When m = Ω (n3/2r), average error is asymptoMcally smaller and hence Xi,j,k = (1±o(1))Ti,j,k for 1-‐o(1) fracMon of entries

+ ln(2/δ)

SOME REMARKS

When m = Ω (n3/2r), average error is asymptoMcally smaller and hence Xi,j,k = (1±o(1))Ti,j,k for 1-‐o(1) fracMon of entries

+ ln(2/δ)

“Almost all of the entries, almost en;rely correct”

SOME REMARKS

+ ln(2/δ)

SOME REMARKS

+ ln(2/δ)

For r = n3/2-‐δ (highly overcomplete), we only need to observe an n-‐δ fracMon of the entries

SOME REMARKS

Algorithms for decomposing such tensors given all the entries need stronger (e.g. factors are random) assumpMons

+ ln(2/δ)

For r = n3/2-‐δ (highly overcomplete), we only need to observe an n-‐δ fracMon of the entries

LOWER BOUNDS

Not only is the tensor nuclear norm hard to compute, but…

LOWER BOUNDS

Noisy tensor compleMon with m observaMons

Refute random 3-‐SAT with m clauses

LOWER BOUNDS

Refute random 3-‐SAT with m clauses

Believed to be hard, If m = n3/2-‐δ

Matrix compleMon revisited: ConnecMons to random CSPs

Part IV:

Can we disMnguish between low-‐rank and random?

+ noise

Case #1: Approximately low-‐rank

Mi,j = random ±1 w/ probability ¼

+ noise

For each (i,j) Ω

aiaj w/ probability ¾

where each ai = ±1

Case #1: Approximately low-‐rank

, Mi,j = random ±1 For each (i,j) Ω

Case #2: Random

random

, Mi,j = random ±1 For each (i,j) Ω

Case #2: Random

random

In Case #1 the entries are (somewhat) predictable, but in Case #2 they are completely unpredictable

There are two very different communiMes that (essenMally) auacked this same disMnguishing problem:

The community working on matrix comple;on

The community working on matrix comple;on The community working on refu;ng random CSPs

AN INTERPRETATION

(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

We can interpret:

±1 r.v. as a random 2-‐XOR formula ψ

AN INTERPRETATION

In parMcular each observaMon/fctn value maps to a clause:

(i, j, σ)

We can interpret:

(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

vi vj = σ �

variables constraint

AN INTERPRETATION

(and vice-‐versa)

In parMcular each observaMon/fctn value maps to a clause:

(i, j, σ) vi vj = σ �

We can interpret:

(i1, j1; σ1), (i2, j2; σ2), …, (im, jm; σm)

variables constraint

STRONG REFUTATION

We will say that an algorithm strongly refutes* random 2-‐XOR with m clauses if:

STRONG REFUTATION

(1) On any 2-‐XOR formula ψ, it outputs val where:

OPT(ψ) ≤ val(ψ)

STRONG REFUTATION

OPT(ψ) ≤ val(ψ)

largest fracMon of clauses of ψ that can be saMsfied

STRONG REFUTATION

OPT(ψ) ≤ val(ψ)

STRONG REFUTATION

OPT(ψ) ≤ val(ψ)

(2) With high probability (for random ψ with m clauses):

val(ψ) = 1 2 + o(1)

Lemma: If (i1, j1; σ1), …, (im, jm; σm) ψ then

A 2 OPT(ψ) – ≤ 1

n 1 m 2

Lemma: If (i1, j1; σ1), …, (im, jm; σm) ψ then j

Proof: Map the assignment to a unit vector so that xi = ±1/√n and take the quadraMc form on A

2 OPT(ψ) – ≤ 1 n

1 m A 1

2 OPT(ψ) – ≤ 1 n

1 m A 1

mn OPT(ψ) ≤

1 2 + o(1)

2 OPT(ψ) – ≤ 1 n

m = ω(n)

This solves the strong refutaMon problem…

2 OPT(ψ) – ≤ 1 n

1 m A 1

mn OPT(ψ) ≤

1 2 + o(1) m = ω(n)

The same spectral bound implies:

An algorithm for strongly refuMng random 2-‐XOR (1)

The community working on refu;ng random CSPs

An algorithm for strongly refuMng random 2-‐XOR An algorithm for the disMnguishing problem

GeneralizaMon bounds for the nuclear norm (3)

It also yields bounds on how well the soluMon to the convex program generalizes [Srebro, Shraibman] …

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

An approach through sta;s;cal learning theory:

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

empirical error: (i,j) Ω

|Xi,j–Mi,j| 1 |Ω|

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

empirical error: (i,j) Ω

(≤ η)

min X s.t. *

(i,j) Ω

|Xi,j–Mi,j| ≤ η 1 |Ω|

empirical error:

predic;on error:

(i,j) Ω

|Xi,j–Mi,j| 1

(≤ η)

Then if we let

K = { X s.t. * X ≤ 1 } = } abT s.t. conv { a b ≤ , 1

Then if we let

sup X K

generaliza;on error:

empirical error (X) predic;on error (X) (on Ω)

Then if we let

generalizaMon error ≤ best agreement with random funcMon Theorem: “ ”

(on Ω)

sup X K

Then if we let

generalizaMon error ≤ best agreement with random funcMon Theorem: “ ”

(on Ω)

Rademacher complexity

sup X K

More precisely:

sup X K a = 1

X ia,ja σa 1 m

Rademacher complexity (Rm(K))

More precisely:

sup X K a = 1

X ia,ja σa 1 m

= 1 m A

More precisely:

sup X K a = 1

X ia,ja σa 1 m

= 1 m A

1 m A 1

More precisely:

sup X K a = 1

X ia,ja σa 1 m

= 1 m A

Rm(K) = o( ) 1 n

1 m A 1

mn m = ω(n)

Noisy matrix compleMon with m observaMons

Strongly refute* random 2-‐XOR/2-‐SAT with m clauses

*Want an algorithm that cerMfies a formula is far from saMsfiable

Rademacher Complexity

Strongly refute* random 3-‐XOR/3-‐SAT with m clauses Rademacher

Complexity

[Coja-‐Oghlan, Goerdt, Lanka]

Complexity

vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( )

vi σ' vj

[Coja-‐Ohglan, Goerdt, Lanka]: Reduce 3-‐XOR to 4-‐XOR

vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( ) ( vj vk vj’ vk’ = σ σ’ � � � )

vi σ' vj

σ vj'

vk σ σ’

vi σ' vj

σ vj'

vk σ σ’

This yields n5p2 = n2 logO(1)n clauses

vi σ' vj

σ vj'

vk σ σ’

This yields n5p2 = n2 logO(1)n clauses

Warning: The 4-‐XOR clauses are not independent!

vi vj’ vk’ = σ’ vi vj vk = σ � � ( ) � � � ( ) ( vj vk vj’ vk’ = σ σ’ � � � ) [Coja-‐Ohglan, Goerdt, Lanka]: Reduce 3-‐XOR to 4-‐XOR

(j, j’)

(k,k’)

vk σ σ’

(j, j’)

(k,k’)

Hence the paired variables for the rows (and colns) come from different clauses!

vk σ σ’

(j, j’)

(k,k’)

Hence the paired variables for the rows (and colns) come from different clauses!

vk σ σ’

(j, j’)

(k,k’)

A + trace method

Complexity

Strongly refute* random 3-‐XOR/3-‐SAT with m clauses

We then embed this algorithm into the sixth level of the sum-‐of-‐squares hierarchy, to get a relaxaMon for tensor predicMon

Embedding in SOS

+ ln(2/δ)

GENERALIZATION BOUNDS

Suppose we are given |Ω|= m noisy observaMons Ti,j,k ± η, and the factors of T are C-‐incoherent:

This comes from giving an efficiently computable norm whose Rademacher complexity is asymptoMcally smaller than the trivial bound whenever m=Ω(n3/2log4n)

Suppose we are given |Ω|= m noisy observaMons Ti,j,k ± η, and the factors of T are C-‐incoherent:

GENERALIZATION BOUNDS

+ ln(2/δ)

SUMMARY

We gave an algorithm for 3rd-‐order tensor predicMon that uses m = n3/2rlog4n observaMons

New Algorithm:

SUMMARY

There is an inefficient algorithm that use m = nrlogn observaMons

New Algorithm:

An Inefficient Algorithm:

(via tensor nuclear norm)

SUMMARY

There is an inefficient algorithm that use m = nrlogn observaMons

Even for nδ rounds of the powerful sum-‐of-‐squares hierarchy, no norm solves tensor predicMon with m = n3/2-‐δr observaMons

New Algorithm:

An Inefficient Algorithm:

A Phase Transi;on:

(via tensor nuclear norm)

New direcMons in computaMonal vs. staMsMcal tradeoffs

Epilogue:

DISCUSSION

Convex programs are unreasonably effecMve for linear inverse problems!

DISCUSSION

But we gave simple linear inverse problems that exhibit striking gaps between efficient and inefficient esMmators

DISCUSSION

Where else are there computaMonal vs staMsMcal tradeoffs?

DISCUSSION

Where else are there computaMonal vs staMsMcal tradeoffs?

New Direc;on: Explore computaMonal vs. staMsMcal tradeoffs through the powerful sum-‐of-‐squares hierarchy

BEYONDMATRIXCOMPLETION& - People | MIT CSAIL

Documents

Key-Value Memory Networks for - People | MIT CSAIL

Halle - MIT CSAIL

Height and Gradient from Shading - People | MIT CSAIL

Intro Python Material - People | MIT CSAIL

Pregel - MIT CSAIL

Random - People | MIT CSAIL

The “99% Robot” Report - People | MIT CSAIL

Hydraulic artificial muscles - MIT CSAIL

1 Introduction - People | MIT CSAIL

CMMD Reference Manual - MIT CSAIL

Last Time? - Research | MIT CSAIL

lecture18 - People | MIT CSAIL

Passive Navigation - CSAIL People - MIT

Steerable Part Models - MIT CSAIL

Tina Ann Nolte - MIT CSAIL

The Baskets Queue - MIT CSAIL

Hynek Schlawack - People | MIT CSAIL

Tommi S. Jaakkola MIT CSAIL

Bluetooth for Programmers - People | MIT CSAIL

The Case for RC6 as the AES - People | MIT CSAIL