22
PCA with Outliers and Missing Data Sujay Sanghavi Electrical and Computer Engg. University of Texas, Austin Joint w/ C. Caramanis, Y. Chen, H. Xu

PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

PCA with Outliers and Missing Data

Sujay Sanghavi

Electrical and Computer Engg. University of Texas, Austin

Joint w/ C. Caramanis, Y. Chen, H. Xu

Page 2: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Outline

PCA and Outliers

- Why SVD fails

- Corrupted features vs. corrupted points

Our idea + Algorithms Results

- Full observation

- Missing Data Framework for Robustness in High Dimensions

Page 3: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Principal Components Analysis

Data Points

�Low Rank Matrix

Classical technique: 1.  Organize points as matrix 2.  Take SVD 3.  Top singular vectors span space

Given points that lie on/near a Lower dimensional subspace, find this subspace.

Page 4: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Fragility

Gross errors of even one/few points can completely throw off PCA

Reason: Classical PCA minimizes error, which is susceptible to gross outliers `2

Page 5: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Two types of gross errors

-  Individual entries corrupted - Entire columns corrupted

.. and missing data versions of both

Corrupted Features Outliers

Page 6: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

PCA with Outliers Points

Objective: find identities of outliers (and hence col. space of true matrix)

Page 7: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Outlier Pursuit - Idea Points

minL

⇥M � L⇥F

minL,C

⇥M � L� C⇥F

col(C) = c

Standard PCA

s.t. rank(L) = rs.t. rank(L) = r

Page 8: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Outlier Pursuit - Method Points

minL,C

⇥M � L� C⇥F + �1⇥L⇥� + �2⇥C⇥1,2We propose:

Convex surrogate for Rank constraint

Convex surrogate for Column-sparsity

Page 9: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

When does it (not) work ?

When certain directions of column space of poorly represented

This vector has large inner product with some coordinate axes

maxi

�V �ei� is large

L�

L� = U�V ⇥

Page 10: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Results

Assumption: Columns of true are incoherent:

r � µr � nNote:

maxi⇥V �ei⇥2 � µr

nL�

First consider: Noiseless case

minL,C

kLk⇤ + �kCk1,2

s.t. L+ C = M

Page 11: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Results

Assumption: Columns of true are incoherent:

Theorem: (noiseless case) Our convex program can identify upto a fraction of outliers as long as

1� �⇥ c

µr

r � µr � nNote:

maxi⇥V �ei⇥2 � µr

nL�

⇥ =3

7��n

� >1

r + 1Outer bound: makes the problem un-identifiable

Page 12: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Proof Technique

A point is the optimum of a convex function

Zero lies in the (sub) gradient of at

xf

�f(x)f x�

Steps: 1. guess a “nice” point, -- oracle problem 2. show it is the optimum by showing zero is in subgradient

Page 13: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Proof Technique

Guessing a “nice” optimum

(Note: in “single structure” problems like matrix completion, compressed sensing etc., this is not an issue)

Oracle Problem:

minL,C

⇥M � L� C⇥F + �1⇥L⇥� + �2⇥C⇥1,2

ColSpace(L) � ColSpace(L�)

s.t. ColSupp(C) � ColSupp(C�)

(�L, �C) is, by definition, a nice point. Rest of proof: showing it is the optimum of original program,

under our assumption.

Page 14: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Performance

fraction of outliers !

ran

k r

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

21

19

17

15

13

fraction of ourliers !

ran

k r

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

21

19

17

15

13

L + C formulation L + S formulation ( from [Chandrasekaran et. al.], [Candes,et. al.] )

Page 15: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Another view…

Mean is solution of

minx

i

(xi � x)2

Median is solution of

minx

i

|xi � x|

Standard PCA of M is solution of

Our method is (convex rel. of)

X

j

kMj � Ljk

Fragile: Can be easily skewed by one / few points

Robust: skewing requires Error in constant fraction of pts

rank(L) r

rank(L) r

X

j

kMj � Ljk2

Page 16: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Collaborative Filtering w/ Adversaries

Page 17: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Collaborative Filtering w/ Adversaries

Users

Low-rank matrix that -  Is partiallly observed

-  Has some corrupted columns

== outliers with missing data !

Our setting: -  Good users == random sampling of incoherent matrix (as in matrix completion)

-  Manipulators == completely arbitrary sampling, values

Page 18: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Outlier Pursuit with Missing Data

min ||L||� + �||C||1,2

s.t. lij + cij = mij (i, j)for observed

Now: need row space to be incoherent as well

- since we are doing matrix completion and manipulator identification

Page 19: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Our Result

� � c1µ2r2 log3(4n)

p

1� �⇥ c2

⇥2

(1 + µr��

p )µ2r2 log6(4n)

Theorem: Convex program optimum is such that has the correct column space and the support of is exactly the set of manipulators, whp, provided

Sampling density

Fraction of users that are manipulators

(�L, �C) �L�C

Note: no assumptions on manipulators

n � p

Page 20: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Robust Collaborative Filtering

fraction of manipulators !

fra

ctio

n o

f o

bse

rve

d e

ntr

ies "

0.02 0.05 0.1 0.2 0.4

1

0.8

0.6

0.45

0.3

fraction of manipulators !

fra

ctio

n o

f o

bse

rve

d e

ntr

ies "

0.02 0.05 0.1 0.2 0.4

1

0.8

0.6

0.45

0.3

Algo: Partially observed Low-rank + Column-sparse

Algo: Partially observed Sparse + Low-rank

Page 21: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

More generally …

minX

L(y,A;X) + � r(X)Several methods in High-dim. Statistics

Our approach:

Loss function regularizer

(same) Loss function

minX1,X2

L(y,A;X1 + X2) + �1 r1(X1) + �2 r2(X2)

Weighted sum of regularizers

Yields robustness + flexibility in several settings. Today: PCA wit Outliers + missing data

Page 22: PCA with Outliers and Missing Datalerman/Meetings/SIAM2012_Sujay.pdfcompressed sensing etc., this is not an issue) Oracle Problem: min L,C ⇥M L C⇥ F + 1⇥L⇥ + 2⇥C⇥ 1,2 ColSpace(L)

Latent factors in time series (ICML’12) Matrix completion from Errors and Erasures (ISIT 2011, SIAM J. Optim. 2011) Graph clustering (ICML 2011) Robust Recommender Systems (ICML 2011) Multiple Sparse Regression (NIPS 2010) PCA that is robust to Outliers (NIPS 2010, Trans IT)

All papers on my website, Arxiv.

Ali Jalali

Constantine Caramanis

Yudong Chen

Pradeep Ravikumar

Huan Xu