Maximum Likelihood Matrix Completion Under Sparse Factor ...€¦ · Background and Motivation...

Maximum Likelihood Matrix Completion UnderSparse Factor Models:

Error Guarantees and Efficient Algorithms

Jarvis Haupt

Department of Electrical and Computer EngineeringUniversity of Minnesota

Institute for Computational and Experimental Research in Mathematics (ICERM)Workshop on Approximation, Integration, and Optimization

October 1, 2014

Background and Motivation Problem Statement Error Bounds Algorithmic Approach Experimental Results Acknowledgments

Section 1

Background and Motivation

A Classical Example

Sampling Theorem:(Whittaker/Kotelnikov/Nyquist/Shannon, 1930’s-1950’s)

Original Signal (Red)Samples (Black)

Accurate Recovery (and Imputation)via Ideal Low-Pass Filtering

when Original Signal is Bandlimited

Basic “Formula” for Inference:To draw inferences from limiteddata (or here, to impute missing

elements), need to leverageunderlying structure in the signal

being inferred.

A Contemporary Example

Matrix Completion:(Candes & Recht; Keshavan, et al.; Candes & Tao;

Candes & Plan; Negahban & Wainwright;

Koltchinskii et al.; Davenport et al.;... 2009- )

Samples

Accurate Recovery (and Imputation)via Convex Optimization

when Original Matrix is Low-Rank

Low-rank modeling assumptioncommonly utilized in

collaborative filtering applications(e.g. the Netflix prize),

to describe settings where eachobserved value depends on only a

few latent factors or features.

Beyond Low Rank Models?

Low-Rank Models: All columns of the ma-trix are well-approximated as vectors incommon linear subspace.

Union of Subspaces Model: All columns ofthe matrix are well-approximated as vectorsin a union of linear subspaces.

Union of subspaces models are at the essence of sparse subspace clustering (Elhamifar & Vidal;

Soltanolkotabi et al.; Erikkson et al; Balzano et al) and dictionary learning (Olshausen & Field; Aharon et

al; Mairal et al.;...).

Here, we examine the efficacy of such models in matrix completion tasks.

Section 2

Problem Statement

“Sparse Factor” Data Models

We assume the unknown X∗ ∈ Rn1×n2 we seek to estimate admits a factorization of the form

X∗ = D∗A∗, D∗ ∈ Rn1×r ,A∗ ∈ Rr×n2

• ‖D∗‖max , maxi,j |Di,j | ≤ 1 (essentially to fix scaling ambiguities)

• ‖A∗‖max ≤ Amax for a constant 0 < Amax ≤ (n1 ∨ n2)

• ‖X∗‖max ≤ Xmax/2 for a constant Xmax ≥ 1

Our Focus: Sparse factor models, characterized by (approximately or exactly) sparse A∗.

“Sparse Factor” Data Models

We assume the unknown X∗ ∈ Rn1×n2 we seek to estimate admits a factorization of the form

X∗ = D∗A∗, D∗ ∈ Rn1×r ,A∗ ∈ Rr×n2

• ‖D∗‖max , maxi,j |Di,j | ≤ 1 (essentially to fix scaling ambiguities)

• ‖A∗‖max ≤ Amax for a constant 0 < Amax ≤ (n1 ∨ n2)

• ‖X∗‖max ≤ Xmax/2 for a constant Xmax ≥ 1

Our Focus: Sparse factor models, characterized by (approximately or exactly) sparse A∗.

Observation Model

We observe X∗ only at a subset S ∈ {1, 2, . . . , n1} × {1, 2, . . . , n2} of its locations. For someγ ∈ (0, 1] each (i , j) is in S independently with probability γ, and interpret γ = m(n1n2)−1, sothat m = is the nominal number of observations.

Observations {Yi,j}(i,j)∈S , YS conditionally independent given S, modeled via joint density

pX∗S(YS) =

∏(i,j)∈S

pX∗i,j(Yi,j )︸︷︷︸

scalar densities

Estimation Approach

We estimate X∗ via a sparsity-penalized maximum likelihood approach: for λ > 0, we take

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

The set X of candidate reconstructions is any subset of X ′, where

X ′ , {X = DA : D ∈ D, A ∈ A, ‖X‖max ≤ Xmax} ,where

• D: the set of all matrices D ∈ Rn1×r whose elements are discretized to one of Luniformly-spaced values in the range [−1, 1]

• A: the set of all matrices A ∈ Rr×n2 whose elements either take the value zero, or arediscretized to one of L uniformly-spaced values in the range [−Amax,Amax]

Estimation Approach

We estimate X∗ via a sparsity-penalized maximum likelihood approach: for λ > 0, we take

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

The set X of candidate reconstructions is any subset of X ′, where

X ′ , {X = DA : D ∈ D, A ∈ A, ‖X‖max ≤ Xmax} ,where

• D: the set of all matrices D ∈ Rn1×r whose elements are discretized to one of Luniformly-spaced values in the range [−1, 1]

• A: the set of all matrices A ∈ Rr×n2 whose elements either take the value zero, or arediscretized to one of L uniformly-spaced values in the range [−Amax,Amax]

Section 3

Error Bounds

A General “Sparse Factor” Matrix Completion Error Guarantee

Theorem (A. Soni, S. Jain, J.H., and S. Gonella, 2014)

Let β > 0 and set L = (n1 ∨ n2)β . If CD satisfies CD ≥ maxX∈X maxi,j D(pX∗i,j‖pXi,j

), then for

any λ ≥ 2 · (β + 2) ·(

1 + 2CD3

)· log(n1 ∨ n2), the sparsity penalized ML estimate

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

}satisfies the (normalized, per-element) error bound

ES,YS[−2 log A(p

X, pX∗ )

≤8CD log m

+3 minX=DA∈X

{D(pX∗‖pX)

4CD(β + 2) log(n1 ∨ n2)

)(n1p + ‖A‖0

A(pX, pX∗ ) ,∏

i,j A(pXi,j, pX∗

i,j) where A(pXi,j

, pX∗i,j

) , EpX∗i,j

[√pXi,j

/pX∗i,j

]is the Hellinger Affinity

D(pX∗‖pX) ,∑

i,j D(pX∗i,j‖pXi,j

) where D(pX∗i,j‖pXi,j

) , EpX∗i,j

[log(pX∗

i,j/pXi,j

]is KL Divergence

Next, we instantiate this result for some specific cases (using a specific choice of β, λ).

A General “Sparse Factor” Matrix Completion Error Guarantee

Theorem (A. Soni, S. Jain, J.H., and S. Gonella, 2014)

Let β > 0 and set L = (n1 ∨ n2)β . If CD satisfies CD ≥ maxX∈X maxi,j D(pX∗i,j‖pXi,j

), then for

any λ ≥ 2 · (β + 2) ·(

1 + 2CD3

)· log(n1 ∨ n2), the sparsity penalized ML estimate

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

}satisfies the (normalized, per-element) error bound

ES,YS[−2 log A(p

X, pX∗ )

≤8CD log m

+3 minX=DA∈X

{D(pX∗‖pX)

4CD(β + 2) log(n1 ∨ n2)

)(n1p + ‖A‖0

A(pX, pX∗ ) ,∏

i,j A(pXi,j, pX∗

i,j) where A(pXi,j

, pX∗i,j

) , EpX∗i,j

[√pXi,j

/pX∗i,j

]is the Hellinger Affinity

D(pX∗‖pX) ,∑

i,j D(pX∗i,j‖pXi,j

) where D(pX∗i,j‖pXi,j

) , EpX∗i,j

[log(pX∗

i,j/pXi,j

]is KL Divergence

Next, we instantiate this result for some specific cases (using a specific choice of β, λ).

Additive White Gaussian Noise Model

Suppose each observation is corrupted by zero-mean AWGN with known variance σ2, so that

pX∗S(YS) =

(2πσ2)|S|/2exp

∑(i,j)∈S

(Yi,j − X∗i,j )2

Let X = X ′, essentially (a discretization of) a set of rank and max-norm constrained matrices.

Gaussian Noise (Exact Sparse Factor Model)

If A∗ is exactly sparse with ‖A∗‖0 nonzero elements, the sparsity penalized ML estimate satisfies

ES,YS[‖X∗ − X‖2

(σ2 + X2max)

(n1r + ‖A∗‖0

)log(n1 ∨ n2)

AWGN – Our Result in Context

Gaussian Noise (Exact Sparse Factor Model)

(σ2 + X2max)

(n1r + ‖A∗‖0

)log(n1 ∨ n2)

Compare with result of (Koltchinskii et al, 2011); when X∗ is max-norm and rank-constrained,nuclear-norm penalized optimization yields estimate satisfying

‖X∗ − X‖2F

n1n2= O

((σ2 + X2

((n1 + n2)r

)log(n1 ∨ n2)

)with high probability.

Note: Our guarantees can have improved error performance in the case where ‖A∗‖0 � n2r .The two bounds coincide when A∗ is not sparse (take ‖A∗‖0 = n2r in our error bounds).

AWGN Model (Extension to Approximately Sparse Factor Model)

Recall: For p ≤ 1, a vector x ∈ Rn is said to belong to a weak-`p ball of radius R > 0, denotedx ∈ w`p(R), if its ordered elements |x(1)| ≥ |x(2)| ≥ · · · ≥ |x(n)| satisfy

|x(i)| ≤ Ri−1/p for all i ∈ {1, 2, . . . , n}.

With this, we can state a variant of the above for when columns of A∗ are approximately sparse.

Gaussian Noise (Approximately Sparse Factor Model)

Consider the same Gaussian noise model described above. If for some p ≤ 1 all columns of A∗

belong to a weak-`p ball of radius Amax, then for α = 1/p − 1/2,

= O(A2

) 2α2α+1

+ (σ2 + X2max)

) 2α2α+1

)log(n1 ∨ n2)

Note:( n2

) 2α2α+1 ≤ n2m−

2α2α+1 ⇐ aggregate error of estimating n2 vectors in w`p from noisy obs.

AWGN Model (Extension to Approximately Sparse Factor Model)

Recall: For p ≤ 1, a vector x ∈ Rn is said to belong to a weak-`p ball of radius R > 0, denotedx ∈ w`p(R), if its ordered elements |x(1)| ≥ |x(2)| ≥ · · · ≥ |x(n)| satisfy

|x(i)| ≤ Ri−1/p for all i ∈ {1, 2, . . . , n}.

With this, we can state a variant of the above for when columns of A∗ are approximately sparse.

Gaussian Noise (Approximately Sparse Factor Model)

Consider the same Gaussian noise model described above. If for some p ≤ 1 all columns of A∗

belong to a weak-`p ball of radius Amax, then for α = 1/p − 1/2,

= O(A2

) 2α2α+1

+ (σ2 + X2max)

) 2α2α+1

)log(n1 ∨ n2)

Note:( n2

) 2α2α+1 ≤ n2m−

2α2α+1 ⇐ aggregate error of estimating n2 vectors in w`p from noisy obs.

Additive Laplace Noise Model

Suppose each observation is corrupted by additive Laplace noise with known parameter τ > 0, so

pX∗S(YS) =

)|S|exp

−τ ∑(i,j)∈S

|Yi,j − X∗i,j |

Let X = X ′, essentially (a discretization of) a set of rank and max-norm constrained matrices.

Laplace Noise (Exact Sparse Factor Model)

= O( (

τ2+ X2

)︸︷︷︸

O(variance + X2max)

τXmax

(n1r + ‖A∗‖0

)︸︷︷︸

“parametric-like” formsimilar to sparse model

Gaussian-noise case

log(n1∨n2)

Can also obtain results for the approximately sparse case here, analogously to above...

Poisson-distributed Observations

Suppose that each element of X∗ satisfies X∗i,j ≥ Xmin for some Xmin > 0, and that each

observation is Poisson-distributed, so that YS ∈ N|S| and

pX∗S(YS) =

∏(i,j)∈S

(X∗i,j )Yi,j e−X∗i,j

(Yi,j )!,

Let X = {X ∈ X ′ : Xi,j ≥ 0 for all (i , j) ∈ {1, 2, . . . , n1} × {1, 2, . . . , n2}}.(To allow only non-negative rate estimates)

Poisson-distributed Observations (Exact Sparse Factor Model)

= O( (

Xmax + X2max

)︸︷︷︸

O(worst-case variance + X2max)

when Xmax/Xmin = O(1)

(n1r + ‖A∗‖0

)log(n1∨n2)

One-bit Observations

Let link function F : R→ [0, 1] be a differentiable link function with f (t) = ddt

F (t). Supposeeach observation Yi,j for (i , j) ∈ S is Bernoulli(F (X∗i,j ))-distributed, so that

pX∗S(YS) =

∏(i,j)∈S

[F (X∗i,j )

]Yi,j[1− F (X∗i,j )

]1−Yi,j

Assume F (Xmax) < 1, F (−Xmax) > 0, and inf|t|≤Xmaxf (t) > 0.

One-bit Observations (Exact Sparse Factor Model)

cF ,Xmax

c ′F ,Xmax

cF ,Xmax

+ X2max

) (n1r + ‖A∗‖0

)log(n1 ∨ n2)

cF ,Xmax ,

|t|≤Xmax

F (t)(1− F (t))

sup|t|≤Xmax

f 2(t)

c ′F ,Xmax, inf

|t|≤Xmax

f 2(t)

F (t)(1− F (t)).

Comparisons to “One bit Matrix Completion”

One-bit Observations (Exact Sparse Factor Model)

cF ,Xmax

c ′F ,Xmax

cF ,Xmax

+ X2max

) (n1r + ‖A∗‖0

)log(n1 ∨ n2)

Compare with low-rank recovery result of (Davenport et al., 2012); maximum likelihoodoptimization over a set of max-norm and nuclear-norm constrained candidates yields estimatesatisfying

‖X∗ − X‖2F

n1n2= O

(CF ,XmaxXmax

√(n1 + n2)r

)with high probability, where CF ,Xmax analogous to (cF ,Xmax/c ′F ,Xmax

) factor in our bounds.

Extra loss of Xmax log(n1 ∨ n2) in our bound, but faster “parametric-like” dependence on m (inaddition to the “sparse factor” improvement). Lower bounds for “sparse factor” model stillopen (we think!).

Section 4

Algorithmic Approach

A Non-Convex Problem...

Our optimizations take the general form

minD∈Rn1×r ,A∈Rr×n2 ,X∈Rn1×n2

∑i,j

−si,j logpXi,j(Yi,j ) + IX (X) + ID(D) + IA(A) + λ‖A‖0

s.t. X = DA.

where si,j = 1 if (i , j) ∈ S (and 0 otherwise) and IX (.), ID(.), IA(.) are indicator functions.

Multiple sources of non-convexity:

• `0 regularizer

• discretized sets D and A• inherent bilinearity of the model!

We propose an approach based on the Alternating Direction Method of Multipliers (ADMM).

∑i,j

s.t. X = DA.

• `0 regularizer

∑i,j

s.t. X = DA.

• `0 regularizer

A General-Purpose ADMM-based Approach

We form the augmented Lagrangian

L(D,A,X,Λ) = −∑i,j

si,j logpXi,j(Yi,j ) + IX (X) + ID(D) + IA(A) + λ‖A‖0

+tr (Λ(X−DA)) +ρ

2‖X−DA‖2F,

where Λ is Lagrange multiplier for the equality constraint and ρ > 0 is a parameter, and solve:

(S1 :) Xk+1 := minX∈Rn1×n2

L(Dk ,Ak ,X,Λk )

(S2 :) Ak+1 := minA∈Rr×n2

L(Dk ,A,Xk+1,Λk )

(S3 :) Dk+1 := minD∈Rn1×r

L(D,Ak+1,Xk+1,Λk )

(S4 :) Λk+1 = Λk + ρ(Xk+1 −Dk+1Ak+1).

Efficiently Solvable Subproblems

We relax D,A,X to closed convex sets, and solve S1-S4 iteratively, as follows...

Step S1: After simplification, the solution can be written in terms of scalar prox functions:

Xk+1i,j = arg min

Xi,j∈R−si,j logpXi,j

(Yi,j ) + IX (Xi,j ) +ρ

(Xi,j − (Dk Ak )i,j + (Λk )i,j/ρ

, prox−si,j logp· (Yi,j )+IX (·)

((Dk Ak )i,j − (Λk )i,j/ρ

(Closed-form for three of our examples; use Newton’s Method for the one-bit model w/probit or logit link.)

Step S2: The subproblem takes the form

minA∈Rn1×r

IA(A) + λ‖A‖0 +ρ

2‖Xk+1 −Dk A + Λk/ρ‖2

(Solved via “majorization-minimization;” Iterative Hard Thresholding (Blumensath & Davies 2008).)

minD∈Rr×n2

ID(D) +ρ

2‖Xk+1 −DAk+1 + Λkρ‖2

(Efficiently solved via Newton’s Method or closed-form.)

Xk+1i,j = arg min

(Yi,j ) + IX (Xi,j ) +ρ

minA∈Rn1×r

IA(A) + λ‖A‖0 +ρ

2‖Xk+1 −Dk A + Λk/ρ‖2

minD∈Rr×n2

ID(D) +ρ

2‖Xk+1 −DAk+1 + Λkρ‖2

Xk+1i,j = arg min

(Yi,j ) + IX (Xi,j ) +ρ

minA∈Rn1×r

IA(A) + λ‖A‖0 +ρ

2‖Xk+1 −Dk A + Λk/ρ‖2

minD∈Rr×n2

ID(D) +ρ

2‖Xk+1 −DAk+1 + Λkρ‖2

Section 5

Experimental Results

A Comparison with Synthetic Data

Preliminary Experimental Results: We evaluated each of these methods on matrices of size100× 1000 with r = 20 and 4 nonzero elements per column of A∗, for varying sampling rates(and different likelihood models). For each, we evaluated the average (over 5 trials) normalizedreconstruction error as a function of the sampling rate.

Gaussian and Laplace Noises havesame variances.

For sampling rates > 10−4 ≈ 40%,the error exhibits predicted decay

(slope of ≈-1 on the log-log scale).

−1 −0.8 −0.6 −0.4 −0.2 0−2

log10(γ )

E‖X−X

∗‖2 F

GaussianLaplacePoisson

Imaging Example – Gaussian Noise

Original 512× 512 image reshaped into 256× 1024 matrix (0.005 ≤ X∗i,j ≤ 1.05 for all i , j)

Inner dimension r = 25, noise standard deviation: σ = 0.01, sampling rate = 50%

Original Image Samples

Estimated Image Estimated A

Imaging Example – Laplace Noise

Inner dimension r = 25, noise standard deviation:√

2/τ = 0.01, sampling rate = 50%

Imaging Example – Poisson-distributed Observations

Inner dimension r = 25, sampling rate = 50%

Imaging Example – One-bit Observations

Inner dimension r = 25, sampling rate = 50%

Section 6

Acknowledgments

Collaborators/Co-authors:

Akshay Soni Swayambhoo Jain Prof. Stefano Gonella(UMN ECE PhD Student) (UMN ECE PhD Student) (UMN Civil Engr.)

Research Support:NSF EARS (Enhancing Access to the Radio Spectrum) ProgramDARPA Young Faculty Award

Thanks!jdhaupt@umn.edu

www.ece.umn.edu/~jdhaupt

(Special thanks to Prof. Julian Wolfson, UMN Dept. of Biostatistics, for the Beamer Template!)

Acknowledgments

Maximum Likelihood Matrix Completion Under Sparse Factor ...€¦ · Background and Motivation...

Documents

Bounds for the asymptotic normality of the maximum ...alea.impa.br/articles/v14/14-09.pdfBounds for the asymptotic normality of the maximum likelihood estimator using the Delta method

From Algorithmic States to Algorithmic Brainswhitakerinstitute.ie/wp-content/uploads/2014/02/... · From Algorithmic States to Algorithmic Brains The modern age is characterised by

Algorithmic Trading Session 8 Trade Implementation II ... · Algorithmic Trading Session 8 Trade Implementation II Algorithmic Execution ... Review: Algorithmic Trading - Areas of

IEEE TRANSACTIONS ON INFORMATION THEORY, …models. Sparse superposition codes for this channel were developed in [27], where reliability bounds for the optimal maximum-likelihood

Generalization Bounds for Ranking Algorithms via Algorithmic …lekheng/meetings/matho... · 2010-11-21 · ranking or ordering over an instance space, has recently gained much attention

Maximum Likelihood Markov Networks: An Algorithmic ...ttic.uchicago.edu/~nati/Publications/MThesis.pdf · Maximum Likelihood Markov Networks: An Algorithmic Approach by Nathan Srebro

Algorithmic Bounds on Hypergraph Coloring and Coveringpraveenk/papers/honorsthesis.pdf · Praveen Kumar. Abstract Consider the coloring of a vertex-labelled r-uniform hypergraph G(V,E),

5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee

Java Algorithms for Computer Performance Analysis...A Java implementation of Asymptotic Bounds, Balanced Job Bounds and Geometric Bounds (as proposed in [6]), providing bounds on throughput,

Expectation-Maximizationffh8x/docs/teaching/esl/2020-04/...-2.3-2.25-2.2-2.15-2.1 Figure 7.2: The log-likelihood of the observation and consecutive EM lower bounds and estimates. In

Technical Report 2008/19. - uniroma1.it · Simone Borra - Agostino Di Ciaccio 4 approach, recent approaches to deriving generalization bounds are based on the algorithmic stability,

PSEUDO-LIKELIHOOD METHODS FOR COMMUNITY …danroy.org/teaching/2014/STA4513/presentations/homayounfar2.pdf1.Algorithmic approaches in physics: Greedy methods such as hierarchical clustering

Best bounds for expected financial payoffs I Algorithmic ...Best bounds for expected financial payoffs I Algorithmic evaluation ... CH-8401 Winterthur, Switzerland Received 19 August

Computational Complexity. Problem complexity vs. algorithmic complexity Upper bounds vs. lower bounds Exact solutions vs. approximate solutions Deterministic

Algorithmic Graph Minor Theory: Improved Grid Minor Bounds ...erikdemaine.org/papers/GridWagner_Algorithmica/paper.pdf · Algorithmic Graph Minor Theory: Improved Grid Minor Bounds

Maximum Likelihood and Robust Maximum Likelihood

Papers on the Selection Problem TIME BOUNDS FOR ...bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Lec5...Around 1930, Hugo Steinhaus brought the problem into the realm of algorithmic

Algorithmic Graph Minor Theory: Improved Grid Minor Bounds ... · algorithms for any contraction-closed problem, like we do for minor-closed problems via Graph Minor Theory. 1 Introduction

Methods for Algorithmic Meta Theoremsjanos/RESEARCH/AMS-Book-files/pdfs/05_G… · for the rst time we also survey lower bounds for algorithmic meta theorems, most of which are very

Generalisation Bounds (5): Regret bounds for online learningjaven/talk/bounds_slide5.pdf · Generalisation Bounds (5): Regret bounds for online learning Qinfeng (Javen) Shi The Australian