Estimating the Inverse Covariance Matrix of Independent ... · Estimating inverse covariance matrix 1 We consider the problem of ﬁnding a good estimator for inverse covariance matrix

Hanne Iryna Vitor Dominique Conclusions

Estimating the Inverse Covariance Matrix of IndependentMultivariate Normally Distributed Random Variables

Dominique Brunet, Hanne Kekkonen, Vitor Nunes, Iryna Sivak

Fields-MITACS Thematic Program on Inverse Problems and ImagingGroup on Bayesian/Statistical Inverse Problems

July 27, 2012


Menu

HanneProblem Formulation and MotivationFrequentist and Bayesian OriginsRelation to Imaging

IrynaThe LASSO Methods

VitorNewton Methods and Quasi-Newton MethodsThe QUIC TrickThoughts on Future Work

DominiqueSimulation and Results

Conclusions


Bayes’ formula brings a priori information and measured data together

Statistical InversionWe consider the linear measurement model

m = Ax + ε

where x ∈ Rp and m ∈ Rp are treated as random variables.

Bayes’ formulaBayes’ formula gives us the posterior distribution π(x | m):

π(x | m) ∝ π(m | x)π(x)

We recover x as a point estimate from π(x | m).


Likelihood distribution describes the measurement

If we assume that the noise ε is Gaussian

εk ∼ Np(µ, σ2), 1 ≤ k ≤ p

we can write the likelihood distribution which measures data misfit as

π(m | x) = πε(‖Ax −m‖) =1

σ√

(2π)exp

(− 1

2σ2 ‖Ax −m‖22

).


Prior distribution contains all other information about unknown x

• The prior distribution shouldn’t depend on the measurement.

• The prior distribution should assign a clearly higher probability to thosevalues of x that we expect to see than to unexpected x .

• Designing a good prior distributions is one of the main difficulties instatistical inversion.


Gaussian prior distribution

Gaussian modelIf we assume that also x is Gaussian

x ∼ Np(µ,Σ)

we can write

π(x) =(det(Σ−1))1/2

(2π)p/2 exp(− 1

2(x − µ)T Σ−1(x − µ)

)


Estimating inverse covariance matrix Σ−1

We consider the problem of finding a good estimator for inverse covariancematrix Σ−1 with a constraint that certain given pairs of variables areconditionally independent.

Conditional independence constraints describe the sparsity pattern of theinverse covariance matrix Σ−1, zeros showing the conditional independencebetween variables.


Why inverse covariance matrix?

Simple exampleWe can think a line of objects which are connected by a spring.

If we move the leftmost object y1 also the rightmost object y5 of the line willmove so clearly y1 and y5 are not independent.


However y1 and y5 are conditionally independent given all the other variables.This is because given the information of how y2 moves knowing y5 gives usno further information about y1. In fact yi is conditionally dependent only of itsneighbours.


What does this have to do with pictures?

Neighbourhood -RuleWe are interested in exploring the assumption that every pixel is conditionallydependent only of its neighbours.

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

nz = 838

Student Version of MATLAB

In this case the inverse covariance matrix has non zero elements only onnine ’diagonals’.


Log-likelihood function

Likelihood functionWe want to estimate parameters Σ−1 and µ of Np(µ,Σ), based on Nindependent samples xi . The likelihood function is

N∏i=1

π(xi | Σ, µ) =det Σ−

N2

(2π)Np2

exp{− 1

2

N∑j=1

(xj − µ)Tσ−1(xj − µ)

}

Log-likelihood functionLog-likelihood function of data is, up to a constant

L(Σ−1, µ) = N log det Σ−1 −N∑

j=1

(xj − µ)T Σ−1(xj − µ)


How to obtain sparse Σ−1

maximum likelihood estimateMaximizing the log-likelihood with respect to Σ−1 leads to the maximumlikelihood estimate Σ−1 = S−1 which isn’t usually sparse (here S is thesample covariance obtained from the data). Also when p > N, S will besingular and so the maximum likelihood estimate cannot be computed.

Penalised log-likelihood functionTo find a sparse Σ−1 we use penalised log-likelihood function

argminΣ−1�0{− log det Σ−1 + (SΣ−1) + ||P ∗ Σ−1||1}

where where S is the sample covariance obtained from the data and P is oursparsity constraint.


In a nutshell

π(x | m) ∝ π(m | x)π(x)

Expect that

x ∼ Np(µ,Σ).

What is Σ−1?

Back to original inverse problemSolve the original inverse problemusing Σ−1

argminx‖Ax −m‖2 − (x − µ)T Σ−1(x − µ)

Penalised log-likelihoodfunction

Estimating Σ−1

Use algorithms to find Σ−1


Graphical Lasso

Σ−11 =

a11 a12 a13

a12 a22 a23

a13 a23 a33

Σ−12 =

b11 0 b13

0 b22 b23

b13 b23 b33


Graphical Lasso

Σ−11 =

a11 a12 a13

a12 a22 a23

a13 a23 a33

Σ−12 =

b11 0 b13

0 b22 b23

b13 b23 b33


Graphical Lasso


Graphical Lasso


GLASSO

Θ := Σ−1

Θ = argminΘ�0− log det(Θ) + tr(SΘ) + λ||Θ||1

m

Θ−1 − S − λΓ(Θ) = 0,

where (Γ(Θ))ij = Γ(Θij ) the subgradient of |Θij |:

Γ(Θij ) = sign(Θij ) if Θij 6= 0,

Γ(Θij ) ∈ (−1, 1) if Θij = 0


GLASSO

Θ := Σ−1, W := Θ−1

Θ =

(Θ11 θ12

θ21 θ22

), S =

(S11 s12

s21 s22

), W =

(W11 w12

w21 w22

)Start with W = S + λI. We also added Thresholding.Cycle around rows/columns till convergence:• solve QP:

β = arg minβ∈Rp−1

12β′W11β + β′s12 + λ||β||1, W11 � 0

using as warm starts the solution from the previous round for this column• update w12 = −W11β

• save β for this column in the matrix B

For every row/column compute θ22 = (s22 + λ− β′w12)−1, θ12 = βθ22.

[J. Friedman, T. Hastie, R. Tibshirani, 2007]


GLASSO 1.7

Theorem. Solution to the graphical lasso problem to be diagonal withblocks C1,C2, ...CK ⇔ |Sij | ≤ λ for all i ∈ Ck , j ∈ Cl , k 6= l .

Θ =

Θ1

Θ2

. . .ΘK

where Θk solves the graphical lasso problem applied to submatrix of Sconsisting of the entries whose indices are in Ck .

[D.M. Witten, J. Friedman, N. Simon, 2011]


Problem Formulation

Cost Functional DefinitionOur goal is to minimize the following functional, f (v), under the set of definitepositive matrices.

minΣ−1�0

f (Σ−1) = − log(det Σ−1) + tr(Σ−1S) +ρ

2‖R(Σ−1)‖2

2 (1)

Minimizing the previous is equivalent to minimize :

minΘ�0

f (Θ) = − log(det Θ) + tr(ΘS) +ρ

2‖R(Θ)‖2

2 (2)


Model Reduction- Closest Neighbor

Neighborhood -RuleReduce the model by assuming that each pixel is only related to itself and tothose which it shares a boundary or a corner. As the following figure shows:

Neighborhood -RuleThis rule permits to reduced the problem dimension from n = (nx ny )2 tom = 5nx ny − 3nx − 3ny + 2, where nx and ny are the number of pixels on thex-axis and y-axis respectively.


Model Reduction-Problem Reformulation

TransformationOnce g : Rm → Rn,n is defined, the problem becomes an unconstrainedoptimization problem on Rm. Therefore the minimizing problem (2) is writtenas:

minv∈Rm

J(v) = f (g(v)) = − log(det g(v)) + tr(g(v)S) +ρ

2‖R(g(v))‖2

2 (3)


Numerical Approach

Quasi-Newton’s MethodExpanding J(v) to a quadratic function we get :

J(v + ∆) ≈ J(v) +∇J(v) ·∆ + ∆tH(v)∆ (4)

Therefore the ∆ that minimizes J(v + ∆) must be the solution of∆ = −H−1(v) ∇J(v). So given an initial guess v0 one can solve the iterativesequence:

vk = vk−1 − αk H−1k (vk−1)J ′(vk−1) (5)


Gradient Computation:The gradient of J(v) is computed exactly by:

J ′(v) = ∇f (g(v)) · g′(v) ∈ Rm (6)

Where ∇f (Θ) is given by

(∇f )i,j = −Θ−1j,i + Sj,i + +ρ[R′(Θ)

∂Θ

∂σi,j]i,j (7)

Hessian Computation:Since is computational expensive to compute the Hessian, one should use anapproximation method such as BFGS.


Numerical Results, n=100

Error Analysis ‖g(v)S − I‖2

Error Analysis ‖g(v)−Θ‖2


QUIC Algorithm, Cho Jui Hsieh University of Texas:

Cost Function

minΘ

J(Θ) = − log(det Θ) + tr(ΘS) +ρ

2‖Θ‖1 (8)

Split Into a Differentiable Function and a Non-Differentiable

g(Θ) = − log(det Θ) + tr(ΘS) (9)

h(Θ) =ρ

2‖Θ‖1 (10)


QUIC Algorithm, Cho Jui Hsieh University of Texas:

Second Order Approximation of g(.)

G(∆) = g(Θ + ∆) = g(Θ) +∇g(Θ + ∆) + ∆tH∆ (11)

New Cost functional

min∆

F (∆) = G(∆) + h(Θ + ∆) (12)

Choice of ∆

∆ = µ[ej eti + ei et

j ] (13)

minµ

F (µ) (14)


Overview / Future Work

Overview• Model Reduction• Numerical Approach• Numerical Results• QUIC Algorithm

Future Work• Improve Memory Efficiency• Better Choice of x0

• Different Rules of Reduction• Incorporate the Model Reduction to QUIC


Simulation

Inverse Covariance TestMatrix

Σ−1 = 0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

nz = 838


Inverse CovarianceEstimation

Σ−1

Multivariate NormalSampling

x1, x2, . . . , xN

Sample Covariance

S = xxT/N


Timing (s)

Sample data: X ∈ Rd×N with N = 100.Covariance matrix S ∈ Rd×d .

Algorithm d = 49 d = 100 d = 200 d = 400S−1 0.0004 0.0018 0.0083 0.0477GLASSO 0.01± 0.01 0.17± 0.01 2.03± 0.03 23.51± 0.22QUIC 0.04± 0.02 0.39± 0.03 4.72± 0.11 77.15± 2.46quasi-Newton 10.5 56.2 ∗ ∗


Error Analysis (N = 100)

0 100 200 300 4000

10

20

30

40

50

60

70

80

90

100

d

Fro

be

niu

s n

orm

quasi!Newton

QUIC or GLASSO

S inverse



Training Set extracted from


Results

original noisy (! = 5%)

GRMF denoising TV denoising



Remarks

1. Computed Inverse Covariance Matrix on 15× 17 subblocks andaggregated blocks;

2. Assumed Stationarity;

3. Assumed Gaussianity of the image pixels;

4. Performed very basic (poor) alignment of training set;

5. Had a small (29) (and unhealthy) training set;

Nevertheless, it performs better than TV denoising!


Future Challenges

1. Computational Challenge: computing inverse covariance matrix on fullimages (e.g. 40, 000× 40, 000)

• Could Iteratively Re-weighted Least Squares Minimization for SparseRecovery help?

2. Parameter selection: “warm start” and generalized cross-validation;

3. Non-stationary model: use bigger and better aligned database;

4. Try Gaussianity model in transform domain or higher order models;


THANK YOU FOR YOUR TIME!!!

Documents

Estimating the Inverse Covariance Matrix of Independent ... · Estimating inverse covariance matrix 1 We consider the problem of ﬁnding a good estimator for inverse covariance matrix