Upload
letuyen
View
258
Download
0
Embed Size (px)
Citation preview
Hanne Iryna Vitor Dominique Conclusions
Estimating the Inverse Covariance Matrix of IndependentMultivariate Normally Distributed Random Variables
Dominique Brunet, Hanne Kekkonen, Vitor Nunes, Iryna Sivak
Fields-MITACS Thematic Program on Inverse Problems and ImagingGroup on Bayesian/Statistical Inverse Problems
July 27, 2012
Hanne Iryna Vitor Dominique Conclusions
Menu
HanneProblem Formulation and MotivationFrequentist and Bayesian OriginsRelation to Imaging
IrynaThe LASSO Methods
VitorNewton Methods and Quasi-Newton MethodsThe QUIC TrickThoughts on Future Work
DominiqueSimulation and Results
Conclusions
Hanne Iryna Vitor Dominique Conclusions
Bayes’ formula brings a priori information and measured data together
Statistical InversionWe consider the linear measurement model
m = Ax + ε
where x ∈ Rp and m ∈ Rp are treated as random variables.
Bayes’ formulaBayes’ formula gives us the posterior distribution π(x | m):
π(x | m) ∝ π(m | x)π(x)
We recover x as a point estimate from π(x | m).
Hanne Iryna Vitor Dominique Conclusions
Likelihood distribution describes the measurement
If we assume that the noise ε is Gaussian
εk ∼ Np(µ, σ2), 1 ≤ k ≤ p
we can write the likelihood distribution which measures data misfit as
π(m | x) = πε(‖Ax −m‖) =1
σ√
(2π)exp
(− 1
2σ2 ‖Ax −m‖22
).
Hanne Iryna Vitor Dominique Conclusions
Prior distribution contains all other information about unknown x
• The prior distribution shouldn’t depend on the measurement.
• The prior distribution should assign a clearly higher probability to thosevalues of x that we expect to see than to unexpected x .
• Designing a good prior distributions is one of the main difficulties instatistical inversion.
Hanne Iryna Vitor Dominique Conclusions
Gaussian prior distribution
Gaussian modelIf we assume that also x is Gaussian
x ∼ Np(µ,Σ)
we can write
π(x) =(det(Σ−1))1/2
(2π)p/2 exp(− 1
2(x − µ)T Σ−1(x − µ)
)
Hanne Iryna Vitor Dominique Conclusions
Estimating inverse covariance matrix Σ−1
We consider the problem of finding a good estimator for inverse covariancematrix Σ−1 with a constraint that certain given pairs of variables areconditionally independent.
Conditional independence constraints describe the sparsity pattern of theinverse covariance matrix Σ−1, zeros showing the conditional independencebetween variables.
Hanne Iryna Vitor Dominique Conclusions
Why inverse covariance matrix?
Simple exampleWe can think a line of objects which are connected by a spring.
If we move the leftmost object y1 also the rightmost object y5 of the line willmove so clearly y1 and y5 are not independent.
Hanne Iryna Vitor Dominique Conclusions
However y1 and y5 are conditionally independent given all the other variables.This is because given the information of how y2 moves knowing y5 gives usno further information about y1. In fact yi is conditionally dependent only of itsneighbours.
Hanne Iryna Vitor Dominique Conclusions
What does this have to do with pictures?
Neighbourhood -RuleWe are interested in exploring the assumption that every pixel is conditionallydependent only of its neighbours.
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
nz = 838
Student Version of MATLAB
In this case the inverse covariance matrix has non zero elements only onnine ’diagonals’.
Hanne Iryna Vitor Dominique Conclusions
Log-likelihood function
Likelihood functionWe want to estimate parameters Σ−1 and µ of Np(µ,Σ), based on Nindependent samples xi . The likelihood function is
N∏i=1
π(xi | Σ, µ) =det Σ−
N2
(2π)Np2
exp{− 1
2
N∑j=1
(xj − µ)Tσ−1(xj − µ)
}
Log-likelihood functionLog-likelihood function of data is, up to a constant
L(Σ−1, µ) = N log det Σ−1 −N∑
j=1
(xj − µ)T Σ−1(xj − µ)
Hanne Iryna Vitor Dominique Conclusions
How to obtain sparse Σ−1
maximum likelihood estimateMaximizing the log-likelihood with respect to Σ−1 leads to the maximumlikelihood estimate Σ−1 = S−1 which isn’t usually sparse (here S is thesample covariance obtained from the data). Also when p > N, S will besingular and so the maximum likelihood estimate cannot be computed.
Penalised log-likelihood functionTo find a sparse Σ−1 we use penalised log-likelihood function
argminΣ−1�0{− log det Σ−1 + (SΣ−1) + ||P ∗ Σ−1||1}
where where S is the sample covariance obtained from the data and P is oursparsity constraint.
Hanne Iryna Vitor Dominique Conclusions
In a nutshell
π(x | m) ∝ π(m | x)π(x)
Expect that
x ∼ Np(µ,Σ).
What is Σ−1?
Back to original inverse problemSolve the original inverse problemusing Σ−1
argminx‖Ax −m‖2 − (x − µ)T Σ−1(x − µ)
Penalised log-likelihoodfunction
Estimating Σ−1
Use algorithms to find Σ−1
Hanne Iryna Vitor Dominique Conclusions
Graphical Lasso
Σ−11 =
a11 a12 a13
a12 a22 a23
a13 a23 a33
Σ−12 =
b11 0 b13
0 b22 b23
b13 b23 b33
Hanne Iryna Vitor Dominique Conclusions
Graphical Lasso
Σ−11 =
a11 a12 a13
a12 a22 a23
a13 a23 a33
Σ−12 =
b11 0 b13
0 b22 b23
b13 b23 b33
Hanne Iryna Vitor Dominique Conclusions
Graphical Lasso
Hanne Iryna Vitor Dominique Conclusions
Graphical Lasso
Hanne Iryna Vitor Dominique Conclusions
GLASSO
Θ := Σ−1
Θ = argminΘ�0− log det(Θ) + tr(SΘ) + λ||Θ||1
m
Θ−1 − S − λΓ(Θ) = 0,
where (Γ(Θ))ij = Γ(Θij ) the subgradient of |Θij |:
Γ(Θij ) = sign(Θij ) if Θij 6= 0,
Γ(Θij ) ∈ (−1, 1) if Θij = 0
Hanne Iryna Vitor Dominique Conclusions
GLASSO
Θ := Σ−1, W := Θ−1
Θ =
(Θ11 θ12
θ21 θ22
), S =
(S11 s12
s21 s22
), W =
(W11 w12
w21 w22
)Start with W = S + λI. We also added Thresholding.Cycle around rows/columns till convergence:• solve QP:
β = arg minβ∈Rp−1
12β′W11β + β′s12 + λ||β||1, W11 � 0
using as warm starts the solution from the previous round for this column• update w12 = −W11β
• save β for this column in the matrix B
For every row/column compute θ22 = (s22 + λ− β′w12)−1, θ12 = βθ22.
[J. Friedman, T. Hastie, R. Tibshirani, 2007]
Hanne Iryna Vitor Dominique Conclusions
GLASSO 1.7
Theorem. Solution to the graphical lasso problem to be diagonal withblocks C1,C2, ...CK ⇔ |Sij | ≤ λ for all i ∈ Ck , j ∈ Cl , k 6= l .
Θ =
Θ1
Θ2
. . .ΘK
where Θk solves the graphical lasso problem applied to submatrix of Sconsisting of the entries whose indices are in Ck .
[D.M. Witten, J. Friedman, N. Simon, 2011]
Hanne Iryna Vitor Dominique Conclusions
Problem Formulation
Cost Functional DefinitionOur goal is to minimize the following functional, f (v), under the set of definitepositive matrices.
minΣ−1�0
f (Σ−1) = − log(det Σ−1) + tr(Σ−1S) +ρ
2‖R(Σ−1)‖2
2 (1)
Minimizing the previous is equivalent to minimize :
min�0
f (Θ) = − log(det Θ) + tr(ΘS) +ρ
2‖R(Θ)‖2
2 (2)
Hanne Iryna Vitor Dominique Conclusions
Model Reduction- Closest Neighbor
Neighborhood -RuleReduce the model by assuming that each pixel is only related to itself and tothose which it shares a boundary or a corner. As the following figure shows:
Neighborhood -RuleThis rule permits to reduced the problem dimension from n = (nx ny )2 tom = 5nx ny − 3nx − 3ny + 2, where nx and ny are the number of pixels on thex-axis and y-axis respectively.
Hanne Iryna Vitor Dominique Conclusions
Model Reduction-Problem Reformulation
TransformationOnce g : Rm → Rn,n is defined, the problem becomes an unconstrainedoptimization problem on Rm. Therefore the minimizing problem (2) is writtenas:
minv∈Rm
J(v) = f (g(v)) = − log(det g(v)) + tr(g(v)S) +ρ
2‖R(g(v))‖2
2 (3)
Hanne Iryna Vitor Dominique Conclusions
Numerical Approach
Quasi-Newton’s MethodExpanding J(v) to a quadratic function we get :
J(v + ∆) ≈ J(v) +∇J(v) ·∆ + ∆tH(v)∆ (4)
Therefore the ∆ that minimizes J(v + ∆) must be the solution of∆ = −H−1(v) ∇J(v). So given an initial guess v0 one can solve the iterativesequence:
vk = vk−1 − αk H−1k (vk−1)J ′(vk−1) (5)
Hanne Iryna Vitor Dominique Conclusions
Gradient Computation:The gradient of J(v) is computed exactly by:
J ′(v) = ∇f (g(v)) · g′(v) ∈ Rm (6)
Where ∇f (Θ) is given by
(∇f )i,j = −Θ−1j,i + Sj,i + +ρ[R′(Θ)
∂Θ
∂σi,j]i,j (7)
Hessian Computation:Since is computational expensive to compute the Hessian, one should use anapproximation method such as BFGS.
Hanne Iryna Vitor Dominique Conclusions
Numerical Results, n=100
Error Analysis ‖g(v)S − I‖2
Error Analysis ‖g(v)−Θ‖2
Hanne Iryna Vitor Dominique Conclusions
QUIC Algorithm, Cho Jui Hsieh University of Texas:
Cost Function
minΘ
J(Θ) = − log(det Θ) + tr(ΘS) +ρ
2‖Θ‖1 (8)
Split Into a Differentiable Function and a Non-Differentiable
g(Θ) = − log(det Θ) + tr(ΘS) (9)
h(Θ) =ρ
2‖Θ‖1 (10)
Hanne Iryna Vitor Dominique Conclusions
QUIC Algorithm, Cho Jui Hsieh University of Texas:
Second Order Approximation of g(.)
G(∆) = g(Θ + ∆) = g(Θ) +∇g(Θ + ∆) + ∆tH∆ (11)
New Cost functional
min∆
F (∆) = G(∆) + h(Θ + ∆) (12)
Choice of ∆
∆ = µ[ej eti + ei et
j ] (13)
minµ
F (µ) (14)
Hanne Iryna Vitor Dominique Conclusions
Overview / Future Work
Overview• Model Reduction• Numerical Approach• Numerical Results• QUIC Algorithm
Future Work• Improve Memory Efficiency• Better Choice of x0
• Different Rules of Reduction• Incorporate the Model Reduction to QUIC
Hanne Iryna Vitor Dominique Conclusions
Simulation
Inverse Covariance TestMatrix
Σ−1 = 0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
nz = 838
Student Version of MATLAB
Inverse CovarianceEstimation
Σ−1
Multivariate NormalSampling
x1, x2, . . . , xN
Sample Covariance
S = xxT/N
Hanne Iryna Vitor Dominique Conclusions
Timing (s)
Sample data: X ∈ Rd×N with N = 100.Covariance matrix S ∈ Rd×d .
Algorithm d = 49 d = 100 d = 200 d = 400S−1 0.0004 0.0018 0.0083 0.0477GLASSO 0.01± 0.01 0.17± 0.01 2.03± 0.03 23.51± 0.22QUIC 0.04± 0.02 0.39± 0.03 4.72± 0.11 77.15± 2.46quasi-Newton 10.5 56.2 ∗ ∗
Hanne Iryna Vitor Dominique Conclusions
Error Analysis (N = 100)
0 100 200 300 4000
10
20
30
40
50
60
70
80
90
100
d
Fro
be
niu
s n
orm
quasi!Newton
QUIC or GLASSO
S inverse
Student Version of MATLAB
Hanne Iryna Vitor Dominique Conclusions
Training Set extracted from
Hanne Iryna Vitor Dominique Conclusions
Results
original noisy (! = 5%)
GRMF denoising TV denoising
Student Version of MATLAB
Hanne Iryna Vitor Dominique Conclusions
Remarks
1. Computed Inverse Covariance Matrix on 15× 17 subblocks andaggregated blocks;
2. Assumed Stationarity;
3. Assumed Gaussianity of the image pixels;
4. Performed very basic (poor) alignment of training set;
5. Had a small (29) (and unhealthy) training set;
Nevertheless, it performs better than TV denoising!
Hanne Iryna Vitor Dominique Conclusions
Future Challenges
1. Computational Challenge: computing inverse covariance matrix on fullimages (e.g. 40, 000× 40, 000)
• Could Iteratively Re-weighted Least Squares Minimization for SparseRecovery help?
2. Parameter selection: “warm start” and generalized cross-validation;
3. Non-stationary model: use bigger and better aligned database;
4. Try Gaussianity model in transform domain or higher order models;
Hanne Iryna Vitor Dominique Conclusions
THANK YOU FOR YOUR TIME!!!