Download pdf - Hierarchical Bayesian Methods - University of Oulu 4...Hierarchical Bayesian Methods Bhaskar D Rao1 University of California, San Diego 1Thanks to David Wipf, Jason Palmer, Zhilin

Hierarchical Bayesian Methods

Bhaskar D Rao1

University of California, San Diego

1Thanks to David Wipf, Jason Palmer, Zhilin Zhang, R. Prasad and RitwikGiri

Bhaskar D Rao University of California, San Diego

Bayesian Methods

1. MAP Estimation Framework (Type I)

2. Hierarchical Bayesian Framework (Type II)


Bayesian Methods




Bayesian Methods




MAP Estimation Framework (Type I)

Problem Statement

x̂ = arg maxx

p(x |y) = arg maxx

p(y |x)p(x)

Choice of p(x) = a2e−a|x| as Laplacian and p(y |x) as Gaussian will lead

to the familiar LASSO framework.


Hierarchical Bayes (Type II): Sparse BayesianLearning(SBL)

MAP methods were interested in the mode of the posterior p(x |y) butSBL uses posterior information beyond the mode, i.e. posteriordistribution p(x |y).

Problem

For all sparse priors it is not possible to compute the normalized posteriorp(x |y), hence some approximations are needed.




Problem





Problem



Hierarchical Bayesian Framework (Type II)

Problem Statement

γ̂ = arg maxγ

p(γ|y) = arg maxγ

∫p(y |x)p(x |γ)p(γ)dx

Using this estimate of γ, one can compute the desired posterior p(x |y ; γ̂).



Problem Statement

γ̂ = arg maxγ

p(γ|y) = arg maxγ

∫p(y |x)p(x |γ)p(γ)dx

Using this estimate of γ, one can compute the desired posterior p(x |y ; γ̂).



Potential Advantages

Averaging over x leads to fewer minima in p(γ|y).

γ can tie several parameters, leading to fewer parameters.

Example: Bayesian LASSO

Laplacian prior p(x) can be represented as a Gaussian Scale Mixture inthis fashion,

p(x) =

∫p(x |γ)p(γ)dγ

=

∫1√2πγ

exp(− x2

2γ)× a2

2exp(−a2

2γ)dγ

=a

2exp(−a|x |)



Potential Advantages

Averaging over x leads to fewer minima in p(γ|y).

γ can tie several parameters, leading to fewer parameters.

Example: Bayesian LASSO

Laplacian prior p(x) can be represented as a Gaussian Scale Mixture inthis fashion,

p(x) =

∫p(x |γ)p(γ)dγ

=

∫1√2πγ

exp(− x2

2γ)× a2

2exp(−a2

2γ)dγ

=a

2exp(−a|x |)


Hierarchical Bayes: Sparse Bayesian Learning(SBL)

Instead of solving a MAP problem in x , in this framework one estimatesthe hyperparameters γ and then an estimate of the posterior distributionfor x , i.e. p(x |y ; γ̂). (Sparse Bayesian Learning)








Useful Representation for Sparse priors

In order for this framework to be useful, we need tractablerepresentations:

Gaussian Scaled Mixtures (GSM)

Gaussian Scale Mixtures : Model for random variable X

X = γG where, G ∼ N(g ; 0, 1)

γ is a positive random variable, which is independent of G .

p(x) =

∫p(x |γ) p(γ)dγ

=

∫N(x ; 0, γ) p(γ)dγ



In order for this framework to be useful, we need tractablerepresentations: Gaussian Scaled Mixtures (GSM)


X = γG where, G ∼ N(g ; 0, 1)


p(x) =


=

∫N(x ; 0, γ) p(γ)dγ





X = γG where, G ∼ N(g ; 0, 1)


p(x) =


=

∫N(x ; 0, γ) p(γ)dγ





X = γG where, G ∼ N(g ; 0, 1)


p(x) =


=

∫N(x ; 0, γ) p(γ)dγ


Gaussian Scale Mixtures

Most of the sparse priors over x (including those with concave g) can berepresented in this GSM form, and different scale mixing density i.e, p(γi )will lead to different sparse priors. [Palmer et al., 2006]

Example: Laplacian density

p(x ; a) =a

2exp(−a|x |)

Scale mixing density: p(γ) = a2

2 exp(− a2

2 γ), γ ≥ 0.





p(x ; a) =a

2exp(−a|x |)


2 exp(− a2

2 γ), γ ≥ 0.





p(x ; a) =a

2exp(−a|x |)


2 exp(− a2

2 γ), γ ≥ 0.


Examples of Gaussian Scale Mixtures

Student-t Distribution

p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

1

(b + x2/2)a+1/2

Scale mixing density: Inverse Gamma Distribution

p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).

Generalized Gaussian

p(x ; p) =1

2Γ(1 + 1p )

e−|x|p

Scale mixing density: Positive alpha stable density of order p/2.

Generalized logistic density

p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α

Scale mixing density: Related to Kolmogorov-Smirnov distancestatistics.




p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

1

(b + x2/2)a+1/2


p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).


p(x ; p) =1

2Γ(1 + 1p )

e−|x|p



p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α





p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

1

(b + x2/2)a+1/2


p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).


p(x ; p) =1

2Γ(1 + 1p )

e−|x|p



p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α



Sparse Bayesian Learning (Tipping)

y = Ax + v

Solving for MAP estimate of γ

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y , γ)p(γ) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

Given γ, x is Gaussian with mean zero and Covariance matrix Γ withΓ = diag(γ), i.e. p(x |γ) = N(x ; 0, Γ) = ΠN(xi ; 0, γi ).

Then p(y |γ) = N(y ; 0,Σy ), where Σy = σ2I + AΓAT ,

p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y



y = Ax + v


γ̂ = arg maxγ

p(γ|y) = arg maxγ


p(y |γ)p(γ)

What is p(y |γ)



p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y



y = Ax + v


γ̂ = arg maxγ

p(γ|y) = arg maxγ


p(y |γ)p(γ)

What is p(y |γ)



p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y



y = Ax + v


γ̂ = arg maxγ

p(γ|y) = arg maxγ


p(y |γ)p(γ)

What is p(y |γ)



p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y



y = Ax + v


γ̂ = arg maxγ

p(γ|y) = arg maxγ


p(y |γ)p(γ)

What is p(y |γ)



p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y



y = Ax + v


γ̂ = arg maxγ

p(γ|y) = arg maxγ


p(y |γ)p(γ)

What is p(y |γ)



p(y |γ) =1√

(2π)N |Σy |e−

12 y

T Σ−1y y


MAP estimate of γ

γ̂ = arg minγ(log |Σy |+ yTΣ−1

y y − 2∑

i log p(γi ))

Computational Methods

Many options for solving the above optimization problem, e.g.Majorization Minimization, Expectation-Maximization (EM).


MAP estimate of γ

γ̂ = arg minγ(log |Σy |+ yTΣ−1

y y − 2∑

i log p(γi ))

Computational Methods

Many options for solving the above optimization problem, e.g.Majorization Minimization, Expectation-Maximization (EM).


Sparse Bayesian Learning

y = Ax + v

Computing Posterior

Now because of our convenient GSM choice, posterior can be easilycomputed, i.e, p(x |y ; γ̂) = N(µx ,Σx) where,

µx = E [x |y ; γ̂] = Γ̂AT (σ2I + AΓ̂AT )−1y

Σx = Cov [x |y ; γ̂] = Γ̂− Γ̂AT (σ2I + AΓ̂AT )−1AΓ̂

µx can be used as a point estimate.

Sparsity of µx is achieved through sparsity in γ.

Another parameter of interest for the EM algorithm

E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)



y = Ax + v

Computing Posterior







E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)



y = Ax + v

Computing Posterior







E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)



y = Ax + v

Computing Posterior







E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)


EM algorithm: Updating γ

Treating (y, x) as complete data and vector x as hidden variable.

log p(y , x , γ) = log p(y |x) + log p(x |γ) + log p(γ)

E step

Q(γ|γk) = Ex|y ;γk [log p(y |x) + log p(x |γ) + log p(γ)]

M step

γk+1 = argmaxγQ(γ|γk) = argmaxγEx|y ;γk [log p(x |γ) + log p(γ)]

= argminγEx|y ;γk [M∑i=1

(x2i

2γi+

1

2log γi

)− log p(γ)]

Solving this optimization problem with a non-informative prior p(γ),

γk+1i = E (x2

i |y, γk) = µx(i)2 + Σx(i , i)





E step


M step



(x2i

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (x2

i |y, γk) = µx(i)2 + Σx(i , i)





E step


M step



(x2i

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (x2

i |y, γk) = µx(i)2 + Σx(i , i)





E step


M step



(x2i

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (x2

i |y, γk) = µx(i)2 + Σx(i , i)





E step


M step



(x2i

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (x2

i |y, γk) = µx(i)2 + Σx(i , i)


SBL properties

Local minima are sparse, i.e. have at most N nonzero γi

Cost function p(γ|y) is generally much smoother than theassociated MAP estimation objective p(x |y). Fewer local minima.

In high signal to noise ratio, the global minima is the sparsestsolution. No structural problems.

Attempts to approximate the posterior distribution p(x |y) in thearea with significant mass.


SBL properties






SBL properties






SBL properties






SBL properties






Algorithmic Variants

Fixed Point iteration based on setting the derivative of the objectivefunction to zero (Tipping)

Sequential search for the significant γ’s (Tipping and Faul)

Majorization-Minimization based approach (Wipf and Nagarajan)

Reweighted `1 and `2 algorithms (Wipf and Nagarajan)

Approximate Message Passing (AlShoukairi) and Rao)





































Empirical Comparison

For each test case

1 Generate a random dictionary A with 50 rows and 250 columnsfrom the normal distribution and normalize each column to have2-norm of 1.

2 Select the support for the true sparse coefficient vector x0 randomly.

3 Generate the non-zero components of x0 from the normaldistribution.

4 Compute signal, y = Ax0 (Noiseless case).

5 Compare SBL with previous methods with regard to estimating x0.

6 Average over 1000 independent trials.



For each test case









For each test case









For each test case









For each test case









For each test case









For each test case









For each test case








Empirical Comparison: 1000 trials

Figure: Probability of Successful recovery vs Number of non zerocoefficients


Useful Extensions

Multiple Measurement Vectors (MMV)

Block Sparsity

Block MMV

MMV with time varying sparsity


Useful Extensions


Block Sparsity

Block MMV



Useful Extensions


Block Sparsity

Block MMV



Useful Extensions


Block Sparsity

Block MMV




Multiple measurements: L measurements

Common Sparsity Profile: k nonzero rows










Block Sparsity

Variations include equal blocks, unequal blocks, block boundary known orunknown.


Block Sparsity



Block Sparsity



MMV solutions

Greedy Search Algorithms: Extend MP, OMP to search for row sparsity.

Regularization methods

X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M

i ‖Xi,.‖2, where Xi,. is the ith row of matrix X(Extension of `1)

G (X ) =∑M

i log(‖Xi,.‖2 + ε) (Extension of the Candes, Wakin andBoyd)

G (X ) =∑M

i log(‖Xi,.‖22 + ε) (Extension of the Chartrand and Yin

penalty)


MMV solutions

Greedy Search Algorithms:

Extend MP, OMP to search for row sparsity.


X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


MMV solutions



X̂ = arg minX

[‖Y − AX‖2F + λG (X )]

Choice of G (X )

G (X ) =∑M


G (X ) =∑M


G (X ) =∑M


penalty)


Bayesian Methods

Representation for Random Vectors

X = γG where, G ∼ N(g ; 0,B)

γ is a positive random variable, which is independent of G.

p(x) =

∫p(x|γ) p(γ)dγ

=

∫N(x; 0, γB) p(γ)dγ

B = I if the row entries are assumed independent.


Bayesian Methods




p(x) =

∫p(x|γ) p(γ)dγ

=




Bayesian Methods




p(x) =

∫p(x|γ) p(γ)dγ

=




Bayesian Methods




p(x) =

∫p(x|γ) p(γ)dγ

=





Treating (Y,X) as complete data and vector X as hidden variable.

log p(Y ,X , γ) = log p(Y |X ) + log p(X |γ) + log p(γ)

E step

Q(γ|γk) = EX |Y ;γk [log p(Y |X ) + log p(X |γ) + log p(γ)]

M step

γk+1 = argmaxγQ(γ|γk) = argmaxγEX |Y ;γk [log p(X |γ) + log p(γ)]


(‖Xi,.‖2

2

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (‖Xi,.‖2

2|Y , γk) =L∑l

µx(i , l)2 + Σx(i , i , l)





E step


M step



(‖Xi,.‖2

2

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (‖Xi,.‖2

2|Y , γk) =L∑l

µx(i , l)2 + Σx(i , i , l)





E step


M step



(‖Xi,.‖2

2

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (‖Xi,.‖2

2|Y , γk) =L∑l

µx(i , l)2 + Σx(i , i , l)





E step


M step



(‖Xi,.‖2

2

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (‖Xi,.‖2

2|Y , γk) =L∑l

µx(i , l)2 + Σx(i , i , l)





E step


M step



(‖Xi,.‖2

2

2γi+

1

2log γi

)− log p(γ)]


γk+1i = E (‖Xi,.‖2

2|Y , γk) =L∑l

µx(i , l)2 + Σx(i , i , l)


Empirical Comparison: Multiple Measurement Vectors(MMV)

Generate data matrix via Y = ΦX0 (noiseless), where:

1 X0 is 100-by-5 with random non-zero rows.

2 Φ is 50-by-100 with Gaussian iid entries.


MMV Empirical Comparison: 1000 trials


Summary

Sparse Signal Recovery (SSR) and Compressed Sensing (CS) areinteresting new signal processing tools with many potentialapplications.

Many algorithmic options exist to solve the underlying sparse signalrecovery problem; Greedy Search Techniques, regularizationmethods, Bayesian methods, among others

Nice theoretical results particularly for the greedy search algorithmsand `1 recovery methods.

Bayesian methods offer interesting algorithmic options to the SparseSignal Recovery problem

MAP methods (reweighted `1 and `2 methods)Hierarchical Bayesian Methods (Sparse Bayesian Learning)Versatile and can be more easily employed in problems withstructureAlgorithms can often be justified by studying the resultingobjective functions.

More applications to come enriching the field.


Summary








Summary








Summary








Summary








Summary





MAP methods (reweighted `1 and `2 methods)

Hierarchical Bayesian Methods (Sparse Bayesian Learning)Versatile and can be more easily employed in problems withstructureAlgorithms can often be justified by studying the resultingobjective functions.



Summary





MAP methods (reweighted `1 and `2 methods)Hierarchical Bayesian Methods (Sparse Bayesian Learning)

Versatile and can be more easily employed in problems withstructureAlgorithms can often be justified by studying the resultingobjective functions.



Summary





MAP methods (reweighted `1 and `2 methods)Hierarchical Bayesian Methods (Sparse Bayesian Learning)Versatile and can be more easily employed in problems withstructure

Algorithms can often be justified by studying the resultingobjective functions.



Summary








Summary






More applications to come enriching the field.Bhaskar D Rao University of California, San Diego