On Simulation Methods for Two Component Normal Mixture ...300849/FULLTEXT01.pdf · EM-Algorithm for the two-component normal mixture models to get the iter- ative computation estimates,

U.U.D.M. Project Report 2009:17

Examensarbete i matematisk statistik, 30 hpHandledare och examinator: Silvelyn Zwanzig

September 2009

Department of MathematicsUppsala University

On Simulation Methods for Two Component Normal Mixture Models under Bayesian Approach

Liwen Liang

Abstract

EM-Algorithm and Gibbs sampler are two useful Bayesian simulationmethods for parameter estimation of finite normal mixture model. TheEM-Algorithm is an iterative estimate of maximum likelihood for incom-plete data problem. Gibbs sampler is an approach of generating randomsample from a multivariate distribution. We introduce and derive DempsterEM-Algorithm for the two-component normal mixture models to get the iter-ative computation estimates, also use data augmentation and general Gibbssampler to get the sample from posterior distribution under conjugate prior.The estimate results from both simulation methods under two-componentnormal mixture model with unknown mean parameters are compared andthe connections and differences between both methods are represented. Dataset from astronomy is used for comparison.

Acknowledgement

I would like to thank my supervisor Silvelyn Zwanzig for the patience, guid-ance and encouragement that she always gave to me, not only in the thesis,but also in the whole procedure of my statistics studying.

I would also like to thank my friend Han Jun for the the assistances ofLATEX, thank Alena for the data source, and thank my parents for thespiritual and substantial support and wholesouled love they gave me all mylife.

At last I would like to thank the department of mathematics of UppsalaUniversity for giving me the opportunity to study.

Contents

1 Introduction 61.1 Data source and notation . . . . . . . . . . . . . . . . . . . . 61.2 Two-component normal mixture model . . . . . . . . . . . . . 7

2 Iterative estimation and the EM-Algorithm 82.1 Iterative estimation for nonlinear likelihood function . . . . . 92.2 The EM-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Latent variable model . . . . . . . . . . . . . . . . . . 112.2.2 General Dempster EM-Algorithm . . . . . . . . . . . . 122.2.3 Dempster EM-Algorithm for latent model . . . . . . . 142.2.4 HTF EM-Algorithm . . . . . . . . . . . . . . . . . . . 15

2.3 Convergency of the EM-Algorithm . . . . . . . . . . . . . . . 16

3 Posterior distribution and Gibbs sampler 173.1 Indicator normal mixture model . . . . . . . . . . . . . . . . 173.2 Conjugate prior and posterior . . . . . . . . . . . . . . . . . . 183.3 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 General Gibbs sampler and its properties . . . . . . . 233.3.2 Gibbs sampler for normal mixture model . . . . . . . 243.3.3 Gibbs sampler for normal mixture with known σ2

j and π 26

4 Application 274.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Simulation results for the fictitious data set . . . . . . 274.1.2 Simulation results for the astronomy data set . . . . . 28

4.2 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A Appendix: Astronomic background of the data set 31

B Appendix: Programme results 33B.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 33

C Appendix: R programme code 42C.1 R code for Algorithm 3 with σ2

1, σ22 and π known . . . . . . 42

C.2 R code for Algorithm 7 . . . . . . . . . . . . . . . . . . . . . 42C.3 R code for plots . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Reference 45

1 Introduction

Bayesian approach has been widely used in social sciences and continuouslydeveloped for statistical analysis after 1990s when Markov chain Monte Carlowas discovered. There are a lot of simulation method for parameter estima-tion of finite mixture models, especially finite mixture of exponential familydistribution. We focus on the EM-Algorithm and Gibbs sampler. Bothof them are useful tools for solving difficult computation problem of finitemixture models.

The EM-Algorithm is an iterative estimate of maximum likelihood forincomplete data1 problem. Gibbs sampler is an approach of generatingrandom sample from a multivariate distribution. Finite normal mixturemodel is a classical example of how EM-Algorithm fits with incomplete datasituation and we focus on the two-component normal mixture model int thisthesis.

In Section 1, we first introduce the two-component normal mixture modelwith some basic properties. Then in Section 2 we propose an iterative plug-in procedure for solving nonlinear likelihood function to get the parameterestimators and give the EM-Algorithm. The monotonicity and convergencyof the EM-Algorithm are discussed. After that, in Section 3 we present anindicator latent normal mixture model to get the Bayesian posterior distri-bution and introduce the Gibbs sampler. At last, compare the parameterestimation results got from both algorithm procedures in Section 4. Wefind that both simulation methods get similar estimation results, the EM-Algorithm is stabler but contains less information, and Gibbs sampler is justopposite.

1.1 Data source and notation

Two data sets are used for the parameter estimation comparison. One isthe fictitious data set from Hastie, Tibshirani and Friedman(2001), whichcontains 20 data. The other is a real astronomic data set which observed byFLAMES-ARGUS, a spectrograph on the Very Large Telescope (VLT)2 in2006. It contains a group of 81 datum range from -1250.6222 to 502910.81.Detailed background of the astronomic data see Appendix A.

1The incomplete data situation means where there are missing data, truncated dis-tributions or censored or grouped observations. In fact, the EM-Algorithm can also beapplied to the whole variety of situations where the incompleteness of the data is not soobvious, such as latent variable structure which will be introduced later for our model.Details see [10].

2The VLT is a system which built and operated by the European Southern Observatory(ESO). The VLT is constituted of four separate optical telescopes: the Antu telescope,the Kueyen telescope, the Melipal telescope and the Yepun telescope. The VLT providesthe total light collecting power of a 16 meter single telescope, making it the largest opticaltelescope in the world.

6

Figure 1: Picture of Planetary Nebulae (PNe)

We follow the symbols and procedure forms in [1] primarily to take thebasic expression, but also give forms from other books and authors for com-parison. Now introduce the two-component normal mixture model.

1.2 Two-component normal mixture model

Suppose Y is a mixture model of two normal distributions: Y1 ∼ N(µ1, σ21),

Y2 ∼ N(µ2, σ22), Y1 and Y2 are independent. Then we have the two-component

normal mixture model:

Y = (1−∆) · Y1 + ∆ · Y2, (1.1)

where ∆ = 0 or 1 with Pr(∆ = 1) = π, which means ∆ ∼ Ber(π). ∆ iscalled the unobservable or latent data.

Let θ = (π, µ1, σ21, µ2, σ

22) denote the unknown parameter vector, where

θ1 = (µ1, σ21) and θ2 = (µ2, σ

22). Denote the parameter space by Ω, where

Ω = [0, 1]×R2 ×R2+

Let ϕθj(y), j=1 or 2 denote the normal density of each component. Then

the density of two-component normal mixture model Y under parameter θis:

gY (y) = (1− π)ϕθ1(y) + πϕθ2(y) (1.2)

Since Y1, Y2 are independent and ∆ is i.i.d, we have the following basicproperties of Y :

E[Y |θ] = (1− π)µ1 + πµ2

E[Y 2|θ] = (1− π)(µ1 + σ21) + π(µ2 + σ2

2)

7

2 Iterative estimation and the EM-Algorithm

We are interested in the parameter estimation. First introduce the train-ing data Z = Z1, . . . , ZN, where Zi i.i.d ∼ Y , Y1i ∼ N(µ1, σ

21), Y2i ∼

N(µ2, σ22), ∆i ∼ Ber(π), i.i.d., i = 1, . . . , N , which has form:

Zi = (1−∆i)Y1i + ∆iY2i,

andZi|∆i = 1 ∼ N(µ2, σ

22)

Zi|∆i = 0 ∼ N(µ1, σ21)

p(Zi|θ) = (1− π)ϕθ1(yi) + πϕθ2(yi)

From equation (1.1) and (1.2), the log-likelihood of model (1.1) basedon the N training cases Z is:

`(θ;Z) =N∑

i=1

log[(1− π)ϕθ1(yi) + πϕθ2(yi)] (2.1)

Then we have the local derivations under each parameter:

(∗)

∂`

∂π=

N∑

i=1

ϕθ2(yi)− ϕθ1(yi)(1− π)ϕθ1(yi) + πϕθ2(yi)

∂`

∂µ1=

N∑

i=1

(1− π)ϕθ1(yi)(1− π)ϕθ1(yi) + πϕθ2(yi)

· (yi − µ1)σ2

1

∂`

∂σ21

=N∑

i=1


· 12σ2

1

((yi − µ1)2

σ21

− 1)

∂`

∂µ2=

N∑

i=1

πϕθ2(yi)(1− π)ϕθ1(yi) + πϕθ2(yi)

· (yi − µ2)σ2

2

∂`

∂σ22

=N∑

i=1

πϕθ2(yi)(1− π)ϕθ1(yi) + πϕθ2(yi)

· 12σ2

2

((yi − µ2)2

σ22

− 1)

We want to solve ∂`/∂θ = 0 to get the MLEs. Trying to maximize`(θ;Z) directly for the estimation of parameters is quite difficult since thoseequations are nonlinear and no analytic solutions can be found. So numericalprocedure like iterative optimization methods often be used to get successiveapproximation of the solution. In that case we focus on the EM-algorithmin this section. First consider the idea of iterative plug-in procedure.

8

2.1 Iterative estimation for nonlinear likelihood function

From the local derivation equation system (∗), we notice that expression


orπϕθ2(yi)

(1− π)ϕθ1(yi) + πϕθ2(yi)

exists in the last four equations, and for the first equation, we also have:

∂`

∂π=

N∑

i=1

ϕθ2(yi)− ϕθ1(yi)(1− π)ϕθ1(yi) + πϕθ2(yi)

=N∑

i=1

(πϕθ2(yi)

(1− π)ϕθ1(yi) + πϕθ2(yi)· 1π− (1− π)ϕθ1(yi)

(1− π)ϕθ1(yi) + πϕθ2(yi)· 1(1− π)

)

So we denote

γi(θ) =πϕθ2(yi)

(1− π)ϕθ1(yi) + πϕθ2(yi)(2.2)

and introduce γi(θ) into (∗), then we get the plug-in derivation equationsystem as followed:

(∗∗)

∂`

∂π=

N∑

i=1

(γi(θ)

π− 1− γi(θ)

1− π

)

∂`

∂µ1=

N∑

i=1

(1− γi(θ))(yi − µ1)σ2

1

∂`

∂σ21

=N∑

i=1

1− γi(θ)2σ2

1

((yi − µ1)2

σ21

− 1)

∂`

∂µ2=

N∑

i=1

γi(θ)(yi − µ2)σ2

2

∂`

∂σ22

=N∑

i=1

γi(θ)2σ2

2

((yi − µ2)2

σ22

− 1)

Let ∂`/∂θ = 0, then rewrite these equation systems (∗) and (∗∗) asmatrix forms respectively:

L′(θMLE) = 0

H(γ(θMLE), θMLE) = 0 (2.3)

9

We want to solve (2.3) and get the parameter estimator as:

θMLE = h(γ(θMLE))

Notice that θ is also contained in γ(θ), which is called the plug − inestimatior, so we propose an iterative procedure to solve problem (2.3):

1. Let θ(j) denote the current jth parameter statement, so γ(j)i = γ(θ(j))

is the current plug-in estimator, where:

γ(j)i =

π(j)ϕθ(j)2

(yi)

(1− π(j))ϕθ(j)1

(yi) + π(j)ϕθ(j)2

(yi)

2. Introduce γ(j)i into equation (2.3) and get the iterative plug-in param-

eter estimator θ(j+1)r as:

θ(j+1)r = (π(j+1), µ

(j+1)1 , (σ2

1)(j+1), µ

(j+1)2 , (σ2

2)(j+1))

where

π(j+1) =1N

N∑

i=1

γ(j)i

µ(j+1)1 =

∑Ni=1(1− γ

(j)i )yi∑N

i=1(1− γ(j)i )

(σ21)

(j+1) =∑N

i=1(1− γ(j)i )(yi − µ

(j+1)1 )2

∑Ni=1(1− γ

(j)i )

(2.4)

µ(j+1)2 =

∑Ni=1 γ

(j)i yi∑N

i=1 γ(j)i

(σ22)

(j+1) =∑N

i=1 γ(j+1)i (yi − µ

(j+1)2 )2

∑Ni=1 γ

(j+1)i

By this iterative procedure, we get the parameter estimator as:

θ(j+1) = h(γ(θ(j)))

If we have some initial value of θ as θ(0), we can continue this iterativeprocedure for j = 1, 2, . . . , n, . . . and consider the iterative parameter esti-mator θ(n), n → ∞ as the approximation of the solution for (2.3) if it isconvergent. This iterative procedure is similar as the iteratively reweightedleast squares. The EM-Algorithm has equal form with iteratively reweightedleast squares for normal mixture model and we will discuss the general formof the EM-Algorithm in the following section.

10

2.2 The EM-Algorithm

The earliest literature related to an EM-type algorithm appears in New-comb(1886) with estimation of parameters of a mixture of two univariatenormal models. McKendrick(1926) also gives a method with basic EM-Algorithm spirit.

The formulation of EM algorithm is first introduced by Dempster, Lairdand Rubin in 1977. The convergency and other basic properties of the EM-Algorithm under general conditions was established in their literature.

Before recommending the EM-Algorithm procedure, we introduce thelatent variable normal mixture model(we called it latent model for shortin this thesis) first to give a further comprehension for the latent variable.The latent model is another form for the two-component normal mixturemodel which consider the latent data ∆ as part of the model (also see [1]).

2.2.1 Latent variable model

Take a review of the original model (1.1)

Y = (1−∆) · Y1 + ∆ · Y2,

where ∆ ∼ Ber(π). For training data Z = Z1, . . . , ZN, where Zi i.i.d ∼Y , i = 1, . . . , N , we have

Zi = (1−∆i)Y1i + ∆iY2i,

and the log-likelihood function:

`(θ;Z) =N∑

i=1

log[(1− π)ϕθ1(yi) + πϕθ2(yi)].

Now we consider unobserved latent variables ∆i: when ∆i = 0, Yi comesfrom N(µ1, σ

21), when ∆i = 1, Yi comes from N(µ2, σ

22). If we knew the value

of ∆′is, then get the latent variable model with log-likelihood as followed:

`0(θ;Z,∆) =N∑

i=1

[(1−∆i)logϕθ1(yi) + ∆ilogϕθ2(yi)] (2.5)

Actually the values of ∆′is is unknown, so we want to use the expected

value of ∆i|θ,Z to substitute each ∆i in (2.5). We have

E(∆i|θ,Z) = p(∆i = 1|θ, Zi)

=p(∆i = 1, Zi|θ)

p(Zi|θ) =p(Zi|θ, ∆i = 1)p(∆i = 1|θ)

P (Zi|θ)=

ϕθ2(yi)π(1− π)ϕθ1(yi) + πϕθ2(yi)

= γi(θ) (2.6)

11

In fact, γi(θ) was defined as the expected value of ∆i|θ,Z, which calledresponsibility in [1]. Now give the definition as followed:

Definition 1. The expected value of ∆i|θ,Z is called the responsibility ofmodel for observation i, denoted as γi(θ) :

γi(θ) = E(∆i|θ,Z)

Then we have the expected latent log-likelihood function as:

`0(θ;Z) = E∆i(`0(θ;Z,∆)|θ, Zi)

= E∆i

(N∑

i=1

[(1−∆i)logϕθ1(yi) + ∆ilogϕθ2(yi)]

)

=N∑

i=1

[E∆i((1−∆i)logϕθ1(yi)) + E∆i(∆ilogϕθ2(yi))]

=N∑

i=1

[(1− γi)logϕθ1(yi) + γilogϕθ2(yi)]

= −12Nlog(2π)− 1

2

N∑

i=1

[(1− γi)(logσ2

1 + (yi − µ1)2/σ21)

+γi(logσ22 + (yi − µ2)2/σ2

2)]

(2.7)

Obviously π = 1N

∑Ni=1 γi since

∑Ni=1 γi is the estimation of expected

number that Yi comes from N(µ2, σ22). Then set ∂`0/∂µj = ∂`0/∂σ2

j =0, and follow the iterative procedure in Section 2.1, we can get the sameiterative parameter estimators as (2.4).

The latent data ∆ is part of the model in our latent normal mixturesituation. In general problems the latent data could also be actual datathat should been observed but missing. Now come to the EM-Algorithmprocedures.

2.2.2 General Dempster EM-Algorithm

Denote Z as the observed incomplete data, Zm as the latent data (or missingdata), and T = (Z,Zm) as the unobserved complete data, where t(T) = Zcollapses T to Z. First we give the general form for Dempster EM-Algorithm(also see [2] and [12]):

Algorithm 1. Dempster EM algorithm

1. Choose a start value θ(0).

12

2. Expectation Step: at the jth step, calculate

Q(θ|θ(j)) = E[ln(f(T)|θ),Z, θ(j)]

3. Maximization Step: deterimine the new estimate θ(j+1) as

θ(j+1) = arg max Q(θ|θ(j))

4. Iterate step 2 and 3 until convergence.

The essence of the EM algorithm is that maximizing Q(θ′, θ) leads toan increase in the log-likelihood of the observed data. Remind the Jensen’sinequality and information inequality which prove this property:

Proposition 2.1. (Jensen’ Inequality). Assume that the values of the ran-dom variable W are confined to the possibly infinite interval (a, b). If h(w) isconvex on (a, b), then E[h(W )] ≥ h[E(W )], provided both expectations exist.For a strictly convex function h(w), equality holds in Jensen’s inequality ifand only if W = E(W ) almost surely.

Proof. See [12].

Proposition 2.2. (Information Inequality). Let f and g be probability den-sities with respect to a measure µ. Suppose f > 0 and g > 0 almost every-where relative to µ. If Ef denotes expectation with respect to the probabilitymeasure fdµ, then Ef (ln f) ≥ Ef (ln g), with equality only if f = g almosteverywhere relative to µ.

Proof. See [12].

Now give the fundamental property for the EM-Algorithm in general(also see [12]).

Proposition 2.3. Suppose that g(Z|θ) and f(T |θ) are the probability den-sities of the observed and complete data, respectively. Then the EM iteratesobey

ln g(Z|θ(j+1)) ≥ ln g(Z|θ(j),

with strict inequality when f(T |θ(j+1))/g(Z|θ(j+1)) and f(T |θ(j))/g(Z|θ(j))are different conditional densities or when the surrogate function Q(θ|θ(j)))satisfies

Q(θ(j+1)|θ(j)) > Q(θ(j)|θ(j))

13

Proof. (See [12]) Since both f(T |θ)/g(Z|θ) and f(T |θ(j))/g(Z|θ(j)) are con-ditional densities of T on T : t(T ) = Z with respect to some measureµZ , then by the information inequality, for Q(θ|θ(j)) = E[ln(f(T)|θ),Z =Z, θ(j)], we have:

Q(θ|θ(j))− ln g(Z|θ) = E[ln(f(T|θ)g(Z|θ) )|Z = Z, θ(j)]

≤ E[ln(f(T|θ(j))

g(Z|θ(j)))|Z = Z, θ(j)]

= Q(θ(j)|θ(j))− ln g(Z|θ(j))

Hence the difference Q(θ|θ(j)) − ln g(Z|θ) attains its maximum whenθ = θ(j).

If we choose θ(j+1) to maximize Q(θ|θ(j)), then we have:

ln g(Z|θ(j+1)) = Q(θ(j+1)|θ(j))− [Q(θ(j+1)|θ(j))− ln g(Z|θ(j+1))]≥ Q(θ(j)|θ(j))− [Q(θ(j)|θ(j))− ln g(Z|θ(j))]= ln g(Z|θ(j))

Proposition 2.3 provide that the EM iteration never decreases the log-likelihood of observed data, hence the EM algorithm works in general andalso be called generalized EM algorithm(GEM).

2.2.3 Dempster EM-Algorithm for latent model

Now apply Dempster EM-Algorithm to latent model. We have T = (Z,∆)and log-likelihood for complete data as (2.5):

ln f(T|∆, θ) = `0(θ;Z,∆) =N∑

i=1

[(1−∆i)logϕθ1(yi) + ∆ilogϕθ2(yi)]

Then

Q(θ|θ(j)) = E∆|θ(j) [`0(θ;Z,∆)] = `0(θ;Z, γ(θ(j))),

according to (2.7). Hence the Dempster EM-Algorithm for latent model isas followed:

Algorithm 2. Dempster EM algorithm for latent model

1. Choose a start value θ(0).

2. Expectation Step: at the jth step, calculate

Q(θ|θ(j)) = `0(θ;Z, γ(θ(j))),

where `0(θ;Z, γ(θ(j))) defined as (2.7)

14

3. Maximization Step: deterimine the new estimate θ(j+1) as

θ(j+1) = arg max Q(θ|θ(j))

4. Iterate step 2 and 3 until convergence.

2.2.4 HTF EM-Algorithm

We consider the iterative plug-in parameter estimator given in Section 2.1as the root of arg max `0(θ;Z,∆) by Iteration Theorem.

Since the result of iterative estimation for both original normal mixturemodel (1.2) and latent model (2.5) are the same, we get a concrete formof the EM-Algorithm procedure for two-component normal mixture model,called HTF EM-Algorithm 3, as followed:

Algorithm 3 (also see [1]). HTF EM-Algorithm

1. Take initial guesses for the parameters π, µ1, σ21, µ2, σ

22.

2. Expectation Step: compute the responsibilities

γi =πϕθ2

(yi)

(1− π)ϕθ1(yi) + πϕθ2

(yi), i = 1, 2, . . . , N.

3. Maximization Step: compute the weighted means and variances:

µ1 =∑N

i=1(1− γi)yi∑Ni=1(1− γi)

, σ21 =

∑Ni=1(1− γi)(yi − µ1)2∑N

i=1(1− γi),

µ2 =∑N

i=1 γiyi∑Ni=1 γi

, σ22 =

∑Ni=1 γi(yi − µ2)2∑N

i=1 γi

,

and π =∑N

i=1 γi/N .

4. Iterate steps 2 and 3 until convergence.

HTF EM-Algorithm procedure is substantively the same as iterative pro-cedure introduced in Section 2.2.1. For the initial guess for the parameters,usually take π = 0.5, take two of yi randomly as the initial guesses for µ1

and µ2, and take σ21 = σ2

2 = ΣNi=1(yi − y)2/N . Details see [1].

3We call this algorithm as HTE EM-Algorithm since Hastie, Tibshirani and Friedman(first) introduce this algorithm procedure in [1].

15

2.3 Convergency of the EM-Algorithm

We just prove the monotonicity of the EM-Algorithm in Section 2.2.2 andnow we are interested in the relationship between the iterative plug-in pa-rameter estimator sequence θ(n), n → ∞ and the maximum likelihoodestimator θMLE . Wu (1983) have given the convergency properties of theEM-Algorithm and we take a statement of the main convergency theoremhere without a proof.

Theorem 2.1 (Also see [24] and [26]). Let θ(j) be an instance of a GEMalgorithm generated by θ(j+1) = arg max Q(θ|θ(j)). Suppose that

1. θ(j+1) = arg max Q(θ|θ(j)) is closed over the complement of S, the setof stationary points in the interior of Ω and,

2. Have`0(θ;Z, γ(θ(j+1))) > `0(θ;Z, γ(θ(j)))

for all θ(j) do not belong to S.

Then all the limit points of θ(j) are stationary points and `0(θ;Z, γ(θ(j)))converges monotonically to `∗0 = `∗0(θ;Z, γ(θ∗)) for some stationary point θ∗.

Wu(1983) and Hartley & Hocking(1971) have given some very usefultheorems about the convergency properties for an EM sequence(details see[24], pp.82-84) and we focus on the situation when there is only when sta-tionary point in the interior of Ω. Now we have the convergency theoremfor the iterative plug-in parameter estimator sequence to a unique maximumlikelihood estimate.

Theorem 2.2 (Also see [24]). Suppose that `0(θ;Z) is unimodal in Ω withθ∗ being the only stationary point and that ∂Q(θ, ϕ)∂θ is continuous in θand ϕ. Then any EM sequence θ(j) converges to the unique maximizer θ∗

of `0(θ;Z). That is, it converges to the unique MLE of θ.

Proof. See [24].

There are many other methods for computing MLE, like the Newton-Raphson and Fisher’s scoring method. EM-Algorithm has several advan-tages relative to those iterative algorithms, such as the numerically stabil-ity, reliable global convergence, easy implementation, small storage spacerequirement, and so on. It also has a few disadvantages, like the slowlyconvergency. But the disadvantages do not prevent the EM-Algorithm frombeing one of the most appealing iterative simulation methods for findingMLE. Details see [24].

16

3 Posterior distribution and Gibbs sampler

In this section, we give the posterior distribution for two-component normalmixture model under a conjugate prior and present the Gibbs sampler forboth general case and normal mixture application. Gibbs sampler is a usefulsimulation method which generates sample from the posterior distribution.

3.1 Indicator normal mixture model

Calculating the posterior under conjugate prior for normal mixture model(1.2)

gY (y) = (1− π)ϕθ1(y) + πϕθ2(y)

is very complicated (see [3]), so we introduce the zero-one component indi-cator vector variable z = z1, z2 4 into the model to make the calculationeasier and the result more straightforward. Consider

zij =

1, if yi ∼ ϕθj(yi)

0, otherwise

Since yi can only belongs to one normal model for each i, we have z1i +z2i =1. Also zij is unobservable and with conditional expectation as

τij = E[zij |y] =ϕθj

(yi)πj

π1ϕθ1(yi) + π2ϕθ2(yi)

where π1 = 1− π, π2 = π.

Now see the joint density of the observed data y and unobserved data z:

f(y, z; θ) = f(z, θ) · f(y|z, θ)

From the definition of zj , we have

f(z, θ) = (1− π)z1πz2

f(y|z1 = 1, θ) = ϕθ2(y) = (ϕθ2(y))z1

f(y|z2 = 1, θ) = ϕθ2(y) = (ϕθ2(y))z2

So we get the indicator normal mixture model with joint density

f(y, z; θ) =N∏

i=1

[(1− π)ϕθ1(yi)]z1i · [πϕθ2(yi)]z2i (3.1)

4Here the indicator vector variable z is the same as ∆ in Section 2, in fact z1 = 1−∆and z2 = ∆. We change the symbol just for the conveniency of later calculation.

17

The log-likelihood function under indicator normal mixture model is:

`(θ) =N∑

i=1

[z1ilog(1− π) + z2ilogπ] +N∑

i=1

[z1ilogϕθ1(yi) + z2ilogϕθ2(yi)]

If we take the iterative procedure as Section 2.2.1 into model (3.1), byusing τij instead of zij and letting

∂`(θ)/∂πj = 0∂`(θ)/∂µj = 0∂`(θ)/∂σ2

j = 0,

we get the same iterative plug-in-estimators as:

π(k+1)j =

1N

N∑

i=1

τ(k)ij

µ(k+1)j =

∑Ni=1 τ

(k)ij yi

∑Ni=1 τ

(k)ij

(σ2j )

(k+1) =

∑Ni=1 τ

(k)ij (yi − µ

(k+1)i )2

∑Ni=1 τ

(k)ij

So there is no difference in parameter estimation result among originalnormal mixture model, latent model and indicator normal mixture throughthe EM-Algorithm. More details of g-component normal mixture model see[10] pp. 16-20 and pp. 68-71.

3.2 Conjugate prior and posterior

Choosing a proper prior distribution is very important for Bayesian method,since the improper prior may not lead to an analytical tractable form ofposterior. Specify a conjugate prior can guarantee the easily calculable formof the posterior. First give the definition of conjugate.

Definition 2. A family F of probability distributions on Θ is said to beconjugate for a likelihood function f(x|Θ) if, for every π ∈ F , the posteriordistribution π(θ|x) also belongs to F .

We already know the inverse gamma and normal priors are conjugate forthe variance and mean parameter in normal distribution model (see [3], [22]and [9]). Now we give the following theorem of expounding the conjugacyproperty for the normal, inverse gamma and beta prior for two-componentnormal mixture model.

18

Theorem 3.1. (Conjugacy of the normal, inverse gamma and beta priorfor two-component normal mixture model) In the normal mixture model, anormal prior along with a mixture normal joint likelihood function produceda normal posterior for the mean parameter; an inverse gamma prior withthe same mixture normal likelihood produced an inverse gamma posterior forthe variance parameter; a beta prior with the same mixture normal likelihoodproduced a beta posterior for the proportion parameter.

Proof. Suppose the prior distributions are followed:

µj |σ2j ∼ N

(ξj ,

σ2j

nj

)

σ2j ∼ IG

(νj

2,s2j

2

)(3.2)

π ∼ Be(α, β

)

with density functions

p(µj |ξj ,σ2

j

nj) ∝ (σ2

j )− 1

2 exp

[− 1

2σ2j /nj

(µj − ξj)2]

p(σ2j |

νj

2,s2j

2) ∝ (σ2

j )−(

νj2

+1) exp

[− s2

j

2σ2j

]

p(π|α, β) ∝ πα−1(1− π)β−1

For indicator normal mixture model (3.1), we have the following joint dis-tribution under all parameters

p(θ|y, z) = p(y, z|θ)p(π|α, β)2∏

j=1

[p(σ2

j |νj

2,s2j

2)p(µj |ξj ,

σ2j

nj)

]

= (1− π)∑N

i=1 z1i+β−1π∑N

i=1 z2i+α−1 ×N∏

i=1

(φθ1(yi)

)z1i

N∏

i=1

(φθ2(yi)

)z2i

×2∏

j=1

[(σ2

j )− νj

2− 3

2 exp

(− s2

j

2σ2j

− 12σ2

j /nj(µj − ξj)2

)]

(3.3)

Denote zj =∑N

i=1 zij and yj(z) = 1zj

∑Ni=1 zijyi. From (3.3) we get the

19

posterior distribution for θ = (π, µ1, σ21, µ2, σ

22) as followed:

p(θ|y, z) ∝ (1− π)z1+β−1πz2+α−12∏

j=1

(σ2j )− zj

2− νj

2− 3

2

×2∏

j=1

exp

[− s2

j

2σ2j

− 12σ2

j

N∑

i=1

zij(yi − µj)2 − 12σ2

j /nj(µj − ξj)2

]

= (1− π)z1+β−1πz2+α−12∏

j=1

(σ2j )− zj

2− νj

2− 3

2

2∏

j=1

exp

[− s2

j

2σ2j

− 12σ2

j

(N∑

i=1

zijy2i − 2zj yj(z)µj + zjµ

2j

)− 1

2σ2j /nj

(µ2j − 2µξj + ξ2

j )

]

= (1− π)z1+β−1πz2+α−12∏

j=1

(σ2j )− zj

2− νj

2− 1

2 · exp

[− s2

j

2σ2j

− 12σ2

j

(N∑

i=1

zijy2i − kjy

2

)]

×2∏

j=1

(σ2j )−1 · exp

[− 1

2σ2j

((zj + nj)µ2

j − 2(zj yj(z) + ξjnj)µj + (kjy2 + njξ

2j )

)]

(3.4)

where

kjy2 =

(zj yj(z) + ξjnj

)2 − (zj + nj)njξ2j

zj + nj

The second line of (3.4) is a product of two normal kernels for µ1 andµ2, the first line only contain π and σ2

j . Parameters π, (µ1, σ21) and (µ2, σ

22)

are all independent to each other. So we take the operation over π, µ1, µ2

and σ22 to get the posterior for σ2

1:

p(σ21|y, z) =

∫ ∞

0

∫ ∞

0

∫ ∞

0

∫ 1

0p(µ1, σ

21, π|y, z)dπdµ1dµ2dσ2

∝ ((σ21)− z1

2− ν1

2− 1

2 exp

[− 1

σ21

(s21

2+

12

(N∑

i=1

z1iy2i − k1y

2

))]

Similarly we can get the posterior for σ22:

p(σ22|y, z) ∝ ((σ2

2)− z2

2− ν2

2− 1

2 exp

[− 1

σ22

(s22

2+

12

(N∑

i=1

z2iy2i − k2y

2

))]

Hence the posterior distribution of σ2j is also inverse gamma:

σ2j |y, z ∼ IG

(νj + zj

2,12

[s2j +

N∑

i=1

zij

(yi − yj(z)

)2 +nj zj(yj(z)− ξj)2

nj + zj

])

(3.5)

20

Then the posterior distribution of (π, µ1, µ2) is:

p(π, µ1, µ2|y, z) =p(θ|y, z)∏2

j=1 p(σ2j |y, z)

∝ (1− π)z1+β−1πz2+α−12∏

j=1

(σ2j )−1 · exp

[− 1

2σ2j

((zj + nj)µ2

j

−2(zj yj(z) + ξjnj)µj + (kjy2 + njξ

2j )

)]

Take operation over π and µ2 to get the posterior for µ1:

p(µ1|y, z) ∝ σ−21 exp

[− 1

2σ21/(z1 + n1)

(µ2

1 − 2z1y1(z) + ξ1n1

z1 + n1µ1 +

k1y2 + n1ξ

21

z1 + n1

)]

Similarly the posterior distribution of µ2 is:

p(µ2|y, z) ∝ σ−22 exp

[− 1

2σ22/(z2 + n2)

(µ2

2 − 2z2y2(z) + ξ2n2

z2 + n2µ2 +

k2y2 + n2ξ

22

z2 + n2

)]

Hence the posterior distribution of µj is also normal:

µj |σ2j ,y, z ∼ N

(njξj + zj yj(z)

nj + zj,

σ2j

nj + zj

)(3.6)

Last the posterior distribution of (π) is:

p(π|y, z) =∫ ∞

0

∫ ∞

0

∫ ∞

0

∫ ∞

0p(µ, σ2, π|y, z)dµ1dµ2dσ2

1dσ22

∝ (1− π)z1+β−1πz2+α−1

Hence the posterior distribution of π is also Beta:

π|y, z ∼ Be (z2 + α, z1 + β2) (3.7)

Conclude from previous calculation, we get that: for two-componentnormal mixture model, if

µj |σ2j ∼ N

(ξj ,

σ2j

nj

)

σ2j ∼ IG

(νj

2,s2j

2

)

π ∼ Be(α, β

)

Then


(njξj + zj yj(z)

nj + zj,

σ2j

nj + zj

)

21

σ2j |y, z ∼ IG

(νj + zj

2,12

[s2j +

N∑

i=1

zij

(yi − yj(z)


nj + zj

])

π|y, z ∼ Be (z2 + α, z1 + β2)

This property is also true for g-component normal mixture model ifwe use the g-category Dirichlet prior D(π|α1, . . . , αg) for the proportionparameter, where Dirichlet distribution is the multicategory generalizationof Beta distribution, details see [22] and [9].

3.3 Gibbs sampler

Gibbs sampler which originating by the Geman’s(1984) is a useful approachto draw sample from the joint posterior when the joint distribution is com-plicated and difficult to handle. It is a Markov Chain Monte Carlo (MCMC)procedure which sampling from conditional distribution of each parameterwhen the other parameters and observed data Z are given.

In this section, we fist give the the general Gibbs sampler procedure anda discussion for stationary. Then apply Gibbs sampler to two-componentnormal mixture model in both situation with unknown and known varianceparameter.

We start with a review of the basic Bayesian framework for prior p(θ)and posterior p(θ|X):

p(θ|X) =p(θ)`(θ|X)

p(X)

=p(θ)`(θ|X)∫

Θ p(θ)`(θ|X)dθ)

E(θ|X) =∫

Θθp(θ|X)dθ

Since∫Θ p(θ)`(θ|X)dθ) is the expression only for p(X), we have:

p(θ|X) ∝ p(θ) × L(θ|X)PosteriorProbability ∝ PriorProbability × LikelihoodFunction

Then posterior distribution can be considered as the conditional distribu-tions for unknown parameter when the other parameter and observed dataare given. So we generate parameter sample from the posterior distribution.

22

3.3.1 General Gibbs sampler and its properties

Consider random variables U1, U2, . . . , UK , we simulate sample from the con-ditional distributions P (Uj |U1, U2, . . . , Uj−1, Uj+1, . . . , UK), j = 1, 2, . . . , Kinstead of the joint distribution. Then give general Gibbs sampler as fol-lowed (also see [1]):

Algorithm 4. General Gibbs sampler

1. Take some initial values U(0)k , k = 1, 2, . . . , K.

2. Repeat for t = 1, 2, . . . , .:

For k = 1, 2, . . . , K generate U(t)k from

P (U (t)k |U (t)

1 , . . . , U(t)k−1, U

(t−1)k+1 , . . . , U

(t−1)K )

3. Continue step 2 until the joint distribution of (U (t)1 , U

(t)2 , . . . , U

(t)K ) does

not change.

If the explicit form of the conditional density P (Uk|Ul, l 6= k) is available,we can estimate the marginal density of Uk by:

PUk(u) =

1M −m + 1

M∑t=m

P (u|U (t)l , l 6= k).

This equation can be obtained from following formula:

P (A) =∫

P (A|B)d(P (B))

Now give a discussion for the convergency of Gibbs sampler procedure.Gibbs sampler produces a Markov chain with stationary distribution to bethe true joint distribution. We will take the data augmentation for exampleto exposit the stationary. Data augmentation is the first Gibbs samplingbrought forth by Tanner and Wong(1987), it is a simple form of generalGibbs sampler when K = 2. The algorithm is followed (also see [3]):

Algorithm 5. Data augmentation

1. Initialization: Start with an arbitrary value λ(0).

2. Iteration t: Given λ(t−1), generate

(a) θ(t) according to p1(θ|x, λ(t−1))

(b) λ(t) according to p2(λ|x, θ(t−1))

This data augmentation algorithm leads to good convergence properties:

23

Proposition 3.1. If p1(θ|x, λ) > 0 on Θ(p2(λ|x, θ > 0 on Λ, resp.), bothsequences

(θ(m)

)and

(λ(m)

)are ergodic Markov chains with invariant dis-

tributions p(θ|x) and p(λ|x).

Proof. See [15].

Proposition 3.2. (Duality Principle) If the convergence is uniformly geo-metric for one of the two chains, e.g., if it takes values in a finite space, theconvergence to the stationary distribution is also uniformly geometric for theother chain.

Proof. See [15].

Extending these properties to the general Gibbs sampler, the joint dis-tribution P (U1, U2, . . . , UK) is stationary at each step since the P ′

ks are thefull condition of P (U1, U2, . . . , UK). Then the whole procedure is stationary.

Now see the convergency property of Gibbs sampler. Chao (1970) havegiven following proposition:

Proposition 3.3. If θ is the MLE and θ is the posterior mean from aBayesian model using the same likelihood, but any proper prior(and mostimproper priors), then:

√n(θ − θ) → 0 as n →∞

almost assuredly for reasonable starting values of θ.

Details see [9] and [23].

3.3.2 Gibbs sampler for normal mixture model

Now we apply Gibbs sampler to the indicator normal mixture model, Weneed to know the conditional distributions for all parameters.

As calculation in the pervious section, choose conjugate prior for indi-cator normal mixture model to get a posterior distribution. Besides, alsoconsider z = z1, z2 as additional parameter. So we get the following jointdistributions:

zi|θ ∼ Ber(π),

yi|zi, θ ∼ N

2∏

j=1

µzij

i ,2∏

j=1

σ2zij

i

.

According to the computation in Section 3.2, we have the following pos-terior distributions:

24

For i = 1, . . . , N and j = 1, 2,

zi|y, θ ∼ Ber

(πϕθ2(yi)

(1− π)ϕθ1(yi) + πϕθ2(yi)

)


(njξj + zj yj(z)

nj + zj,

σ2j

nj + zj

),

π|y, z ∼ Be (z2 + α, z1 + β2) ,

and

σ2j |y, z ∼ IG

(νj + zj

2,12

[s2j +

N∑

i=1

zij

(yi − yj(z)


nj + zj

]).

(3.8)

Then we give the Gibbs sampler for two-component normal mixturemodel.

Algorithm 6. Gibbs sampler for two-component normal mixtures.

1. Take some initial values θ(0) = (π(0), µ(0)1 , µ

(0)2 , (σ2

1)(0), (σ2

2)(0)), where

those parameters come from the prior distributions (3.2).

2. Repeat for t=1,2,. . . ,.

(a) For i = 1, 2, . . . , N , generate z(t)i ∈ 0, 1 with

z(t)i ∼ Ber

(π(t)ϕθ2(yi)

(1− π(t))ϕθ1(yi) + π(t)ϕθ2(yi)

)

25

(b) For j = 1, 2, generate parameters as followed 5:

π(t+1) ∼ Be(z(t)2 + α, z

(t)1 + β2

),

(σ2j )

(t+1) ∼ IG(

νj + z(t)j

2,12

[s2j +

N∑

i=1

z(t)ij

(yi − yj(z)(t)

)2

+nj z

(t)j (yj(z)(t) − ξj)2

nj + z(t)j

]),

µ(t+1)j ∼ N

(njξj + z

(t)j yj(z)(t)

nj + z(t)j

,(σ2

j )(t+1)

nj + z(t)j

).

3. Continute step 2 until the joint distribution of (z(t), θ(t)) does notchange.

3.3.3 Gibbs sampler for normal mixture with known σ2j and π

At last give a specific application to normal mixture model with variancesand mixture proportion known:

Algorithm 7. Gibbs sampler for two-component normal mixtures with knownvariance and mixture proportion.

1. Take some initial values θ(0) = (µ(0)1 , µ

(0)2 ).

2. Repeat for t=1,2,. . . ,.

(a) For i = 1, 2, . . . , N , generate ∆(t)i ∈ 0, 1 with Pr(∆(t)

i = 1) =γi(θ(t)), by Equation (2.6).

(b) Set

µ1 =∑N

i=1(1−∆(t)i ) · yi∑N

i=1(1−∆(t)i )

, µ2 =∑N

i=1 ∆(t)i · yi∑N

i=1 ∆(t)i

(3.9)

and generate µ(t+1)1 ∼ N(µ1, σ

21) and µ

(t+2)2 ∼ N(µ2, σ

22).

5The priors and posterior under other parameters must be proper otherwise the geo-metric convergence will not be established. [1] give an example with following simulation

σ2j |y, z, µi ∼ IG

(νj + nj + 1

2,1

2

[s2

j +

N∑i=1

zij(yj − ξj)2

]),

The convergence will be much more difficult to established since geometric convergencemust be established under imposing restrictions on σ2

i . So µi should not be involved inthe conditional posteriors distribution of σ2

i . Details see [3].

26

3. Continute step 2 until the joint distribution of (∆(t), µ(t)1 , µ

(t)2 ) does not

change.

Compare this Gibbs sampler with the simulation result in Algorithm 3:

µ1 =∑N

i=1(1− γi) · yi∑Ni=1(1− γi)

, µ2 =∑N

i=1 γi · yi∑Ni=1 γi

Notice that the iterative prior µj in (3.9) has the similar form as theplug-in-estimators of the EM-Algorithm. But these two approaches haveessential differences in both parameter and procedure.

In Gibbs sampler, we consider the latent data ∆ to be another param-eter. Compared with the ”E” step of the EM-Algorithm, Gibbs sampleruse generated latent data ∆i from distributions P (∆i|θ,Z) instead of usingthe responsibilities γi(θ) = E(∆i|θ,Z). Also compared with the ”M” stepof the EM-Algorithm, Gibbs sampler generate the iterative items from theconditional distribution P (µ1, µ2|∆,Z) instead of maximizing the posteriorP (µ1, µ2,∆|Z).

4 Application

In the last part of Section 3.3.3, we compared the EM-Algorithm and Gibbssampler in procedure through a simplified normal mixture model. Now takea further comparison practically for the parameter estimators received fromboth algorithm procedures.

4.1 Simulation results

We run the algorithm procedures under a simplified two-component normalmixture model with variances σ2

1, σ22 and the mixture proportion π known.

We focus on the estimation of unknown parameters µ1 and µ2.

4.1.1 Simulation results for the fictitious data set

From Figure 4, we have the density plot of the fictitious data set and itindicates a two-component normal mixture model.

In Section 2.2.4, we mentioned the common way of initial guesses forunknown parameters. In fact, different initial guesses lead to different it-erative estimate results and those get the highest maximized likelihood arethe best. Hastie, Tibshirani and Friedman run the EM-Algorithm for 20fictitious data and received a best group of estimate values in [1]. So we justchoose the best estimate values given in [1] to evaluate the known parame-ters σ2

1, σ22 and π, also to be the initial guesses for µ1 and µ2. Then we have

σ21=0.87, σ2

2=0.77, and π=0.546 (See [1] pp. 237-240).

27

Run Algorithm 3 and Algorithm 7 through R programme with initialguesses µ

(0)1 =4.62 and µ

(0)2 =1.06, we get the simulation results of the ficti-

tious data set. From Figure 5 and Figure 6 we find these two algorithmsget similar iterative estimation results. The EM-Algorithm estimator, themean value of the Gibbs sampler estimator and the Gibbs sampler estimatorwith the highest density are very close in Figure 6. But The EM-Algorithmis much stabler and faster for this data sample. From Figure 5, we see itget converged in less than 10 steps, when Gibbs sampler is still fluctuatingwidely after 200 iterations.

4.1.2 Simulation results for the astronomy data set

Now come to the real astronomy data. We choose a data set with 81 da-tum which is the numbers of photons over wavelength in [nm](we call itphoton data). From the density plot(Figure 4), we consider the photon dataas following a two-component normal mixture model. Use the common wayto choose the initial values of parameters and run the EM-Algorithm andGibbs sampler, we get the estimation results of this photon data (Figure 8and Figure 9).

Both algorithms also get similar estimate results for the real photondata, and the EM-Algorithm still seems faster and stabler. But we noticethat the density plot of Gibbs sampler estimators has more than one localmaximization(Figure 9), which means there might be some information thatout of our model. Come back to the initial data set information, a groupof noise at each point was mentioned. Fixing of this kind of noise requiresmore astronomy background which is out of this thesis, so we just use theoriginal photon data.

Here we could find the advantages of Gibbs sampler: it contains moreinformation. The EM-Algorithm can only get the stable convergency es-timator value, but Gibbs sampler shows more information of the data setitself besides the estimator, which is more important for statistical analysis,especially with large number of data set.

The estimator generated from Gibbs sampler should follow a normaldistribution, but Figure 9 are obviously not normal. We wonder whether itis the problem of the data set and the model or the problem of the methodof Gibbs sampler, so we generate a two-component normal mixture modelwith parameters as followed:

π = 0.5, µ1 = 18, σ21 = 0.8, µ2 = 25, σ2

2 = 4.

Then suppose the mean parameters are unknown and generate the meanparameter estimators by Gibbs sampler. We get the results as Figure 11.From Figure 11 we have that the parameter estimators generated from Gibbs

28

sampler for generated two-component normal mixture data are typical nor-mal distribution and the estimation is not bad compared with the originalparameter value. So Gibbs sampler might not be responsible for the badresults of the photon data.

Come back to Figure 4. We suppose the photon data follow a two-component normal mixture model according to the shape of density plot.But it may also be a finite normal mixture with higher dimension thantwo. Figure 12 shows two density plots of a three-component and a four-component normal mixture model with parameters as followed:

π1 = 0.2, µ1 = 18, σ21 = 0.8,

π2 = 0.3, µ2 = 25, σ22 = 3.4,

π3 = 0.5, µ3 = 60, σ23 = 11.2;

and

π1 = 0.1, µ1 = 18, σ21 = 0.8,

π2 = 0.2, µ2 = 25, σ22 = 3.4,

π3 = 0.3, µ3 = 40, σ23 = 11.2,

π4 = 0.4, µ4 = 110, σ24 = 34.

Both density plots are similar to the density plot of the photon data, andit is difficult to judge the exactly dimension of the photon data. So thebad results from previous Gibbs sampler for the photon data are probablycaused by the data and model, but not the simulation method. And Gibbssampler might be a good method for simulation under two-component nor-mal mixture model. The situation of the higher dimension of finite normalmixtures is worth for further study.

4.2 Confidence interval

Now give the definition of Bayesian Credible Interval:

Definition 3. Credible set(also see [9]) Define C as a subset of the parameterspace Θ such that a 100(1− α) credible interval meets the condition:

1− α =∫

Cp(θ|X)dθ.

Denote T as the estimate of parameter θ, ap as the quantiles of θ − θ.Then we have:

P (T − θ ≤ aα) = P (T − θ ≥ a1−α) = α.

29

Suppose T is continuous. We want an equal tail interval with tail errorsequal to α. The limits of 1− 2α equal tail interval is

θα = t− a1−α, θ1−α = t− aα.

Using T-test, we get the two-side confidence intervals for both datasets. The 95% confidence intervals for µ1 and µ2 under fictitious data are(3.819986, 4.193414) and (1.555847, 1.922822). The 95% confidence intervalsfor µ1 and µ2 under photon data are (189317.6, 191720.1) and (-380.9429,-356.9297). Box plots also show the similar results. See Figure 7 and Figure10.

4.3 Discussion

For general case of Gibbs sampler, iterative sample need to be generatedfrom the posterior distribution. One important point is the parameter se-lection for the prior distribution. Since we already have some initial guessesfor the unknown parameter vector θ, the prior parameter need to be chosencarefully to make the initial guesses fit the prior distribution. Plenty valuesof prior parameters are tested and the final choice for our model are:

α = 2, β = 2.48,

ν1

2= 5.46,

s21

2= 1.032 ∗ 1010,

ν2

2= 5.46,

s22

2= 106,

ξ1 = 191942.2, σ21 = 2847896799, n1 = 5 ∗ 108,

ξ2 = −379.9798, σ22 = 206657.3, n2 = 42000,

The selection of the prior parameter is important but complicate workand the further study of more efficient way to choose proper prior parametermight be attractive.

The EM-Algorithm and Gibbs sampler are both good simulation meth-ods for parameter estimation, and they usually get similar results. TheEM-Algorithm needs no prior information and it is stabler. Gibbs sam-pler is more complicated in computing but also contains more information,which is better for further statistical analysis. Summarizing the simulationresults, we suggest to use the EM-Algorithm when have a small number ofdata set, and choose Gibbs sampler when more data and information needto be dealed with.

30

A Appendix: Astronomic background of the dataset

Planetary nebulae (PNe) has a strong emission-line spectrum and it can beused for studying the expansion properties of PNe. When the PNe expands,the Doppler-shifting can course a broadening of the emission-line.

Figure 2: Picture of Planetary Nebulae (PNe)

Two different velocities v1 and v2 (Figure 3) are obtained ,which resultin two symmetric peaks and the Doppler-shift can be assumed to causes amixture of two normal distributions.

The spectrograph detects the number of photons for different wavelength-bins. The outcome spectra(numbers of photons over wavelength in [nm])shows a line broadening, or in some cases, a double line profile. It is pos-sible to translate the wavelength in radial velocity (RV). So the number ofphotons can be used to set the two-component normal mixture model. Thenumber of photons in the ”StrWr2 Hr8 O2 sigmaright” data set is chosenfor the comparison in this thesis.

31

Figure 3: Picture of Velocities

32

B Appendix: Programme results

B.1 Simulation results

Fictitious Data

Den

sity

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

0e+00 2e+05 4e+05 6e+05

0e+

001e

−06

2e−

063e

−06

4e−

06

Number of Photons

Den

sity

Figure 4: Density plots for both data sets

33

0 20 40 60 80 1004.

620

4.63

5

EM Iteration for the Fictitious Data

mu1

0 20 40 60 80 100

1.06

01.

066

EM Iteration for the Fictitious Data

mu2

0 50 100 150 200

3.5

4.5

5.5

Gibbs Sampler Simulation for the Fictitious Data

mu1

0 50 100 150 200

0.5

1.5

2.5


mu2

Figure 5: Mean value estimators for the Fictitious data

34

2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

0.0

0.5

1.0

1.5


Den

sity

of m

u1Gibbs Sampler MeanEM Estimator

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5


Den

sity

of m

u2

Gibbs Sampler MeanEM Estimator

Figure 6: Density plots from Gibbs sampler for the Fictitious data

35

mu1 mu2

12

34

5


Figure 7: Box plot from Gibbs sampler for the Fictitious data

36

0 20 40 60 80 10019

2000

2000

00

EM Iteration for the Photon Data

mu1

0 20 40 60 80 100

−50

0−

420

EM Iteration for the Photon Data

mu2

0 50 100 150 200

5000

020

0000

Gibbs Sampler Simulation for the Photon Data

mu1

0 50 100 150 200

−60

00


mu2

Figure 8: Mean value estimators for the Photon data

37

160000 180000 200000 220000 240000

0.00

000

0.00

005

0.00

010

0.00

015


Den

sity

of m

u1Gibbs Sampler MeanEM Estimator

−800 −600 −400 −200 0 200

0.00

00.

004

0.00

80.

012


Den

sity

of m

u2

Gibbs Sampler MeanEM Estimator

Figure 9: Density plots from Gibbs sampler for the Photon data

38

5000

010

0000

1500

0020

0000


mu1

−60

0−

400

−20

00

200

mu2

Figure 10: Box plots from Gibbs sampler for the Photon data

39

Generated Two−Component Normal Mixture

Den

sity

15 20 25 30 35

0.00

0.05

0.10

0.15

0.20

0.25

17.5 18.0 18.5 19.0

0.0

1.5

3.0

Gibbs Sampler for Generated Two−Component Normal Miture

Den

sity

of m

u1

23 24 25 26 27

0.0

0.6

1.2

Gibbs Sampler for Generated Two−Component Normal Miture

Den

sity

of m

u2

Figure 11: Gibbs sampler for generated two-component normal mixture

40

Generated Three−Component Normal Mixture

Den

sity

20 30 40 50 60 70 80

0.00

0.02

0.04

0.06

0.08

Generated Four−Component Normal Mixture

Den

sity

0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

Figure 12: Density plots for high dimension-component normal mixturemodel

41

C Appendix: R programme code

C.1 R code for Algorithm 3 with σ21, σ2

2 and π known

data<−c ( −0 . 3 9 , 0 . 1 2 , 0 . 9 4 , 1 . 6 7 , 1 . 7 6 , 2 . 4 4 , 3 . 7 2 , 4 . 2 8 , 4 . 9 2 , 5 . 5 3 ,0 . 0 6 , 0 . 4 8 , 1 . 0 1 , 1 . 6 8 , 1 . 8 0 , 3 . 2 5 , 4 . 1 2 , 4 . 6 0 , 5 . 2 8 , 6 . 2 2 )

pi <−0.546sigmas1 <−0.87sigmas2 <−0.77mu1<−numeric (0 )mu2<−numeric (0 )r<−numeric (0 )R1<−matrix (0 ,20 ,100)mu1[1] <−4.62mu2[1] <−1.06f o r ( j in 1 :100)f o r ( i in 1 : 20 )r [ i ]<−pi ∗dnorm( data [ i ] ,mu2 [ j ] , s igmas2 ˆ(1/2))/((1− pi )∗dnorm( data [ i ] ,

mu1 [ j ] , s igmas1 ˆ(1/2))+ pi ∗dnorm( data [ i ] ,mu2 [ j ] , s igmas2 ˆ (1/2 ) ) )R1 [ i , j ]<−r [ i ]mu1 [ j+1]<−sum((1− r )∗ data )/sum(1− r )mu2 [ j+1]<−sum( r ∗data )/sum( r )Muu1<−mu1 [ j +1]Muu2<−mu2 [ j +1]Muu1Muu2x11 ( )layout ( matrix ( c ( 1 , 2 ) ) )p l o t (mu1 , type=” l ” ,main=””, xlab=”EM I t e r a t i o n f o r theF i c t i t i o u s Data ”)p l o t (mu2 , type=” l ” ,main=””, xlab=”EM I t e r a t i o n f o r theF i c t i t i o u s Data ”)

C.2 R code for Algorithm 7

de l ta<−numeric (0 )r<−numeric (0 )R2<−matrix (0 ,20 ,200)f o r ( j in1 :200)f o r ( i in 1 : 20 )r [ i ]<−pi ∗dnorm( data [ i ] ,mu2 [ j ] , s igmas2 ˆ(1/2))/((1− pi )∗dnorm( data [ i ] ,

42

mu1 [ j ] , s igmas1 ˆ(1/2))+ pi ∗dnorm( data [ i ] ,mu2 [ j ] , s igmas2 ˆ (1/2 ) ) )de l t a [ i ]<−rbinom (1 ,1 , r [ i ] )R2 [ i , j ]<−r [ i ]de l t amu1 [ j+1]<−sum((1− de l t a )∗ data )/sum(1−de l t a )mu2 [ j+1]<−sum( de l t a ∗data )/sum( de l t a )

mu1 [ j+1]=rnorm (1 ,mu1 [ j ] , 0 . 8 7 ˆ ( 1 / 2 ) )mu2 [ j+1]=rnorm (1 ,mu2 [ j ] , 0 . 7 7 ˆ ( 1 / 2 ) )

MU1<−mu1 [ j +1]MU2<−mu2 [ j +1]MU1MU2x11 ( )layout ( matrix ( c ( 1 , 2 ) ) )x11 ( )layout ( matrix ( c ( 1 , 2 ) ) )p l o t (mu1 , type=” l ” ,main=””, xlab=”Gibbs Sampler S imulat ion f o rthe F i c t i t i o u s Data ”)p l o t (mu2 , type=” l ” ,main=””, xlab=”Gibbs Sampler S imulat ion f o rthe F i c t i t i o u s Data ”)

C.3 R code for plots

x11 ( )p l o t ( dens i ty (mu1) , type=” l ” ,main=””, ylab=”Density o f mu1” ,xlab=”Gibbs Sampler S imulat ion f o r the F i c t i t i o u s Data ”)po in t s (mean(mu1) , 0 , pch=19)ab l i n e (v=mean(mu1) , c o l=”gray60 ”)po in t s (4 . 637871 ,0 , pch=21)ab l i n e (v=4.637871 , c o l=”gray60 ”)legend ( l o c a t o r (n=1) , l egend=c (” Gibbs Sampler Mean” ,”EMEstimator ”) , pch=c (19 , 21 ) )

t . t e s t (mu1 , conf . l e v e l =0.95)t . t e s t (mu2 , conf . l e v e l =0.95)

x11 ( )boxplot (mu1 ,mu2 , names=c (”mu1” ,”mu2”) , main=”Gibbs SamplerS imulat ion f o r the F i c t i t i o u s Data ”)

43

References

[1] Hastie, T., Tibshirani, R., Friedman, J. (2001). The elements of statis-tical learning; data mining, inference, and prediction, Springer, ISBN0-387-95284-5

[2] Liero, H., Zwanzig, S. (2008). Script for ”Computer Intensive Methodsin Statistics—a short course with R”, Home page of Silvelyn Zwanzig,Uppsala University, http://www.math.uu.se/ zwanzig/index.html

[3] Robert, Christian P. (2001). The Bayesian choice, 2nd edition, Springer,ISBN 0-387-95231-4

[4] Marin, J-M., Robert, Christian P. (2007). Bayesian Core: A PracticalApproach to Computational Bayesian Statistics, Springer, ISBN 978-0-387-38979-0

[5] Efron, B. (1979). Bootstrap Method: Another Look at the Jackknife,The Annals of Statistics, Vol. 7, No. 1, pp. 1-26

[6] Davison, A.C., Hinkley, D.V. (1997). Bootstrap methods and their ap-plications, Cambridge University Press, ISBN 978-0-521-57391-7

[7] Efron, B., Tibshirani, R. (1993). An introduction to the bootstrap,Chapman & Hall, Inc., ISBN 0-412-04231-2

[8] Woodward, M. (1999). Epidemiology; study design and data analysis,Chapman & Hall/CRC, ISBN 1-58488-009-0

[9] Gill, J. (2008). Bayesian Methods; a social and behavioral sciences ap-proach, 2nd edition, Chapman & Hall/CRC, ISBN 1-58488-562-9

[10] McLachlan, G.J.,Krishnan T. (1997). The EM Alogorithm and Exten-sions, Jhon Wiley & Sons, Inc., ISBN 0-471-12358-7

[11] Ghosh, J.K., Ramamoorthi R.V. (2003). Bayesian Nonparnmetrics,Springer, ISBN 0-387-95537-2

[12] Lange K. (1998). Numerical Analysis for Statisticians, Springer, ISBN0-387-94979-8

[13] Titterington D.M. (1985). Statistical Analysis of Finite Mixture Distri-butions, Jhon Wiley & Sons, Inc., ISBN 0-471-90763-4

[14] Dempster A.D., Laird N.M., Rubin D.B. (1977). Maximum Likelihoodfrom Incomplete Data via the EM Algorithm, Journal of the RoyalStaitistical Society, Series B, Vol.39, No.1, pp.1-38

44

[15] Diebolt J., Robert C.P. (1994). Estimation of Finite Mixture Distri-butions through Bayesian Sampling, Journal of the Royal StaitisticalSociety, Series B, Vol.56, No.2, pp.363-375

[16] Verzani J. (2004). Using R for Introductory Statistics, Chapman &Hall/CRC, ISBN 1-58488-4509

[17] Albert J. (2007). Bayesian Computation with R, Springer, ISBN 978-0-387-71384-7

[18] Givens G.H., Hoeting J.A. (2005). Computational Statistics, John Wi-ley & Sons, Inc., ISBN 0-471-46124-5

[19] Garthwaite P.H., Jolliffe I.T., Jones B. (2002). Statistical Inference, 2ndedition,Oxford, ISBN 0-19-857226-3

[20] Figueiredo M.A.T., (2004). Lecture Notes on the EM Algorithm, Insti-tuto de Telecomunicacoes, Instituto Superior Tecnico, 1049-001 Lisboa,Portugal, [email protected]

[21] Dellaert F., Hoeting J.A. (2002). The Expectation Maximization Algo-rithm, College of Computing, Georgia Institute of Technology Techni-cal, Report number GIT-GVU-02-20

[22] Marin J.M., Mengersen K., Robert C.P. (2005). Bayesian Modellingand Inference on Mixtures of Distributions, Handbook of Statistics, D.Dey & C. Rao, eds., Vol. 25. Elsevier-Sciences, ISBN: 9780444515391

[23] Chao M.T.(1970). The Asymptotic Behavior of Bayes’s Estimators, An-nals of Mathematical Statistics 41, 601-608

[24] McLachlan G.J., Krishnan T.(2008). The EM Algorithm and Exten-sions, 2nd edition, John Wiley & Sons, ISBN 978-0-471-20170-0

[25] Ishwaran H., James L.F.(2002). Approximate Dirichlet Process Com-puting in Finite Normal Mixtrues: Smoothing and Prior Information,Journal of Computational and Graphical Statistics, Vol. 11, No. 3, Page1-26

[26] Wu C.F.J.(1983). On the Convergence Properties of the EM Algorithm,Annals of Statistics 11, 95-103

45

Documents

On Simulation Methods for Two Component Normal Mixture ...300849/FULLTEXT01.pdf · EM-Algorithm for the two-component normal mixture models to get the iter- ative computation estimates,