FACULTY OF ENGINEERING AND ARCHITECTURE
Mathematical Techniques in Engineering Science
Module Statistics
Lecture 7+8
Estimation of parameters:Fisher estimationBayesian estimation
Stijn De Vuyst
30 november 2016
STAT 7+8 Statistics Lecture 7+8 1
Statistics Lecture 7+8
Fisher estimationLikelihood functionScore functionFisher informationMSE: bias and varianceUnbiased estimators: Cramer-Rao Lower BoundBiased estimators
Sufficient statisticsRao-Blackwellisation
Maximum-likelihood estimatorThe EM algorithm
Example: censored data
Bayesian estimation
STAT 7+8 Statistics Lecture 7+8 2
Estimation of parameters: two approachespopulation X
parameter θ
sample x
estimate θ
Classical frameworkIn 1920s and 1930s by Ronald Fisher,
Karl Pearson, Jerzy Neyman, . . .Later also C.R. Rao, H. Cramer,
Egon Pearson, D. Blackwell,
I θ is unknown, but deterministic
θ ∈ S, the parameter space
Bayesian framework18th century concepts by Thomas Bayes and Pierre-Simon LaplaceHuge following after 1950s due to availability of computer-intensive methods
I θ is an unknown realisationof a random variable Θ
Θ ∈ S
STAT 7+8 Statistics Lecture 7+8 3
Classical setting: Fisher estimationX: population,
system, process, . . .
parameter θ
θ is a scalar here,but could also be
a vector θ in someparameter space S
X: data,observations,sample, . . .
estimate θ
The samplen independent members taken from the population(n is the sample size)
I X = (X1, X2, . . . , Xn) before observation
I x = (x1, x2, . . . , xn) after observation
X ∈ ΩΩ = Rn for real populations,Ω = 0, 1n for Bernoulli populations,. . .
The ‘model’: likelihood function
I p(x; θ) = Prob[observe X = x if true parameter is θ]
p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood
−→ can be either a density (X cont.)or a mass function (X discr.)
STAT 7+8 Fisher Likelihood 4
Example: likelihood function for a Bernoulli populationAssume a Bernoulli population: X ∼ Bern(θ),
i.e. X = 1 with probability θ and X = 0 otherwise
I The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0)
I Likelihood p(x; θ) =∏6i=1 p(xi|θ) = (1− θ)4θ2, θ ∈ S = [0, 1]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.000
0.005
0.010
0.015
0.020
0.025
Likelihood p(x; θ)
(1− θ)4θ2
θθML =
1
3
Maximum-likelihood estimate for parameter θ
θML = arg maxθ p(x; θ) =count 1s in the data
n=c
n=
2
6=
1
3
STAT 7+8 Fisher Likelihood 5
Exa
mp
le
Score function
The score function of the model
S(θ,x) =∂
∂θln p(x; θ) =
∂∂θ p(x; θ)
p(x; θ)
I S(θ,x) indicates the relative change in likelihood
I indicates the sensitivity of the log-likelihood to its parameter θ
Expected value and variance of the scoreIf X is not yet observed, the score S(θ,X) at θ is a random variableWhat is its mean and variance?
I The expected score is 0
E[S(θ,X)] =
∫Ω
( ∂∂θ
ln p(x; θ))p(x; θ)dx =
∫Ω
∂∂θp(x; θ)
p(x; θ)p(x; θ)dx
REG=
∂
∂θ
∫Ωp(x; θ)dx =
∂
∂θ1 = 0
I The variance of the score is called the Fisher information J(θ)
Var[S(θ,X)] = E[S2(θ,X)] = E[( ∂∂θ
ln p(X; θ))2
], J(θ)
STAT 7+8 Fisher Score 6
Fisher InformationI J(θ) is the variance of the score function S(θ,X),
averaged over all possible samples X in Ω
I J(θ) is a metric for how much you can expect to learn from the sample Xabout parameter θ
Property
J(θ) = E[( ∂∂θ
ln p(X; θ))2
] = −E[∂2
∂θ2ln p(X; θ)]
Proof:The first equality is due to the definition of Fisher information.
The second follows from E[S(θ,X)] = 0, ∀θ, which means that also:
0 =∂
∂θE[S(θ,X)] =
∂
∂θ
∫Ω
( ∂∂θ
ln p)p dx
REG=
∫Ω
[( ∂2
∂θ2ln p)p+
( ∂∂θ
ln p) ∂∂θp]dx
=
∫Ω
( ∂2
∂θ2ln p)p dx +
∫Ω
( ∂∂θ
ln p)2p dx = E[
∂2
∂θ2ln p] + E[
( ∂∂θ
ln p)2
] QED
(!) Note we assume sufficient ‘regularity’ (REG) of the likelihood functionp(x; θ), so that differentiation over θ and integration over x can be switched
STAT 7+8 Fisher Information 7
Estimators for a parameter θ
DefinitionAn estimator θ is a statistic Ω→ S : x→ θ(x) (not depending on any unknown parameters!)
giving values that are hopefully ‘close’ to the true θ
! after observation, θ(x) is a deterministic number
before observation, θ(X) is a random variable→ θ is a shorthand notation for either, depending on the context
MEAN→ bias
I E[θ − θ] = E[θ(X)]− θ is the biasif bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’
if estimator is not unbiased −→ estimator is biased
STAT 7+8 Fisher MSE: bias and variance 8
Estimators for a parameter θ
VARIANCE→ Mean Square Error
I The variance of estimator θ is the expect square deviation from E[θ]:
Var[θ] = E[(θ(X)− E[θ(X)]
)2]
= E[(θ−θ − (E[θ]−θ)
)2]
= E[(θ − θ)2]− 2(E[θ]− θ)E[θ − θ] + (E[θ]− θ)2
= E[(θ − θ)2]MSE
−(E[θ]− θ
bias
)2
The Mean Square Error (MSE) is expected square deviation from true θ.
=⇒ MSE(θ) = bias2 + Var[θ]
Minimum Variance and Unbiased estimator (MVU)θ is unbiased and has lower variance than any other estimator for all θ ∈ S
−→ estimator is ‘MVU’
STAT 7+8 Fisher MSE: bias and variance 9
Estimators for a parameter θOften, the asymptotic distribution of an estimator is of interest
−→ behaviour of θ(X) when sample size n becomes very large?
An estimator θn = θ(X1, . . . , Xn) of θ is consistent if and only if
θn converges to θ (‘in probability’) for n→∞, ∀θ ∈ S, i.e.
limn→∞
Prob[|θn − θ| > ε] = 0 , ∀ε > 0 , or plimn→∞
θn = θ , ∀θ ∈ S
Consistencyvs. biasexamples:
θ
θn = X
unbiased and consistent
θ
θn =X1 +X2 +X3
3, n > 3
unbiased but not consistent
θ
θn = − 1
n+
1
n
n∑
i=1
Xi
biased but consistent
θ a
θn = a 6= θ
biased and not consistentn = 1n = 2n = 3n = 5n = 10n = 50
STAT 7+8 Fisher MSE: bias and variance 10
Unbiased estimators: Cramer-Rao Lower Bound (CRLB)There may be many plausible estimators θ for θ.
? Which is the ‘best’
Several criteria for a suitable estimator are possible,but suppose we aim for an MVU estimator (unbiased and minimal MSE)
Lower bound for the MSE of unbiased estimatorsGiven the model p(x; θ), there is a lower bound on the MSE that any unbiased
estimator θ can possibly achieve:
MSE(θ(X)) > 1
J(θ)−→ ‘Cramer-Rao Lower Bound’ (CRLB)
if θ reaches this bound, MSE(θ(X)) = 1/J(θ) −→ estimator is ‘efficient’
I the CRLB is inverse of the Fisher information
I having a lot of information in the sample about true θ (high J(θ)) allowsfor estimators with very low variance
I efficient ⇒ MVU, but MVU ; efficientbecause CRLB can not always be reached by MVU estimators
STAT 7+8 Fisher CRLB 11
Cramer-Rao Lower Bound (CRLB): proof
I The ‘triangle inequality’, best known in Euclidean vector spaces Rn
u = (u1, . . . , un) ∈ Rn is an n-dimensional vector
||u|| =√u2
1 + . . .+ u2n is the Euclidean length of u
inner (dot) product:
u · v = ||v|| ||u|| cosα∈[−1,1]
= u1v1 + . . .+ unvn
u
v
α
||u|| cosαCauchy-Schwarz:
(u · v)2 6 ||u||2 ||v||2 equality iff u = kv
or:(∑
i
ui vi)2 6
(∑
i
u2i
)(∑
i
v2i
)equality iff ui = kvi, ∀i
I If n→∞, Rn becomes a Hilbert space or ‘function space’:
(∫u(x)v(x)dx
)2
6(∫
u(x)2dx)(∫
v(x)2dx)
equality iff u(x) = kv(x), ∀x
STAT 7+8 Fisher CRLB 12
Cramer-Rao Lower Bound (CRLB): proof
θ(x) is an unbiased estimator for θ, so E[θ(X)− θ] = 0
⇒ 0 =∂
∂θE[θ(X)− θ] =
∂
∂θ
∫ (θ(x)− θ
)p(x; θ)dx
REG=
∫∂
∂θ
((θ(x)− θ)p(x; θ)
)dx
=
∫(0− 1)p(x; θ)dx
−1
+
∫(θ(x)− θ) ∂
∂θp(x; θ)
p(x; θ)S(θ,x)
dx
⇒ 1 =
∫
Ω
(θ(x)− θ)√p(x; θ)
u(x)
√p(x; θ)S(θ,x)
v(x)
dx =
∫
Ω
u(x)v(x)dx
In particular for these two functions:∫u(x)2dx =
∫(θ(x)− θ)2p(x; θ)dx = E[(θ − θ)2] = MSE(θ)
∫v(x)2dx =
∫S2(θ,x)p(x; θ)dx = E[S2(θ,X)2] = J(θ)
STAT 7+8 Fisher CRLB 13
Cramer-Rao Lower Bound (CRLB): proof
So due to the Cauchy-Schwarz inequality in Hilbert space:(∫
u(x)v(x)dx)2
1
6∫u(x)2dx
MSE(θ)
·∫v(x)2dx
J(θ)
which proves the theorem: MSE(θ) = Var[θ] > 1
J(θ)QED
Efficient form
The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff
S(θ,x) = k(θ)[θ(x)− θ] ‘efficient form’
If the score function can be written as k(θ)[θ − θ] for all θ ∈ S−→ estimator θ is ‘efficient’
STAT 7+8 Fisher CRLB 14
Example: estimate the variance of a normal population
Assume a zero-mean normal population: X ∼ N(0, σ2),
? How to estimate σ2(= θ) given only the data x = (x1, . . . , xn)
I Likelihood p(x; θ) =
n∏
i=1
1√2πθ
exp−x2
i
2θ
I Log-likelihood ln p(x; θ) = −n ln√
2π − n
2ln θ − 1
2
n∑
i=1
x2i
θ
I Score S(θ,x) =∂
∂θln p(x; θ) = − n
2θ+
1
2
n∑
i=1
x2i
θ2=
n
2θ2
k(θ)
( 1
n
n∑
i=1
x2i
θ(x)
− θ)
The score function can be written in efficient form!
so θ(x) =1
n
n∑
i=1
x2i is an unbiased and efficient estimator for θ = σ2
STAT 7+8 Fisher CRLB 15
Exa
mp
le
Example: estimate intensity of a Poisson processA Poisson process with intensity λ is a point process so that the times between‘events’ are indep. and exponentially distributed with mean τ = 1/λ.
0 t
N = n
∼ Expon(λ)The number of events N in an intervalof length t is Poiss(λt)
I Likelihood p(n;λ) = Prob[n events in interval of length t] = e−λt(λt)n
n!I Log-likelihood ln p(n;λ) = −λt+ n lnλt− lnn!
I Score S(λ, n) =∂
∂λln p(n;λ) = −t+
n
λtt =
t
λk(λ)
( n
t
λ(n)
− λ)
This is efficient form, so λ(n) =n
tis an unbiased efficient estimator for λ !
! However, the inverse 1/λ = t/n is not an efficient estimator for τ
p(n; τ) = e−t/τ(t/τ)n
n!, so that the score is S(τ, n) = −n
τ+
t
τ2
This is impossible to write in efficient form,so no unbiased efficient estimator for τ exists!
STAT 7+8 Fisher CRLB 16
Exa
mp
le
Biased estimators
Should we always try to find unbiased estimators? No!
I They may not existe.g. no unbiased estimator for 1/p from a Bern(p) population exists
I They may be unreasonablee.g. MVU estimate of p from X ∼ Geom(p) is p(X) = 1〈X=1〉
this estimate is always 0 or 1
I They may have extremely large variance (= MSE)
So unbiased estimators do not always minimize the MSE:MSE(θ) = bias2 + Var[θ]−→ Sometimes it is better to sacrifice unbiasedness for lower variance
Minimising the MSEWe require:
I the concept of sufficient statistics
I the Rao-Blackwell theorem
STAT 7+8 Fisher Biased estimators 17
Sufficient statistics
Recall: a statistic T (x) is any function of the sample datanot depending on unknown parameters
could also be vector-valued: T(x) : Ω→ Rm with m < n, typically
A statistic T(x) is sufficient with respect to the model p(x; θ) if
p(x|T(x); θ) = p(x|T(x)), ∀xi.e. if the distribution of X given that T(X) = t, is independent of θ
−→ “All you can learn about θ from the data X,you can also learn from the statistic T(X)”
I If X is a book in which θ is a character, then a summary T(X) is sufficientif it gives all information about θ that is also in the book
I Sufficiency can be checked using the Neyman-Fisher criterium
STAT 7+8 Fisher Biased estimators Sufficient statistics 18
Sufficient statistics
Neyman-Fisher factorisation criterium
A statistic T(x) is sufficient with respect to the model p(x; θ)
⇔ p(x; θ) = g(x) · h(T(x), θ
)∀x ∈ Ω
independent of θdepends only on x through T(x)
Proof: (assuming X is discrete)
I First note if t = T(x) then “T(X) = t,X = x” and “X = x” are the same event!
−→ p(x; θ) = p(x, t; θ)
⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ) sufficiency= p(x|t)
g(x)
· p(t; θ)h(t, θ)
⇐ p(x|t; θ) =p(x, t; θ)
p(t; θ)=
p(x , t ; θ)∑x p(x , t ; θ)1〈T(x)=t〉
=g(x)h(t, θ)∑
x g(x)h(t, θ)1〈T(x)=t〉
independent of θ= p(x|t) −→ sufficiency QED
STAT 7+8 Fisher Biased estimators Sufficient statistics 19
Example: Sample mean for Bernoulli populationAssume again a Bernoulli population: X ∼ Bern(θ),
i.e. p(x; θ) = θx(1− θ)1−x for x ∈ 0, 1
I Sample size n
I Take as statistic the sample mean T (X) = X =1
n
n∑
i=1
Xi =C
n
with C the count of 1s in the sample
p(x; θ) =
n∏
i=1
p(xi; θ) =
n∏
i=1
θxi(1− θ)1−xi = θ∑xi(1− θ)n−
∑xi
= θnT (x)(1− θ)n−nT (x)
h(T (x),θ)
· 1g(x)
Neyman-Fisher checks out, so the sample mean is a sufficient statistic for θ
−→ T (x) is also efficient, since S(θ,x) =n
θ(1− θ)(T (x)− θ
)
STAT 7+8 Fisher Biased estimators Sufficient statistics 20
Exa
mp
le
Rao-Blackwellisation of an estimator
Rao-Blackwell Theorem
For model p(x; θ), let θ(x) be an estimator for θ so that Var[θ] exists.If T(x) is a sufficient statistic, then for the new estimator
θ∗(t) = E[θ(X)|T(X) = t],
1) the new estimator θ∗ is a statistic, i.e. does not depend on θ
2) if θ is unbiased, then θ∗ is also unbiased
3) MSE(θ∗) 6 MSE(θ) −→ so new estimator may be ‘better’!
4) MSE(θ∗) = MSE(θ) iff θ(x) depends on x only through T(x)
I Process of improving existing estimators is called ‘Rao-Blackwellisation’
I The process is idempotent: repeating it will give no further improvement
I The proof is essentially based on the law of total expectation:
Let f(t) = E[|T = t] then E[] = ET[f(T)] = ET[EX[|T]]
inner expectation over all X for which T(X) is fixedouter expectation over all T
STAT 7+8 Fisher Biased estimators Rao-Blackwell 21
Rao-Blackwellisation of an estimator
Proof:
1) θ∗ is a statistic because of the sufficiency of T(x):
−→ θ∗(t) =∑
x
θ(x)p(x|t ; θ ) is independent of θ
2) θunbiased
= E[θ] = ET
[EX[θ(X)|T]
]= ET[θ∗(T)] = E[θ∗]
3) Since both estimators are unbiased, their MSE equals their variance, so
MSE(θ)−MSE(θ∗) = Var[θ]−Var[θ∗] = E[θ2]−(E[θ]
)2
θ2
−E[θ∗2]+(E[θ∗]
)2
θ2
= E[θ2(X)]− E[θ∗2(T)] = ET
[EX[θ2(X)|T]− θ∗2(T)
]
= ET
[EX[θ2(X)|T]−
(EX[θ(X)|T]
)2]= ET
[Var[θ(X)|T]
>0
]> 0
4) The inequality is strict iff Var[θ(X)|T = t] = 0, ∀t−→ given T(X) = t, θ is fixed
so θ(x) only depends on x through T(x)
STAT 7+8 Fisher Biased estimators Rao-Blackwell 22
Example: estimate maximum of uniform distributionObserve X1, . . . , Xn ∼ Unif(0, a)
how to estimate upper bound a?x2 x4 x1 x30
x
max(x) = t
a ?
I Original (naive) estimator: since E[Xi] =a
2, one could propose
a(x) = 2x =2
n
n∑
i=1
xi −→ E[a] = a ,MSE(a) =a2
3n(exercise)
I T (x) = max(x) is sufficient for a since Neyman-Fisher checks out:
p(x; a) =
n∏
i=1
1
a1〈0 6 xi 6 a〉 =
1
an1〈T (x) 6 a〉 · ∏n
i=1 1〈0 6 xi〉
I Rao-Blackwell new estimator: (suppose n > 1)
a∗(t) = E[a(X)|T (x) = t] = E[ 2
n
( n−1∑
i=1
Xi +t)|T (x) = t
]=
2t
n+(n−1)
t
n
=n+ 1
nt =
n+ 1
nmax(x) −→ E[a∗] = a ,MSE(a∗) =
a2
n(n+ 2)(exercise)
We find that indeed, MSE(a∗) < MSE(a) , ∀n > 1
STAT 7+8 Fisher Biased estimators Rao-Blackwell 23
Exa
mp
le
The Maximum-likelihood Estimator
For a model p(x; θ), the maximum-likelihood estimator θML (MLE) for θ is thevalue of θ for which the modelproduces the highest probability ofobserving sample X = x,
θML(x) = arg maxθ∈S
p(x; θ)
Likelihood p(x; θ)
θθML
I Finding θML is a maximisation problem:
∂
∂θp(x; θ) = 0 ⇒ ∂
∂θln p(x; θ)
score function
= 0 ⇒ S(θ,x) = 0
−→ so involves finding zeroes of the score function
usually requires numerical (search) algorithms
STAT 7+8 Fisher MLE 24
The Maximum-likelihood Estimator
Properties
I Any unbiased efficient estimator θ is also MLE
score has efficient form S(θ,x) = k(θ)(θ(x)− θ
), so
S(θ,x) = 0 −→ θ is MLE
The converse is not true: not all MLE are efficient
I Under some regularity conditions however, for increasing sample sizen→∞, the MLE
I is consistent: plimn→∞
θML,n = θ
I is asymptotically efficient: limn→∞
Var[θML,n]
1/nJ(θ)= 1
I is asymptotically normal: θML,n −→ N(θ,1
nJ(θ)) as n→∞
STAT 7+8 Fisher MLE 25
EM algorithm (Expectation/Maximisation) for finding MLE
Observed data vs. complete dataThe log-likelihood p(x; θ) may be a complicated function of θ so that
Find arg maxθ ln p(x; θ) −→ is difficult
But in the case where the observed data x is only part of the
underlying complete data ( xobserved
, yhidden
)
often the complete-data log-likelihood problem
Find arg maxθ ln p(x,y; θ) −→ is easy
EM-algorithm
I Numerical search algorithm: θ0E→M→ θ1
E→M→ θ2E→M→ . . . −→ θML
I Sure to converge to local likelihood maximum
STAT 7+8 Fisher EM algorithm 26
EM algorithm
p(x,y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ)observed LL
max is difficult
= ln p(x,y; θ)complete LLmax is easy
− ln p(y|x; θ)hidden
conditional on x
EM approaches argmax of observed LL by iteratively maximising complete LL:
E-step (expectation)So we need to maximise ln p(x,y; θ) . . . but how if y is unknown!?
I Trick 1: Replace complete LL by its expected value :
Lx(θ) = E[ln p(x,Y; θ)] =
∫ln p(x,y; θ) p(y|x; θ)dy
I Trick 2: Use current estimate θk of θ to fix distribution of hidden data−→ Replace p(y|x; θ) by p(y|x; θk) and calculate
Lx(θ|θk) =
∫ln p(x,y; θ) p(y|x; θk)dy
M-step (maximisation)Next estimate of θ is: θk+1 ← arg maxθ Lx(θ|θk)
STAT 7+8 Fisher EM algorithm 27
EM algorithm
I It can be shown that, for the observed LL:
ln p(x; θk+1) > ln p(x; θk)
So if the likelihood has alocal maximum, the EM-algorithmwill converge to it
I In fact, the EM-algorithm is especially useful when the parameter to beestimated is a vector
θ = (θ1, . . . , θh)
so that the ‘search space’ S is very large.
STAT 7+8 Fisher EM algorithm 28
Example: censored data
An electricity company has a power line to a part of the city with fluctuating daily demand. It
is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is,
the variance is known (σ = 1MWh) but the mean is not.
To estimate the mean daily power demand µ = E[W ], the company asks n = 5 employees to
measure the power, on 5 different days and each with a different power meter. Unfortunately,
the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not
give a reading.
employee (i) range meter (ri), MWh measurement (xi), MWh1 7 ×2 5 4.23 8 ×4 6 4.75 10 6.9
−→ We try to find the MLE for µ x = 13 (4.2 + 4.7 + 6.9) = 5.26
STAT 7+8 Fisher EM algorithm Example: censored data 29
Exa
mp
le
Example: censored dataDirect maximisation of observed LLSuppose the first m 6 n measurements succeeded, x = (x1, . . . , xm) (observed)
and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i 6 n
p(x;µ) =
m∏
i=1
φ(xi − µ)
n∏
i=m+1
(1− Φ(ri − µ)
)
`obs(µ) = ln p(x;µ) = −m2
ln(2π)−m∑
i=1
1
2(xi−µ)2 +
n∑
i=m+1
ln(1−Φ(ri−µ)
)
µML satisfies `′obs(µ) = 0, or:
m(x− µ) =
n∑
i=m+1
φ(ri − µ)
1− Φ(ri − µ)
transcendental equation, difficult to solve
can only be done numerically
−→ so let us use EM algorithm instead!5.0 5.5 6.0 6.5 7.0 7.5 8.0
−20
−18
−16
−14
−12
−10
−8
maximum can befound usingnumericaltechniques
µ
observed LL `obs(µ) = ln p(x;µ)
x
STAT 7+8 Fisher EM algorithm Example: censored data 30
Exa
mp
le
Example: censored dataE-step
Complete LL is ln p(x,Y;µ) = −n2
ln(2π)− 1
2
m∑
i=1
(xi−µ)2− 1
2
n∑
i=m+1
(Yi−µ)2
I 1: Replace LL by its expected value:
E[ln p(x,Y;µ)] = −1
2
m∑
i=1
(xi − µ)2 − 1
2
n∑
i=m+1
E[(Yi − µ)2] +csome constant
indep. of µ
E[(Yi − µ)2] =
∫ ∞
ri
(y − µ)2p(y |x ;µ)
p(y;µ)
dy with p(yi;µ) =φ(yi − µ)
1− Φ(ri − µ)
I 2: . . . and use current estimate µk for distribution of hidden data:
Eµk [(Yi − µ)2] =
∫ ∞
ri
(y−µ)2 p(y; µk)dy =
∫ ∞
ri
(−2yµ+µ2 + y2)2 p(y; µk)dy
= −2µ+
∫ ∞
ri
y p(y; µk)dy
Eµk [Y ]=Eµk [W |W>ri]
+ µ2
∫ ∞
ri
p(y; µk)dy
1
+ c
= −2µ(µk +
φ(ri − µk)
1− Φ(ri − µk)
)+ µ2 + c
STAT 7+8 Fisher EM algorithm Example: censored data 31
Exa
mp
le
Example: censored dataM-step
Lx(µ|µk) = −1
2
m∑
i=1
(xi−µ)2− 1
2
n∑
i=m+1
[− 2µ
(µk +
φ(ri − µk)
1− Φ(ri − µk)
)+µ2
]+ c
L′x(µ|µk) = 0 ⇔ mx− nµ+ (n−m)µk +
n∑
i=m+1
φ(ri − µk)
1− Φ(ri − µk)= 0
So we update: µk+1 ←m
nx+
n−mn
µk +1
n
n∑
i=m+1
φ(ri − µk)
1− Φ(ri − µk)
5.0 5.5 6.0 6.5 7.0 7.5 8.0−20
−18
−16
−14
−12
−10
−8
µ
observed LL `obs(µ) = ln p(x;µ)
µ0
µ1
µ2
started with µ0 = x
convergence is very fast
only 2 or 3 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 32
Exa
mp
le
Example: censored dataWhat if σ is also unknown!?
no problem, the EM-algorithm can be used to approximate θ = (µ, σ2) :
µk+1 ←m
nx+
n−mn
µk +1
n
n∑
i=m+1
σkφ((ri − µk)/σk)
1− Φ((ri − µk)/σk)
σ2k+1 ←
1
n
m∑
i=1
x2i +
n−mn
(µ2k + σ2
k) +1
n
n∑
i=m+1
σk(µk + ri)φ((ri − µk)/σk)
1− Φ((ri − µk)/σk)
5.0 5.5 6.0 6.5 7.0 7.50.0
0.5
1.0
1.5
2.0
2.5
3.0
µML
σML
µk
σk
observed LL is -5.91
started with µ0 = x, σ20 = 1
convergence is again very fast
only 6 or 7 iterations required here
STAT 7+8 Fisher EM algorithm Example: censored data 33
Exa
mp
le
STAT 7+8 Bayes 34
STAT 7+8 Bayes 35
STAT 7+8 Bayes 36
STAT 7+8 Bayes 37
STAT 7+8 Bayes 38