Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
LOREM I P S U M
MACHINE LEARNING 2- EM ALGORITHMLECTURE 2
Royal Institute of Technology
EXTENDED STUDENT EXAMPLE
HLHL
HL
BL
B - better H - higher L - less
B L
D-SEPARATION★ A path is d-separated by O if it has
• a chain X → Y → Z where Y ∈ O
• a fork X ← Y → Z where Y ∈ O
• a v-structure X → Y ← Z where (Y ⋃ desc(Y)) ⋂ O = ∅
X Y∈O Z
XY∈O
Z
X
Y∉O
Z
desc ∉O
Chain
Fork
v-struct
D-SEPARATION SETS AND CI OF
DAGS★ A is d-separated from B given O if every undirected path between A and B is d-separated by O
★ In a DAG G,
A is d-separated from B given O
A B
GO
xA �G xB |xO
EQUIVALENCE OF INDEPENDENCE DEFINITIONS
★ Global (G): d-separation
★ Local (L):
★ Ordered (O):
where pred is according to a topological order
★ Factorized (F): can be family-factorized
★ Theorem: G ⇔ L ⇔ O ⇔ F
MARKOV BLANKET
★ A minimal set B s/t Xt is independent from XV\(B⋃t) given XB is a Markov blanket
★ For t, pa(t) ⋃ c(t) ⋃ pa(c(t)) is a Markov blanket – i.e., parents, children, and co-parents (necessary due to v-structures)
1
2
3
5
4
6
7
LATENT = HIDDEN
★ Can reduce #parameters
★ Can represent common causes
+
���SDUDPHWHUV ���SDUDPHWHUV
LEARNING PARAMETERS COMPLETE DATA
★ “...Bayesian view, the parameters are unknown variables and should also be inferred”
★ Learning from complete data
★ Likelihood
where Dv is values of v together with its parents and θv is v’s CPD
P (D|�) =N�
n=1
P (xn|�) =N�
n=1
V�
v=1
P (xnv|xn,pa(v),�)
=V�
v=1
N�
n=1
P (xnv|xn,pa(v),�) =V�
v=1
P (Dv|�v)
★ Each , i.e., here each
can be maximized independently
★ So, MLE is
★ where
MLE FOR CAT CPDS
c
Nvck =N�
n=1
I(xnv = k, xn,pa(v) = c)
Nvc =N�
n=1
I(xn,pa(v) = c)
�vck = Nvck/Nvc
�vcP (Dv|�v)
SUFFICIENT STATISTICS
EXTENDED STUDENT EXAMPLE
GIVEN DATA AND NON-OPTIMAL PARAMETERS
Fair Biased/loaded
p
1-p 1-q
q★ We observe the sequence of dice outcomes of visited vertices
MIXTURE MODELS AND THE EXPECTATION MAXIMIZATION (EM) ALGORITHM
Used for MLE of ★ mixture models ★ parameters of DGMs, HMMs ★ DAG in DGMs (the structural EM by Nir Friedman)
X1
Z hidden
XV
EM iteratively improves parameters
E-step: compute expected sufficient statistics (ESS) w.r.t.
M-step: find optimal θ´ using the ESS
Chapter 3
EXPECTATION-MAXIMIZATION
THEORY
3.1 Introduction
Learning networks are commonly categorized in terms of supervised and unsuper-vised networks. In unsupervised learning, the training set consists of input trainingpatterns only. In contrast, in supervised learning networks, the training data consistof many pairs of input/output patterns. Therefore, the learning process can benefitgreatly from the teacher’s assistance. In fact, the amount of adjustment of the up-dating coe±cients often depends on the diÆerence between the desired teacher valueand the actual response. As demonstrated in Chapter 5, many supervised learningmodels have been found to be promising for biometric authentication; their imple-mentation often hinges on an eÆective data-clustering scheme, which is perhaps themost critical component in unsupervised learning methods. This chapter addressesa data-clustering algorithm, called the expectation-maximization (EM) algorithm,when complete or partial information of observed data is made available.
3.1.1 K-Means and VQ algorithms
An eÆective data-clustering algorithm is known as K-means [85], which is very sim-ilar to another clustering scheme known as the vector quantization (VQ) algorithm[118]. Both methods classify data patterns based on the nearest-neighbor criterion.
Verbally, the problem is to cluster a given data set X = {xt; t = 1, . . . , T} intoK groups, each represented by its centroid denoted by µ(j), j = 1, . . . ,K. The taskis (1) to determine the K centroids {µ(1), µ(2), . . . , µ(K)} and (2) to assign eachpattern xt to one of the centroids. The nearest-neighbor rule assigns a pattern x tothe class associated with its nearest centroid, say µ(i).
Mathematically speaking, one denotes the centroid associated with xt as µt,where µt 2 {µ(1), µ(2), . . . , µ(K)}. Then the objective of the K-means algorithm is
50
GAUSSIAN – MVN
N (x|µ,�) =1
(2�)D/2|�|1/2exp
��1
2(x� µ)T ��1(x� µ)
�
N (x|µ,�) =1�
2��2exp
�� 1
2�2(x� µ)2
�
TWO DIMENSIONAL NORMAL
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=male
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=male
GAUSSIAN MIXTURE
MODELS (GMM)
Z hidden � Cat(�)D = (x1, . . . ,xN )
xn = (xn1, . . . , xnD)
X � N (µc,�c)
p(X|Z = c) = pc(X) = N (X|µc,�c)
1-DIM GAUSSIAN
MIXTURE MODELS
Z hidden � Cat(�)D = (x1, . . . ,xN )
p(X|Z = c) = pc(X) = N (X|µc,�c)
X � N (µc,�c)
EXAMPLEzn is red with probability 1/2, green with probability 3/10, blue with probability 1/5
xn is generated from the Gaussian indicated by zn
We get x1,…,xN
zn = blue
xn
GMM
and
and
So,
p(xn, zn) = p(zn)p(xn|zn)
COMPLETE LOG LIKELIHOOD (KNOWING ALL Z)
Nc =�
n
I(zn = c)
All paremeters
MLE FOR GAUSSIAN
• Maximizeing
• Boils down to maximizing
l(��;D) =�
c
Nc log ��c +
�
c
�
n:I(zn=c)
log p(xn|��c)
�
n:I(zn=c)
log p(xn|��c)
�
n:I(zn=c)
log1�
2���2c
exp�� 1
2��2c
(xn � µc)2�that is
MLE FOR GAUSSIAN
So,
�f
�µ�c
= 0��
n:I(zn=c)
xn =�
n:I(zn=c)
µ�c = Ncµ
�c
Derivation,
Solving,
f(��c, µ
�c) =
�
n:I(zn=c)
log1�
2���2c
exp�� 1
2��2c
(xn � µ�c)
2
�
=�
n:I(zn=c)
log1�
2���2c
��
n:I(zn=c)
12��2
c
(xn � µ�c)
2
µ�c =
�n:I(zn=c) xn
Nc
�f
�µ�c
=�
n:I(zn=c)
22��2
c
(xn � µ�c)
Nc =�
n
I(zn = c)where
K-MEANS★ Data vectors D={x1,…,xN}
★ Randomly selected classes z1,…,zN
★ Iteratively do
★ One step O(NKD), can be improved
µc =1
Nc
�
n:zn=c
xn, where Nc = |{n : zn = c}|
zn = argminc||xn � µc||2
ASSIGN POINT TO MEANS
−2 0 2
−2
0
2
K-MEANS AS GMM★ Fixed variance, only means must be estimated
★ Idea: each point can belong to several means (clusters)
★ Use responsibilities to find means
rnc = p(zn = c|xn,�) =p(zn = c|�)p(xn|zn = c,�)
�Cc=1 p(zn = c|�)p(xn|zn = c,�)
µc =1
Nc
�
n
rncxn, where Nc =�
n
rnc
�c = (µc,�2)
EM & EXPECTED LOG LIKELIHOOD (Q-TERM)
• Iteratively maximizing the expected log likelihood in practice always leads to a local maxima
• The expectation is over latent variables given data and current parameters
• We maximize the expression by choosing new parameters.
EM FOR GMM★ Problem with previous most similar approach (MLE by derivation)
★ How to handle the sum? the log cannot be pushed inside.
★ Idea: use zi from
l(��) =N�
n=1
log p(xn|��) =N�
n=1
log�
zn
p(xn, zn|��)
p(zn|xn,�)
Current parameters
L(��) =N�
n=1
p(xn|��) =N�
n=1
�
zn
p(xn, zn|��)
LOG LIKELIHOOD
where for a one dim Gaussian �c = (µc,�2c )
EXPECTED LOG LIKELIHOOD (Q-TERM)
Ep(zn|xn,✓) [l(✓0;D)] = Ep(zn|xn,✓)
"log
Y
n
p(xn, zn|✓0)#
=
X
n
E
"log
Y
c
(⇡0cp(xn|zn = c, ✓0)I(zn=c)
#
=
X
n
X
c
E [I(zn = c) log(⇡0cp(xn|✓0c)]
=
X
n
X
c
p(zn = c|xn,✓) log(⇡0cp(xn|✓0c))
=
X
n
X
c
rnc log ⇡0c +
X
n
X
c
rnc log p(xn|✓0c)
★ We want
★ The 2 sums &
are independent
★ So,
★ In the second, different c indices are independent
★ So, we want to maximize each
HOW TO FIND ϴ ´?argmax��Ep(zn|xn,�) [l(��;D)]
�
c
��
n
rnc
�log ��
c
�
c
�
n
rnc log p(xn|��c)
��c =
�
n
rnc/N = rc/N
�
n
rnc log1�
2���2c
exp�� 1
2��2c
(xn � µc)2�
WEIGHTED GAUSS - MEAN
So,
�f
�µ�c
= 0��
n
rncxn =
��
n
rnc
�µ�
c = rcµ�c
µ�c =
�n rncxn
rc
Derivation,
Solving,
f(��c, µ
�c) =
�
n
rnc log1�
2���2c
exp�� 1
2��2c
(xn � µ�c)
2
�
=�
n
rnc log1�
2���2c
��
n
rnc1
2��2c
(xn � µ�c)
2
�f
�µ�c
=�
n
rnc2
2��2c
(xn � µ�c)
VARIANCE
So,
Derivation,
Solving,
Let ��c = 1/��
c
f(��c, µ
�c) =
�
n
rnc log1�
2���2c
��
n
rnc1
2��2c
(xn � µ�c)
2
=�
n
rnc log��
c�2��
�
n
rnc��2
c
2(xn � µ�
c)2
�f
���c
=�
n
rnc
��c
��
n
rnc��c(xn � µ�
c)2
�f
���c
= 0� rc
��c
=�
n
rnc��2c (xn � µ�
c)2
��2c =
1��2
c
=�
n
rnc(xn � µ�c)
2/rc
THEORETICAL BASIS FOR EM ★ 11.4.7 stared but read it.
★ Three prerequisites
• Jensen’s inequality – do not read
• Entropy – 2.8.1, read
• Kullback-Leibler (KL) divergence – 2.8.2, read
Or make sure to understand these slides or read the EM theory text
Entropy and KL are interesting and included, but not necessary for the slides
JENSEN’S INEQUALITY
In general,
s � [0, 1]c = a + s(b� a) = (1� s)a + sb
c� = (1� s) log a + s log b
c�� = log c = log[(1� s)a + sb]� (1� s) log a + s log b = c�
log�
x
p(x)f(x) ��
x
p(x) log f(x)
i.e.,log E[f(x)] � E[log f(x)]
Let
Let
Then
a,b,c can be values that f takes on
log p(x|��)
= log�
z
p(x,z|��)
= log�
z
p(z|x,�)p(x,z|��)p(z|x,�)
= log Ez
�p(x,z|��)p(z|x,�)
| x,�
�
�Jensen Ez
�log
p(x,z|��)p(z|x,�)
| x,�
�
=�
z
p(z|x,�) logp(x,z|��)p(z|x,�)
=�
z
p(z|x,�) log p(x,z|��)��
z
p(z|x,�) log p(z|x,�)
= Q(��;�)�R(�;�)
Expected complete log-likelihood – Q Expected complete: notationlog p(xn|��)
= log�
zn
p(xn,zn|��)
= log�
zn
p(zn|xn,�)p(xn,zn|��)p(zn|xn,�)
= log Ezn
�p(xn,zn|��)p(zn|xn,�)
| xn,�
�
�Jensen Ezn
�log
p(xn,zn|��)p(zn|xn,�)
| xn,�
�
=�
zn
p(zn|xn,�) logp(xn,zn|��)p(zn|xn,�)
=�
zn
p(zn|xn,�) log p(xn,zn|��)��
zn
p(zn|xn,�) log p(zn|xn,�)
= Qn(��;�)�Rn(�;�)
Expected complete: fewer stepslog p(xn|��)
= log�
zn
p(xn,zn|��)
= log�
zn
p(zn|xn,�)p(xn,zn|��)p(zn|xn,�)
��
zn
p(zn|xn,�) logp(xn,zn|��)p(zn|xn,�)
=�
zn
p(zn|xn,�) log p(xn,zn|��)��
zn
p(zn|xn,�) log p(zn|xn,�)
= Qn(��;�)�Rn(�;�)
Expected complete: of all datalog p(D|��)
=�
n
log�
zn
p(xn,zn|��)
=�
n
log�
zn
p(zn|xn,�)p(xn,zn|��)p(zn|xn,�)
��
n
�
zn
p(zn|xn,�) logp(xn,zn|��)p(zn|xn,�)
=�
n
�
zn
p(zn|xn,�) log p(xn,zn|��)��
zn
p(zn|xn,�) log p(zn|xn,�)
=�
n
Qn(��;�)
� �� �Q(��;�)
��
n
Rn(�;�)
� �� �Rn(�;�)
Expected complete: for Θlog p(xn|�)
= [log p(xn|�)]�
zn
p(zn|xn,�)
=�
zn
p(zn|xn,�) log p(xn|�)
=�
zn
p(zn|xn,�) logp(zn,xn|�)p(zn|xn,�)
=�
zn
p(zn|xn,�) log p(zn,xn|�)��
zn
p(zn|xn,�) log p(zn|xn,�)
= Qn(�,�)�Rn(�,�)
RELATIONS BETWEEN LOG-LIKELIHOODS AND Q-TERMS
Theorem: for
So by maximizing Q-term (through ESS) we monotonically increase the likelihood.
The Q-term may not increase in every step!
�� = argmax���Q(���,�)
log p(D|��) � Q(��,�)�R(�,�) � Q(�,�)�R(�,�) = log p(D|�)
PRACTICAL ISSUES
★ Starting points
★ Number of starting points
★ Sieving starting points
★ The competition
• The first iterations of EM show huge improvement in the likelihood. These are then followed by many iterations that slowly increase the likelihood. Conjugate gradient shows the opposite behaviour.
THE END
MVN MLE
FOR MVN��c = rc/N
µ�c =
�n rncxn
rc
��2c =
�
n
rncxnxTn
rc� µcµ
Tc
Z hidden � Cat(�)
X � N (µc,�c)
ENTROPY★ Definition
★ q a fair coin
★q uniform on [K]
0 0.5 10
0.5
1
p(X = 1)
H(X
)
★q on [K], q(1)=1H(q) = �
�
x
q(x) log q(x)
H(q) = �12
log12� 1
2log
12
= �1(�1) = 1
H(q) = �� 1
Klog
1K
= log K
H(q) = �1 log 1�K�
k=2
0 log 0
= 0
ENTROPY VS EXPECTATION
★ Entropy depends on the distribution only
value 1 2 3 μ= 1/2 +1/2+3/4 = 7/4
value 1 9 27 μ= 1/2 +9/4+27/4 = 38/4
probability 1/2 1/4 1/4 H= -(log 1/2)/2-2(log 1/4)/4=3/2
KL-divergence ★ Definition
★ Alternative
★ Theorem (you do not have to read the proof)
KL(q||p) =K�
k=1
pk logpk
qk
KL(q||p) =�
x
p(x) logp(x)q(x)
KL(q||p) � 0 with equality i� p = q
HMMS (LAYERED OR NOT)
0 1 2 3----0000
40000000
00306010
04000000
410
310
201
400
000
000 4
21
000
---
ACGTACGTM-MM-DM-II-MI-DI-ID-MD-DD-I
010
002
100
matchemissions
insertemissions
statetransitions
M
(b) Profile-HMM architecture:
0
x x . . . x(a) Multiple alignment:
A G - - - CA - A G - CA G - A A -- - A A A CA G - - - C
batratcatgnatgoat
2 . . .1 3
model position(c) Observed emission/transition counts
M EndBegin M M M
I I I I
D D D
10 2 3 4
GIVEN DATA AND NON-OPTIMAL PARAMETERS
Fair Biased/loaded
p
1-p 1-q
q★ We observe the sequence of dice outcomes of visited vertices
………………………………………………………………………………….………………………………………………………………………………….………………………………………………………………………………….