MACHINE LEARNING 2- ALGORITHM LECTURE 2...LOREM IPSUM MACHINE LEARNING 2- EM ALGORITHM LECTURE 2 Royal Institute of Technology EXTENDED STUDENT EXAMPLE L H L H L H L B B - better H

LOREM I P S U M

MACHINE LEARNING 2- EM ALGORITHMLECTURE 2

Royal Institute of Technology

EXTENDED STUDENT EXAMPLE

HLHL

HL

BL

B - better H - higher L - less

B L

D-SEPARATION★ A path is d-separated by O if it has

• a chain X → Y → Z where Y ∈ O

• a fork X ← Y → Z where Y ∈ O

• a v-structure X → Y ← Z where (Y ⋃ desc(Y)) ⋂ O = ∅

X Y∈O Z

XY∈O

Z

X

Y∉O

Z

desc ∉O

Chain

Fork

v-struct

D-SEPARATION SETS AND CI OF

DAGS★ A is d-separated from B given O if every undirected path between A and B is d-separated by O

★ In a DAG G,

A is d-separated from B given O

A B

GO

xA �G xB |xO

EQUIVALENCE OF INDEPENDENCE DEFINITIONS

★ Global (G): d-separation

★ Local (L):

★ Ordered (O):

where pred is according to a topological order

★ Factorized (F): can be family-factorized

★ Theorem: G ⇔ L ⇔ O ⇔ F

MARKOV BLANKET

★ A minimal set B s/t Xt is independent from XV\(B⋃t) given XB is a Markov blanket

★ For t, pa(t) ⋃ c(t) ⋃ pa(c(t)) is a Markov blanket – i.e., parents, children, and co-parents (necessary due to v-structures)

1

2

3

5

4

6

7

LATENT = HIDDEN

★ Can reduce #parameters

★ Can represent common causes

+

��SDUDPHWHUV ��SDUDPHWHUV

LEARNING PARAMETERS COMPLETE DATA

★ “...Bayesian view, the parameters are unknown variables and should also be inferred”

★ Learning from complete data

★ Likelihood

where Dv is values of v together with its parents and θv is v’s CPD

P (D|�) =N�

n=1

P (xn|�) =N�

n=1

V�

v=1

P (xnv|xn,pa(v),�)

=V�

v=1

N�

n=1

P (xnv|xn,pa(v),�) =V�

v=1

P (Dv|�v)

★ Each , i.e., here each

can be maximized independently

★ So, MLE is

★ where

MLE FOR CAT CPDS

c

Nvck =N�

n=1

I(xnv = k, xn,pa(v) = c)

Nvc =N�

n=1

I(xn,pa(v) = c)

�vck = Nvck/Nvc

�vcP (Dv|�v)

SUFFICIENT STATISTICS

EXTENDED STUDENT EXAMPLE

GIVEN DATA AND NON-OPTIMAL PARAMETERS

Fair Biased/loaded

p

1-p 1-q

q★ We observe the sequence of dice outcomes of visited vertices

MIXTURE MODELS AND THE EXPECTATION MAXIMIZATION (EM) ALGORITHM

Used for MLE of ★ mixture models ★ parameters of DGMs, HMMs ★ DAG in DGMs (the structural EM by Nir Friedman)

X1

Z hidden

XV

EM iteratively improves parameters

E-step: compute expected sufficient statistics (ESS) w.r.t.

M-step: find optimal θ´ using the ESS

Chapter 3

EXPECTATION-MAXIMIZATION

THEORY

3.1 Introduction

Learning networks are commonly categorized in terms of supervised and unsuper-vised networks. In unsupervised learning, the training set consists of input trainingpatterns only. In contrast, in supervised learning networks, the training data consistof many pairs of input/output patterns. Therefore, the learning process can benefitgreatly from the teacher’s assistance. In fact, the amount of adjustment of the up-dating coe±cients often depends on the diÆerence between the desired teacher valueand the actual response. As demonstrated in Chapter 5, many supervised learningmodels have been found to be promising for biometric authentication; their imple-mentation often hinges on an eÆective data-clustering scheme, which is perhaps themost critical component in unsupervised learning methods. This chapter addressesa data-clustering algorithm, called the expectation-maximization (EM) algorithm,when complete or partial information of observed data is made available.

3.1.1 K-Means and VQ algorithms

An eÆective data-clustering algorithm is known as K-means [85], which is very sim-ilar to another clustering scheme known as the vector quantization (VQ) algorithm[118]. Both methods classify data patterns based on the nearest-neighbor criterion.

Verbally, the problem is to cluster a given data set X = {xt; t = 1, . . . , T} intoK groups, each represented by its centroid denoted by µ(j), j = 1, . . . ,K. The taskis (1) to determine the K centroids {µ(1), µ(2), . . . , µ(K)} and (2) to assign eachpattern xt to one of the centroids. The nearest-neighbor rule assigns a pattern x tothe class associated with its nearest centroid, say µ(i).

Mathematically speaking, one denotes the centroid associated with xt as µt,where µt 2 {µ(1), µ(2), . . . , µ(K)}. Then the objective of the K-means algorithm is

50

GAUSSIAN – MVN

N (x|µ,�) =1

(2�)D/2|�|1/2exp

��1

2(x� µ)T ��1(x� µ)

�

N (x|µ,�) =1�

2��2exp

�� 1

2�2(x� µ)2

�

TWO DIMENSIONAL NORMAL

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=male

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=male

GAUSSIAN MIXTURE

MODELS (GMM)

Z hidden � Cat(�)D = (x1, . . . ,xN )

xn = (xn1, . . . , xnD)

X � N (µc,�c)

p(X|Z = c) = pc(X) = N (X|µc,�c)

1-DIM GAUSSIAN

MIXTURE MODELS

Z hidden � Cat(�)D = (x1, . . . ,xN )

p(X|Z = c) = pc(X) = N (X|µc,�c)

X � N (µc,�c)

EXAMPLEzn is red with probability 1/2, green with probability 3/10, blue with probability 1/5

xn is generated from the Gaussian indicated by zn

We get x1,…,xN

zn = blue

xn

GMM

and

and

So,

p(xn, zn) = p(zn)p(xn|zn)

COMPLETE LOG LIKELIHOOD (KNOWING ALL Z)

Nc =�

n

I(zn = c)

All paremeters

MLE FOR GAUSSIAN

• Maximizeing

• Boils down to maximizing

l(��;D) =�

c

Nc log ��c +

�

c

�

n:I(zn=c)

log p(xn|��c)

�

n:I(zn=c)

log p(xn|��c)

�

n:I(zn=c)

log1�

2��2c

exp�� 1

2��2c

(xn � µc)2�that is

MLE FOR GAUSSIAN

So,

�f

�µ�c

= 0��

n:I(zn=c)

xn =�

n:I(zn=c)

µ�c = Ncµ

�c

Derivation,

Solving,

f(��c, µ

�c) =

�

n:I(zn=c)

log1�

2��2c

exp�� 1

2��2c

(xn � µ�c)

2

�

=�

n:I(zn=c)

log1�

2��2c

��

n:I(zn=c)

12��2

c

(xn � µ�c)

2

µ�c =

�n:I(zn=c) xn

Nc

�f

�µ�c

=�

n:I(zn=c)

22��2

c

(xn � µ�c)

Nc =�

n

I(zn = c)where

K-MEANS★ Data vectors D={x1,…,xN}

★ Randomly selected classes z1,…,zN

★ Iteratively do

★ One step O(NKD), can be improved

µc =1

Nc

�

n:zn=c

xn, where Nc = |{n : zn = c}|

zn = argminc||xn � µc||2

ASSIGN POINT TO MEANS

−2 0 2

−2

0

2

K-MEANS AS GMM★ Fixed variance, only means must be estimated

★ Idea: each point can belong to several means (clusters)

★ Use responsibilities to find means

rnc = p(zn = c|xn,�) =p(zn = c|�)p(xn|zn = c,�)

�Cc=1 p(zn = c|�)p(xn|zn = c,�)

µc =1

Nc

�

n

rncxn, where Nc =�

n

rnc

�c = (µc,�2)

EM & EXPECTED LOG LIKELIHOOD (Q-TERM)

• Iteratively maximizing the expected log likelihood in practice always leads to a local maxima

• The expectation is over latent variables given data and current parameters

• We maximize the expression by choosing new parameters.

EM FOR GMM★ Problem with previous most similar approach (MLE by derivation)

★ How to handle the sum? the log cannot be pushed inside.

★ Idea: use zi from

l(��) =N�

n=1

log p(xn|��) =N�

n=1

log�

zn

p(xn, zn|��)

p(zn|xn,�)

Current parameters

L(��) =N�

n=1

p(xn|��) =N�

n=1

�

zn

p(xn, zn|��)

LOG LIKELIHOOD

where for a one dim Gaussian �c = (µc,�2c )

EXPECTED LOG LIKELIHOOD (Q-TERM)

Ep(zn|xn,✓) [l(✓0;D)] = Ep(zn|xn,✓)

"log

Y

n

p(xn, zn|✓0)#

=

X

n

E

"log

Y

c

(⇡0cp(xn|zn = c, ✓0)I(zn=c)

#

=

X

n

X

c

E [I(zn = c) log(⇡0cp(xn|✓0c)]

=

X

n

X

c

p(zn = c|xn,✓) log(⇡0cp(xn|✓0c))

=

X

n

X

c

rnc log ⇡0c +

X

n

X

c

rnc log p(xn|✓0c)

★ We want

★ The 2 sums &

are independent

★ So,

★ In the second, different c indices are independent

★ So, we want to maximize each

HOW TO FIND ϴ ´?argmax��Ep(zn|xn,�) [l(��;D)]

�

c

��

n

rnc

�log ��

c

�

c

�

n

rnc log p(xn|��c)

��c =

�

n

rnc/N = rc/N

�

n

rnc log1�

2��2c

exp�� 1

2��2c

(xn � µc)2�

WEIGHTED GAUSS - MEAN

So,

�f

�µ�c

= 0��

n

rncxn =

��

n

rnc

�µ�

c = rcµ�c

µ�c =

�n rncxn

rc

Derivation,

Solving,

f(��c, µ

�c) =

�

n

rnc log1�

2��2c

exp�� 1

2��2c

(xn � µ�c)

2

�

=�

n

rnc log1�

2��2c

��

n

rnc1

2��2c

(xn � µ�c)

2

�f

�µ�c

=�

n

rnc2

2��2c

(xn � µ�c)

VARIANCE

So,

Derivation,

Solving,

Let ��c = 1/��

c

f(��c, µ

�c) =

�

n

rnc log1�

2��2c

��

n

rnc1

2��2c

(xn � µ�c)

2

=�

n

rnc log��

c�2��

�

n

rnc��2

c

2(xn � µ�

c)2

�f

��c

=�

n

rnc

��c

��

n

rnc��c(xn � µ�

c)2

�f

��c

= 0� rc

��c

=�

n

rnc��2c (xn � µ�

c)2

��2c =

1��2

c

=�

n

rnc(xn � µ�c)

2/rc

THEORETICAL BASIS FOR EM ★ 11.4.7 stared but read it.

★ Three prerequisites

• Jensen’s inequality – do not read

• Entropy – 2.8.1, read

• Kullback-Leibler (KL) divergence – 2.8.2, read

Or make sure to understand these slides or read the EM theory text

Entropy and KL are interesting and included, but not necessary for the slides

JENSEN’S INEQUALITY

In general,

s � [0, 1]c = a + s(b� a) = (1� s)a + sb

c� = (1� s) log a + s log b

c�� = log c = log[(1� s)a + sb]� (1� s) log a + s log b = c�

log�

x

p(x)f(x) ��

x

p(x) log f(x)

i.e.,log E[f(x)] � E[log f(x)]

Let

Let

Then

a,b,c can be values that f takes on

log p(x|��)

= log�

z

p(x,z|��)

= log�

z

p(z|x,�)p(x,z|��)p(z|x,�)

= log Ez

�p(x,z|��)p(z|x,�)

| x,�

�

�Jensen Ez

�log

p(x,z|��)p(z|x,�)

| x,�

�

=�

z

p(z|x,�) logp(x,z|��)p(z|x,�)

=�

z

p(z|x,�) log p(x,z|��)��

z

p(z|x,�) log p(z|x,�)

= Q(��;�)�R(�;�)

Expected complete log-likelihood – Q Expected complete: notationlog p(xn|��)

= log�

zn

p(xn,zn|��)

= log�

zn

p(zn|xn,�)p(xn,zn|��)p(zn|xn,�)

= log Ezn

�p(xn,zn|��)p(zn|xn,�)

| xn,�

�

�Jensen Ezn

�log

p(xn,zn|��)p(zn|xn,�)

| xn,�

�

=�

zn

p(zn|xn,�) logp(xn,zn|��)p(zn|xn,�)

=�

zn

p(zn|xn,�) log p(xn,zn|��)��

zn

p(zn|xn,�) log p(zn|xn,�)

= Qn(��;�)�Rn(�;�)

Expected complete: fewer stepslog p(xn|��)

= log�

zn

p(xn,zn|��)

= log�

zn


��

zn


=�

zn


zn


= Qn(��;�)�Rn(�;�)

Expected complete: of all datalog p(D|��)

=�

n

log�

zn

p(xn,zn|��)

=�

n

log�

zn


��

n

�

zn


=�

n

�

zn


zn


=�

n

Qn(��;�)

� �� Q(��;�)

��

n

Rn(�;�)

� �� Rn(�;�)

Expected complete: for Θlog p(xn|�)

= [log p(xn|�)]�

zn

p(zn|xn,�)

=�

zn

p(zn|xn,�) log p(xn|�)

=�

zn

p(zn|xn,�) logp(zn,xn|�)p(zn|xn,�)

=�

zn

p(zn|xn,�) log p(zn,xn|�)��

zn


= Qn(�,�)�Rn(�,�)

RELATIONS BETWEEN LOG-LIKELIHOODS AND Q-TERMS

Theorem: for

So by maximizing Q-term (through ESS) we monotonically increase the likelihood.

The Q-term may not increase in every step!

�� = argmax��Q(��,�)

log p(D|��) � Q(��,�)�R(�,�) � Q(�,�)�R(�,�) = log p(D|�)

PRACTICAL ISSUES

★ Starting points

★ Number of starting points

★ Sieving starting points

★ The competition

• The first iterations of EM show huge improvement in the likelihood. These are then followed by many iterations that slowly increase the likelihood. Conjugate gradient shows the opposite behaviour.

THE END

MVN MLE

FOR MVN��c = rc/N

µ�c =

�n rncxn

rc

��2c =

�

n

rncxnxTn

rc� µcµ

Tc

Z hidden � Cat(�)

X � N (µc,�c)

ENTROPY★ Definition

★ q a fair coin

★q uniform on [K]

0 0.5 10

0.5

1

p(X = 1)

H(X

)

★q on [K], q(1)=1H(q) = �

�

x

q(x) log q(x)

H(q) = �12

log12� 1

2log

12

= �1(�1) = 1

H(q) = �� 1

Klog

1K

= log K

H(q) = �1 log 1�K�

k=2

0 log 0

= 0

ENTROPY VS EXPECTATION

★ Entropy depends on the distribution only

value 1 2 3 μ= 1/2 +1/2+3/4 = 7/4

value 1 9 27 μ= 1/2 +9/4+27/4 = 38/4

probability 1/2 1/4 1/4 H= -(log 1/2)/2-2(log 1/4)/4=3/2

KL-divergence ★ Definition

★ Alternative

★ Theorem (you do not have to read the proof)

KL(q||p) =K�

k=1

pk logpk

qk

KL(q||p) =�

x

p(x) logp(x)q(x)

KL(q||p) � 0 with equality i� p = q

HMMS (LAYERED OR NOT)

0 1 2 3----0000

40000000

00306010

04000000

410

310

201

400

000

000 4

21

000

---

ACGTACGTM-MM-DM-II-MI-DI-ID-MD-DD-I

010

002

100

matchemissions

insertemissions

statetransitions

M

(b) Profile-HMM architecture:

0

x x . . . x(a) Multiple alignment:

A G - - - CA - A G - CA G - A A -- - A A A CA G - - - C

batratcatgnatgoat

2 . . .1 3

model position(c) Observed emission/transition counts

M EndBegin M M M

I I I I

D D D

10 2 3 4

GIVEN DATA AND NON-OPTIMAL PARAMETERS

Fair Biased/loaded

p

1-p 1-q

q★ We observe the sequence of dice outcomes of visited vertices

………………………………………………………………………………….………………………………………………………………………………….………………………………………………………………………………….

Documents

MACHINE LEARNING 2- ALGORITHM LECTURE 2...LOREM IPSUM MACHINE LEARNING 2- EM ALGORITHM LECTURE 2 Royal Institute of Technology EXTENDED STUDENT EXAMPLE L H L H L H L B B - better H