33
EM algorithm and applications

EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Embed Size (px)

Citation preview

Page 1: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM algorithm and applications

חרף 04-5: הספקתי רק עד ההוכחה של EM, לא כולל (התקדמתי לאט והיו שאלות - ייתכן שניתן להספיק יותר)חרף 05-6: עד 15 (סוף דוגמה 2 מטבעות) בשעה הראשונה, אבל סיימתי הכל, הרבה שאלות והתקדמות איטית, בזמן קצת לחוץ.מספר שינויים - בעיקר קיצור (ושיפור?) הדוגמה עם ה ABO
Page 2: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Relative Entropy

Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by

H(p||q) = ∑x p(x)log[p(x)/q(x)]

= ∑x p(x)log(1/(q(x)) -

-∑ x p(x)log(1/(p(x)).

“The inefficiency of assuming distribution q when the correct distribution is p”.

Page 3: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

log P(x| λ)

General idea of EM:Use “current point” θ to construct alternative function Qθ (which is “nice”) if Qθ(λ)>Qθ(θ), than λ has higher likelihood than θ.

EM algorithm: approximating MLE from Incomplete Data• Finding MLE parameters: nonlinear optimization problem

λ

E [log P(x,y|λ)]

θ λ

Page 4: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

The EM algorithmConsider a model where, for observed data x and

model parameters θ:

p(x|θ)=∑yp(x,y|θ).(y are “hidden data”).

The EM algorithm receives x and parameters θ, and returns new parameters *, s.t. p(x|*) > p(x|θ).

Page 5: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Finding * which maximizes

p(x|*)=∑yp(x,y|*).

is equivalent to finding * which maximizes the logarithm

log p(x|*) = log (∑yp(x,y| *))

Which is what the EM algorithm attempts to do.

The EM algorithm

Page 6: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

In each iteration the EM algorithm does the following. •(E step): Calculate

Qθ () = ∑y p(y|x,θ)log p(x,y|)

•(M step): Find * which maximizes Qθ ()

(Next iteration sets * and repeats).

The EM algorithm

Page 7: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

The Baum-Welsh algorithm is the EM algorithm for HMM :

•E step for HMM: Qθ () = ∑S p(s|x,θ)log p(s,x|), where λ are the

new parameters {akl,ek(b)}.

Example: Baum-Welsh = EM for HMM

( )

, ,

, ,

log( ( , | )) log( ( ) )

log ( )log ( ))

s skl kA E b

k

k l k b

s skl k k

k l k b

p s x a e bkl

A a E b e bkl

(The are the counts of state transitions and symbol emissions in (s,x)).

's 's, ( )s skl kA E b

this and next slide added on fall05-6
Page 8: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Baum Welsh = EM for HMM

• M step For HMM: Find * which maximizes Qθ

(). As we proved, λ* is given by the relative frequencies of the Akl’s and the Ek(b)’s

Thus,

, ,

( ) ( | , )log( ( , | ))

( | , ) log ( )log ( ))

S

s skl k k

S k l k b

Q p s x p s x

p s x A a E b e bkl

Page 9: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

A simplest example: EM for 2 coin tosses

•Given a coin with two possible outcomes: H (head) and

T (tail), with probabilities qH, qT = 1- qH.

•The coin is tossed twice, but only the 1st outcome, T, is

seen. So the data is x = (T,*).•We wish to apply the EM algorithm to get parameters

that increase the likelihood of the data.

•Let the initial parameters be θ = (qH, qT) = ( ¼, ¾ ).

Page 10: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for 2 coin tosses

The hidden data which can produce x are the sequences

y1= (T,H); y2=(T,T);

Hence the likelihood of x with parameters (qH, qT), is

p(x| θ) = P(x,y1 |) + P(x,y2 |) = qHqT+qT2

For the initial parameters θ = ( ¼, ¾ ), we have:

p(x| θ) = ¾ * ¼ + ¾ * ¾ = ¾

Note that in this case P(x,yi |) = P(yi |), for i = 1,2.

we can always define y so that (x,y) = y (otherwise we set y’ (x,y) and replace the “ y ”s by “ y’ ”s).

Page 11: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for 2 coin tosses - E stepCalculate Qθ

() = Qθ(qH,qT).

Qθ () = p(y1|x,θ)log p(x,y1|) +

p(y2|x,θ)log p(x,y2|) p(y1|x,θ) = p(y1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼

p(y2|x,θ) = p(y2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾

Thus we have

Qθ () = ¼ log p(x,y1|) + ¾ log p(x,y2|)

Page 12: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for 2 coin tosses - E stepFor a sequence y of coin tosses, let NH(y) be the number of

H’s in y, and NT(y) be the number of T’s in y. Then

log p(y|) = NH(y) log qH+ NT(y) log qT

In our example:

y1= (T,H); y2=(T,T), hence:

NH(y1) = NT(y1)=1, NH(y2) =0, NT(y2)=2

Page 13: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Example: 2 coin tosses - E step

Thus

¼ log p(x,y1|) = ¼ (NH(y1) log qH+ NT(y1) log qT) = ¼ (log qH+ log qT)

¾ log p(x,y2|) = ¾ ( NH(y2) log qH+ NT(y2) log qT) = ¾ (2 log qT)

Substituting in the equation for Qθ () :

Qθ () = ¼ log p(x,y1|)+ ¾ log p(x,y2|)

= ( ¼ NH(y1)+ ¾ NH(y2))log qH + ( ¼ NT(y1)+ ¾ NT(y2))log qT

Qθ () = NHlog qH + NTlog qT

NT= 7/4NH= ¼

Page 14: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for 2 coin tosses - M stepFind * which maximizes Qθ

()

Qθ () = NHlog qH + NTlog qT = ¼ log qH + 7/4 log qT

We saw earlier that this is maximized when:

TH

TT

TH

HH NN

Nq

NN

Nq

;

.*)|(),(* 87

87

81 and is,that

87 ;8

14

74

1

47

47

41

41

xp

qq TH

The optimal parameters (0,1), will never be reached by the EM algorithm!

Page 15: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Let Nl be the expected value of Nl(y), given x and θ:

Nl=E(Nl|x,θ) = ∑y p(y|x,θ) Nl(y),

EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities q1,..,qm. Let Nl(y) = #(outcome l occurs in y). Then

m

lll qyNyp

1

log)()|(log

Page 16: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

' '

loglog)(),|(

log)(),|(

)|(log),|()(

l l

ll

l

m

lll

m

l yl

ly

m

ll

y

N

Nq

qNqyNxyp

qyNxyp

ypxypQ

for maximized iswhich

11

1

Q (λ) for one dice

Page 17: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM algorithm for n independent observations x1,…, xn :

Expectation stepIt can be shown that, if the xj are independent, then:

n

j

jl

jjl

y

jjn

jl NxyNxypN

j 11

),(),|(

jlN

),()|,()|(

jjl

y

jjn

jj

xyNxypxp j

1

1

Page 18: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Example: The ABO locusA locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

N

Nq

N

Nq

N

Nq

N

Nq

N

Nq

N

Nq oo

ooba

baob

obbb

bboa

oaaa

aa/

//

//

//

//

//

/ ,,,,,

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.We wish to estimate the proportion in a population of the 6 genotypes.

Page 19: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

The ABO locus

However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?

The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

Page 20: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

The ABO locusThe Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles qa,qb,qo in the population determine the frequencies of the genotypes as follows: qa/b= 2qa qb, qa/o= 2qa qo, qb/o= 2qb qo, qa/a= [qa]2, qb/b= [qb]2, qo/o= [qo]2.

In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y.

Page 21: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

The ABO locusThe dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O.Each blood type is determined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a).

So we have three parameters of one dice – qa,qb,qo - that we need to estimate.

Page 22: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM setting for the ABO locus problem

The observed data x =(x1,..,xn) is a sequence of letters (blood types) from the alphabet {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x1,…x11).

The hidden data (ie the y’s) for each letter xj is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}.

The parameters = {qa ,qb, qo} are the probabilities of the alleles.

We need is to find the parameters = {qa ,qb, qo} that maximize the likelihood of the given data.We do this by the EM algorithm:

Page 23: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for ABO locus problem

For each observed blood type xj{A,B,AB,O} and for each allele z in {a,b,o} we compute Nz(xj) , the expected number

of times that z appear in xj.

( ) ( | , ) ( )j

j j j jz z

y

N x p y x N yWhere the sum is taken over the ordered “genotype pairs” yj, and Nz(yj) is the number of times allele z occurs in the pair yj. eg,

Na(o,b)=0; Nb(o,b) = No(o,b) = 1.

Page 24: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for ABO locus problemThe computation for blood type B:

P(B|) = P((b,b)|) + p((b,o)|) +p((o,b)|)) = qb2 + 2qbqo.

Since Nb((b,b))=2, and Nb((b,o))=Nb((o,b)) =No((o,b))=No((b,o))=1,

No(B) and Nb(B) , the expected number of occurrences of o and b in B,

are given by:

2

2

2

2 2

2

2 2

2

( ) ( | , ) ( )( | )

( ) ( | , ) ( )

b o b oo o

b b oy

b b ob b

b b oy

q q q qN B p y B N y

p B q q q

q q qN B p y B N y

q q q

Observe that Nb(B) + No(B) = 2

Page 25: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

EM for ABO lociSimilarly, P(A|) = qa

2 + 2qaqo.

2

2 2

2 2 2

2 2( ) , ( )a o a a o

o aa a o a a o

q q q q qN A N A

q q q q q q

P(AB|) = p((b,a)|) + p((a,b)|)) = 2qaqb ;

P(O|) = p((o,o)|) = qo2

Na(AB) = Nb(AB) = 1

No(O) = 2

[ Nb(O) = Na(O) = No(AB) = Nb(A) = Na(B) = 0 ]

Page 26: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

E step: compute Na, Nb and No

Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively.

Note that 2 22

#( ) ( ) #( ) ( )

#( ) ( ) #( ) ( )

#( ) ( ) #( ) ( ) #( ) ( )

a a a

b b b

o o o o

a b o

N A N A AB N AB

N B N B AB N AB

N A N A B N B O N O

N N N N

N

Nq

N

Nq

N

Nq o

ob

ba

a 222 *** ;;

M step: set λ*=( qa*, qb* , qo*)

Page 27: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Example: the Motif Finding Problem

• Given a set of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the motif in each of the individual sequences

Page 28: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Motif Finding Problem: a reformulation

• Collect all substrings with the same length k from the input sequences;

• With N sequences with same length L, n=N(L-k+1) substrings can be derived;

• Find a significant number of substring that can be described by a profile model.

Page 29: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Fitting a mixture model by EM

• A finite mixture model:– data X = (X1,…,Xn) arises from two or more groups

with g models = (1, …, g).

• Indicator vectors Z = (Z1,…,Zn), where Zi = (Zi1,…,Zig), and Zij = 1 if Xi is from group j, and = 0 otherwise.

• P(Zij= 1|) = j . For any given i, all Zij are 0 except one j;

• g=2: class 1 (the motif) and class 2 (the background) are given by position specific and a general multinomial distribution

Page 30: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Complete data likelihood• Under the assumption that the pairs (Zi,Xi) are mutually

independent, their joint density may be written• P(Z, X| ) = ∏ij [j P(Xi|j) ] Zij

• The log likelihood of the model is thus

• log L(, | Z, X) = ∑∑ Zij log [ j P(Xi|j) ].

• The EM algorithm iteratively computes the expectation of the likelihood given the observed data X, and initial estimates ’ and ’ of and (the E-step), and then maximizes the result in the free variables and leading to new estimates ’’ and ’’ (the M-step).

Page 31: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Mixture models: the E-step

• Since the log likelihood is the sum of over i and j of terms multiplying Zij, and these are independent across i, we need only consider the expectation of one such, given Xi. Using initial parameter values ’ and ’, and the fact that the Zij are binary, we get

• E(Zij |X,’,’)=’jP(Xi|’j)/ ∑k ’kP(Xi|’k)= Z’ij

Page 32: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Mixture models: the M-step• Now we want to maximize the result of an E-

step:

• ∑∑ Z’ij j + ∑∑ Z’ij log P(Xi | j).

• The maximization over is independent of the rest and is readily achieved with

• j’’ = ∑iZ’ij / n.

Page 33: EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and

Mixture models: the M-step

Note, P(Xi|1) = ∏j∏k fjk I(k,Xij) , and P(Xi | 2) = ∏j∏k f0k I(k,Xij)

where Xij is the letter in the jth position of sample i, and I(k,a) = 1 if

a=ak, and =0 otherwise. c0k = ∑∑ Z’i2I(k,Xij), and cjk = ∑∑ Z’i1I(k,Xij). Here c0k is the expected number of times letter ak appears in the

background, and cjk the expected number of times ak appears in occurrences of the motif in the data.

f’’jk = cjk / ∑k=1L cjk , j = 0,1,…,w; k = 1,…,L.

In practice, care must be taken to avoid zero frequencies, so one

adds pseudo-counts: small constants j ,∑j = , giving

f’’jk = (cjk+ j )/ (∑k=1L cjk + ) , j = 0,1,…,w; k = 1,…,L.