CS344: Introduction to Artificial Intelligence (associated...

CS344: Introduction to Artificial Intelligence

(associated lab: CS386)Pushpak Bhattacharyya

CSE Dept., IIT Bombay

Lecture–22, 25: EM, Baum Welch28th March and 1st April, 2014

(Lecture 21 was by Girish on Markov Logic Network )

Expectation Maximization

One of the key ideas of Statistical AI, ML, NLP, CV

Iterative procedure Find Parameters Find hidden variables Maiximize data likelihood

The coin tossing problem

Case of 1 coin: Suppose there are N tosses of a coin. NH = The number of Heads What is the probability of a head i.e. PH = ?

Observed variable

#Observation = N

xPTherefore

otherwiseheadaproducestossthewhenxwhere

Each observation is a Bernoulli’s Trial where

is the probability of success i.e., getting a head

is the probability of failure i.e., getting a tail

Likelihood of X

• Likelihood of X, i.e., probability of Observation Sequence X is:

Each trial is identical and independent. Maximum Likelihood of data, requires

us to make and thus, get

the expression for PH

ii x-1H

xHH )P -(1P )PL(X,

Mathematical Convenience

Take log of the likelihood.

Differentiating w.r.t. PH

To get the expression for , make

iHiHi PxPxXLL

)1log()1(log);(

HP 0HdP

Equating to 0, expression for PH

Maximum Entropy

Suppose we do not know how to get the MLE, or the likelihood expression is impossible to get, then we use: Maximum Entropy. Example: In problems like co-reference

resolution.

Entropy= To be elaborated later.

)1log()1(log HHHH PPPP

Case for Expectation Maximization

Instead of one coin we toss two coins.Parameters <P, P1, P2>

P = Probability of choosing first coin P1 = Probability of choosing head from first

coin P2 = Probability of choosing head from second

We do not know which coin the observation came from

NxxxxX ,.....,,: 321

EM continued..

Z1, Z2, Z3,…, ZN is the hidden sequence running alongside X1, X2, X3,…, XN

Where, Zi =1, if the ith observation came from coin 1, =0, otherwise

,,,....,,,

),,Pr();Pr(

PPPzzzzZ

NN zxzxzxzxY ,......,,,: 332211

We want to work with

Invoke convexity/concavity and expectation of Zi and work with log(Pr(Y;θ))

zxxzxx iiiiii PPPPPP

111 ))1.().1((*))1.(.(

);,();Pr(

));,(log();( Z

ZXPXLL

iiii PxPxPzEXLL

11 ))1log()1(log)(log([);(

))]1log()1(log)1))(log((1( 22 PxPxPziE ii

Log Likelihood of the Data

IMPORTANT POINTS TO NOTE

Log moves inside the product term. Σ disappears giving rise to E(Zi) in place

Differentiate wrt p, p1, p2, equate to 0 and get the results

P, P1, P2

zENxzEM

p M= observed no. of heads

p iNi )(1

Application of EM: HMM Training

Baum Welch or Forward Backward Algorithm

Key Intuition

Given: Training sequenceInitialization: Probability valuesCompute: Pr (state seq | training seq)

get expected count of transitioncompute rule probabilities

Approach: Initialize the probabilities and recompute them… EM like approach

Baum-Welch algorithm: counts

String = abb aaa bbb aaa

Sequence of states with respect to input symbols

q ra,b

rqrqqqrqrqqrq aaabbbaaabba o/p seq

State seq

Calculating probabilities from tableTable of counts

T=#statesA=#alphabet symbols

Now if we have a non-deterministic transitions then multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected count through this.

8/3)( bqP b

Src Dest O/P Count

q r a 5

q q b 3

r q a 3

r q b 2

8/5)( rqP a

swscswsPm

Interplay Between Two Equations

jWijWi

sscssPk

),,()|()(

,01,0,01,0n

wSssnWSPssC

No. of times the transitions sisj occurs in the string

Illustration

a:0.67

b:0.17

a:0.16

a:0.04

b:0.48

a:0.48

Actual (Desired) HMM

Initial guess

One run of Baum-Welch algorithm: string ababb

P(path)

q r q r q q 0.00077 0.00154 0.00154 0 0.00077

q r q q q q 0.00442 0.00442 0.00442 0.00442

0.00884

q q q r q q 0.00442 0.00442 0.00442 0.00442

0.00884

q q q q q q 0.02548 0.0 0.000 0.05096

0.07644

Rounded Total 0.035 0.01 0.01 0.06 0.095

New Probabilities (P) 0.06=(0.01/(0.01+0.06+

0.095)

1.0 0.36 0.581

qbq qaq raq qbr a ba ab bb bba

* ε is considered as starting and ending symbol of the input sequence string.

State sequences

Through multiple iterations the probability values will converge.

Computational part (1/2)

nt snn

WsSwWsSPWP

WSsSwWsSPWP

WSssnWSPWP

WSssnWSPssC

,0,01,01

,01,0,01,0,0

,01,0,01,0

)],,,([)(

)],,,,([)(

)],,(),([)(

)],,()|([)(

w0 w1 w2 wk wn-1 wn

S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1

Computational part (2/2)

),1()(),1(

),1()|,(),1(

)|(),|,(),(

),,,,(

0,11,011,0

0,111,0

jtBswsPitF

jtBsSwWsSPitF

sSWPsSWwWsSPsSWP

WwWsSsSWP

WwWsSsSP

w0 w1 w2 wk wn-1 wn

S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1

Discussions1. Symmetry breaking:

Example: Symmetry breaking leads to no change in initial values

2 Struck in Local maxima3. Label bias problem

Probabilities have to sum to 1.Values can rise at the cost of fall of values for others.

a:0.25

a:0.5b:0.5

a:0.25

b:0.25

Desired Initialized

Another application of EM

Mitesh Khapra, Salil Joshi and PushpakBhattacharyya, It takes two to Tango: A Bilingual Unsupervised Approach for estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.

Definition: WSD

Given a context: Get “meaning” of

a set of words (targeted wsd) or all words (all words wsd)

The “Meaning” is usually given by the id of senses in a sense repository usually the wordnet

Example: “operation” (from Princeton Wordnet) Operation, surgery, surgical operation, surgical procedure, surgical

process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1

Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1

mathematical process, mathematical operation, operation --((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1

Hindi Wordnet

Dravidian Language Wordnet

North East Language Wordnet

Marathi Wordnet

Sanskrit Wordnet

EnglishWordnet

Bengali Wordnet

Punjabi Wordnet

KonkaniWordnet

UrduWordnet

WSD for ALL Indian languages: Critical resource: INDOWORDNET

Gujarati Wordnet

Oriya Wordnet

Kashmiri Wordnet

Synset Based Multilingual Dictionary

Expansion approach for creating wordnets [Mohanty et. al., 2008]

Instead of creating from scratch link to the synsets of existing wordnet

Relations get borrowed from existing wordnet

S2 A sample entry from the MultiDict

Hindi Marathi

Hypothesis

Sense distributions across languages is invariant!! Proportion of times a sense appears in a

language is uniform across languages!

E.g., proportion of times the sense of “sun” appears in any language through “sun” and its synonyms remains the same!

ESTIMATING SENSE DISTRIBUTIONS

If sense tagged Marathi corpus were available, we could have estimated

But such a corpus is not available

EM for estimating sense distributions

Problem: ‘galaa’ itself is ambiguous Its raw count cannot be used as it

Solution: Its count should be weighted by

Word correspondencesSense inEnglish

(Marathisensenumber)

wordsmar

(partial list)Shin=π(Smar)(projectedHindi sensenumber)

wordsmar

(partial listof words inprojectedHindisense)

Neck 1 maan, greeva 1 gardan, galaa

Respect 2 maan,satkaar,sanmaan

3 izzat, aadar

Voice 3 awaaz, swar 2 galaa

EM for estimating sense distributions

M-Step

E-Step

)().#|()().#|()().#|()().#|()().#|()().#|(

swarswarSPawaajawaajSPgreevagreevaSPmaanmaanSPgreevagreevaSPmaanmaanSP

galaSP

marmarmarmar

marmar

)().#|()().#|()().#|()().#|()().#|()().#|(

izzatizzatSPaadaraadarSPgalagalaSPgardangardanSPgalagalaSPgardangardanSP

maanSP

hinhinhinhin

hinhin

General Algo

stepExxSP

vvSPuSP L

)2()()).#|((

)().#|)(()|(

)3()()).#|((

)().#|)(()|(

SSwhere

stepMyySP

vvSPvSP

Results Algorithm

MarathiP % R % F %

IWSD (training onself corpora; noparameterprojection) 81.29 80.42 80.85

IWSD (training onHindi and projectingparameters forMarathi) 73.45 70.33 71.86

EM (no sensecorpora in eitherHindi or Marathi) 68.57 67.93 68.25

Wordnet Baseline 58.07 58.07 58.07

Results & Discussions

Performance of projection using manual cross linkages is within 7% of Self-Training

Performance of projection using probabilistic cross linkages is within 10-12% of Self-Training – remarkable since no additional cost incurred in target language

Both MCL and PCL give 10-14% improvement over Wordnet First Sense Baseline

Not prudent to stick to knowledge based and unsupervised approaches –they come nowhere close to MCL or PCL

Manual Cross LinkagesProbabilistic Cross LinkagesSkyline - self training data is available

Wordnet first sense baseline

S-O-T-A Knowledge Based ApproachS-O-T-A Unsupervised Approach

Our values

Delving deeper into EM

Some Useful mathematical concepts

Convex/ concave functions Jensen’s inequality Kullback–Leibler distance/divergence

)( 1xf

)( 2xf

)()1()( 21 xfxf

))1(( 21 xxf

21 )1( xxz

Criteria for convexity

A function f(x) is said to be convex inthe interval [a,b] iff

)()1()())1(( 2121 xfxfxxf

baxxxx

Jensen’s inequality

For any convex function f(x)

iii xfxf

11)()(

Where 11

ii and 10, ii

Proof of Jensen´s inequality

Method:- By induction on N Base case:-

ally truef(x),trivif(x)λλ

λf(x)x)f(λN

11 where.

Another base case

convex is f(x) since )()1()(1 since ))1((

212111

xfxfxxf

Hypothesis

iii xfxf

)()( i.e

for trueSuppose

Induction Step

thatShow

)1( where )()()1(

convexityBy )())1(

11332211

Continued...

Examine each µi

)1()1(

)1()1()1()1(

Continued...

Therefore,

proved is inequality Jensen´s Thus

stepinduction at theFinally

)()()1(

KL -divergence

We will do the discrete form of probability distribution.

Given two probability distribution P,Q on the random variable

X : x1,x2,x3...xN

P:p1=p(x1 ), p2=p(x2), ... pn=p(xn) Q:q1=q(x1 ), q2=q(x2), ... qn=q(xn)

KLD definition

Q)(EP)(E DKL(P,Q)

q,p qpp D KL(P,Q)

loglog

as written also0 and cassymmetri is

11log1

Proof: KLD>=0

)x(pxp

],[x pqp

qpp KL(P,Q)

loglog So

0in convex islog

loglog

-:Proof

Proof cntd.

Apply Jensen’s inequality

loglog

loglog So

Convexity of –log x

1 1)1(

log)1(log))1(log(..

)log)(1()log())1(log(

xxxxei

Interesting problem

Try to prove:-

21 2121

2211 ww ww xxww

2nd definition of convexity

Theorem:

.convex is log So.in convex is then ,0

and in abledifferenti twiceis )( If

x-[a,b]f(x)[a,b] x (x)f

[a,b]xf''

Lemma 1

],[ s t,and s t, ),()(then ],[in 0)( If

batssftfbaxf

a s z t b

Mean Value Theorem

npm(p) m)f(nf(m)f(n)xf

(z,a)s (s) a)f(zf(a)f(z)

where)(function any For

Alternative form of z

Add –λz to both sides

21 1 λ)x(λxz

)xλ(zz)λ)(x(λ)x(z)λ(xλ)z(

Alternative form of convexity

Add –λf(z) to both sides

)λ)f(x()λf(x)λ)x(f(λ( 2121 11

)λ)f(x(f(z)))λ(f(xλ)f(z)()λ)f(x(f(z)))λ(f(xλ)f(z)(

λf(z))λ)f(x()λf(xλf(z)f(z)

Proof: second derivative >=0 implies convexity (1/2)We have that,

(2) ][z]-)[x-(1(1) )]()([)]()()[1(

)()1()()(

xzxfzfzfxf

xfxfzf

Second derivative >=0 implies convexity (2/2)

(2) Is equivalent to

For some s and t , where

Now since f’’(x) >=0

)(')(' sftf

Combining this with (1), the result is proved

))(()).(()1( 12 xzsfxtf

21 xtzsx

Why all this In EM, we maximize the expectation of

log likelihood of the data Log is a concave function We have to take iterative steps to get

to the maximum There are two unknown values: Z

(unobserved data) and θ (parameters) From θ, get new value of Z (E-step) From Z, get new value of θ (M-step)

How to change θ How to choose the next θ? Take

Where,X: observed dataZ: unobserved dataΘ: parameterLL(X,Z:θn): log likelihood of complete

data with parameter value at θn

This is in lieu of, for example, gradient ascent

At every step LL(.) willIncrease, ultimatelyreaching local/globalmaximum

)):,():,((maxarg nZXLLZXLL

Why expectation of log likelihood? (1/2) P(X:θ) may not be a convenient mathematical

expression Deal with P(X,Z:θ), marginalized over Z Log(ΣZP(X,Z:θ)) is mathematically processed with

multiplying by P(Z|X: θn) which for each Z is between 0 and 1 and sums to 1

Then Jensen inequality will give

);,(log();|(

at y probabilit theis );|( where ),;|(by devide andmultiply

);,();|(log());,(log(

XZPZXPXZP

XZPXZP

XZPZXPXZPZXP

Why expectation of log likelihood? (2/2)

Z. w.r.t.data complete of liklihood log ofn expectatio theis (.) where))),;,((log(

));,(log();|(

));();((maxarg So,

));,();,(log();|(

1);|( since

));().;|(

);,(log();|(

));(log());|(

);,();|(log(

));(log());,(log();();(

ZXPXZP

XLLXLLZXPZXPXZP

XZPXPXZP

ZXPXZP

XPZXPXLLXLL

Why expectation of Z?

If the log likelihood is a linear function of Z, then the expectation can be carried inside of the log likelihood and E(Z) is computed

The above is true when the hidden variables form a mixture of distributions (e..g, in tosses of two coins), and

Each distribution is an exponential distribution like multinomial/normal/poisson

CS344: Introduction to Artificial Intelligence (associated...

Documents

CS344 : Artificial Intelligence

CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence (associated lab: CS386)

CS344: Introduction to Artificial Intelligence (associated ...pb/cs344-2013/cs344-lect1to6-introduction-and-fuzzy-logic...Artificial intelligence (AI) is the intelligence of machines

CS344: Introduction to Artificial Intelligencecs344/2010/slides/cs344-lect32-IR... · CS344: Introduction to Artificial Intelligence ... American IndianAmerican Indiancreation legends

CS344: Introduction to Artificial Intelligencecs344/2010/slides/cs... · CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT BbBombay Lecture 31: Information

CS344: Introduction to Artificial Intelligence (associated ...cs344/2011/slides/... · Artificial intelligence (AI ) is the intelligence of machines and the branch of computer science

CS344 : Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence (associated lab: CS386)

CS344: Introduction to Artificial Intelligence (associated lab: CS386)

CS344: Introduction to Artificial Intelligencecs344/2010/slides/cs...CS344: Introduction to Artificial Intelligence ... Lecture 18-19-20–Natural Language Processing (ambiguities

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 2 - Search

CS344 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 29 Introducing Neural Nets

CS344: Artificial Intelligence

CS344: Introduction to Artificial Intelligence …pb/cs344-2014/cs344-lec22-23...CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept.,