Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Speech, NLP and the Web
Pushpak BhattacharyyaCSE Dept., IIT Bombay
Lecture 7,9, 10: Theoretical Underpinnings-Maximum Likelihood and Maximum Entropy
Principles(lecture 8 was on NLTK by Abhijit)
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 1
Fundamental principles of machine learning
Learning in vacuum is impossible-importance of prior knowledge
Inductive Bias: What too learn, in what form to learn are pre-decided
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 2
Structure learning and parameter learning
Structure- parts and their relationships
Parameter- probabilities
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 3
Example (1/2): transition table
^
NN
NV
.
^ N V O .
^ 0 0.6 0.2 0.2 0
N 0 0.1 0.4 0.3 0.2
V 0 0.3 0.1 0.3 0.3
O 0 0.3 0.2 0.3 0.2
. 1 0 0 0 0
This transition table will change from language to language due to language divergences.
Partial sequence graph
Example (2/2): Lexical Probability Table
Size of this table = # pos tags in tagset X vocabulary size
vocabulary size = # unique words in corpus
Є people laugh ... …
^ 1 0 0 ... 0
N 0 1x10-3 1x10-5 ... ...
V 0 1x10-6 1x10-3 ... ...
O 0 0 1x10-9 ... ...
. 1 0 0 0 0
Structure and parameter
N people N (1 x 10-3 ) x 0.1 N laugh N (1 x 10-5 ) x 0.1 N people V (1 x 10-3 ) x 0.4 N laugh V (1 x 10-5 ) x 0.4 …
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 6
PCFG rules (structure + parameter)
S NP VP 1.0 NP DT NN 0.5 NP NNS 0.3 NP NP PP 0.2 PP P NP 1.0 VP VP PP 0.6 VP VBD NP 0.4
• DT the 1.0• NN gunman 0.5• NN building 0.5• VBD sprayed 1.0• NNS bullets 1.0• P with 1.0
29 July, 2014 Pushpak Bhattacharyya: Parsing 7
Expectation Maximization
One of the key ideas of Statistical AI, ML, NLP, CV
Iterative procedure Find Parameters Find hidden variables Maiximize data likelihood
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 8
The coin tossing problem
Case of 1 coin: Suppose there are N tosses of a coin. NH = The number of Heads What is the probability of a head i.e. PH = ?
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 9
Observed variable
#Observation = N
N
xPTherefore
otherwiseheadaproducestossthewhenxwhere
xxxx
N
ii
H
i
N
1
321
,
,0,1
,, :X
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 10
Each observation is a Bernoulli’s Trial where
is the probability of success i.e., getting a head
is the probability of failure i.e., getting a tail
HP
HP1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 11
Likelihood of X
• Likelihood of X, i.e., probability of Observation Sequence X is:
Each trial is identical and independent. Maximum Likelihood of data, requires
us to make and thus, get
the expression for PH
ii x-1H
N
1i
xHH )P -(1P )PL(X,
0HdP
dL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 12
Mathematical Convenience
Take log of the likelihood.
Differentiating w.r.t. PH
To get the expression for , make
N
iHiHi PxPxXLL
1
)1log()1(log);(
H
iN
i H
i
Px
Px
dHdLL
1
11
HP 0HdP
dLL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 13
Equating to 0, expression for PH
H
N
ii
H
N
ii
H P
x
PNx
P
111 1
1
N
xP
N
ii
H
1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 14
Maximum Entropy
Suppose we do not know how to get the MLE, or the likelihood expression is impossible to get, then we use: Maximum Entropy. Example: In problems like co-reference
resolution.
Entropy= To be elaborated later.
)1log()1(log HHHH PPPP
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 15
Case for Expectation Maximization Instead of one coin we toss two coins.
Parameters <P, P1, P2> P = Probability of choosing first coin P1 = Probability of choosing head from first
coin P2 = Probability of choosing head from second
coin
We do not know which coin the observation came from
NxxxxX ,.....,,: 321
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 16
EM continued..
Z1, Z2, Z3,…, ZN is the hidden sequence running alongside X1, X2, X3,…, XN
Where, Zi =1, if the ith observation came from coin 1, =0, otherwise
21
321
,,,....,,,
),,Pr();Pr(
PPPzzzzZ
ZXX
N
Z
NN zxzxzxzxY ,......,,,: 332211
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 17
Cntd.
We want to work with
Invoke convexity/concavity and expectation of Zi and work with log(Pr(Y;θ))
N
i
zxxzxx iiiiii PPPPPP
ZXPY
1
1122
111 ))1.().1((*))1.(.(
);,();Pr(
));,(log();( Z
ZXPXLL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 18
N
iiii PxPxPzEXLL
1
11 ))1log()1(log)(log([);(
))]1log()1(log)1))(log((1( 22 PxPxPziE ii
Log Likelihood of the Data
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 19
IMPORTANT POINTS TO NOTE
Log moves inside the product term. Σ disappears giving rise to E(Zi) in place
of Zi
Differentiate wrt p, p1, p2, equate to 0 and get the results
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 20
P, P1, P2
)()(
1
11
i
Ni
iiNi
zExzE
p
)()(
1
12
i
Ni
iiNi
zENxzEM
p
M= observed no. of heads
NzE
p iNi )(1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 21
)1(22
)1(11
)1(11
)1()1()1()1(
)(/)1|().1()|1()(
iiii
ii
xxxx
xxiiiiii
i
PPPPPPPPP
xxPzxxPzPxxzPzE
Another application of EM
WSD
Mitesh Khapra, Salil Joshi and PushpakBhattacharyya, It takes two to Tango: A Bilingual Unsupervised Approach for estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 22
Definition: WSD
Given a context: Get “meaning” of
a set of words (targeted wsd) or all words (all words wsd)
The “Meaning” is usually given by the id of senses in a sense repository usually the wordnet
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 23
Example: “operation” (from Princeton Wordnet) Operation, surgery, surgical operation, surgical procedure, surgical
process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1
Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1
mathematical process, mathematical operation, operation --((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 24
Hindi Wordnet
Dravidian Language Wordnet
North East Language Wordnet
Marathi Wordnet
Sanskrit Wordnet
EnglishWordnet
Bengali Wordnet
Punjabi Wordnet
KonkaniWordnet
UrduWordnet
WSD for ALL Indian languages: Critical resource: INDOWORDNET
Gujarati Wordnet
Oriya Wordnet
Kashmiri Wordnet
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 25
Synset Based Multilingual Dictionary
Expansion approach for creating wordnets [Mohanty et. al., 2008]
Instead of creating from scratch link to the synsets of existing wordnet
Relations get borrowed from existing wordnet
S1
S3 S4
S6
S5
S7
S2
S1
S3 S4
S6
S5
S7
S2
S1
S3 S4
S6
S5
S7
S2 A sample entry from the MultiDict
Hindi Marathi
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 26
Hypothesis
Sense distributions across languages is invariant!! Proportion of times a sense appears in a
language is uniform across languages!
E.g., proportion of times the sense of “sun” appears in any language through “sun” and its synonyms remains the same!
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 27
ESTIMATING SENSE DISTRIBUTIONS
If sense tagged Marathi corpus were available, we could have estimated
But such a corpus is not available
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 28
EM for estimating sense distributions
‘
Problem: ‘galaa’ itself is ambiguous Its raw count cannot be used as it
is
Solution: Its count should be weighted by
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 29
Word correspondencesSense inEnglish
Smar
(Marathisensenumber)
wordsmar
(partial list)Shin=π(Smar)(projectedHindi sensenumber)
wordsmar
(partial listof words inprojectedHindisense)
Neck 1 maan, greeva 1 gardan, galaa
Respect 2 maan,satkaar,sanmaan
3 izzat, aadar
Voice 3 awaaz, swar 2 galaa
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 30
EM for estimating sense distributions
‘
M-Step
E-Step
)().#|()().#|()().#|()().#|()().#|()().#|(
)|(
1111
11
1
swarswarSPawaajawaajSPgreevagreevaSPmaanmaanSPgreevagreevaSPmaanmaanSP
galaSP
marmarmarmar
marmar
hin
)().#|()().#|()().#|()().#|()().#|()().#|(
)|(
2211
11
1
izzatizzatSPaadaraadarSPgalagalaSPgardangardanSPgalagalaSPgardangardanSP
maanSP
hinhinhinhin
hinhin
mar
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 31
General Algo
stepExxSP
vvSPuSP L
jLSxS
LiL
SvLi
LjL
Lj
LiL
)2()()).#|((
)().#|)(()|(
1
21
21
1
21
21
)(
)(
)(
)3()()).#|((
)().#|)(()|(
1
2
2
2
12
22
2
11
22
)(
)(
LiL
Lk
LmL
SyS
LkL
SvLk
SSwhere
stepMyySP
vvSPvSP
LmL
Lm
LiL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 32
Results Algorithm
MarathiP % R % F %
IWSD (training onself corpora; noparameterprojection) 81.29 80.42 80.85
IWSD (training onHindi and projectingparameters forMarathi) 73.45 70.33 71.86
EM (no sensecorpora in eitherHindi or Marathi) 68.57 67.93 68.25
Wordnet Baseline 58.07 58.07 58.07
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 33
Results & Discussions
Performance of projection using manual cross linkages is within 7% of Self-Training
Performance of projection using probabilistic cross linkages is within 10-12% of Self-Training – remarkable since no additional cost incurred in target language
Both MCL and PCL give 10-14% improvement over Wordnet First Sense Baseline
Not prudent to stick to knowledge based and unsupervised approaches –they come nowhere close to MCL or PCL
Manual Cross LinkagesProbabilistic Cross LinkagesSkyline - self training data is available
Wordnet first sense baseline
S-O-T-A Knowledge Based ApproachS-O-T-A Unsupervised Approach
Our values
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 34
Convexity
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 35
Motivation: argmaxcomputation
Statistical Spell Checking Automatic Speech Recognition Part of Speech Tagging Probabilistic Parsing Statistical Machine Translation
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 36
Some general observations
A*= argmax [P(A|B)]A
= argmax [P(A).P(B|A)]A
Computing and using P(A) and P(B|A), both need(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts
Problem 1: Spell checker: apply Bayes Rule
W*= argmax [P(W|T)]= argmax [P(W).P(T|W)]
W=correct word, T=misspelt word Why apply Bayes rule?
Finding p(w|t) vs. p(t|w) ? Assumptions :
t is obtained from w by a single error. The words consist of only alphabets(Jurafsky and Martin, Speech and NLP, 2000)
Problem-2: Isolated word recognition
Problem Definition : Given a sequence of speech signals, identify the words.
2 steps : Segmentation (Word Boundary Detection) Identify the word
Isolated Word Recognition : Identify W given SS (speech signal)
^arg max ( | )
WW P W SS
Problem-3: Statistical MT “Find the English translation e corresponding to a
given Foreign sentence f”
Thus, we seek ebest such that
ebest = argmaxe P(e |f ) = argmaxe [P(e) * P(f |e)]
Language Model – P(e)Translation Model – P(f |e)
Translations are produced on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Convexity: utility
Jensen’s inequality
Kullback–Leibler distance/divergence
EM formulation
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 41
)( 1xf
)( 2xf
)()1()( 21 xfxf
))1(( 21 xxf
21 )1( xxz
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 42
Criteria for convexity
A function f(x) is said to be convex inthe interval [a,b] iff
)()1()())1(( 2121 xfxfxxf
],[,
21
21
baxxxx
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 43
Jensen’s inequality
For any convex function f(x)
n
iii
n
iii xfxf
11)()(
Where 11
n
ii and 10, ii
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 44
Proof of Jensen´s inequality
Method:- By induction on N Base case:-
ally truef(x),trivif(x)λλ
λf(x)x)f(λN
i
11 where.
1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 45
Another base case
N = 2
convex is f(x) since )()1()(1 since ))1((
)(
2111
212111
2211
xfxfxxf
xxf
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 46
Hypothesis
n
iii
n
iii xfxf
kn
11
)()( i.e
for trueSuppose
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 47
Induction Step
1 given
)()(
thatShow
1321
1
1
1
1
kk
k
iii
k
iii xfxf
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 48
Proof
)1( where )()()1(
convexityBy )())1(
()1(
))1(
)1((
)(
111
11
111 1
1
111 1
1
11332211
k
iikk
k
iiik
kk
k
i k
iik
kk
k
i k
iik
kk
xfxf
xfxf
xxf
xxxxf
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 49
Continued...
Examine µis
1 because
1)1()1(
)1(
)1()1()1()1(
1321
1
1
1
321
11
3
1
2
1
1
3211
kk
k
k
k
k
k
k
kkk
k
k
ii
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 50
Continued...
Therefore,
proved is inequality Jensen´s Thus
)()(
stepinduction at theFinally
)()(
)()()1(
)()()1(
1
1
1
1
111
111
1
111
1
i
k
ii
k
iii
kki
k
ii
kki
k
iik
kk
k
iiik
xfxf
xfxf
xfxf
xfxf
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 51
KL -divergence
We will do the discrete form of probability distribution.
Given two probability distribution P,Q on the random variable
X : x1,x2,x3...xN
P:p1=p(x1 ), p2=p(x2), ... pn=p(xn) Q:q1=q(x1 ), q2=q(x2), ... qn=q(xn)
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 52
KLD definition
Q)(EP)(E DKL(P,Q)
DD
q,p qpp D KL(P,Q)
pp
iii
iN
ii
loglog
as written also0 and cassymmetri is
11log1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 53
Proof: KLD>=0
)x(pxp
],[x pqp
qpp
qpp KL(P,Q)
i
N
iii
N
ii
i
iN
ii
i
iN
ii
i
iN
ii
loglog So
0in convex islog
loglog
-:Proof
0log
11
11
1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 54
Proof cntd.
Apply Jensen’s inequality
10log
loglog
loglog So
11
11
11
N
ii
i
iN
ii
i
iN
ii
N
ii
i
iN
ii
i
iN
ii
q qpp
qppq
)pq(p
pqp
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 55
Convexity of –log x
1 1)1(
1)1(
1)1(
)1(
log)1(log))1(log(..
)log)(1()log())1(log(
2
1
1
2
2
1
1
112
2
1
12121
2121
2121
1
1
1
xxy
yy
xx
xx
xx
xx
xxxx
xxxxei
xxxx
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 56
Interesting problem
Try to prove:-
21 2121
21
2211 ww ww xxww
xwxw
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 57
2nd definition of convexity
Theorem:
.convex is log So.in convex is then ,0
and in abledifferenti twiceis )( If
x-[a,b]f(x)[a,b] x (x)f
[a,b]xf''
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 58
Lemma 1
],[ s t,and s t, ),()(then ],[in 0)( If
''
''
batssftfbaxf
a s z t b
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 59
Mean Value Theorem
npm(p) m)f(nf(m)f(n)xf
(z,a)s (s) a)f(zf(a)f(z)
'
'
where)(function any For
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 60
Alternative form of z
Add –λz to both sides
21 1 λ)x(λxz
)xλ(zz)λ)(x(λ)x(z)λ(xλ)z(
12
21
111
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 61
Alternative form of convexity
Add –λf(z) to both sides
)λ)f(x()λf(x)λ)x(f(λ( 2121 11
)λ)f(x(f(z)))λ(f(xλ)f(z)()λ)f(x(f(z)))λ(f(xλ)f(z)(
λf(z))λ)f(x()λf(xλf(z)f(z)
21
21
21
1111
1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 62
Proof: second derivative >=0 implies convexity (1/2)We have that,
(2) ][z]-)[x-(1(1) )]()([)]()()[1(
)()1()()(
)1(
12
12
21
21
xzxfzfzfxf
xfxfzf
xxz
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 63
Second derivative >=0 implies convexity (2/2)
(2) Is equivalent to
For some s and t , where
Now since f’’(x) >=0
)(')(' sftf
Combining this with (1), the result is proved
))(()).(()1( 12 xzsfxtf
21 xtzsx
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 64
Why all this In EM, we maximize the expectation of
log likelihood of the data Log is a concave function We have to take iterative steps to get
to the maximum There are two unknown values: Z
(unobserved data) and θ (parameters) From θ, get new value of Z (E-step) From Z, get new value of θ (M-step)4 Aug, 2014
Pushpak Bhattacharyya: ML and ME 65
Recap: a simple EM situation Toss of two coins:
Parameters <P, P1, P2> P = Probability of choosing first coin P1 = Probability of choosing head from first
coin P2 = Probability of choosing head from second
coin
We do not know which coin the observation came from
NxxxxX ,.....,,: 321
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 66
EM continued..
Z1, Z2, Z3,…, ZN is the hidden sequence running alongside X1, X2, X3,…, XN
Where, Zi =1, if the ith observation came from coin 1, =0, otherwise
21
321
,,,....,,,
),,Pr();Pr(
PPPzzzzZ
ZXX
N
Z
NN zxzxzxzxY ,......,,,: 332211
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 67
Cntd.
We want to work with
Invoke convexity/concavity and expectation of Zi and work with log(Pr(Y;θ))
N
i
zxxzxx iiiiii PPPPPP
ZXPY
1
1122
111 ))1.().1((*))1.(.(
);,();Pr(
));,(log();( Z
ZXPXLL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 68
N
iiii PxPxPzEXLL
1
11 ))1log()1(log)(log([);(
))]1log()1(log)1))(log((1( 22 PxPxPziE ii
Log Likelihood of the Data
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 69
IMPORTANT POINTS TO NOTE
Log moves inside the product term. Σ disappears giving rise to E(Zi) in place
of Zi
Differentiate wrt p, p1, p2, equate to 0 and get the results
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 70
P, P1, P2
)()(
1
11
i
Ni
iiNi
zExzE
p
)()(
1
12
i
Ni
iiNi
zENxzEM
p
M= observed no. of heads
NzE
p iNi )(1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 71
)1(22
)1(11
)1(11
)1()1()1()1(
)(/)1|().1()|1()(
iiii
ii
xxxx
xxiiiiii
i
PPPPPPPPP
xxPzxxPzPxxzPzE
How to find θ How to choose the next θ? Take
Where,X: observed dataZ: unobserved dataΘ: parameterLL(X,Z:θn): log likelihood of complete
data with parameter value at θn
This is in lieu of, for example, gradient ascent
θnΘ
At every step LL(.) willIncrease, ultimatelyreaching local/globalmaximum
)):,():,((maxarg nZXLLZXLL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 72
Why expectation of log likelihood? (1/3) P(X:θ) is the observation likelihood
Deal with P(X,Z:θ), marginalized over Z
Log(ΣZP(X,Z:θ)) is mathematically processed with multiplying by P(Z|X: θn) which for each Z is between 0 and 1 and sums to 1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 73
Why expectation of log likelihood? (2/3) Then Jensen inequality will give
));|(
);,(log();|(
at y probabilit theis );|( where ),;|(by devide andmultiply
));|(
);,();|(log());,(log(
nzn
nn
n
z n
n
z
XZPZXPXZP
XZPXZP
XZPZXPXZPZXP
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 74
Why expectation of log likelihood? (3/3)
Z. w.r.t.data complete of liklihood log ofn expectatio theis (.) where))),;,((log(
));,(log();|(
));();((maxarg So,
));,();,(log();|(
1);|( since
));().;|(
);,(log();|(
));(log());|(
);,();|(log(
));(log());,(log();();(
zz
Zn
n
nZn
Zn
nnZn
nZ n
n
nZ
n
EZXPE
ZXPXZP
XLLXLLZXPZXPXZP
XZPXPXZP
ZXPXZP
XPXZP
ZXPXZP
XPZXPXLLXLL
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 75
Why expectation of Z?
If the log likelihood is a linear function of Z, then the expectation can be carried inside of the log likelihood and E(Z) is computed
The above is true when the hidden variables form a mixture of distributions (e..g, in tosses of two coins), and
Each distribution is an exponential distribution like multinomial/normal/poisson
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 76
Application of EM: HMM Training
Baum Welch or Forward Backward Algorithm
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 77
A problem scenario
Unsupervised POS tagging Convert the Brown corpus into a corpus
with ONLY the following tags: N (noun), V (verb), J (adjective), R
(adverb), F (function words like prepositions and conjunctions), A (articles ‘a’, ‘an’, ‘the’) and O (others)
Assumes raw corpus and then create a POS tagger
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 78
Key Intuition
Given: Training sequenceInitialization: Probability valuesCompute: Pr (state seq | training seq)
get expected count of transitioncompute rule probabilities
Approach: Initialize the probabilities and recompute them… EM like approach
a
b
a
b
a
b
a
b
q r
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 79
Baum-Welch algorithm: counts
String = abb aaa bbb aaa
Sequence of states with respect to input symbols
a, b
a,b
q ra,b
rqrqqqrqrqqrq aaabbbaaabba o/p seq
State seq
a,b
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 80
Calculating probabilities from tableTable of counts
T=#statesA=#alphabet symbols
Now if we have a non-deterministic transitions then multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected count through this.
8/3)( bqP b
Src Dest O/P Count
q r a 5
q q b 3
r q a 3
r q b 2
8/5)( rqP a
T
l
A
m
li
jiji
swsc
swscswsPm
kk
1 1)(
)()(
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 81
Interplay Between Two Equations
T
l
A
m
lWmi
jWijWi
ssc
sscssPk
k
0 0
)(
)()(
1,0
),,()|()(
,01,0,01,0n
k
k
snn
jWinn
jWi
wSssnWSPssC
wk
No. of times the transitions sisj occurs in the string4 Aug, 2014
Pushpak Bhattacharyya: ML and ME 82
Illustration
a:0.67
b:1.0
b:0.17
a:0.16
q r
a:0.04
b:1.0
b:0.48
a:0.48
q r
Actual (Desired) HMM
Initial guess
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 83
One run of Baum-Welch algorithm: string ababb
P(path)
q r q r q q 0.00077 0.00154 0.00154 0 0.00077
q r q q q q 0.00442 0.00442 0.00442 0.00442
0.00884
q q q r q q 0.00442 0.00442 0.00442 0.00442
0.00884
q q q q q q 0.02548 0.0 0.000 0.05096
0.07644
Rounded Total 0.035 0.01 0.01 0.06 0.095
New Probabilities (P) 0.06=(0.01/(0.01+0.06+
0.095)
1.0 0.36 0.581
qbq qaq raq qbr a ba ab bb bba
* ε is considered as starting and ending symbol of the input sequence string.
State sequences
Through multiple iterations the probability values will converge.4 Aug, 2014
Pushpak Bhattacharyya: ML and ME 84
Computational part (1/2)
ntn
jtkt
it
n
nt snn
jtkt
it
n
snn
jWinn
n
snn
jWinn
jWi
WsSwWsSPWP
WSsSwWsSPWP
WSssnWSPWP
WSssnWSPssC
n
n
k
n
kk
,0,01
,0
,0,01,01
,0
,01,0,01,0,0
,01,0,01,0
)],,,([)(
1
)],,,,([)(
1
)],,(),([)(
1
)],,()|([)(
1,0
1,0
1,0
w0 w1 w2 wk wn-1 wn
S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 85
Computational part (2/2)
),1()(),1(
),1()|,(),1(
),1()|,(),1(
)|(),|,(),(
),,,,(
),,,(
0
0
1
0
1
1
0,11,011,0
0,111,0
0,01
jtBswsPitF
jtBsSwWsSPitF
jtBsSwWsSPitF
sSWPsSWwWsSPsSWP
WwWsSsSWP
WwWsSsSP
n
t
ji
n
t
itkt
jt
n
t
itkt
jt
jt
n
tnt
ittkt
jt
itt
n
tntkt
jt
itt
n
tnkt
jt
it
k
w0 w1 w2 wk wn-1 wn
S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 86
Discussions1. Symmetry breaking:
Example: Symmetry breaking leads to no change in initial values
2 Struck in Local maxima3. Label bias problem
Probabilities have to sum to 1.Values can rise at the cost of fall of values for others.
s
ss
b:1.0
b:0.5
a:0.5
a:1.0
s
ss
a:0.5
b:0.5
a:0.25
a:0.5b:0.5
a:0.25
b:0.25
b:0.5
Desired Initialized
4 Aug, 2014Pushpak Bhattacharyya: ML and
ME 87