Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
MachineLearning&DataMiningCS/CNS/EE155
Lecture14:HiddenMarkovModels
1
SequencePrediction(POSTagging)
• x=“FishSleep”• y=(N,V)
• x=“TheDogAteMyHomework”• y=(D,N,V,D,N)
• x=“TheFoxJumpedOverTheFence”• y=(D,N,V,P,D,N)
2
Challenges
• MultivariableOutput– Makemultiplepredictionssimultaneously
• VariableLengthInput/Output– Sentencelengthsnotfixed
3
MultivariateOutputs
• x=“FishSleep”• y=(N,V)
• Multiclassprediction:
• Howmanyclasses?
4
POSTags:Det,Noun,Verb,Adj,Adv,Prep
w =
w1w2!wK
!
"
#####
$
%
&&&&&
f (x |w,b) =
w1T x − b1
w2T x − b2!
wKT x − bK
"
#
$$$$$
%
&
'''''
PredictviaLargestScore:
argmaxk
w1T x − b1
w2T x − b2!
wKT x − bK
"
#
$$$$$
%
&
'''''
ReplicateWeights: ScoreAllClasses:
b =
b1b2!bK
!
"
#####
$
%
&&&&&
MulticlassPrediction
• x=“FishSleep”• y=(N,V)
• Multiclassprediction:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…
• LMclasses!– Length2:62 =36!
5
POSTags:Det,Noun,Verb,Adj,Adv,Prep
L=6
MulticlassPrediction
• x=“FishSleep”• y=(N,V)
• Multiclassprediction:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…
• LMclasses!– Length2:62 =36!
6
POSTags:Det,Noun,Verb,Adj,Adv,Prep
L=6
ExponentialExplosionin#Classes!(NotTractableforSequencePrediction)
WhyisNaïveMulticlassIntractable?
– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …
7
POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoften”
Assumepronouns arenouns forsimplicity.
WhyisNaïveMulticlassIntractable?
– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …
8
TreatsEveryCombinationAsDifferentClass(Learnmodelforeachcombination)
ExponentiallyLargeRepresentation!(ExponentialTimetoConsiderEveryClass)
(ExponentialStorage)
POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoften”
Assumepronouns arenouns forsimplicity.
IndependentClassification
• Treateachwordindependently(assumption)– Independentmulticlasspredictionperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“often”independently– Concatenatepredictions.
9
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronouns arenouns forsimplicity.
IndependentClassification
• Treateachwordindependently(assumption)– Independentmulticlasspredictionperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“often”independently– Concatenatepredictions.
10
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronouns arenouns forsimplicity.
#Classes=#POSTags(6inourexample)
Solvableusingstandardmulticlassprediction.
IndependentClassification
• Treateachwordindependently– Independentmulticlasspredictionperword
11
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronouns arenouns forsimplicity.
P(y|x) x=“I” x=“fish” x=“often”
y=“Det” 0.0 0.0 0.0
y=“Noun” 1.0 0.75 0.0
y=“Verb” 0.0 0.25 0.0
y=“Adj” 0.0 0.0 0.4
y=“Adv” 0.0 0.0 0.6
y=“Prep” 0.0 0.0 0.0
Prediction:(N,N,Adv)
Correct:(N,V,Adv)
Whythemistake?
ContextBetweenWords
• IndependentPredictionsIgnoreWordPairs– InIsolation:• “Fish”ismorelikelytobeaNoun
– ButConditionedonFollowinga(pro)Noun…• “Fish”ismorelikelytobeaVerb!
– “1stOrder”Dependence(ModelAllPairs)• 2nd OrderConsidersAllTriplets• ArbitraryOrder=ExponentialSize(NaïveMulticlass)
12
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronouns arenouns forsimplicity.
1st OrderHiddenMarkovModel
• x=(x1,x2,x4,x4,…,xM)(sequenceofwords)• y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)
• P(xi|yi)Probabilityofstateyi generatingxi
• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1
• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
13
GraphicalModelRepresentation
14
Y1
X1
Y2
X2
YM
XM
…
…
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Optional
Y0 YEnd
1st OrderHiddenMarkovModel
• P(xi|yi)Probabilityofstateyi generatingxi
• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1
• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate
15
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏“JointDistribution”
• P(xi|yi)Probabilityofstateyi generatingxi
• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1
• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate
1st OrderHiddenMarkovModel
16
P x | y( ) = P(xi | yi )i=1
M
∏
“ConditionalDistributiononxgiveny”
GivenaPOSTagSequencey:CancomputeeachP(xi|y)independently!(xi conditionallyindependentgivenyi)
1st OrderHiddenMarkovModel
17
ModelsAllState-StatePairs(allPOSTag-Tagpairs)ModelsAllState-ObservationPairs(allTag-Wordpairs)
SameComplexityasIndependentMulticlass
AdditionalComplexityof(#POSTags)2
• P(xi|yi)Probabilityofstateyi generatingxi
• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1
• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate
RelationshiptoNaïveBayes
18
Graphical)Model)Representa2on)
14)
Y1#
X1)
Y2#
X2)
YM#
XM)
…#
…#
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Op2onal)
Y0# YEnd#
Graphical)Model)Representa2on)
15)
Y1#
X1)
Y2#
X2)
YM#
XM)
…#
…#
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Op2onal)ReducestoasequenceofdisjointNaïveBayesmodels(ifweignoretransitionprobabilities)
P(word|state/tag)
• Two-wordlanguage:“fish” and“sleep”• Two-taglanguage:“Noun”and“Verb”
Slidesborrowed fromRalph Grishman 19
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
GivenTagSequencey:
P(“fishsleep”|(N,V))=0.8*0.5P(“fishfish”|(N,V))=0.8*0.5P(“sleepfish”|(V,V))=0.8*0.5P(“sleepsleep”|(N,N))=0.2*0.2
Sampling
• HMMsare“generative”models– ModelsjointdistributionP(x,y)– Cangeneratesamplesfromthisdistribution– FirstconsiderconditionaldistributionP(x|y)
–WhataboutsamplingfromP(x,y)?
20
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
GivenTagSequencey=(N,V):
Sampleeachwordindependently:SampleP(x1|N)(0.8Fish,0.2Sleep)SampleP(x2|V)(0.5Fish,0.5Sleep)
ForwardSamplingofP(y,x)
21
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Slidesborrowed fromRalph Grishman
Initializey0 =StartInitializei =0
1. i=i +12. Sampleyi fromP(yi|yi-1)3. Ifyi ==End:Quit4. Samplexi fromP(xi|yi)5. Goto Step1
ExploitsConditionalInd.RequiresP(End|yi)
ForwardSamplingofP(y,x|L)
22
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P x, y |M( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Slidesborrowed fromRalph Grishman
Initializey0 =StartInitializei =0
1. i=i +12. If(i ==M):Quit3. Sampleyi fromP(yi|yi-1)4. Samplexi fromP(xi|yi)5. Goto Step1
ExploitsConditionalInd.AssumesnoP(End|yi)
A Simple POS HMM
start noun verb 0.8
0.2
0.91
0.333
0.667
0.09
Slides'borrowed'from'Ralph Grishman'' 19'
• P(xi|yi)Probabilityofstateyi generatingxi
• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1
• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate
1st OrderHiddenMarkovModel
23
P xk+1:M , yk+1:M | x1:k, y1:k( ) = P xk+1:M , yk+1:M | yk( )“Memory-lessModel”– onlyneedsyk tomodelrestofsequence
ViterbiAlgorithm
24
MostCommonPredictionProblem
• Giveninputsentence,predictPOSTagseq.
• Naïveapproach:– Tryallpossibley’s– Chooseonewithhighestprobability– Exponentialtime:LMpossibley’s
25
argmaxy
P y | x( )
Recall:Bayes’s Rule
26
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
P(x | y)P(y)
𝑃 𝑥 𝑦 =%𝑃(𝑥'|𝑦')*
'+,
𝑃(𝑦) = 𝑃(𝐸𝑁𝐷|𝑦*)%𝑃(𝑦'|𝑦'0,)*
'+,
27
argmaxy
P(y, x) = argmaxy
P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= argmaxyM
argmaxy1:M−1
P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= argmaxyM
argmaxy1:M−1
P(yM | yM−1)P(xM | yM )P(y1:M−1 | x1:M−1)
P x1:k | y1:k( ) = P(xi | yi )i=1
k
∏
P y1:k( ) = P(yi+1 | yi )i=1
k
∏
ExploitMemory-lessProperty:ThechoiceofyM onlydependsony1:M-1viaP(yM|yM-1)!
P y1:k | x1:k( ) = P(x1:k | y1:k )P(y1:k )
DynamicProgramming
• Input: x=(x1,x2,x3,…,xM)
• Computed:bestlength-kprefixendingineachTag:– Examples:
• Claim:
28
Y k (V ) = argmaxy1:k−1
P(y1:k−1⊕V, x1:k )#
$%
&
'(⊕V Y k (N ) = argmax
y1:k−1P(y1:k−1⊕ N, x1:k )
#
$%
&
'(⊕ N
SequenceConcatenation
Y k+1(V ) = argmaxy1:k∈ Y k T( ){ }T
P(y1:k ⊕V, x1:k+1)#
$%%
&
'((⊕V
= argmaxy1:k∈ Y k T( ){ }T
P(y1:k, x1:k )P(yk+1 =V | yk )P(xk+1 | yk+1 =V )#
$%%
&
'((⊕V
Pre-computed RecursiveDefinition!
Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T
P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"
#$$
%
&''⊕V
29
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
Solve:
y1=V
y1=D
y1=N
Ŷ1(Z)isjustZ
Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T
P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"
#$$
%
&''⊕V
30
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
y1=N
Ŷ1(Z)isjustZ Ex:Ŷ2(V)=(N,V)
Solve:
31
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T
P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"
#$$
%
&''⊕VSolve:
y2=V
y2=D
y2=N
32
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Claim:OnlyneedtochecksolutionsofŶ2(Z),Z=V,D,N
y2=V
y2=D
y2=N
Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T
P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"
#$$
%
&''⊕V
33
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Claim:OnlyneedtochecksolutionsofŶ2(Z),Z=V,D,N
y2=V
y2=D
y2=N
SupposeŶ3(V)= (V,V,V)……provethatŶ3(V)=(N,V,V)hashigherprob.
Proofdependson1st orderproperty• Prob.of(V,V,V)&(N,V,V)differin3terms• P(y1|y0),P(x1|y1),P(y2|y1)• Noneofthesedependony3!
Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T
P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"
#$$
%
&''⊕V
34
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Y M (V ) = argmaxy1:M−1∈ Y M−1 T( ){ }T
P(y1:M−1, x1:M−1)P(yM =V | yM−1)P(xM | yM =V )P(End | yM =V )#
$%%
&
'((⊕V
StoreeachŶ3(Z)&P(Ŷ3(Z),x1:3)
Ex:Ŷ3(V)=(D,N,V)
ŶM(V)
ŶM(D)
ŶM(N)
…
Optional
ViterbiAlgorithm
• Solve:
• Fork=1..M– IterativelysolveforeachŶk(Z)• ZloopingovereveryPOStag.
• PredictbestŶM(Z)• AlsoknownasMeanAPosteriori(MAP)inference
35
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
P(x | y)P(y)
NumericalExample
start noun verb end0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slidesborrowed fromRalph Grishman 36
x=(FishSleep)
0 1 2 3
start 1
verb 0
noun 0
end 0Slidesborrowed fromRalph Grishman 37
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0
verb 0 .2 * .5
noun 0 .8 * .8
end 0 0
Token 1: fish
Slidesborrowed fromRalph Grishman 38
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0
verb 0 .1
noun 0 .64
end 0 0
Token 1: fish
Slidesborrowed fromRalph Grishman 39
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .1*.1*.5
noun 0 .64 .1*.2*.2
end 0 0 -
Token 2: sleep
(if ‘fish’ is verb)
Slidesborrowed fromRalph Grishman 40
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005
noun 0 .64 .004
end 0 0 -
Token 2: sleep
(if ‘fish’ is verb)
Slidesborrowed fromRalph Grishman 41
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.64*.8*.5
noun 0 .64 .004.64*.1*.2
end 0 0 -
Token 2: sleep
(if ‘fish’ is a noun)
Slidesborrowed fromRalph Grishman 42
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.256
noun 0 .64 .004.0128
end 0 0 -
Token 2: sleep
(if ‘fish’ is a noun)
Slidesborrowed fromRalph Grishman 43
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.256
noun 0 .64 .004.0128
end 0 0 -
Token 2: sleeptake maximum,set back pointers
Slidesborrowed fromRalph Grishman 44
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .256
noun 0 .64 .0128
end 0 0 -
Token 2: sleeptake maximum,set back pointers
Slidesborrowed fromRalph Grishman 45
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7.0128*.1
Token 3: end
Slidesborrowed fromRalph Grishman 46
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7.0128*.1
Token 3: endtake maximum,set back pointers
Slidesborrowed fromRalph Grishman 47
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7
Decode:fish = nounsleep = verb
Slidesborrowed fromRalph Grishman 48
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7
Decode:fish = nounsleep = verb
Slidesborrowed fromRalph Grishman 49
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
Whatmightgowrongforlongsequences?
Underflow!Smallnumbersgetrepeatedlymultiplied
together– exponentiallysmall!
ViterbiAlgorithm(w/LogProbabilities)
• Solve:
• Fork=1..M– Iterativelysolveforeachlog(Ŷk(Z))• ZloopingovereveryPOStag.
• Predictbestlog(ŶM(Z))– Log(ŶM(Z)) accumulatesadditively,notmultiplicatively
50
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
logP(x | y)+ logP(y)
Recap:IndependentClassification
• Treateachwordindependently– Independentmulticlasspredictionperword
51
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronouns arenouns forsimplicity.
P(y|x) x=“I” x=“fish” x=“often”
y=“Det” 0.0 0.0 0.0
y=“Noun” 1.0 0.75 0.0
y=“Verb” 0.0 0.25 0.0
y=“Adj” 0.0 0.0 0.4
y=“Adv” 0.0 0.0 0.6
y=“Prep” 0.0 0.0 0.0
Prediction:(N,N,Adv)
Correct:(N,V,Adv)
Mistakeduetonotmodelingmultiplewords.
Recap:Viterbi
• Modelspairwisetransitionsbetweenstates– PairwisetransitionsbetweenPOSTags– “1st order”model
52
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
x=“Ifishoften” Independent:(N,N,Adv)
HMMViterbi:(N,V,Adv)*Assuming wedefinedP(x,y)properly
TrainingHMMs
53
SupervisedTraining
• Given:
• Goal:EstimateP(x,y)usingS
• MaximumLikelihood!
54
S = (xi, yi ){ }i=1N
WordSequence(Sentence)
POSTagSequence
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Aside:MatrixFormulation
• DefineTransitionMatrix:A– Aab =P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))
• ObservationMatrix:O– Owz =P(xi=w|yi=z)or–Log(P(xi=w|yi=z))
55
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P(ynext|y) y=“Noun” y=“Verb”
ynext=“Noun” 0.09 0.667
ynext=“Verb” 0.91 0.333
Aside:MatrixFormulation
56
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= AEnd,yM
Ayi ,yi−1
i=1
M
∏ Oxi ,yi
i=1
M
∏
− log(P(x, y)) = !AEnd,yM
+ !Ayi ,yi−1
i=1
M
∑ + !Oxi ,yi
i=1
M
∑ Logprob.formulationEachentryofà isdefineas–log(A)
MaximumLikelihood
• Estimateeachcomponentseparately:
• (Derivedviaminimizingneg.loglikelihood)
57
Aab =1
yji+1=a( )∧ yj
i =b( )"#
$%i=0
M j
∑j=1
N
∑
1yji =b"
#$%
i=0
M j
∑j=1
N
∑Owz =
1x ji =w( )∧ yj
i =z( )"#
$%i=1
M j
∑j=1
N
∑
1yji =z"
#$%
i=1
M j
∑j=1
N
∑
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
Recap:SupervisedTraining
• MaximumLikelihoodTraining– Countingstatistics– Supereasy!–Why?
• Whataboutunsupervisedcase?
58
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
Recap:SupervisedTraining
• MaximumLikelihoodTraining– Countingstatistics– Supereasy!–Why?
• Whataboutunsupervisedcase?
59
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
ConditionalIndependenceAssumptions
• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse
• Canjustestimatefrequencies:– Howoftenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossalllocationsofallsequences.
60
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
ConditionalIndependenceAssumptions
• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse
• Canjustestimatefrequencies:– Howoftenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossalllocationsofallsequences.
61
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
#Parameters:TransitionsA:#Tags2
ObservationsO:#Wordsx#TagsAvoidsdirectlymodelword/wordpairings
#Tags=10s#Words=10000s
UnsupervisedTraining
• Whataboutifnoy’s?– Justatrainingsetofsentences
• StillwanttoestimateP(x,y)– How?–Why?
62
S = xi{ }i=1N
WordSequence(Sentence)
argmax P xi( )i∏ = argmax P xi, y( )
y∑
i∏
UnsupervisedTraining
• Whataboutifnoy’s?– Justatrainingsetofsentences
• StillwanttoestimateP(x,y)– How?–Why?
63
S = xi{ }i=1N
WordSequence(Sentence)
argmax P xi( )i∏ = argmax P xi, y( )
y∑
i∏
WhyUnsupervisedTraining?
• SupervisedDatahardtoacquire– RequireannotatingPOStags
• UnsupervisedDataplentiful– Justgrabsometext!
• MightjustworkforPOSTagging!– Learny’sthatcorrespondtoPOSTags
• Canbeusedforothertasks– Detectoutliersentences(sentenceswithlowprob.)– Samplingnewsentences.
64
EMAlgorithm(Baum-Welch)
• Ifwehady’sèmaxlikelihood.• Ifwehad(A,O)è predicty’s
1. InitializeAandOarbitrarily
2. Predict prob.ofy’sforeachtrainingx
3. Usey’stoestimatenew(A,O)
4. RepeatbacktoStep1untilconvergence
65http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
ExpectationStep
MaximizationStep
Chickenvs Egg!
ExpectationStep
• Given(A,O)• Fortrainingx=(x1,…,xM)– PredictP(yi)foreachy=(y1,…yM)
– Encodescurrentmodel’sbeliefsabouty– “MarginalDistribution”ofeachyi
66
x1 x2 … xL
P(yi=Noun) 0.5 0.4 … 0.05
P(yi=Det) 0.4 0.6 … 0.25
P(yi=Verb) 0.1 0.0 … 0.7
Recall:MatrixFormulation
• DefineTransitionMatrix:A– Aab =P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))
• ObservationMatrix:O– Owz =P(xi=w|yi=z)or–Log(P(xi=w|yi=z))
67
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P(ynext|y) y=“Noun” y=“Verb”
ynext=“Noun” 0.09 0.667
ynext=“Verb” 0.91 0.333
MaximizationStep
• Max.LikelihoodoverMarginalDistribution
68
Aab =P(yj
i = b, yji+1 = a)
i=0
M j
∑j=1
N
∑
P(yji = b)
i=0
M j
∑j=1
N
∑Owz =
1x ji =w!
"#$P(yj
i = z)i=1
M j
∑j=1
N
∑
P(yji = z)
i=1
M j
∑j=1
N
∑
Aab =1
yji+1=a( )∧ yj
i =b( )"#
$%i=0
M j
∑j=1
N
∑
1yji =b"
#$%
i=0
M j
∑j=1
N
∑Owz =
1x ji =w( )∧ yj
i =z( )"#
$%i=1
M j
∑j=1
N
∑
1yji =z"
#$%
i=1
M j
∑j=1
N
∑Supervised:
Unsupervised:
MarginalsMarginals
Marginals
ComputingMarginals(Forward-BackwardAlgorithm)
• SolvingE-Step,requirescomputemarginals
• CansolveusingDynamicProgramming!– SimilartoViterbi
69
x1 x2 … xL
P(yi=Noun) 0.5 0.4 … 0.05
P(yi=Det) 0.4 0.6 … 0.25
P(yi=Verb) 0.1 0.0 … 0.7
Notation
70
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
Probabilityofobservingprefixx1:iandhavingthei-th statebeyi=Z
Probabilityofobservingsuffixxi+1:Mgiventhei-th statebeingyi=Z
http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)
z '∑
ComputingMarginals=CombiningtheTwoTerms
Notation
71
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
Probabilityofobservingprefixx1:iandhavingthei-th statebeyi=Z
Probabilityofobservingsuffixxi+1:Mgiventhei-th statebeingyi=Z
http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
ComputingMarginals=CombiningtheTwoTerms
P(yi = b, yi−1 = a | x) = aa (i−1)P(yi = b | yi−1 = a)P(xi | yi = b)βb(i)
aa ' (i−1)P(yi = b ' | yi−1 = a ')P(xi | yi = b ')βb ' (i)
a ',b '∑
Forward(sub-)Algorithm
• Solveforevery:
• Naively:
• Canbecomputedrecursively(likeViterbi)
72
αz (i) = P(x1:i, yi = Z | A,O)
αz (i) = P(x1:i, yi = Z | A,O) = P(x1:i, yi = Z, y1:i−1 | A,O)
y1:i−1∑
αz (1) = P(y1 = z | y0 )P(x1 | y1 = z) =O
x1,zAz,start
ExponentialTime!
αz (i+1) =Oxi+1,zα j (i)
j=1
L
∑ Az, j
Viterbieffectivelyreplacessumwithmax
Backward(sub-)Algorithm
• Solveforevery:
• Naively:
• Canbecomputedrecursively(likeViterbi)
73
βz (i) = P(xi+1:M | yi = Z,A,O) = P(xi+1:M , yi+1:M | yi = Z,A,O)
yi+1:L∑
ExponentialTime!
βz (i) = β j (i+1)j=1
L
∑ Aj,zOxi+1, j
βz (i) = P(xi+1:M | yi = Z,A,O)
𝛽2 𝑀 = 1
Forward-BackwardAlgorithm
• RunsForward
• RunsBackward
• Foreachtrainingx=(x1,…,xM)– ComputeseachP(yi)fory=(y1,…,yM)
74
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)
z '∑
Recap:UnsupervisedTraining
• Trainusingonlywordsequences:
• y’sare“hiddenstates”– Allpairwisetransitionsarethroughy’s– HencehiddenMarkovModel
• TrainusingEMalgorithm– Convergetolocaloptimum
75
S = xi{ }i=1N
WordSequence(Sentence)
Initialization
• Howtochoose#hiddenstates?– Byhand– CrossValidation• P(x)onvalidationdata• CancomputeP(x)viaforwardalgorithm:
76
P(x) = P(x, y)y∑ = αz (M )
z∑ P(End | yM = z)
Recap:SequencePrediction&HMMs
• Modelspairwisedependencesinsequences
• Compact:onlymodelpairwisebetweeny’s• MainLimitation:Lotsofindependenceassumptions– Poorpredictiveaccuracy
77
x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Independent:(N,N,Adv)HMMViterbi:(N,V,Adv)
NextLectures
• Thursday:HiddenMarkovModels– (UnstructuredLecture)
• NextTuesday:DeepGenerativeModels– RecentApplications
• RecitationThursday– RecapofViterbiandForward/Backward
78