Machine Learning & Data Mining · 2019-02-27 · Sequence Prediction (POS Tagging) • x = “Fish Sleep” • y = (N, V) • x = “The Dog Ate My Homework” • y = (D, N, V,

MachineLearning&DataMiningCS/CNS/EE155

Lecture14:HiddenMarkovModels

1

SequencePrediction(POSTagging)

• x=“FishSleep”• y=(N,V)

• x=“TheDogAteMyHomework”• y=(D,N,V,D,N)

• x=“TheFoxJumpedOverTheFence”• y=(D,N,V,P,D,N)

2

Challenges

• MultivariableOutput– Makemultiplepredictionssimultaneously

• VariableLengthInput/Output– Sentencelengthsnotfixed

3

MultivariateOutputs


• Multiclassprediction:

• Howmanyclasses?

4

POSTags:Det,Noun,Verb,Adj,Adv,Prep

w =

w1w2!wK

!

"

#####

$

%

&&&&&

f (x |w,b) =

w1T x − b1

w2T x − b2!

wKT x − bK

"

#

$$$$$

%

&

'''''

PredictviaLargestScore:

argmaxk

w1T x − b1

w2T x − b2!

wKT x − bK

"

#

$$$$$

%

&

'''''

ReplicateWeights: ScoreAllClasses:

b =

b1b2!bK

!

"

#####

$

%

&&&&&

MulticlassPrediction


• Multiclassprediction:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…

• LMclasses!– Length2:62 =36!

5


L=6

MulticlassPrediction


• Multiclassprediction:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…

• LMclasses!– Length2:62 =36!

6


L=6

ExponentialExplosionin#Classes!(NotTractableforSequencePrediction)

WhyisNaïveMulticlassIntractable?

– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …

7

POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoften”

Assumepronouns arenouns forsimplicity.

WhyisNaïveMulticlassIntractable?

– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …

8

TreatsEveryCombinationAsDifferentClass(Learnmodelforeachcombination)

ExponentiallyLargeRepresentation!(ExponentialTimetoConsiderEveryClass)

(ExponentialStorage)

POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoften”


IndependentClassification

• Treateachwordindependently(assumption)– Independentmulticlasspredictionperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“often”independently– Concatenatepredictions.

9

x=“Ifishoften” POSTags:Det,Noun,Verb,Adj,Adv,Prep



• Treateachwordindependently(assumption)– Independentmulticlasspredictionperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“often”independently– Concatenatepredictions.

10



#Classes=#POSTags(6inourexample)

Solvableusingstandardmulticlassprediction.


• Treateachwordindependently– Independentmulticlasspredictionperword

11



P(y|x) x=“I” x=“fish” x=“often”

y=“Det” 0.0 0.0 0.0

y=“Noun” 1.0 0.75 0.0

y=“Verb” 0.0 0.25 0.0

y=“Adj” 0.0 0.0 0.4

y=“Adv” 0.0 0.0 0.6

y=“Prep” 0.0 0.0 0.0

Prediction:(N,N,Adv)

Correct:(N,V,Adv)

Whythemistake?

ContextBetweenWords

• IndependentPredictionsIgnoreWordPairs– InIsolation:• “Fish”ismorelikelytobeaNoun

– ButConditionedonFollowinga(pro)Noun…• “Fish”ismorelikelytobeaVerb!

– “1stOrder”Dependence(ModelAllPairs)• 2nd OrderConsidersAllTriplets• ArbitraryOrder=ExponentialSize(NaïveMulticlass)

12



1st OrderHiddenMarkovModel

• x=(x1,x2,x4,x4,…,xM)(sequenceofwords)• y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)

• P(xi|yi)Probabilityofstateyi generatingxi

• P(yi+1|yi)Probabilityofstateyi transitioningtoyi+1

• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused

13

GraphicalModelRepresentation

14

Y1

X1

Y2

X2

YM

XM

…

…

P x, y( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

Optional

Y0 YEnd




• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate

15


M

∏ P(xi | yi )i=1

M

∏“JointDistribution”



• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate


16

P x | y( ) = P(xi | yi )i=1

M

∏

“ConditionalDistributiononxgiveny”

GivenaPOSTagSequencey:CancomputeeachP(xi|y)independently!(xi conditionallyindependentgivenyi)


17

ModelsAllState-StatePairs(allPOSTag-Tagpairs)ModelsAllState-ObservationPairs(allTag-Wordpairs)

SameComplexityasIndependentMulticlass

AdditionalComplexityof(#POSTags)2



• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate

RelationshiptoNaïveBayes

18

Graphical)Model)Representa2on)

14)

Y1#

X1)

Y2#

X2)

YM#

XM)

…#

…#


M

∏ P(xi | yi )i=1

M

∏

Op2onal)

Y0# YEnd#

Graphical)Model)Representa2on)

15)

Y1#

X1)

Y2#

X2)

YM#

XM)

…#

…#


M

∏ P(xi | yi )i=1

M

∏

Op2onal)ReducestoasequenceofdisjointNaïveBayesmodels(ifweignoretransitionprobabilities)

P(word|state/tag)

• Two-wordlanguage:“fish” and“sleep”• Two-taglanguage:“Noun”and“Verb”

Slidesborrowed fromRalph Grishman 19

P(x|y) y=“Noun” y=“Verb”

x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

GivenTagSequencey:

P(“fishsleep”|(N,V))=0.8*0.5P(“fishfish”|(N,V))=0.8*0.5P(“sleepfish”|(V,V))=0.8*0.5P(“sleepsleep”|(N,N))=0.2*0.2

Sampling

• HMMsare“generative”models– ModelsjointdistributionP(x,y)– Cangeneratesamplesfromthisdistribution– FirstconsiderconditionaldistributionP(x|y)

–WhataboutsamplingfromP(x,y)?

20


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

GivenTagSequencey=(N,V):

Sampleeachwordindependently:SampleP(x1|N)(0.8Fish,0.2Sleep)SampleP(x2|V)(0.5Fish,0.5Sleep)

ForwardSamplingofP(y,x)

21

A Simple POS HMM

start noun verb end 0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Slides'borrowed'from'Ralph Grishman'' 12'


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5


M

∏ P(xi | yi )i=1

M

∏

Slidesborrowed fromRalph Grishman

Initializey0 =StartInitializei =0

1. i=i +12. Sampleyi fromP(yi|yi-1)3. Ifyi ==End:Quit4. Samplexi fromP(xi|yi)5. Goto Step1

ExploitsConditionalInd.RequiresP(End|yi)

ForwardSamplingofP(y,x|L)

22


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P x, y |M( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

Slidesborrowed fromRalph Grishman

Initializey0 =StartInitializei =0

1. i=i +12. If(i ==M):Quit3. Sampleyi fromP(yi|yi-1)4. Samplexi fromP(xi|yi)5. Goto Step1

ExploitsConditionalInd.AssumesnoP(End|yi)

A Simple POS HMM

start noun verb 0.8

0.2

0.91

0.333

0.667

0.09




• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate


23

P xk+1:M , yk+1:M | x1:k, y1:k( ) = P xk+1:M , yk+1:M | yk( )“Memory-lessModel”– onlyneedsyk tomodelrestofsequence

ViterbiAlgorithm

24

MostCommonPredictionProblem

• Giveninputsentence,predictPOSTagseq.

• Naïveapproach:– Tryallpossibley’s– Chooseonewithhighestprobability– Exponentialtime:LMpossibley’s

25

argmaxy

P y | x( )

Recall:Bayes’s Rule

26

argmaxy

P y | x( ) = argmaxy

P(y, x)P(x)

= argmaxy

P(y, x)

= argmaxy

P(x | y)P(y)

𝑃 𝑥 𝑦 =%𝑃(𝑥'|𝑦')*

'+,

𝑃(𝑦) = 𝑃(𝐸𝑁𝐷|𝑦*)%𝑃(𝑦'|𝑦'0,)*

'+,

27

argmaxy

P(y, x) = argmaxy

P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

= argmaxyM

argmaxy1:M−1

P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

= argmaxyM

argmaxy1:M−1

P(yM | yM−1)P(xM | yM )P(y1:M−1 | x1:M−1)

P x1:k | y1:k( ) = P(xi | yi )i=1

k

∏

P y1:k( ) = P(yi+1 | yi )i=1

k

∏

ExploitMemory-lessProperty:ThechoiceofyM onlydependsony1:M-1viaP(yM|yM-1)!

P y1:k | x1:k( ) = P(x1:k | y1:k )P(y1:k )

DynamicProgramming

• Input: x=(x1,x2,x3,…,xM)

• Computed:bestlength-kprefixendingineachTag:– Examples:

• Claim:

28

Y k (V ) = argmaxy1:k−1

P(y1:k−1⊕V, x1:k )#

$%

&

'(⊕V Y k (N ) = argmax

y1:k−1P(y1:k−1⊕ N, x1:k )

#

$%

&

'(⊕ N

SequenceConcatenation

Y k+1(V ) = argmaxy1:k∈ Y k T( ){ }T

P(y1:k ⊕V, x1:k+1)#

$%%

&

'((⊕V

= argmaxy1:k∈ Y k T( ){ }T

P(y1:k, x1:k )P(yk+1 =V | yk )P(xk+1 | yk+1 =V )#

$%%

&

'((⊕V

Pre-computed RecursiveDefinition!

Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T

P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"

#$$

%

&''⊕V

29

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)

StoreeachŶ1(Z)&P(Ŷ1(Z),x1)

Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

Solve:

y1=V

y1=D

y1=N

Ŷ1(Z)isjustZ

Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T

P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"

#$$

%

&''⊕V

30

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

y1=N

Ŷ1(Z)isjustZ Ex:Ŷ2(V)=(N,V)

Solve:

31

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)

Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T

P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"

#$$

%

&''⊕VSolve:

y2=V

y2=D

y2=N

32

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)


Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Claim:OnlyneedtochecksolutionsofŶ2(Z),Z=V,D,N

y2=V

y2=D

y2=N

Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T

P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"

#$$

%

&''⊕V

33

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)


Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Claim:OnlyneedtochecksolutionsofŶ2(Z),Z=V,D,N

y2=V

y2=D

y2=N

SupposeŶ3(V)= (V,V,V)……provethatŶ3(V)=(N,V,V)hashigherprob.

Proofdependson1st orderproperty• Prob.of(V,V,V)&(N,V,V)differin3terms• P(y1|y0),P(x1|y1),P(y2|y1)• Noneofthesedependony3!

Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T

P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"

#$$

%

&''⊕V

34

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)


Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Y M (V ) = argmaxy1:M−1∈ Y M−1 T( ){ }T

P(y1:M−1, x1:M−1)P(yM =V | yM−1)P(xM | yM =V )P(End | yM =V )#

$%%

&

'((⊕V


Ex:Ŷ3(V)=(D,N,V)

ŶM(V)

ŶM(D)

ŶM(N)

…

Optional

ViterbiAlgorithm

• Solve:

• Fork=1..M– IterativelysolveforeachŶk(Z)• ZloopingovereveryPOStag.

• PredictbestŶM(Z)• AlsoknownasMeanAPosteriori(MAP)inference

35

argmaxy


P(y, x)P(x)

= argmaxy

P(y, x)

= argmaxy

P(x | y)P(y)

NumericalExample

start noun verb end0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1


x=(FishSleep)

0 1 2 3

start 1

verb 0

noun 0

end 0Slidesborrowed fromRalph Grishman 37

A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0

verb 0 .2 * .5

noun 0 .8 * .8

end 0 0

Token 1: fish


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0

verb 0 .1

noun 0 .64

end 0 0

Token 1: fish


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .1*.1*.5

noun 0 .64 .1*.2*.2

end 0 0 -

Token 2: sleep

(if ‘fish’ is verb)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005

noun 0 .64 .004

end 0 0 -

Token 2: sleep

(if ‘fish’ is verb)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.64*.8*.5

noun 0 .64 .004.64*.1*.2

end 0 0 -

Token 2: sleep

(if ‘fish’ is a noun)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.256

noun 0 .64 .004.0128

end 0 0 -

Token 2: sleep

(if ‘fish’ is a noun)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.256

noun 0 .64 .004.0128

end 0 0 -

Token 2: sleeptake maximum,set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .256

noun 0 .64 .0128

end 0 0 -

Token 2: sleeptake maximum,set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7.0128*.1

Token 3: end


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7.0128*.1

Token 3: endtake maximum,set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

Decode:fish = nounsleep = verb


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

Decode:fish = nounsleep = verb


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

Whatmightgowrongforlongsequences?

Underflow!Smallnumbersgetrepeatedlymultiplied

together– exponentiallysmall!

ViterbiAlgorithm(w/LogProbabilities)

• Solve:

• Fork=1..M– Iterativelysolveforeachlog(Ŷk(Z))• ZloopingovereveryPOStag.

• Predictbestlog(ŶM(Z))– Log(ŶM(Z)) accumulatesadditively,notmultiplicatively

50

argmaxy


P(y, x)P(x)

= argmaxy

P(y, x)

= argmaxy

logP(x | y)+ logP(y)

Recap:IndependentClassification

• Treateachwordindependently– Independentmulticlasspredictionperword

51



P(y|x) x=“I” x=“fish” x=“often”

y=“Det” 0.0 0.0 0.0

y=“Noun” 1.0 0.75 0.0

y=“Verb” 0.0 0.25 0.0

y=“Adj” 0.0 0.0 0.4

y=“Adv” 0.0 0.0 0.6

y=“Prep” 0.0 0.0 0.0

Prediction:(N,N,Adv)

Correct:(N,V,Adv)

Mistakeduetonotmodelingmultiplewords.

Recap:Viterbi

• Modelspairwisetransitionsbetweenstates– PairwisetransitionsbetweenPOSTags– “1st order”model

52


M

∏ P(xi | yi )i=1

M

∏

x=“Ifishoften” Independent:(N,N,Adv)

HMMViterbi:(N,V,Adv)*Assuming wedefinedP(x,y)properly

TrainingHMMs

53

SupervisedTraining

• Given:

• Goal:EstimateP(x,y)usingS

• MaximumLikelihood!

54

S = (xi, yi ){ }i=1N

WordSequence(Sentence)

POSTagSequence


M

∏ P(xi | yi )i=1

M

∏

Aside:MatrixFormulation

• DefineTransitionMatrix:A– Aab =P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))

• ObservationMatrix:O– Owz =P(xi=w|yi=z)or–Log(P(xi=w|yi=z))

55


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P(ynext|y) y=“Noun” y=“Verb”

ynext=“Noun” 0.09 0.667

ynext=“Verb” 0.91 0.333

Aside:MatrixFormulation

56


M

∏ P(xi | yi )i=1

M

∏


M

∏ P(xi | yi )i=1

M

∏

= AEnd,yM

Ayi ,yi−1

i=1

M

∏ Oxi ,yi

i=1

M

∏

− log(P(x, y)) = !AEnd,yM

+ !Ayi ,yi−1

i=1

M

∑ + !Oxi ,yi

i=1

M

∑ Logprob.formulationEachentryofÃ isdefineas–log(A)

MaximumLikelihood

• Estimateeachcomponentseparately:

• (Derivedviaminimizingneg.loglikelihood)

57

Aab =1

yji+1=a( )∧ yj

i =b( )"#

$%i=0

M j

∑j=1

N

∑

1yji =b"

#$%

i=0

M j

∑j=1

N

∑Owz =

1x ji =w( )∧ yj

i =z( )"#

$%i=1

M j

∑j=1

N

∑

1yji =z"

#$%

i=1

M j

∑j=1

N

∑

argmaxA,O

P x, y( )(x,y)∈S∏ = argmax

A,OP(End | yM ) P(yi | yi−1)

i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

Recap:SupervisedTraining

• MaximumLikelihoodTraining– Countingstatistics– Supereasy!–Why?

• Whataboutunsupervisedcase?

58

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

Recap:SupervisedTraining

• MaximumLikelihoodTraining– Countingstatistics– Supereasy!–Why?

• Whataboutunsupervisedcase?

59

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

ConditionalIndependenceAssumptions

• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse

• Canjustestimatefrequencies:– Howoftenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossalllocationsofallsequences.

60

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

ConditionalIndependenceAssumptions

• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse

• Canjustestimatefrequencies:– Howoftenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossalllocationsofallsequences.

61

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

#Parameters:TransitionsA:#Tags2

ObservationsO:#Wordsx#TagsAvoidsdirectlymodelword/wordpairings

#Tags=10s#Words=10000s

UnsupervisedTraining

• Whataboutifnoy’s?– Justatrainingsetofsentences

• StillwanttoestimateP(x,y)– How?–Why?

62

S = xi{ }i=1N


argmax P xi( )i∏ = argmax P xi, y( )

y∑

i∏

UnsupervisedTraining

• Whataboutifnoy’s?– Justatrainingsetofsentences

• StillwanttoestimateP(x,y)– How?–Why?

63

S = xi{ }i=1N


argmax P xi( )i∏ = argmax P xi, y( )

y∑

i∏

WhyUnsupervisedTraining?

• SupervisedDatahardtoacquire– RequireannotatingPOStags

• UnsupervisedDataplentiful– Justgrabsometext!

• MightjustworkforPOSTagging!– Learny’sthatcorrespondtoPOSTags

• Canbeusedforothertasks– Detectoutliersentences(sentenceswithlowprob.)– Samplingnewsentences.

64

EMAlgorithm(Baum-Welch)

• Ifwehady’sèmaxlikelihood.• Ifwehad(A,O)è predicty’s

1. InitializeAandOarbitrarily

2. Predict prob.ofy’sforeachtrainingx

3. Usey’stoestimatenew(A,O)

4. RepeatbacktoStep1untilconvergence

65http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

ExpectationStep

MaximizationStep

Chickenvs Egg!

ExpectationStep

• Given(A,O)• Fortrainingx=(x1,…,xM)– PredictP(yi)foreachy=(y1,…yM)

– Encodescurrentmodel’sbeliefsabouty– “MarginalDistribution”ofeachyi

66

x1 x2 … xL

P(yi=Noun) 0.5 0.4 … 0.05

P(yi=Det) 0.4 0.6 … 0.25

P(yi=Verb) 0.1 0.0 … 0.7

Recall:MatrixFormulation

• DefineTransitionMatrix:A– Aab =P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))

• ObservationMatrix:O– Owz =P(xi=w|yi=z)or–Log(P(xi=w|yi=z))

67


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P(ynext|y) y=“Noun” y=“Verb”

ynext=“Noun” 0.09 0.667

ynext=“Verb” 0.91 0.333

MaximizationStep

• Max.LikelihoodoverMarginalDistribution

68

Aab =P(yj

i = b, yji+1 = a)

i=0

M j

∑j=1

N

∑

P(yji = b)

i=0

M j

∑j=1

N

∑Owz =

1x ji =w!

"#$P(yj

i = z)i=1

M j

∑j=1

N

∑

P(yji = z)

i=1

M j

∑j=1

N

∑

Aab =1

yji+1=a( )∧ yj

i =b( )"#

$%i=0

M j

∑j=1

N

∑

1yji =b"

#$%

i=0

M j

∑j=1

N

∑Owz =

1x ji =w( )∧ yj

i =z( )"#

$%i=1

M j

∑j=1

N

∑

1yji =z"

#$%

i=1

M j

∑j=1

N

∑Supervised:

Unsupervised:

MarginalsMarginals

Marginals

ComputingMarginals(Forward-BackwardAlgorithm)

• SolvingE-Step,requirescomputemarginals

• CansolveusingDynamicProgramming!– SimilartoViterbi

69

x1 x2 … xL

P(yi=Noun) 0.5 0.4 … 0.05

P(yi=Det) 0.4 0.6 … 0.25

P(yi=Verb) 0.1 0.0 … 0.7

Notation

70

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

Probabilityofobservingprefixx1:iandhavingthei-th statebeyi=Z

Probabilityofobservingsuffixxi+1:Mgiventhei-th statebeingyi=Z

http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)

z '∑

ComputingMarginals=CombiningtheTwoTerms

Notation

71

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

Probabilityofobservingprefixx1:iandhavingthei-th statebeyi=Z

Probabilityofobservingsuffixxi+1:Mgiventhei-th statebeingyi=Z

http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

ComputingMarginals=CombiningtheTwoTerms

P(yi = b, yi−1 = a | x) = aa (i−1)P(yi = b | yi−1 = a)P(xi | yi = b)βb(i)

aa ' (i−1)P(yi = b ' | yi−1 = a ')P(xi | yi = b ')βb ' (i)

a ',b '∑

Forward(sub-)Algorithm

• Solveforevery:

• Naively:

• Canbecomputedrecursively(likeViterbi)

72

αz (i) = P(x1:i, yi = Z | A,O)

αz (i) = P(x1:i, yi = Z | A,O) = P(x1:i, yi = Z, y1:i−1 | A,O)

y1:i−1∑

αz (1) = P(y1 = z | y0 )P(x1 | y1 = z) =O

x1,zAz,start

ExponentialTime!

αz (i+1) =Oxi+1,zα j (i)

j=1

L

∑ Az, j

Viterbieffectivelyreplacessumwithmax

Backward(sub-)Algorithm

• Solveforevery:

• Naively:

• Canbecomputedrecursively(likeViterbi)

73

βz (i) = P(xi+1:M | yi = Z,A,O) = P(xi+1:M , yi+1:M | yi = Z,A,O)

yi+1:L∑

ExponentialTime!

βz (i) = β j (i+1)j=1

L

∑ Aj,zOxi+1, j

βz (i) = P(xi+1:M | yi = Z,A,O)

𝛽2 𝑀 = 1

Forward-BackwardAlgorithm

• RunsForward

• RunsBackward

• Foreachtrainingx=(x1,…,xM)– ComputeseachP(yi)fory=(y1,…,yM)

74

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)

z '∑

Recap:UnsupervisedTraining

• Trainusingonlywordsequences:

• y’sare“hiddenstates”– Allpairwisetransitionsarethroughy’s– HencehiddenMarkovModel

• TrainusingEMalgorithm– Convergetolocaloptimum

75

S = xi{ }i=1N


Initialization

• Howtochoose#hiddenstates?– Byhand– CrossValidation• P(x)onvalidationdata• CancomputeP(x)viaforwardalgorithm:

76

P(x) = P(x, y)y∑ = αz (M )

z∑ P(End | yM = z)

Recap:SequencePrediction&HMMs

• Modelspairwisedependencesinsequences

• Compact:onlymodelpairwisebetweeny’s• MainLimitation:Lotsofindependenceassumptions– Poorpredictiveaccuracy

77


Independent:(N,N,Adv)HMMViterbi:(N,V,Adv)

NextLectures

• Thursday:HiddenMarkovModels– (UnstructuredLecture)

• NextTuesday:DeepGenerativeModels– RecentApplications

• RecitationThursday– RecapofViterbiandForward/Backward

78

Documents

Machine Learning & Data Mining · 2019-02-27 · Sequence Prediction (POS Tagging) • x = “Fish Sleep” • y = (N, V) • x = “The Dog Ate My Homework” • y = (D, N, V,