Chap09 Hmm

  • Upload
    xitret

  • View
    242

  • Download
    0

Embed Size (px)

Citation preview

  • 7/30/2019 Chap09 Hmm

    1/39

    Statistical NLP:Hidden Markov Models

    Updated 4/09

  • 7/30/2019 Chap09 Hmm

    2/39

    Markov Models

    Markov models are statistical tools that are

    useful for NLP because they can be used forpart-of-speech-tagging applications

    Their first use was in modeling the letter

    sequences in works of Russian literature They were later developed as a general

    statistical tool More specifically, they model a sequence

    (perhaps through time) of random variablesthat are not necessarily independent

    They rely on two assumptions: LimitedHorizonand Time Invariant

  • 7/30/2019 Chap09 Hmm

    3/39

    Markov Assumptions

    Let X=(X1, .., X

    t) be a sequence of random

    variables taking values from some finite setS={s1, , sn}, the state space, the Markovproperties are:

    Limited Horizon: P(Xt=sk|X1, .., Xt-1)=P(X t = sk|Xt-1) i.e., a state at position tonly depends onthe previous state.

    Time Invariant:P(Xt1=sk|Xt-1)=P(X2 =sk|X1) i.e., the dependencydoes not change over time.

    If X possesses these properties, then X is said tobe a Markov Chain

  • 7/30/2019 Chap09 Hmm

    4/39

    Markov Model parameters

    A Markov chain is described by: Stochastic transition matrix

    aij = p(Xt+1= sj | Xt = si)

    Probabilities of initial states of chain

    i = p(X1 = si)

    there are obvious normalization constraints on a and on .

    Finally a Markov Model is formally specified by a three-tuple(S, , A) where S is the set of states, and , A are theprobabilities for the initial state and state transitions.

  • 7/30/2019 Chap09 Hmm

    5/39

    Probability of sequence of states

    By Markov assumption

    P(X1, X2,XT) = p(X1) p(X2|X1) p(XT|XT-1)

    = X1 t=1T aXt,Xt+1

  • 7/30/2019 Chap09 Hmm

    6/39

    Example of a Markov Chain

    1

    .4

    1

    .3 .3

    .4

    .61

    .6

    .4

    te

    ha p

    i

    Start

    Probability of sequence of states:

    P(t, i, p) = p(X1 = t) p(X2=i| X1=t) p(X3=p| X2=i)= 1.0 * 0.3 * 0.6 = 0.18

  • 7/30/2019 Chap09 Hmm

    7/39

    Are n-gram models MarkovModels?

    Bi-gram models are obviously MarkovModels.

    What about tri-gram, four-gram modelsetc.?

  • 7/30/2019 Chap09 Hmm

    8/39

    Stationary distribution of MarkovModels

    What is the distribution of a Markov state p(xn) forvery large n.

    Obviously, the initial state is forgotten.

    Stationary distribution p*A = p this is aneigenvalue problem A*p = 1*p

    Example:

    Here

    Solution of eigenvalue problem:

    C I0.7

    0.3

    0.5

    0.5

    start

    =

    5.05.03.07.0A

    ( )8/38/5=p

  • 7/30/2019 Chap09 Hmm

    9/39

    Hidden Markov Models (HMM)

    Here states are hidden. We have only clues about states by

    the symbols each state outputs:

    P(Ot = k | Xt = si, Xt+1 = sj) = bi,j,k

  • 7/30/2019 Chap09 Hmm

    10/39

    The crazy soft drink machine

    lemice_tcola

    0.30.10.6CP

    0.20.70.1IP

    ColaPreffered

    Ice teaPreffered0.7

    0.3

    0.5

    0.5

    start

    Output probability

    Comment: for this machinethe output really depends

    only on si namely bijk= bik

  • 7/30/2019 Chap09 Hmm

    11/39

    The crazy soft drink machine cont.

    What is the probability of observing {lem, ice_t}? Need to sum over all 4 possible paths that might be taken

    through the HMM:

    0.7*0.3*0.7*0.1 + 0.7*0.3*0.3*0.1 + 0.3*0.3*0.5*0.7 +

    0.3*0.3*0.5*0.7 = 0.084

    CP CP CP

    IP IP

    0.7 0.7

    0.30.3

    0.50.5

    lemice_tcola

    0.30.10.6CP

    0.20.70.1IP

  • 7/30/2019 Chap09 Hmm

    12/39

    Why Use Hidden MarkovModels?

    HMMs are useful when one can think ofunderlying events probabilisticallygenerating surface events. Example:

    Part-of-Speech-Tagging or speech. HMMs can efficiently be trained using

    the EM Algorithm.

    Another example where HMMs areuseful is in generating parameters forlinear interpolation of n-gram models.

  • 7/30/2019 Chap09 Hmm

    13/39

    wawb

    wbw1

    wbw2

    wbwM

    1ab

    2ab

    3ab

    w1 : P1(w1)

    :1

    :2

    :3

    w2 :P1(w2)

    wM:P1(wM)

    wM:P3(wM| wawb)

    interpoloating parameters forn-gram models

  • 7/30/2019 Chap09 Hmm

    14/39

    interpoloating parameters for n-gram models comments

    the HMM calculation of observing thesequence (wn-2 , wn-1, wn) is equivalent to

    Plin(wn|wn-2,wn-1)=1P1(wn)+ 2P2(wn|wn-1)+

    3P3(wn|wn-1,wn-2) Transitions are special transitions that

    produce no output symbol

    In the above model each word pair (a,b) hasa different HMM. This is relaxed by using tiedstates.

  • 7/30/2019 Chap09 Hmm

    15/39

    General Form of an HMM

    An HMM is specified by a five-tuple (S, K, , A, B) whereS and K are the set of states and the output alphabet,and , A, B are the probabilities for the initial state,state transitions, and symbol emissions, respectively.

    When the sets S and K are obvious, HMM will be

    represented by a three-tuple (, A, B). Given a specification of an HMM, we can simulate the

    running of a Markov process and produce an outputsequence using the algorithm shown on the next page.

    More interesting than a simulation, however, is assumingthat some set of data was generated by a HMM, andthen being able to calculate probabilities and probableunderlying state sequences.

  • 7/30/2019 Chap09 Hmm

    16/39

    A Program for a MarkovProcess modeled by HMM

    t:= 1;Start in state si with probability i (i.e., X1=i)Forever do

    Move from state si to state sj withprobability aij (i.e., Xt+1 = j) Emit observation symbol ot = k with

    probability bijkt:= t+1

    End

  • 7/30/2019 Chap09 Hmm

    17/39

    The Three Fundamental

    Questions for HMMs

    Given a model =(A, B, ), how do weefficiently compute how likely a certainobservation is, that is, P(O| )

    Given the observation sequence O and a

    model , how do we choose a state sequence(X1, , X T+1) that best explains theobservations?

    Given an observation sequence O, and aspace of possible models found by varyingthe model parameters = (A, B, ), how dowe find the model that best explains theobserved data?

  • 7/30/2019 Chap09 Hmm

    18/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a

    model = (A, B, ), we wish to know how toefficiently compute P(O| ).

    For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of theobservation occurring according to each possiblestate sequence.

    Direct evaluation of this expression, however, isextremely inefficient.

  • 7/30/2019 Chap09 Hmm

    19/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ). For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    20/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    21/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    22/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    23/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    24/39

    Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

    (A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:

    This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

    Direct evaluation of this expression, however, is extremelyinefficient.

    S1

    S2

    1t= 2 3 4 5 6 7

  • 7/30/2019 Chap09 Hmm

    25/39

    Finding the probability of anobservation II

    In order to avoid this complexity, we can use dynamic

    programmingor memorizationtechniques. In particular, we use trellisalgorithms.

    We make a square array of states versus time andcompute the probabilities of being at each state at

    each time in terms of the probabilities for being in eachstate at the preceding time.

    A trellis can record the probability of all initial subpathsof the HMM that end in a certain state at a certain

    time. The probability of longer subpaths can then beworked out in terms of the shorter subpaths.

  • 7/30/2019 Chap09 Hmm

    26/39

    Finding the probability of an

    observation III: The forwardprocedure

    Aforward variable, i(t)= P(o1o2ot-1, Xt=i|)is stored at(si, t)in the trellis and expresses the total probability of ending up in

    state si at time t.

    Forward variables are calculated as follows:

    Initialization: i(1)=i, 1iN

    Induction:j(t+1)=i=1Ni(t)aijbijot, 1tT, 1jN

    Total: P(O|)=i=1Ni(T+1)

    This algorithm requires 2N2Tmultiplications (much less than the

    direct method which takes (2T+1)NT+1

  • 7/30/2019 Chap09 Hmm

    27/39

    Finding the probability of an

    observation IV: The backwardprocedure

    The backward procedure computesbackward variableswhich are thetotal probability of seeing the rest of

    the observation sequence given that wewere in state si at time t.

    Backward variables are useful for theproblem of parameter estimation.

  • 7/30/2019 Chap09 Hmm

    28/39

    Finding the probability of an

    observation V: The backwardprocedure

    Let i(t) = P(otoT| Xt= i,)be the backward

    variables.

    Backward variables can be calculated working

    backward through the trellis as follows:

    Initialization: i(T+1) = 1, 1iN

    Induction:i(t) =j=1Naijbijotj(t+1), 1tT, 1

    iN

    Total: P(O|)=i=1Nii(1)

  • 7/30/2019 Chap09 Hmm

    29/39

    Finding the probability of an

    observation VI: Combining Forwardand backward

    P(O, Xt= i|)

    = P(o1oT,Xt= i| )= P(o1ot-1,Xt= i , ot,oT| )

    = P(o1ot-1,Xt= i| ) P( ot,oT|o1ot-1, Xt= i,)

    = P(o1ot-1,Xt= i| ) P( ot,oT|Xt= i,)

    = i(t)i(t),

    Total:

    P(O|) =i=1N

    i(t)i(t), 1 tT+1

  • 7/30/2019 Chap09 Hmm

    30/39

    Finding the Best StateSequence I One method consists of finding the states

    individually:

    For each t, 1 t T+1, we would like to find Xtthat maximizes P(Xt|O, ).

    Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|)= (i(t)i(t)/j=1N j(t)j(t))

    The individually most likely state is

    Xt=argmax1iNi(t), 1tT+1 This quantity maximizes the expected number of

    states that will be guessed correctly. However, it

    may yield a quite unlikely state sequence.

    ^

  • 7/30/2019 Chap09 Hmm

    31/39

    Finding the Best State SequenceII: The Viterbi Algorithm

    The Viterbi algorithmefficiently computes themost likely state sequence. Commonly, we want to find the most likely complete

    path, that is: argmaxXP(X|O,)

    To do this, it is sufficient to maximize for a fixed O:argmaxXP(X,O|) We define

    j(t) = maxX1..Xt-1P(X1Xt-1, o1..ot-1, Xt=j|)

    j(t)records the node of the incoming arc that led tothis most probable path.

  • 7/30/2019 Chap09 Hmm

    32/39

    Finding the Best State Sequence III:The Viterbi Algorithm

    The Viterbi Algorithm works as follows: Initialization: j(1) =j, 1jN

    Induction:j(t+1) = max1iNi(t)aijbijot, 1j

    N

    Store backtrace:

    j(t+1) = argmax1iNj(t)aijbijot, 1jN

    Termination and path readout:XT+1= argmax1iNj(T+1)

    Xt =Xt+1(t+1)

    P(X) = max1iNj(T+1)

    ^

    ^

    ^

  • 7/30/2019 Chap09 Hmm

    33/39

    Finding the Best State Sequence IV:Visualization

    S1

    S2

    1t= 2 3 4 5 6 7

    S3

    S4

    2(t=6) =probability of reaching state 2 at t=6 by the mostprobable path (marked) through state 2 at t=6

    2(t=6) =3 is the state from preceding state 2 at t=6 on themost probable path through state 2 at t=6

  • 7/30/2019 Chap09 Hmm

    34/39

    Parameter Estimation I

    Given a certain observation sequence, we

    want to find the values of the modelparameters =(A, B, ) which best explainwhat we observed.

    Using Maximum Likelihood Estimation, wecan find the values that maximize P(O|),i.e. argmaxP(Otraining|)

    There is no known analytic method to choose

    to maximize P(O|). However, we canlocally maximize it by an iterative hill-climbingalgorithm known as Baum-WelchorForward-Backward algorithm. (specialcase of the EM Algorithm)

  • 7/30/2019 Chap09 Hmm

    35/39

    Parameter Estimation II We dont know what the model is, but we

    can work out the probability of theobservation sequence using some (perhapsrandomly chosen) model.

    Looking at that calculation, we can see whichstate transitions and symbol emissions wereprobably used the most.

    By increasing the probability of those, we canchoose a revised model which gives a higherprobability to the observation sequence.

  • 7/30/2019 Chap09 Hmm

    36/39

    Parameter Estimation III Define Pt(i,j)as the probability of traversing a certain

    arc at time t given observation sequence:

  • 7/30/2019 Chap09 Hmm

    37/39

    == Nj tijipt

    ...1

    ),()(

    Parameter Estimation IV The probability of traversing a certain arc at time t

    given observation sequence can be expressed as:

    The probability of leaving state iat time t

  • 7/30/2019 Chap09 Hmm

    38/39

    )1( i =i

    =

    ==T

    t i

    T

    t t

    ij

    t

    jipa

    1

    1

    )(

    ),(

    Parameter Estimation V

  • 7/30/2019 Chap09 Hmm

    39/39

    Parameter Estimation VI

    Namely, given an HMM model =(A, B, ) the BaumWelsh algorithm gives new values for the ,odel

    parameters =(A, B, ) .

    One may prove that P(O| )P(O| ) which is ageneral property of EM.

    The process is repeated until convergence.

    It is prone to local maxima, another property of EM.

    ^ ^ ^ ^

    ^