Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine...

Preview:

Citation preview

Hidden Markov ModelsHidden Markov Models

CBB 231 / COMPSCI 261CBB 231 / COMPSCI 261

An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the

following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

What is an HMM?What is an HMM?

M1=({q0,q1,q2},{Y,R},Pt,Pe)

Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}

Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}

An ExampleAn Example

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

P(YRYRY|M1) =

a0→1b1,Ya1→2b2,Ra2→1b1,Ya1→2b2,Ra2→1b1,Ya1→0

=1 1 0.15 1 0.3 1 0.15 1 0.3 1 0.05

=0.00010125

Probability of a SequenceProbability of a Sequence

Another ExampleAnother Example

A=10%T=30%C=40%G=20%

A=10%T=30%C=40%G=20% A=11%

T=17%C=43%G=29%

A=11%T=17%C=43%G=29%

A=35%T=25%C=15%G=25%

A=35%T=25%C=15%G=25% A=27%

T=14%C=22%G=37%

A=27%T=14%C=22%G=37%50%50%

50%50%

100%100%

65%65%

35%35%

20%20%

80%

80%

q1q1

100%

100%

q2q2

q3q3

q4q4

q0q0

M2 = (Q, , Pt, Pe)

Q = {q0, q1, q2, q3, q4}

={A, C, G, T}

A=10%T=30%C=40%G=20%

A=10%T=30%C=40%G=20% A=11%

T=17%C=43%G=29%

A=11%T=17%C=43%G=29%

A=35%T=25%C=15%G=25%

A=35%T=25%C=15%G=25% A=27%

T=14%C=22%G=37%

A=27%T=14%C=22%G=37%50%50%

50%50%

100%100%65%65%

35%35%

20%20%

80%

80%

q1q1

100%

100%

q2q2

q3q3

q4q4

q0q0

The most probable path is:The most probable path is: States:States: 122222224 Sequence:Sequence: CATTAATAG

resulting in this parse:resulting in this parse: States:States: 122222224 Sequence:Sequence: CATTAATAG

The most probable path is:The most probable path is: States:States: 122222224 Sequence:Sequence: CATTAATAG

resulting in this parse:resulting in this parse: States:States: 122222224 Sequence:Sequence: CATTAATAG

Finding the Most Probable Path

Example: C A T T A A T A GExample: C A T T A A T A G

top:top: 7.0×10-7

bottom:bottom: 2.8×10-9

feature 1: C

feature 2: ATTAATA

feature 3: G

Finding the Most Probable PathFinding the Most Probable Path

)()|(

)(

)(

)()|(max

φφφ

φφ

φφφφφ

PSPargmax

SPargmaxSP

SPargmaxSPargmax

=

∧=

∧==

P(φ) = Pt(yi+1 |yi )i=0

L

P(S|φ) = Pe(xi |yi+1)i=0

L−1

φmax=argmax

φPt(q0 |yL ) Pe(xi |yi+1)Pt(yi+1 |yi )

i=0

L−1

emission prob.emission prob. transition prob.transition prob.

Decoding with an HMMDecoding with an HMM

φi,k =argmaxφ j ,k−1+qi

P(φ j ,k−1,x0...xk−1)Pt(qi |qj )Pe(xk |qi)[ ] if k> 0

q0qi if k=0

⎧ ⎨ ⎪

⎩ ⎪

P(φi,k,x0...xk) =max

j P(φ j ,k−1,x0...xk−1)Pt(qi |qj )Pe(xk |qi )[ ] if k> 0

Pt(qi |q0 )Pe(x0 |qi ) if k=0

⎧ ⎨ ⎪

⎩ ⎪

The Best Partial ParseThe Best Partial Parse

φi,k = the best partial parse ending in stateqi at positionk

⎪⎩

⎪⎨⎧

=

>−=.0 if )|()|(

,0 if ),()|()1,(max),(

00 kqxPqqP

kqxPqqPkjVjkiVieit

ikejit

The Viterbi AlgorithmThe Viterbi Algorithm

φmax=argmaxφi,L−1

V(i, L−1)Pt(q0 |qi )

sequence

stat

es

(i,k)

kk-1. . .

k-2 k+1. . . . . .

V (i,k)=max

j V ( j,k−1)Pt(qi |qj)Pe(xk |qi) ifk> 0,

Pt (qi |q0 )Pe(x0 |qi) ifk=0.

⎧ ⎨ ⎪

⎩ ⎪

T (i,k)=argmax

jV ( j,k−1)Pt (qi |qj)Pe(xk |qi) if k> 0,

0 if k=0.

⎧ ⎨ ⎪

⎩ ⎪

Viterbi: TracebackViterbi: Traceback

T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0) = 0

Viterbi Algorithm in PseudocodeViterbi Algorithm in Pseudocodetrans[qi]={qj | Pt(qi|qj)>0}

emit[s] = {qi | Pe(s|qi)>0}

initialization

fill out main part of DP matrix

choose best state from last column in DP matrix

traceback

F (i,k)=

1 fork=0, i =00 fork> 0, i =00 fork=0, i > 0

F ( j,k−1)Pt(qi |qj)Pe(xk−1 |qi)j=0

|Q|−1

∑ for1≤k≤|S |, 1≤ i <|Q |

⎪ ⎪ ⎪

⎪ ⎪ ⎪

P(S |M )= F (i,|S |)Pt (q0 |qi)i=0

|Q|−1

The Forward Algorithm : Probability of a SequenceThe Forward Algorithm : Probability of a Sequence

FF(ii,kk) represents the probability P(S0..k-1| qi) that the machine emits the subsequence x0...xk-1 by any path ending in state qi—i.e., so that symbol xk-1 is emitted by state qi.

F ( j,k−1)Pt(qi |qj )Pe(xk−1 |qi)j=0

|Q|−1

The Forward Algorithm : Probability of a SequenceThe Forward Algorithm : Probability of a Sequence

sequence

stat

es

(i,k)

kk-1

. . .

k-2 k+1. . . . . .

maxj V( j,k−1)Pt(qi |qj )Pe(xk |qi )Viterbi:Viterbi:Viterbi:Viterbi:

Forward:Forward:Forward:Forward:

the single most probable path

sum over all paths

P(S,φ)all

pathsφ

∑i.e.,

The Forward Algorithm in PseudocodeThe Forward Algorithm in Pseudocode

fill out the DP matrix

sum over the final column to get P(S)

CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATCCGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 001111111111111122222222222222111111111111222222221111111111111122222222111111111100

to state

0 1 2

from state

0 0 (0%) 1 (100%) 0 (0%)

1 1 (4%) 21 (84%) 3 (12%)

2 0 (0%) 3 (20%) 12 (80%)

symbol

A C G T

instate

16

(24%)7

(28%)5

(20%)7

(28%)

23

(20%)3

(20%)2

(13%)7

(47%)

∑ −

=

= 1||

0 ,

,, Q

h hi

jiji

A

Aa

∑ −Σ

=

= 1||

0 ,

,,

h hi

kiki

E

Ee

Training an HMM from Labeled SequencesTraining an HMM from Labeled Sequencestr

ansi

tion

str

ansi

tion

sem

issi

ons

emis

sion

s

Recall: Eukaryotic Gene StructureRecall: Eukaryotic Gene Structure

ATGATG TGATGA

coding segment

complete mRNA

ATG GT AG GT AG. . . . . . . . .

start codonstart codonstart codonstart codon stop codondonor sitedonor site donor siteacceptor acceptor sitesite

acceptor site

exonexon exon exonintronintron

TGA

exon 1exon 1 exon 2exon 2 exon 3exon 3

AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA

IntergenicIntergenic

StartcodonStartcodon

StopcodonStop

codon

ExonExon

DonorDonor AcceptorAcceptor

IntronIntron

the Markov model:the Markov model:

the gene prediction:the gene prediction:

the input sequence:the input sequence:

q0q0

the most probable path:the most probable path:

Using an HMM for Gene PredictionUsing an HMM for Gene Prediction

Higher Order Markovian Eukaryotic RecognizerHigher Order Markovian Eukaryotic Recognizer (HOMER)(HOMER)

H3 H5

H17

H27 H77

H95

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

baseline 100 28 44 0 0 0 0 0 0 0 0 0

H3 53 88 66 0 0 0 0 0 0 0 0 0

HOMER, version HOMER, version HH33

I=intron stateI=intron state

E=exon stateE=exon state

N=intergenic stateN=intergenic state

tested on 500 tested on 500 ArabidopsisArabidopsis genes: genes:

IntergenicIntergenic

Startcodon

Stopcodon

ExonExon

Donor Acceptor

IntronIntron

q0q0

Recall: Sensitivity and SpecificityRecall: Sensitivity and Specificity

F =2×Sn×Sp

Sn+Sp

Sn=TP

TP+FN

Sp=TP

TP+FP

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H3 53 88 66 0 0 0 0 0 0 0 0 0

H5 65 91 76 1 3 3 3 0 0 0 0 0

HOMER, version HOMER, version HH55

three exon three exon states, for the states, for the three codon three codon positionspositions

three exon three exon states, for the states, for the three codon three codon positionspositions

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H5 65 91 76 1 3 3 3 0 0 0 0 0

H17 81 93 87 34 48 43 37 19 24 21 7 35

HOMER HOMER version version HH1717 acceptor acceptor

sitesiteacceptor acceptor sitesite

donor donor sitesite

donor donor sitesite

start codonstart codonstart codonstart codon

stop stop codoncodonstop stop codoncodon

IntergenicIntergenic

StartcodonStart

codonStop

codonStop

codon

ExonExon

DonorDonor AcceptorAcceptor

IntronIntron

q0q0

GTATGCGATAGTCAAGAGTGATCGCTAGACC01201201 201201201201201201 2012012012

| | | | | | || | | | | | |0 5 10 15 20 25 300 5 10 15 20 25 30

+phase:

sequence:

coordinates:

Maintaining Phase Across an IntronMaintaining Phase Across an Intron

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H17 81 93 87 34 48 43 37 19 24 21 7 35

H27 83 93 88 40 49 41 36 23 27 25 8 38

HOMER HOMER version version HH2727 three three

separate separate intron intron

modelsmodels

three three separate separate

intron intron modelsmodels

A T G

T G A T A A T A G

G T A G

(start codons)(start codons) (start codons)(start codons)

(donor splice sites)(donor splice sites)(donor splice sites)(donor splice sites) (acceptor splice sites)(acceptor splice sites)(acceptor splice sites)(acceptor splice sites)

Recall: Weight MatricesRecall: Weight Matrices

(stop codons)(stop codons) (stop codons)(stop codons)

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H27 83 93 88 40 49 41 36 23 27 25 8 38

H77 88 96 92 66 67 51 46 47 46 46 13 65

HOMER HOMER version version HH7777

positional biases positional biases near splice sitesnear splice sitespositional biases positional biases near splice sitesnear splice sites

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H77 88 96 92 66 67 51 46 47 46 46 13 65

H95 92 97 94 79 76 57 53 62 59 60 19 93

HOMER version HOMER version HH9595

nucleotides

splice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # baseline 100 28 44 0 0 0 0 0 0 0 0 0 H3 53 88 66 0 0 0 0 0 0 0 0 0 H5 65 91 76 1 3 3 3 0 0 0 0 0 H17 81 93 87 34 48 43 37 19 24 21 7 35 H27 83 93 88 40 49 41 36 23 27 25 8 38 H77 88 96 92 66 67 51 46 47 46 46 13 65 H95 92 97 94 79 76 57 53 62 59 60 19 93

Summary of HOMER ResultsSummary of HOMER Results

Higher-order Markov ModelsHigher-order Markov Models

A C G C T A A C G C T A

P(G|AC)

A C G C T A A C G C T A

P(G|C)

A C G C T A A C G C T A P(G)

0th order:

1st order:

2nd order:

Pe(gn |g0...gn−1,qj) ≈C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

ordernucleotides

splice sites

starts/ stops exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H95 0 92 97 94 79 76 57 53 62 59 60 19 93

H95 1 95 98 97 87 81 64 61 72 68 70 25 127

H95 2 98 98 98 91 82 65 62 76 69 72 27 136

H95 3 98 98 98 91 82 67 63 76 69 72 28 140

H95 4 98 97 98 90 81 69 64 76 68 72 29 143

H95 5 98 97 98 90 81 66 62 74 67 70 27 137

Higher-order Markov ModelsHigher-order Markov Models

0

1

2

3

4

5

Pebackoff(gn |g0...gn−1,qj)=

C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

ifC (g0...gn−1,qj)≥K

orn=0

Pebackoff(gn |g1...gn−1,qj) otherwise

⎨ ⎪ ⎪

⎩ ⎪ ⎪

Pebackoff(gn |g0...gn−1,qj)=

C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

ifC (g0...gn−1,qj)≥K

orn=0

Pebackoff(gn |g1...gn−1,qj) otherwise

⎨ ⎪ ⎪

⎩ ⎪ ⎪

PeIMM (s |g0...gk−1)=

kGPe(s |g0...gk−1)+ (1−k

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

PeIMM (s |g0...gk−1)=

kGPe(s |g0...gk−1)+ (1−k

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

kG =

1 ifm≥400

0 ifm< 400 andc< 0.5

c400

C (g0...gk−1x)x∈∑ otherwise

⎪ ⎪ ⎪

⎪ ⎪ ⎪

Variable-Order Markov ModelsVariable-Order Markov Models

Interpolation ResultsInterpolation Results

SummarySummary

• An HMM is a stochastic generative model which emits sequences

• Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence

• Training of unambiguous HMM’s can be accomplished using labeled sequence training

• Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...)

•Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)

• An HMM is a stochastic generative model which emits sequences

• Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence

• Training of unambiguous HMM’s can be accomplished using labeled sequence training

• Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...)

•Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)

Recommended