what is a HMM

Embed Size (px)

Citation preview

  • 8/7/2019 what is a HMM

    1/3

    P R I M E R

    are a recurring them e in comp utational biology. What are hiddenand why are they so useful for so many different problems?

    is just aof putting the right label on eachIn gene identification, we want toasexons, introns, or inter-

    In sequence alignment, weto associate residues in a query

    ina tar-can always write

    ad ho c program for any given problem,t the same frustrating issues will alwaysOne is that we want to incorporate

    of information. Afor instance, ought to combine

    and open reading

    of information be weighted?issue is to interpret results proba-a best scoring answer is

    but what does the score mean,onfident are we that the best scor-iscorrect? A third issue is exten-

    Themoment we perfect our ad ho cwe wish we had also modeledand a polyadenylation signal.

    a fragileprogram makes it collapse under its

    Hidden Markov models (HMMs) are afor making probabilistic

    of linear sequence 'labeling' prob-a conceptual toolkit

    by draw-

    atHoward Hughes Medicaltute & Department of Genetics,

    University School of Medicine,Forest Park Blvd., Box 8510, Saint Louis,

    [email protected]

    A = 0.25C = 0.25G = 0.25T = 0.25

    A = 0C = 0G = 0T = 0

    .05

    .95ACGJ

    = 0.4= 0 1= o!4

    Sequence: C T T C A TState path: E E E E E E

    E n d

    TCAI I I logPI -41.22

    P a r s i n g :

    P o s t e r i o rd e c o d i n g :46% I

    - 4 3 . 9 0- 4 3 . 4 5- 4 3 . 9 4- 4 2 . 5 8- 4 1 . 7 1

    28 %

    Figure 1 A toy HM M for 5' s p l i c e s i t e r e c o g n i t i o n . S e e t e x t for e x p l a n a t i o n .

    ing an intuitive picture. They are at the heartof a diverse range of programs, includinggenefinding, profile searches, multiplesequence alignment and regulatory siteidentification. HMMs are the Legos of com-putational sequence analysis.A toy H M M : 5' spl ice s i te recogni t ionAs a simple example, imagine the followingcaricature of a 5' splice-site recognitionproblem. Assume we are given a DNAsequence that begins in an exon, containson e 5' splice site and ends in an intron.The problem is to identify where the switchfrom exon to intron occurredwhere the5 ' splice site (5'SS) is.

    For us to guess intelligently, the sequencesof exons, splice sites and introns must have

    different statistical pro perties. Let's imaginesome simple differences: say that exonshave a uniform base composition on average(25% each base), introns are A/T rich (say,40 % each for A/T, 10% each for C/G), andth e 5'SS consensus nucleotide is almostalways a G (say, 9 5 % G and5 % A ).

    Starting from this information, we candraw an HM M (Fig. 1) . The HMM invokesthree states, one for each of the three labelswe might assign to a nucleotide: E (exon),5 (5'SS) and I (intron). Each state has itsow n emission probabilities (shown above thestates), which model the base compositionof exons, introns and the consensus G at the5'SS. Each state also has transition probabili-ties (arrows), the probabilities of movingfrom this state to a new state. The transition

  • 8/7/2019 what is a HMM

    2/3

    P R I M E Rprobabilities describe the linear order inwhich we expect the states to occur: one ormore Es, one 5, one or mo re Is,So, what's hidden?It's useful to imagine an HMM generating asequence. When we visit a state, we emit aresidue from the state's emission proba bilitydistribution. Then, we choose which state tovisit next according to the state's transitionprobability distribution. The model thusgenerates two strings of information. Oneis the underlying state path (the labels), aswe transition from state to state. The otheris the observed sequence (the DNA), eachresidue being emitted from one state in thestate path.

    The state path is a Markov chain, m eaningthat what state we go to next depends onlyon what state we're in. Since we're only giventhe observed sequence, this underlying statepath is hiddenthese are the residue labelsthat we'd like to infer. The state path is ahidden Markov chain.

    The probability P(S,7t|HMM,e) that anHMM with parameters 9 generates a statepath 7t and an observed sequence S is theproduct of all the emission probabilities andtransition probabilities that were used.For example, consider the 26-nucleotidesequence and state path in the middle ofFigure 1, where there are 27 transition s and26 emissions to tote up. Multiply all 53probabilities together (and take the log,since these are small numbers) and you'llcalculate log P(S,7t|HMM,e) = -41,22,

    An HMM is a full probabilistic modelthe model parameters and the overallsequence'scores' are all probabilities. There-fore, we can use Bayesian prob ability theoryto manipulate these numbers in standard,powerful ways, including optimizing para-meters and interpreting the significanceof scores.Finding the best state pathIn an analysis problem, we're given asequence, and we want to infer the hiddenstate path. There are potentially many statepaths that could generate the samesequence. We want to find the one with thehighest probability.

    For example, if we were given the HMMand the 26-nucleotide sequence in Figure 1,there are 14 possible paths that have non-zero probability, since the 5'SS must fall onone of 14 internal As or Gs, Figure 1 enu-merates the six highest-scoring paths (those

    with G at the 5'SS), The best one has a logprobability of-41,22, which infers that themost likely 5'SS position is at the fifth G,

    For most problems, there are so manypossible state sequences that we could notafford to enumerate them. The efficientViterbi algorithm is guaranteed to find themost probable state path given a sequenceand an H MM , The Viterbi algorithm is adynamic programming algorithm quitesimilar to those used for standard sequencealignment.Beyond best scoring alignmentsFigure 1 shows that on e alternative statepath differs only slightly in score from put-ting the 5'SS at the fifth G (log probabilitiesof-4 1,71 versus -41,22) , How confident arewe that the fifth G is the right choice?

    This is an example of an advantage ofprobabilistic modeling: we can calculate ourconfidence directly. The probability thatresidue i was emitted by state k is the sumof the probabilities of all the state pathsthat use state k to generate residue i (that is,TC; = k in the state path n) , normalized by thesum over all possible state paths. In our toymode l, this is just on e state path in thenumerator and a sum over 14 state paths inthe denominator. We get a probability of46 % that the best-scoring fifth G is correctand 28% that the sixth G position is correct(Fig, 1, bottom). This is called posteriordecoding. For larger problems, posteriordecoding uses two dynamic programmingalgorithms called Forward and Backward,which are essentially like Viterbi, but theysum over possible paths instead of choosingthe best.Making more realistic modelsMaking an HMM means specifying fourthings: (i) the symbol alphabet, K differentsymbols (e,g,, ACGT, K= 4); (ii) the num berof states in the model, M; (iii) emissionprobabilities e,(x) for each state ;, that sumto one over K symbols x, Xe,( x) = 1; and(iv) transition probabilities tj(j) for each state1 going to any other state j (including itself)that sum to one over the M state s;, X f,( j) = 1,Any model that has these properties is an HMM,This means that one can make a newHMM just by drawing a picture corres-ponding to the problem at hand, likeFigure 1, This graphical simplicity lets onefocus clearly on the biological definition ofa problem.

    For example, in our toy splice-site mmaybe we're not happy with our discrimtion power; maybe we want to add a realistic six-nucleotide consensus GTRat the 5' splice site. We can put a rosix HMM states in place of '5' statmodel a six-base ungapped consensus mparameterizing the emission probabion known 5' splice sites. And maybewant to model a complete intron, inclua 3' splice site; we just add a row of statethe 3'SS consensus, and add a 3' exon to let the observed sequence end in an instead of an intron. Then maybe we wabuild a complete gene model,,,whawe add, it's just a matter of drawing we want.The catchHMMs don't deal well with correlabetween residues, because they assumeeach residue depends only on one ulying state. An example where HMMusually inappropriate is RNA seconstructure analysis, Gonserved RNA pairs induce long-range pairwise cortions; one position might be any resbut the base-paired partner must be plementary. An HMM state path haway of 'remembering' what a distant generated.

    Sometimes, one can bend the ruleHMMs without breaking the algoritFor instance, in genefinding, one wanemit a correlated triplet codon insteathree independent residues; HMM rithms can readily be extended to tremitting states. However, the basic Htoolkit can only be stretched so far. BeHMMs, there are more powerful (thless efficient) classes of probabilistic mfor sequence analysis,1 . R abiner, L ,R, A tu toria l on h idden M arkov mod

    seiected appl icat ions in speech recogni t ion,; E 7 7 , 2 5 7 - 2 8 6 ( 1 9 8 9 ),

    2 , Durb in , R,, Eddy, S,R,, K rogh, A, & M i tch isoBiological Seq uence Analysis: Probabilistic MoProteins and Nucieic Acids {Cambridge UniPress, Cambridge UK, 1 998) ,

    naturebiotechnologyWondering how some othermathematical technique really worksSend suggestions for future primers taskthegeek@natureny,com.

  • 8/7/2019 what is a HMM

    3/3