Hmm Report

Embed Size (px)

Citation preview

  • 7/26/2019 Hmm Report

    1/27

    CS 594 : Introduction to Computational Biology

    Applications of Hidden Markov Models (HMMs) to

    Computational Biology problems

    -A Report

    Group 4

    Shalini Venkataraman

    Vidya Gunaseelan

    0

  • 7/26/2019 Hmm Report

    2/27

    Thursday !pr "5

    This report e#amines the role o$ a po%er$ul statistical model called Hidden

    Markov Models&'(() in the area o$ computational *iology+ ,e %ill start %ith an

    o-er-ie% o$ '((s and some concepts in *iology+ .e#t %e %ill discuss the use o$

    '((s $or *iological se/uences and nally conclude %ith a discussion on the

    ad-antages and limitations o$ '((s and possi*le $uture %ork+

    1+ Biological Background+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

    1+1+ 2.! 3 2eo#yri*onucleic !cid++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

    1+"+ .!+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

    1++ 6roteins+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    1+4+ The Central 2ogma 7$ (olecular Biology+++++++++++++++++++++++++++++++++++++++++++++++++++++

    1+5+ 6rotein structure+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++4

    1+8+ (ultiple Se/uence !lignment++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++5

    "+ 'idden (arko- (odel &'(() !rchitecture++++++++++++++++++++++++++++++++++++++++++++++++++++++++8

    "+1+ (arko- Chains++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++8

    "+"+ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    "++ !n e#ample o$ a '(( $or 6rotein Se/uences+++++++++++++++++++++++++++++++++++++++++++++++9

    "+4+ Three 6ro*lems 7$ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++10

    + !pplications o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

    +1+ Gene nding and prediction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

    +"+ 6rotein3 6role '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++15

    ++ 6rediction o$ protein secondary structure using '((s++++++++++++++++++++++++++++1

    4+ '(( implementation++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

    5+ !d-antages o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

    1

  • 7/26/2019 Hmm Report

    3/27

    8+ ;imitations o$ '((s+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1ecti-e in sol-ing this pro*lem+

    8

  • 7/26/2019 Hmm Report

    8/27

    Sunny

    ain

    Cloudy

    #. Hidden Markov Model (HMM) Arcitecture

    #.1. Markov Cains

    ;et the three states o$ %eather *e Sunny Cloudy and ainy+ ,e cannot e#pect

    these three %eather states to $ollo% each other deterministically *ut %e might still

    hope to model the system that generates a %eather pattern+ 7ne %ay to do this is

    to assume that the state o$ the model depends only upon the pre-ious states o$ the

    model+ This is called the Markov assumptionand simplies pro*lems greatly+

    ,hen considering the %eather the (arko- assumption presumes that todays

    %eather can al%ays *e predicted solely gi-en kno%ledge o$ the %eather o$ the past

    $e% days+

    ! (arko- process is a process %hich

    mo-es $rom state to state depending &only)

    on the pre-ious n states+ The process is

    called an order n model %here n is the

    num*er o$ states a>ecting the choice o$

    ne#t state+ The simplest (arko- process is

    a rst order process %here the choice o$

    state is made purely on the *asis o$ the

    pre-ious state+ This gure sho%s all

    possi*le rst order transitions *et%een the

    states o$ the %eather e#ample+

    The state transition matri# *elo% sho%s possi*le transition pro*a*ilities $or the

    %eather e#ample?

  • 7/26/2019 Hmm Report

    9/27

    that is i$ it %as sunny yesterday there is a pro*a*ility o$ 0+5 that it %ill *e sunny

    today and 0+"5 that it %ill *e cloudy or rainy+

    To initiali@e such a system %e need to state %hat the %eather %as &or pro*a*ly

    %as) on the day a$ter creation? %e dene this in a -ector o$ initial pro*a*ilities

    called the -ector+

    So %e kno% it %as sunny on day 1+

    ,e ha-e no% dened a /rst order Markov processconsisting o$ :

    states: Three states 3 sunny cloudy rainy+

    vector: 2ening the pro*a*ility o$ the system *eing in each o$ the states

    at time 0+

    state transition matri": The pro*a*ility o$ the %eather gi-en the pre-ious

    days %eather+

    !ny system that can *e descri*ed in this manner is a (arko- process+

    #.#. Hidden Markov Models

    In some cases the patterns that %e %ish to nd are not descri*ed su>iciently *y a

    (arko- process+ eturning to the %eather e#ample a hermit $or instance may not

  • 7/26/2019 Hmm Report

    10/27

    ha-e access to direct %eather o*ser-ations *ut does ha-e a piece o$ sea%eed+

    =olklore tells us that the state o$ the sea%eed is pro*a*ilistically related to the

    state o$ the %eather 3 the %eather and sea%eed states are closely linked+ In this

    case %e ha-e t%o sets o$ states

    observablestates &the state o$ the sea%eed) and

    iddenstates &the state o$ the %eather)+

    ,e %ish to de-ise an algorithm $or the hermit to $orecast %eather $rom the

    sea%eed and the (arko- assumption %ithout actually e-er seeing the %eather+ The

    diagram *elo% sho%s the hidden and o*ser-a*le states in the %eather e#ample+ It

    is assumed that the hidden states &the true %eather) are modeled *y a simple rst

    order (arko- process and so they are all connected to each other+

    The connections *et%een the hidden

    states and the o*ser-a*le states

    represent the pro*a*ility o$ generating

    a particular o*ser-ed state gi-en that

    the (arko- process is in a particular

    hidden state+ It should thus *e clear

    that all pro*a*ilities Fentering an

    o*ser-a*le state %ill sum to 1 since in

    the a*o-e case it %ould *e the sum o$

    Pr&7*sSun) Pr&7*sCloud) and

    Pr&7*sain)+

    9

  • 7/26/2019 Hmm Report

    11/27

    In addition to the pro*a*ilities dening the (arko- process %e there$ore ha-e

    another matri# termed the output matri" %hich contains the pro*a*ilities o$ the

    o*ser-a*le states gi-en a particular hidden state+ =or the %eather e#ample the

    output matri# might *e?

    So this is a model containing three sets o$ pro*a*ilities in addition to the t%o sets

    o$ states

    vector: contains the pro*a*ility o$ the hidden model *eing in a particular

    hidden state at time tH 1+

    state transition matri": holding the pro*a*ility o$ a hidden state gi-en the

    pre-ious hidden state+

    output matri" : containing the pro*a*ility o$ o*ser-ing a particular

    o*ser-a*le state gi-en that the hidden model is in a particular hidden state+

    Thus a hidden (arko- model is a standard (arko- process augmented *y a set o$

    o*ser-a*le states and some pro*a*ilistic relations *et%een them and the hidden

    states+

    10

  • 7/26/2019 Hmm Report

    12/27

    #.%. An e"ample of a HMM for &rotein -e,uences

    This is a possi*le hidden (arko- model $or the protein !CC+ The protein is

    represented as a se/uence o$ pro*a*ilities+ The num*ers in the *o#es sho% the

    pro*a*ility that an amino acid occurs in a particular state and the num*ers ne#t to

    the directed arcs sho% pro*a*ilities %hich connect the states+ The pro*a*ility o$

    !CC is sho%n as a highlighted path through the model+ There are three kinds o$

    states represented *y three di>erent shapes+ The s/uares are called matc states

    and the amino acids emitted $rom them $orm the conser-ed primary structure o$ a

    protein+ These amino acids are the same as those in the common ancestor or i$

    not are the result o$ su*stitutions+ The diamond shapes are insert statesand emit

    amino acids that result $rom insertions+ The circles are special silent states kno%n

    as delete statesand model deletions+ These type o$ '((s are called 6rotein

    6role3'((s and %ill *e co-ered in more depth in the later sections+

    -coring a -e,uence 0it an HMM

    !ny se/uence can *e represented *y a path through the model+ The pro*a*ility o$

    any se/uence gi-en the model is computed *y multiplying the emission and

    11

    Transition

    Prob.

    Output Prob.

  • 7/26/2019 Hmm Report

    13/27

    transitionpro*a*ilities along the path+ ! path through the model represented *y

    !CC is highlighted+ =or e#ample the pro*a*ility o$ ! *eing emitted in position 1

    is 0+ and the pro*a*ility o$ C *eing emitted in position " is 0+8+ The pro*a*ility o$

    !CC along this path is

    .4*.3*.46*.6*.97*.5*.015*.73*.01*1 = 1.76x10-6.

    #.'. ree &roblems *f Hidden Markov Models

    1) -coring &roblem

    ,e %ant to nd the pro*a*ility o$ an o*ser-ed se/uence gi-en an '((+

    It can *e seen that one method o$ calculating the pro*a*ility o$ the o*ser-ed

    se/uence %ould *e to nd each possi*le se/uence o$ the hidden states and sum

    these pro*a*ilities+ ,e use the =or%ard !lgorithm $or this+

    Consider the '(( sho%n a*o-e+ In this gure se-eral paths e#ist $or the protein

    se/uence !CC+

    The or0ard algoritm employs a matri# sho%n *elo%+ The columns o$ the

    matri# are inde#ed *y the states in the model and the ro%s are inde#ed *y the

    se/uence+ The elements o$ the matri# are initiali@ed to @ero and then computed

    %ith these steps:

    1"

    M1

    M2M3

    I0

    I1

  • 7/26/2019 Hmm Report

    14/27

    1+ The pro*a*ility that the amino acid A%as generated *y state I0 is computed

    and entered as the rst element o$ the matri#+ This is +4J+ H +1"

    "+ The pro*a*ilities that Cis emitted in state (1 &multiplied *y the pro*a*ility o$

    the most likely transition to state (1$rom state I0) and in state I1 &multiplied

    *y the most likely transition to state I1 $rom state I0) are entered into the

    matri# element inde#ed *y C and I1K(1+

    + The sum o$ the t%o pro*a*ilities sum&I1 (1) is calculated+

    4+ ! pointer is set $rom the %inner *ack to state I0+

    5+ Steps "34 are repeated until the matri# is lled+

    The pro*a*ility o$ the se/uence is $ound *y summing the pro*a*ilities in the last

    column+

    Matrix for the Forward algorithm

    #) Alignment &roblem

    ,e o$ten %ish to take a particular '(( and determine $rom an o*ser-ation

    se/uence the most likely se/uence o$ underlying hidden states that might ha-e

    generated it+ This is the alignment pro*lem and the 2iterbi Algoritmis used to

    sol-e this pro*lem+

    1

  • 7/26/2019 Hmm Report

    15/27

    The Viter*i algorithm is similar to the $or%ard algorithm+ 'o%e-er in step

    ma#imum rather than a sum is calculated+ The most likely path through the model

    can no% *e $ound *y $ollo%ing the *ack3pointers+

    Matrix for the Viterbi algorithm

    7nce the most pro*a*le path through the model is kno%n the pro*a*ility o$ a

    se/uence gi-en the model can *e computed *y multiplying all pro*a*ilities along

    the path+

    %) raining &roblem

    !nother tricky pro*lem is ho% to create an '(( in the rst place gi-en a

    particular set o$ related training se/uences+ It is necessary to estimate the amino

    acid emission distri*utions in each state and all state3to3state transition

    pro*a*ilities $rom a set o$ related training se/uences+ This is done *y using the

    Baum3,elch !lgorithm or the =or%ard Back%ard !lgorithm+

    The algorithm proceeds *y making an initial guess o$ the parameters &%hich may

    %ell *e entirely %rong) and then rening it *y assessing its %orth and attempting

    14

  • 7/26/2019 Hmm Report

    16/27

    to reduce the errors it pro-okes %hen tted to the gi-en data+ In this sense it is

    per$orming a $orm o$ gradient descent looking $or a minimum o$ an error measure+

    15

  • 7/26/2019 Hmm Report

    17/27

    %. Applications of HMM3s

    In this section %e %ill del-e into greater depth at specic pro*lems in the area o$

    computational *iology and e#amine the role o$ '((s+

    %.1. 4ene /nding and prediction

    ,e introduce here the gene3prediction '((s that can *e used to predict the

    structure o$ the gene+ 7ur o*Eecti-e is to nd the coding and non3coding regions o$

    an unla*eled string o$ 2.! nucleotides+

    The moti-ation *ehind this is to

    assist in the annotation o$ genomic data produced *y genome se/uencing

    methods

    gain insight into the mechanisms in-ol-ed in transcription splicing and

    other processes

    18

  • 7/26/2019 Hmm Report

    18/27

    !s sho%n in the diagram a*o-e a string o$ 2.! nucleotides containing a gene %ill

    ha-e separate regions

    Introns L non3coding regions %ithin a gene

    M#ons L coding regions

    These regions are separated *y $unctional sites

    Start and stop codons

    Splice sites L acceptors and donors

    In the process o$ transcription only the e#ons are le$t to $orm the protein se/uence

    as depicted *elo%+

    1

  • 7/26/2019 Hmm Report

    19/27

    (any pro*lems in *iological se/uence analysis ha-e a grammatical structure

    + '((s are -ery use$ul in modeling grammar+ The input to such a '(( is the

    genomic 2.! se/uence and the output in the simplest case is a parse tree o$

    e#ons and introns on the 2.! se/uence+

    Sho%n *elo% is a simple model $or unspliced genes that recogni@es the start

    codon stop codon &only one o$ the three possi*le stop codons are sho%n) and the

    codingKnon3coding regions+ This model has *een trained %ith a test set o$ gene

    data+

    'a-ing such a model ho% can %e predict genes in a se/uence o$ anonymous

    2.! ,e simply use the Viter*i algorithm to nd the most pro*a*le path through the

    model

    1

  • 7/26/2019 Hmm Report

    20/2719

  • 7/26/2019 Hmm Report

    21/27

    %.#. &rotein! &ro/le HMMs

    !s %e ha-e seen earlier protein structural similarities make it possi*le to create a

    statistical model o$ a protein $amily %hich is called a pro/le+ The idea is gi-en a

    single amino acid target se/uence o$ unkno%n structure %e %ant to in$er the

    structure o$ the resulting protein+

    The prole '(( is *uilt *y analy@ing the distri*ution o$ amino3acids in a training

    set o$ related proteins+ This '(( in a natural %ay can model positional dependant

    gap penalties+

    The *asic topology o$ a prole '(( is sho%n a*o-e+ Mach position or module in

    the model has three states+ ! state sho%n as a rectangular *o# is a matcstate

    that models the distri*ution o$ letters in the corresponding column o$ an

    alignment+ ! state sho%n *y a diamond3shaped *o# models insertions o$ random

    letters *et%een t%o alignment positions and a state sho%n *y a circle models a

    deletion corresponding to a gap in an alignment+ States o$ neigh*oring positions

    are connected as sho%n *y lines+ =or each o$ these lines there is an associated

    Ftransition pro*a*ility %hich is the pro*a*ility o$ going $rom one state to the

    other+

    "0

    Matcing

    states 5nsertionstatesDeletionstates

  • 7/26/2019 Hmm Report

    22/27

    The match state represents a consensus amino acid $or this position in the protein

    $amily+ The delete state is a non3emitting state and represents skipping this

    consensus position in the multiple alignment+ =inally the insert state models the

    insertion o$ any num*er o$ residues a$ter this consensus position+

    ! repository o$ protein prole '((s can *e $ound in the 6=!( 2ata*ase

    &ttp677pfam.0ustl.edu).

    Building proles $rom a $amily o$ proteins&or 2.!) a prole '(( can *e made $or

    searching a data*ase $or other mem*ers o$ the $amily+ !s %e ha-e seen *e$ore in

    the section on '(( pro*lems prole '((s can also *e used $or the $ollo%ing

    -coring a se,uence

    ,e are calculating the pro*a*ility o$ a se/uence gi-en a prole *y simply

    multiplying emmision and transition pro*a*ilities along the path+

    Classifying se,uences in a database

    Gi-en a '(( $or a protein $amily and some unkno%n se/uences %e are trying to

    nd a path through the model %here the ne% se/uence ts in or %e are tying to

    Nalign the se/uence to the model+ !lignment to the model is an assignment o$

    states to each residue in the se/uence+ There are many such alignments and the

    Vitter*is algorithm is used to gi-e the pro*a*ility o$ the se/uence $or that

    alignment+

    "1

    http://pfam.wustl.edu/http://pfam.wustl.edu/
  • 7/26/2019 Hmm Report

    23/27

    Creating Multiple se,uence alignment

    '((s can *e used to automatically create a multiple alignment $rom a group o$

    unaligned se/uences+ By taking a close look at the alignment %e can see the

    history o$ e-olution+ 7ne great ad-antage o$ '((s is that they can *e estimated

    $rom se/uences %ithout ha-ing to align the se/uences rst+ The se/uences used to

    estimate or trainthe model are called the training se/uences and any reser-ed

    se/uences used to e-aluate the model are called the test se/uences+ The model

    estimation is done %ith the $or%ard3*ack%ard algorithm also kno%n as the Baum3

    ,elch algorithm+ It is an iterati-e algorithm that ma#imi@es the likelihood o$ the

    training se/uences+

    %.%. &rediction of protein secondary structure using HMM3s

    6rediction o$ secondary structures is need $or the prediction o$ protein $unction+ !s

    an alternati-e method to direct O3ray analysis a '(( is used to

    !naly@e the amino3acid se/uences o$ proteins

    ;earn secondary structures such as heli# sheet and turn

    6redict the secondary structures o$ se/uences

    The method is to train the $our '((s o$ secondary structure L heli# sheet turn

    and other L *y training se/uences+ The Baum3,elch method is used to train the

    '((s+ So the '(( o$ heli# is a*le to produce heli#3like se/uences %ith high

    pro*a*ilities+ .o% these '((s can *e used to predict the secondary structure o$

    the test se/uence+ The $or%ard3*ack%ard algorithm is used to compute the

    pro*a*ilities o$ these '((s outputting the test se/uence+ The se/uence has the

    secondary structure %hose '(( sho%ed the highest pro*a*ility to output the

    se/uence+

    ""

  • 7/26/2019 Hmm Report

    24/27

    '. HMM implementation

    These are the t%o pu*licly a-aila*le '(( implementation so$t%are+

    '((M 3 http:KKhmmer+%ustl+eduK

    S!( system 3 http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKsam+html

    +. Advantages of HMMs

    '((s can accommodate -aria*le3length se/uence+

    Because most *iological data has -aria*le3length properties machine learning

    techni/ues %hich re/uire a #ed3length input such as neural net%orks or support

    -ector machines are less success$ul in *iological se/uence analysis

    !llo%s position dependant gap penalties+

    '((s treat insertions and deletions is a statistical manner that is dependant on

    position+

    . 8imitations of HMMs

    ;inear (odel

    So they are una*le to capture higher order correlations among amino3acids+

    (arko- Chain assumption o$ independent e-ents

    6ro*a*ilities o$ states are supposed to *e independent %hich is not true o$ *iology

    Mg

    6&y) must *e independent o$ 6) and -ice -ersa

    "

    P(x)

    .P(y)

    http://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.htmlhttp://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.html
  • 7/26/2019 Hmm Report

    25/27

    Standard (achine ;earning 6ro*lems

    In the training pro*lem %e need to %atch out $or local ma#ima and so model may

    not con-erge to a truly optimal parameter set $or a gi-en training set+ Secondly

    since the model is only as good as your training set this may lead to o-er3tting+

    9. *pen areas for researc in HMMs in biology

    Integration o$ structural in$ormation into prole '((s+

    2espite the almost o*-ious application o$ using structural in$ormation on a

    mem*er protein $amily %hen one e#ists to *etter the parameteri@ation o$ the

    '(( this has *een e#tremely hard to achie-e in practice+

    (odel architecture

    The architectures o$ '((s ha-e largely *een chosen to *e the simplest

    architectures that can t the o*ser-ed data+ Is this the *est architecture to use

    Can one use protein structure kno%ledge to make *etter architecture decisions or

    in limited regions to learn the architecture directly $rom the data ,ill these

    implied architectures ha-e implications $or our structural understanding

    Biological mechanism

    In gene prediction the '((s may *e getting close to replicating the same sort o$

    accuracy as the *iological machine &the '((s ha-e the additional task o$ nding

    the gene in the genomic 2.! conte#t %hich is not handled *y the *iological

    machine that processes the .!)+ ,hat constraints does our statistical model

    "4

  • 7/26/2019 Hmm Report

    26/27

    place on the *iological mechanismP in particular can %e consider a *iological

    mechanism that could use the same in$ormation as the '((

    :. $eferences

    1+ ;+ + a*iner and B+ '+ QuangAn Introduction to Hidden Markov Models IMMM

    !SS6 (aga@ine Qanuary 19

  • 7/26/2019 Hmm Report

    27/27

    6rocedings o$ the =ourth International Con$erence on Intelligent Systems $or

    (olecular Biology &(enlo 6ark C!) !!!I 6ress 1998+

    9+ + 'ughey and !+ Rrogh+ 'idden (arko- models $or se/uence analysis:

    e#tension and analysis o$ the *asic method+ "omputer Applications in the

    #iosciences 1":95310+1998

    http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKhtml$ormatpapersKhughkrogh98Kca

    *ios+html

    10+'endersonQ+ Sal@*ergS+ and =asmanR+Finding genes in human +(A with a

    hidden Markov model+ Qournal o$ Computational Biology 199 4 1"33141+

    11+An Introduction to Hidden Markov Models for #iological 'e)uences *y !+

    Rrogh+ In S+ ;+ Sal@*erg et al+ eds+ Computational (ethods in (olecular

    Biology 4538+ Mlse-ier 199