Upload
zeljko-zakula
View
229
Download
0
Embed Size (px)
Citation preview
7/26/2019 Hmm Report
1/27
CS 594 : Introduction to Computational Biology
Applications of Hidden Markov Models (HMMs) to
Computational Biology problems
-A Report
Group 4
Shalini Venkataraman
Vidya Gunaseelan
0
7/26/2019 Hmm Report
2/27
Thursday !pr "5
This report e#amines the role o$ a po%er$ul statistical model called Hidden
Markov Models&'(() in the area o$ computational *iology+ ,e %ill start %ith an
o-er-ie% o$ '((s and some concepts in *iology+ .e#t %e %ill discuss the use o$
'((s $or *iological se/uences and nally conclude %ith a discussion on the
ad-antages and limitations o$ '((s and possi*le $uture %ork+
1+ Biological Background+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
1+1+ 2.! 3 2eo#yri*onucleic !cid++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
1+"+ .!+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
1++ 6roteins+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1+4+ The Central 2ogma 7$ (olecular Biology+++++++++++++++++++++++++++++++++++++++++++++++++++++
1+5+ 6rotein structure+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++4
1+8+ (ultiple Se/uence !lignment++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++5
"+ 'idden (arko- (odel &'(() !rchitecture++++++++++++++++++++++++++++++++++++++++++++++++++++++++8
"+1+ (arko- Chains++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++8
"+"+ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"++ !n e#ample o$ a '(( $or 6rotein Se/uences+++++++++++++++++++++++++++++++++++++++++++++++9
"+4+ Three 6ro*lems 7$ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++10
+ !pplications o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1
+1+ Gene nding and prediction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1
+"+ 6rotein3 6role '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++15
++ 6rediction o$ protein secondary structure using '((s++++++++++++++++++++++++++++1
4+ '(( implementation++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1
5+ !d-antages o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1
1
7/26/2019 Hmm Report
3/27
8+ ;imitations o$ '((s+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1ecti-e in sol-ing this pro*lem+
8
7/26/2019 Hmm Report
8/27
Sunny
ain
Cloudy
#. Hidden Markov Model (HMM) Arcitecture
#.1. Markov Cains
;et the three states o$ %eather *e Sunny Cloudy and ainy+ ,e cannot e#pect
these three %eather states to $ollo% each other deterministically *ut %e might still
hope to model the system that generates a %eather pattern+ 7ne %ay to do this is
to assume that the state o$ the model depends only upon the pre-ious states o$ the
model+ This is called the Markov assumptionand simplies pro*lems greatly+
,hen considering the %eather the (arko- assumption presumes that todays
%eather can al%ays *e predicted solely gi-en kno%ledge o$ the %eather o$ the past
$e% days+
! (arko- process is a process %hich
mo-es $rom state to state depending &only)
on the pre-ious n states+ The process is
called an order n model %here n is the
num*er o$ states a>ecting the choice o$
ne#t state+ The simplest (arko- process is
a rst order process %here the choice o$
state is made purely on the *asis o$ the
pre-ious state+ This gure sho%s all
possi*le rst order transitions *et%een the
states o$ the %eather e#ample+
The state transition matri# *elo% sho%s possi*le transition pro*a*ilities $or the
%eather e#ample?
7/26/2019 Hmm Report
9/27
that is i$ it %as sunny yesterday there is a pro*a*ility o$ 0+5 that it %ill *e sunny
today and 0+"5 that it %ill *e cloudy or rainy+
To initiali@e such a system %e need to state %hat the %eather %as &or pro*a*ly
%as) on the day a$ter creation? %e dene this in a -ector o$ initial pro*a*ilities
called the -ector+
So %e kno% it %as sunny on day 1+
,e ha-e no% dened a /rst order Markov processconsisting o$ :
states: Three states 3 sunny cloudy rainy+
vector: 2ening the pro*a*ility o$ the system *eing in each o$ the states
at time 0+
state transition matri": The pro*a*ility o$ the %eather gi-en the pre-ious
days %eather+
!ny system that can *e descri*ed in this manner is a (arko- process+
#.#. Hidden Markov Models
In some cases the patterns that %e %ish to nd are not descri*ed su>iciently *y a
(arko- process+ eturning to the %eather e#ample a hermit $or instance may not
7/26/2019 Hmm Report
10/27
ha-e access to direct %eather o*ser-ations *ut does ha-e a piece o$ sea%eed+
=olklore tells us that the state o$ the sea%eed is pro*a*ilistically related to the
state o$ the %eather 3 the %eather and sea%eed states are closely linked+ In this
case %e ha-e t%o sets o$ states
observablestates &the state o$ the sea%eed) and
iddenstates &the state o$ the %eather)+
,e %ish to de-ise an algorithm $or the hermit to $orecast %eather $rom the
sea%eed and the (arko- assumption %ithout actually e-er seeing the %eather+ The
diagram *elo% sho%s the hidden and o*ser-a*le states in the %eather e#ample+ It
is assumed that the hidden states &the true %eather) are modeled *y a simple rst
order (arko- process and so they are all connected to each other+
The connections *et%een the hidden
states and the o*ser-a*le states
represent the pro*a*ility o$ generating
a particular o*ser-ed state gi-en that
the (arko- process is in a particular
hidden state+ It should thus *e clear
that all pro*a*ilities Fentering an
o*ser-a*le state %ill sum to 1 since in
the a*o-e case it %ould *e the sum o$
Pr&7*sSun) Pr&7*sCloud) and
Pr&7*sain)+
9
7/26/2019 Hmm Report
11/27
In addition to the pro*a*ilities dening the (arko- process %e there$ore ha-e
another matri# termed the output matri" %hich contains the pro*a*ilities o$ the
o*ser-a*le states gi-en a particular hidden state+ =or the %eather e#ample the
output matri# might *e?
So this is a model containing three sets o$ pro*a*ilities in addition to the t%o sets
o$ states
vector: contains the pro*a*ility o$ the hidden model *eing in a particular
hidden state at time tH 1+
state transition matri": holding the pro*a*ility o$ a hidden state gi-en the
pre-ious hidden state+
output matri" : containing the pro*a*ility o$ o*ser-ing a particular
o*ser-a*le state gi-en that the hidden model is in a particular hidden state+
Thus a hidden (arko- model is a standard (arko- process augmented *y a set o$
o*ser-a*le states and some pro*a*ilistic relations *et%een them and the hidden
states+
10
7/26/2019 Hmm Report
12/27
#.%. An e"ample of a HMM for &rotein -e,uences
This is a possi*le hidden (arko- model $or the protein !CC+ The protein is
represented as a se/uence o$ pro*a*ilities+ The num*ers in the *o#es sho% the
pro*a*ility that an amino acid occurs in a particular state and the num*ers ne#t to
the directed arcs sho% pro*a*ilities %hich connect the states+ The pro*a*ility o$
!CC is sho%n as a highlighted path through the model+ There are three kinds o$
states represented *y three di>erent shapes+ The s/uares are called matc states
and the amino acids emitted $rom them $orm the conser-ed primary structure o$ a
protein+ These amino acids are the same as those in the common ancestor or i$
not are the result o$ su*stitutions+ The diamond shapes are insert statesand emit
amino acids that result $rom insertions+ The circles are special silent states kno%n
as delete statesand model deletions+ These type o$ '((s are called 6rotein
6role3'((s and %ill *e co-ered in more depth in the later sections+
-coring a -e,uence 0it an HMM
!ny se/uence can *e represented *y a path through the model+ The pro*a*ility o$
any se/uence gi-en the model is computed *y multiplying the emission and
11
Transition
Prob.
Output Prob.
7/26/2019 Hmm Report
13/27
transitionpro*a*ilities along the path+ ! path through the model represented *y
!CC is highlighted+ =or e#ample the pro*a*ility o$ ! *eing emitted in position 1
is 0+ and the pro*a*ility o$ C *eing emitted in position " is 0+8+ The pro*a*ility o$
!CC along this path is
.4*.3*.46*.6*.97*.5*.015*.73*.01*1 = 1.76x10-6.
#.'. ree &roblems *f Hidden Markov Models
1) -coring &roblem
,e %ant to nd the pro*a*ility o$ an o*ser-ed se/uence gi-en an '((+
It can *e seen that one method o$ calculating the pro*a*ility o$ the o*ser-ed
se/uence %ould *e to nd each possi*le se/uence o$ the hidden states and sum
these pro*a*ilities+ ,e use the =or%ard !lgorithm $or this+
Consider the '(( sho%n a*o-e+ In this gure se-eral paths e#ist $or the protein
se/uence !CC+
The or0ard algoritm employs a matri# sho%n *elo%+ The columns o$ the
matri# are inde#ed *y the states in the model and the ro%s are inde#ed *y the
se/uence+ The elements o$ the matri# are initiali@ed to @ero and then computed
%ith these steps:
1"
M1
M2M3
I0
I1
7/26/2019 Hmm Report
14/27
1+ The pro*a*ility that the amino acid A%as generated *y state I0 is computed
and entered as the rst element o$ the matri#+ This is +4J+ H +1"
"+ The pro*a*ilities that Cis emitted in state (1 &multiplied *y the pro*a*ility o$
the most likely transition to state (1$rom state I0) and in state I1 &multiplied
*y the most likely transition to state I1 $rom state I0) are entered into the
matri# element inde#ed *y C and I1K(1+
+ The sum o$ the t%o pro*a*ilities sum&I1 (1) is calculated+
4+ ! pointer is set $rom the %inner *ack to state I0+
5+ Steps "34 are repeated until the matri# is lled+
The pro*a*ility o$ the se/uence is $ound *y summing the pro*a*ilities in the last
column+
Matrix for the Forward algorithm
#) Alignment &roblem
,e o$ten %ish to take a particular '(( and determine $rom an o*ser-ation
se/uence the most likely se/uence o$ underlying hidden states that might ha-e
generated it+ This is the alignment pro*lem and the 2iterbi Algoritmis used to
sol-e this pro*lem+
1
7/26/2019 Hmm Report
15/27
The Viter*i algorithm is similar to the $or%ard algorithm+ 'o%e-er in step
ma#imum rather than a sum is calculated+ The most likely path through the model
can no% *e $ound *y $ollo%ing the *ack3pointers+
Matrix for the Viterbi algorithm
7nce the most pro*a*le path through the model is kno%n the pro*a*ility o$ a
se/uence gi-en the model can *e computed *y multiplying all pro*a*ilities along
the path+
%) raining &roblem
!nother tricky pro*lem is ho% to create an '(( in the rst place gi-en a
particular set o$ related training se/uences+ It is necessary to estimate the amino
acid emission distri*utions in each state and all state3to3state transition
pro*a*ilities $rom a set o$ related training se/uences+ This is done *y using the
Baum3,elch !lgorithm or the =or%ard Back%ard !lgorithm+
The algorithm proceeds *y making an initial guess o$ the parameters &%hich may
%ell *e entirely %rong) and then rening it *y assessing its %orth and attempting
14
7/26/2019 Hmm Report
16/27
to reduce the errors it pro-okes %hen tted to the gi-en data+ In this sense it is
per$orming a $orm o$ gradient descent looking $or a minimum o$ an error measure+
15
7/26/2019 Hmm Report
17/27
%. Applications of HMM3s
In this section %e %ill del-e into greater depth at specic pro*lems in the area o$
computational *iology and e#amine the role o$ '((s+
%.1. 4ene /nding and prediction
,e introduce here the gene3prediction '((s that can *e used to predict the
structure o$ the gene+ 7ur o*Eecti-e is to nd the coding and non3coding regions o$
an unla*eled string o$ 2.! nucleotides+
The moti-ation *ehind this is to
assist in the annotation o$ genomic data produced *y genome se/uencing
methods
gain insight into the mechanisms in-ol-ed in transcription splicing and
other processes
18
7/26/2019 Hmm Report
18/27
!s sho%n in the diagram a*o-e a string o$ 2.! nucleotides containing a gene %ill
ha-e separate regions
Introns L non3coding regions %ithin a gene
M#ons L coding regions
These regions are separated *y $unctional sites
Start and stop codons
Splice sites L acceptors and donors
In the process o$ transcription only the e#ons are le$t to $orm the protein se/uence
as depicted *elo%+
1
7/26/2019 Hmm Report
19/27
(any pro*lems in *iological se/uence analysis ha-e a grammatical structure
+ '((s are -ery use$ul in modeling grammar+ The input to such a '(( is the
genomic 2.! se/uence and the output in the simplest case is a parse tree o$
e#ons and introns on the 2.! se/uence+
Sho%n *elo% is a simple model $or unspliced genes that recogni@es the start
codon stop codon &only one o$ the three possi*le stop codons are sho%n) and the
codingKnon3coding regions+ This model has *een trained %ith a test set o$ gene
data+
'a-ing such a model ho% can %e predict genes in a se/uence o$ anonymous
2.! ,e simply use the Viter*i algorithm to nd the most pro*a*le path through the
model
1
7/26/2019 Hmm Report
20/2719
7/26/2019 Hmm Report
21/27
%.#. &rotein! &ro/le HMMs
!s %e ha-e seen earlier protein structural similarities make it possi*le to create a
statistical model o$ a protein $amily %hich is called a pro/le+ The idea is gi-en a
single amino acid target se/uence o$ unkno%n structure %e %ant to in$er the
structure o$ the resulting protein+
The prole '(( is *uilt *y analy@ing the distri*ution o$ amino3acids in a training
set o$ related proteins+ This '(( in a natural %ay can model positional dependant
gap penalties+
The *asic topology o$ a prole '(( is sho%n a*o-e+ Mach position or module in
the model has three states+ ! state sho%n as a rectangular *o# is a matcstate
that models the distri*ution o$ letters in the corresponding column o$ an
alignment+ ! state sho%n *y a diamond3shaped *o# models insertions o$ random
letters *et%een t%o alignment positions and a state sho%n *y a circle models a
deletion corresponding to a gap in an alignment+ States o$ neigh*oring positions
are connected as sho%n *y lines+ =or each o$ these lines there is an associated
Ftransition pro*a*ility %hich is the pro*a*ility o$ going $rom one state to the
other+
"0
Matcing
states 5nsertionstatesDeletionstates
7/26/2019 Hmm Report
22/27
The match state represents a consensus amino acid $or this position in the protein
$amily+ The delete state is a non3emitting state and represents skipping this
consensus position in the multiple alignment+ =inally the insert state models the
insertion o$ any num*er o$ residues a$ter this consensus position+
! repository o$ protein prole '((s can *e $ound in the 6=!( 2ata*ase
&ttp677pfam.0ustl.edu).
Building proles $rom a $amily o$ proteins&or 2.!) a prole '(( can *e made $or
searching a data*ase $or other mem*ers o$ the $amily+ !s %e ha-e seen *e$ore in
the section on '(( pro*lems prole '((s can also *e used $or the $ollo%ing
-coring a se,uence
,e are calculating the pro*a*ility o$ a se/uence gi-en a prole *y simply
multiplying emmision and transition pro*a*ilities along the path+
Classifying se,uences in a database
Gi-en a '(( $or a protein $amily and some unkno%n se/uences %e are trying to
nd a path through the model %here the ne% se/uence ts in or %e are tying to
Nalign the se/uence to the model+ !lignment to the model is an assignment o$
states to each residue in the se/uence+ There are many such alignments and the
Vitter*is algorithm is used to gi-e the pro*a*ility o$ the se/uence $or that
alignment+
"1
http://pfam.wustl.edu/http://pfam.wustl.edu/7/26/2019 Hmm Report
23/27
Creating Multiple se,uence alignment
'((s can *e used to automatically create a multiple alignment $rom a group o$
unaligned se/uences+ By taking a close look at the alignment %e can see the
history o$ e-olution+ 7ne great ad-antage o$ '((s is that they can *e estimated
$rom se/uences %ithout ha-ing to align the se/uences rst+ The se/uences used to
estimate or trainthe model are called the training se/uences and any reser-ed
se/uences used to e-aluate the model are called the test se/uences+ The model
estimation is done %ith the $or%ard3*ack%ard algorithm also kno%n as the Baum3
,elch algorithm+ It is an iterati-e algorithm that ma#imi@es the likelihood o$ the
training se/uences+
%.%. &rediction of protein secondary structure using HMM3s
6rediction o$ secondary structures is need $or the prediction o$ protein $unction+ !s
an alternati-e method to direct O3ray analysis a '(( is used to
!naly@e the amino3acid se/uences o$ proteins
;earn secondary structures such as heli# sheet and turn
6redict the secondary structures o$ se/uences
The method is to train the $our '((s o$ secondary structure L heli# sheet turn
and other L *y training se/uences+ The Baum3,elch method is used to train the
'((s+ So the '(( o$ heli# is a*le to produce heli#3like se/uences %ith high
pro*a*ilities+ .o% these '((s can *e used to predict the secondary structure o$
the test se/uence+ The $or%ard3*ack%ard algorithm is used to compute the
pro*a*ilities o$ these '((s outputting the test se/uence+ The se/uence has the
secondary structure %hose '(( sho%ed the highest pro*a*ility to output the
se/uence+
""
7/26/2019 Hmm Report
24/27
'. HMM implementation
These are the t%o pu*licly a-aila*le '(( implementation so$t%are+
'((M 3 http:KKhmmer+%ustl+eduK
S!( system 3 http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKsam+html
+. Advantages of HMMs
'((s can accommodate -aria*le3length se/uence+
Because most *iological data has -aria*le3length properties machine learning
techni/ues %hich re/uire a #ed3length input such as neural net%orks or support
-ector machines are less success$ul in *iological se/uence analysis
!llo%s position dependant gap penalties+
'((s treat insertions and deletions is a statistical manner that is dependant on
position+
. 8imitations of HMMs
;inear (odel
So they are una*le to capture higher order correlations among amino3acids+
(arko- Chain assumption o$ independent e-ents
6ro*a*ilities o$ states are supposed to *e independent %hich is not true o$ *iology
Mg
6&y) must *e independent o$ 6) and -ice -ersa
"
P(x)
.P(y)
http://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.htmlhttp://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.html7/26/2019 Hmm Report
25/27
Standard (achine ;earning 6ro*lems
In the training pro*lem %e need to %atch out $or local ma#ima and so model may
not con-erge to a truly optimal parameter set $or a gi-en training set+ Secondly
since the model is only as good as your training set this may lead to o-er3tting+
9. *pen areas for researc in HMMs in biology
Integration o$ structural in$ormation into prole '((s+
2espite the almost o*-ious application o$ using structural in$ormation on a
mem*er protein $amily %hen one e#ists to *etter the parameteri@ation o$ the
'(( this has *een e#tremely hard to achie-e in practice+
(odel architecture
The architectures o$ '((s ha-e largely *een chosen to *e the simplest
architectures that can t the o*ser-ed data+ Is this the *est architecture to use
Can one use protein structure kno%ledge to make *etter architecture decisions or
in limited regions to learn the architecture directly $rom the data ,ill these
implied architectures ha-e implications $or our structural understanding
Biological mechanism
In gene prediction the '((s may *e getting close to replicating the same sort o$
accuracy as the *iological machine &the '((s ha-e the additional task o$ nding
the gene in the genomic 2.! conte#t %hich is not handled *y the *iological
machine that processes the .!)+ ,hat constraints does our statistical model
"4
7/26/2019 Hmm Report
26/27
place on the *iological mechanismP in particular can %e consider a *iological
mechanism that could use the same in$ormation as the '((
:. $eferences
1+ ;+ + a*iner and B+ '+ QuangAn Introduction to Hidden Markov Models IMMM
!SS6 (aga@ine Qanuary 19
7/26/2019 Hmm Report
27/27
6rocedings o$ the =ourth International Con$erence on Intelligent Systems $or
(olecular Biology &(enlo 6ark C!) !!!I 6ress 1998+
9+ + 'ughey and !+ Rrogh+ 'idden (arko- models $or se/uence analysis:
e#tension and analysis o$ the *asic method+ "omputer Applications in the
#iosciences 1":95310+1998
http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKhtml$ormatpapersKhughkrogh98Kca
*ios+html
10+'endersonQ+ Sal@*ergS+ and =asmanR+Finding genes in human +(A with a
hidden Markov model+ Qournal o$ Computational Biology 199 4 1"33141+
11+An Introduction to Hidden Markov Models for #iological 'e)uences *y !+
Rrogh+ In S+ ;+ Sal@*erg et al+ eds+ Computational (ethods in (olecular
Biology 4538+ Mlse-ier 199