Hmm Report

7/26/2019 Hmm Report

1/27

CS 594 : Introduction to Computational Biology

Applications of Hidden Markov Models (HMMs) to

Computational Biology problems

-A Report

Group 4

Shalini Venkataraman

Vidya Gunaseelan

0


2/27

Thursday !pr "5

This report e#amines the role o$ a po%er$ul statistical model called Hidden

Markov Models&'(() in the area o$ computational *iology+ ,e %ill start %ith an

o-er-ie% o$ '((s and some concepts in *iology+ .e#t %e %ill discuss the use o$

'((s $or *iological se/uences and nally conclude %ith a discussion on the

ad-antages and limitations o$ '((s and possi*le $uture %ork+

1+ Biological Background+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

1+1+ 2.! 3 2eo#yri*onucleic !cid++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

1+"+ .!+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"

1++ 6roteins+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1+4+ The Central 2ogma 7$ (olecular Biology+++++++++++++++++++++++++++++++++++++++++++++++++++++

1+5+ 6rotein structure+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++4

1+8+ (ultiple Se/uence !lignment++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++5

"+ 'idden (arko- (odel &'(() !rchitecture++++++++++++++++++++++++++++++++++++++++++++++++++++++++8

"+1+ (arko- Chains++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++8

"+"+ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

"++ !n e#ample o$ a '(( $or 6rotein Se/uences+++++++++++++++++++++++++++++++++++++++++++++++9

"+4+ Three 6ro*lems 7$ 'idden (arko- (odels++++++++++++++++++++++++++++++++++++++++++++++++10

+ !pplications o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

+1+ Gene nding and prediction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

+"+ 6rotein3 6role '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++15

++ 6rediction o$ protein secondary structure using '((s++++++++++++++++++++++++++++1

4+ '(( implementation++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

5+ !d-antages o$ '((s++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

1


3/27

8+ ;imitations o$ '((s+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1ecti-e in sol-ing this pro*lem+

8


8/27

Sunny

ain

Cloudy

#. Hidden Markov Model (HMM) Arcitecture

#.1. Markov Cains

;et the three states o$ %eather *e Sunny Cloudy and ainy+ ,e cannot e#pect

these three %eather states to $ollo% each other deterministically *ut %e might still

hope to model the system that generates a %eather pattern+ 7ne %ay to do this is

to assume that the state o$ the model depends only upon the pre-ious states o$ the

model+ This is called the Markov assumptionand simplies pro*lems greatly+

,hen considering the %eather the (arko- assumption presumes that todays

%eather can al%ays *e predicted solely gi-en kno%ledge o$ the %eather o$ the past

$e% days+

! (arko- process is a process %hich

mo-es $rom state to state depending &only)

on the pre-ious n states+ The process is

called an order n model %here n is the

num*er o$ states a>ecting the choice o$

ne#t state+ The simplest (arko- process is

a rst order process %here the choice o$

state is made purely on the *asis o$ the

pre-ious state+ This gure sho%s all

possi*le rst order transitions *et%een the

states o$ the %eather e#ample+

The state transition matri# *elo% sho%s possi*le transition pro*a*ilities $or the

%eather e#ample?


9/27

that is i$ it %as sunny yesterday there is a pro*a*ility o$ 0+5 that it %ill *e sunny

today and 0+"5 that it %ill *e cloudy or rainy+

To initiali@e such a system %e need to state %hat the %eather %as &or pro*a*ly

%as) on the day a$ter creation? %e dene this in a -ector o$ initial pro*a*ilities

called the -ector+

So %e kno% it %as sunny on day 1+

,e ha-e no% dened a /rst order Markov processconsisting o$ :

states: Three states 3 sunny cloudy rainy+

vector: 2ening the pro*a*ility o$ the system *eing in each o$ the states

at time 0+

state transition matri": The pro*a*ility o$ the %eather gi-en the pre-ious

days %eather+

!ny system that can *e descri*ed in this manner is a (arko- process+

#.#. Hidden Markov Models

In some cases the patterns that %e %ish to nd are not descri*ed su>iciently *y a

(arko- process+ eturning to the %eather e#ample a hermit $or instance may not


10/27

ha-e access to direct %eather o*ser-ations *ut does ha-e a piece o$ sea%eed+

=olklore tells us that the state o$ the sea%eed is pro*a*ilistically related to the

state o$ the %eather 3 the %eather and sea%eed states are closely linked+ In this

case %e ha-e t%o sets o$ states

observablestates &the state o$ the sea%eed) and

iddenstates &the state o$ the %eather)+

,e %ish to de-ise an algorithm $or the hermit to $orecast %eather $rom the

sea%eed and the (arko- assumption %ithout actually e-er seeing the %eather+ The

diagram *elo% sho%s the hidden and o*ser-a*le states in the %eather e#ample+ It

is assumed that the hidden states &the true %eather) are modeled *y a simple rst

order (arko- process and so they are all connected to each other+

The connections *et%een the hidden

states and the o*ser-a*le states

represent the pro*a*ility o$ generating

a particular o*ser-ed state gi-en that

the (arko- process is in a particular

hidden state+ It should thus *e clear

that all pro*a*ilities Fentering an

o*ser-a*le state %ill sum to 1 since in

the a*o-e case it %ould *e the sum o$

Pr&7*sSun) Pr&7*sCloud) and

Pr&7*sain)+

9


11/27

In addition to the pro*a*ilities dening the (arko- process %e there$ore ha-e

another matri# termed the output matri" %hich contains the pro*a*ilities o$ the

o*ser-a*le states gi-en a particular hidden state+ =or the %eather e#ample the

output matri# might *e?

So this is a model containing three sets o$ pro*a*ilities in addition to the t%o sets

o$ states

vector: contains the pro*a*ility o$ the hidden model *eing in a particular

hidden state at time tH 1+

state transition matri": holding the pro*a*ility o$ a hidden state gi-en the

pre-ious hidden state+

output matri" : containing the pro*a*ility o$ o*ser-ing a particular

o*ser-a*le state gi-en that the hidden model is in a particular hidden state+

Thus a hidden (arko- model is a standard (arko- process augmented *y a set o$

o*ser-a*le states and some pro*a*ilistic relations *et%een them and the hidden

states+

10


12/27

#.%. An e"ample of a HMM for &rotein -e,uences

This is a possi*le hidden (arko- model $or the protein !CC+ The protein is

represented as a se/uence o$ pro*a*ilities+ The num*ers in the *o#es sho% the

pro*a*ility that an amino acid occurs in a particular state and the num*ers ne#t to

the directed arcs sho% pro*a*ilities %hich connect the states+ The pro*a*ility o$

!CC is sho%n as a highlighted path through the model+ There are three kinds o$

states represented *y three di>erent shapes+ The s/uares are called matc states

and the amino acids emitted $rom them $orm the conser-ed primary structure o$ a

protein+ These amino acids are the same as those in the common ancestor or i$

not are the result o$ su*stitutions+ The diamond shapes are insert statesand emit

amino acids that result $rom insertions+ The circles are special silent states kno%n

as delete statesand model deletions+ These type o$ '((s are called 6rotein

6role3'((s and %ill *e co-ered in more depth in the later sections+

-coring a -e,uence 0it an HMM

!ny se/uence can *e represented *y a path through the model+ The pro*a*ility o$

any se/uence gi-en the model is computed *y multiplying the emission and

11

Transition

Prob.

Output Prob.


13/27

transitionpro*a*ilities along the path+ ! path through the model represented *y

!CC is highlighted+ =or e#ample the pro*a*ility o$ ! *eing emitted in position 1

is 0+ and the pro*a*ility o$ C *eing emitted in position " is 0+8+ The pro*a*ility o$

!CC along this path is

.4*.3*.46*.6*.97*.5*.015*.73*.01*1 = 1.76x10-6.

#.'. ree &roblems *f Hidden Markov Models

1) -coring &roblem

,e %ant to nd the pro*a*ility o$ an o*ser-ed se/uence gi-en an '((+

It can *e seen that one method o$ calculating the pro*a*ility o$ the o*ser-ed

se/uence %ould *e to nd each possi*le se/uence o$ the hidden states and sum

these pro*a*ilities+ ,e use the =or%ard !lgorithm $or this+

Consider the '(( sho%n a*o-e+ In this gure se-eral paths e#ist $or the protein

se/uence !CC+

The or0ard algoritm employs a matri# sho%n *elo%+ The columns o$ the

matri# are inde#ed *y the states in the model and the ro%s are inde#ed *y the

se/uence+ The elements o$ the matri# are initiali@ed to @ero and then computed

%ith these steps:

1"

M1

M2M3

I0

I1


14/27

1+ The pro*a*ility that the amino acid A%as generated *y state I0 is computed

and entered as the rst element o$ the matri#+ This is +4J+ H +1"

"+ The pro*a*ilities that Cis emitted in state (1 &multiplied *y the pro*a*ility o$

the most likely transition to state (1$rom state I0) and in state I1 &multiplied

*y the most likely transition to state I1 $rom state I0) are entered into the

matri# element inde#ed *y C and I1K(1+

+ The sum o$ the t%o pro*a*ilities sum&I1 (1) is calculated+

4+ ! pointer is set $rom the %inner *ack to state I0+

5+ Steps "34 are repeated until the matri# is lled+

The pro*a*ility o$ the se/uence is $ound *y summing the pro*a*ilities in the last

column+

Matrix for the Forward algorithm

#) Alignment &roblem

,e o$ten %ish to take a particular '(( and determine $rom an o*ser-ation

se/uence the most likely se/uence o$ underlying hidden states that might ha-e

generated it+ This is the alignment pro*lem and the 2iterbi Algoritmis used to

sol-e this pro*lem+

1


15/27

The Viter*i algorithm is similar to the $or%ard algorithm+ 'o%e-er in step

ma#imum rather than a sum is calculated+ The most likely path through the model

can no% *e $ound *y $ollo%ing the *ack3pointers+

Matrix for the Viterbi algorithm

7nce the most pro*a*le path through the model is kno%n the pro*a*ility o$ a

se/uence gi-en the model can *e computed *y multiplying all pro*a*ilities along

the path+

%) raining &roblem

!nother tricky pro*lem is ho% to create an '(( in the rst place gi-en a

particular set o$ related training se/uences+ It is necessary to estimate the amino

acid emission distri*utions in each state and all state3to3state transition

pro*a*ilities $rom a set o$ related training se/uences+ This is done *y using the

Baum3,elch !lgorithm or the =or%ard Back%ard !lgorithm+

The algorithm proceeds *y making an initial guess o$ the parameters &%hich may

%ell *e entirely %rong) and then rening it *y assessing its %orth and attempting

14


16/27

to reduce the errors it pro-okes %hen tted to the gi-en data+ In this sense it is

per$orming a $orm o$ gradient descent looking $or a minimum o$ an error measure+

15


17/27

%. Applications of HMM3s

In this section %e %ill del-e into greater depth at specic pro*lems in the area o$

computational *iology and e#amine the role o$ '((s+

%.1. 4ene /nding and prediction

,e introduce here the gene3prediction '((s that can *e used to predict the

structure o$ the gene+ 7ur o*Eecti-e is to nd the coding and non3coding regions o$

an unla*eled string o$ 2.! nucleotides+

The moti-ation *ehind this is to

assist in the annotation o$ genomic data produced *y genome se/uencing

methods

gain insight into the mechanisms in-ol-ed in transcription splicing and

other processes

18


18/27

!s sho%n in the diagram a*o-e a string o$ 2.! nucleotides containing a gene %ill

ha-e separate regions

Introns L non3coding regions %ithin a gene

M#ons L coding regions

These regions are separated *y $unctional sites

Start and stop codons

Splice sites L acceptors and donors

In the process o$ transcription only the e#ons are le$t to $orm the protein se/uence

as depicted *elo%+

1


19/27

(any pro*lems in *iological se/uence analysis ha-e a grammatical structure

+ '((s are -ery use$ul in modeling grammar+ The input to such a '(( is the

genomic 2.! se/uence and the output in the simplest case is a parse tree o$

e#ons and introns on the 2.! se/uence+

Sho%n *elo% is a simple model $or unspliced genes that recogni@es the start

codon stop codon &only one o$ the three possi*le stop codons are sho%n) and the

codingKnon3coding regions+ This model has *een trained %ith a test set o$ gene

data+

'a-ing such a model ho% can %e predict genes in a se/uence o$ anonymous

2.! ,e simply use the Viter*i algorithm to nd the most pro*a*le path through the

model

1


20/2719


21/27

%.#. &rotein! &ro/le HMMs

!s %e ha-e seen earlier protein structural similarities make it possi*le to create a

statistical model o$ a protein $amily %hich is called a pro/le+ The idea is gi-en a

single amino acid target se/uence o$ unkno%n structure %e %ant to in$er the

structure o$ the resulting protein+

The prole '(( is *uilt *y analy@ing the distri*ution o$ amino3acids in a training

set o$ related proteins+ This '(( in a natural %ay can model positional dependant

gap penalties+

The *asic topology o$ a prole '(( is sho%n a*o-e+ Mach position or module in

the model has three states+ ! state sho%n as a rectangular *o# is a matcstate

that models the distri*ution o$ letters in the corresponding column o$ an

alignment+ ! state sho%n *y a diamond3shaped *o# models insertions o$ random

letters *et%een t%o alignment positions and a state sho%n *y a circle models a

deletion corresponding to a gap in an alignment+ States o$ neigh*oring positions

are connected as sho%n *y lines+ =or each o$ these lines there is an associated

Ftransition pro*a*ility %hich is the pro*a*ility o$ going $rom one state to the

other+

"0

Matcing

states 5nsertionstatesDeletionstates


22/27

The match state represents a consensus amino acid $or this position in the protein

$amily+ The delete state is a non3emitting state and represents skipping this

consensus position in the multiple alignment+ =inally the insert state models the

insertion o$ any num*er o$ residues a$ter this consensus position+

! repository o$ protein prole '((s can *e $ound in the 6=!( 2ata*ase

&ttp677pfam.0ustl.edu).

Building proles $rom a $amily o$ proteins&or 2.!) a prole '(( can *e made $or

searching a data*ase $or other mem*ers o$ the $amily+ !s %e ha-e seen *e$ore in

the section on '(( pro*lems prole '((s can also *e used $or the $ollo%ing

-coring a se,uence

,e are calculating the pro*a*ility o$ a se/uence gi-en a prole *y simply

multiplying emmision and transition pro*a*ilities along the path+

Classifying se,uences in a database

Gi-en a '(( $or a protein $amily and some unkno%n se/uences %e are trying to

nd a path through the model %here the ne% se/uence ts in or %e are tying to

Nalign the se/uence to the model+ !lignment to the model is an assignment o$

states to each residue in the se/uence+ There are many such alignments and the

Vitter*is algorithm is used to gi-e the pro*a*ility o$ the se/uence $or that

alignment+

"1
http://pfam.wustl.edu/http://pfam.wustl.edu/


23/27

Creating Multiple se,uence alignment

'((s can *e used to automatically create a multiple alignment $rom a group o$

unaligned se/uences+ By taking a close look at the alignment %e can see the

history o$ e-olution+ 7ne great ad-antage o$ '((s is that they can *e estimated

$rom se/uences %ithout ha-ing to align the se/uences rst+ The se/uences used to

estimate or trainthe model are called the training se/uences and any reser-ed

se/uences used to e-aluate the model are called the test se/uences+ The model

estimation is done %ith the $or%ard3*ack%ard algorithm also kno%n as the Baum3

,elch algorithm+ It is an iterati-e algorithm that ma#imi@es the likelihood o$ the

training se/uences+

%.%. &rediction of protein secondary structure using HMM3s

6rediction o$ secondary structures is need $or the prediction o$ protein $unction+ !s

an alternati-e method to direct O3ray analysis a '(( is used to

!naly@e the amino3acid se/uences o$ proteins

;earn secondary structures such as heli# sheet and turn

6redict the secondary structures o$ se/uences

The method is to train the $our '((s o$ secondary structure L heli# sheet turn

and other L *y training se/uences+ The Baum3,elch method is used to train the

'((s+ So the '(( o$ heli# is a*le to produce heli#3like se/uences %ith high

pro*a*ilities+ .o% these '((s can *e used to predict the secondary structure o$

the test se/uence+ The $or%ard3*ack%ard algorithm is used to compute the

pro*a*ilities o$ these '((s outputting the test se/uence+ The se/uence has the

secondary structure %hose '(( sho%ed the highest pro*a*ility to output the

se/uence+

""


24/27

'. HMM implementation

These are the t%o pu*licly a-aila*le '(( implementation so$t%are+

'((M 3 http:KKhmmer+%ustl+eduK

S!( system 3 http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKsam+html

+. Advantages of HMMs

'((s can accommodate -aria*le3length se/uence+

Because most *iological data has -aria*le3length properties machine learning

techni/ues %hich re/uire a #ed3length input such as neural net%orks or support

-ector machines are less success$ul in *iological se/uence analysis

!llo%s position dependant gap penalties+

'((s treat insertions and deletions is a statistical manner that is dependant on

position+

. 8imitations of HMMs

;inear (odel

So they are una*le to capture higher order correlations among amino3acids+

(arko- Chain assumption o$ independent e-ents

6ro*a*ilities o$ states are supposed to *e independent %hich is not true o$ *iology

Mg

6&y) must *e independent o$ 6) and -ice -ersa

"

P(x)

.P(y)
http://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.htmlhttp://hmmer.wustl.edu/http://www.cse.ucsc.edu/research/compbio/sam.html


25/27

Standard (achine ;earning 6ro*lems

In the training pro*lem %e need to %atch out $or local ma#ima and so model may

not con-erge to a truly optimal parameter set $or a gi-en training set+ Secondly

since the model is only as good as your training set this may lead to o-er3tting+

9. *pen areas for researc in HMMs in biology

Integration o$ structural in$ormation into prole '((s+

2espite the almost o*-ious application o$ using structural in$ormation on a

mem*er protein $amily %hen one e#ists to *etter the parameteri@ation o$ the

'(( this has *een e#tremely hard to achie-e in practice+

(odel architecture

The architectures o$ '((s ha-e largely *een chosen to *e the simplest

architectures that can t the o*ser-ed data+ Is this the *est architecture to use

Can one use protein structure kno%ledge to make *etter architecture decisions or

in limited regions to learn the architecture directly $rom the data ,ill these

implied architectures ha-e implications $or our structural understanding

Biological mechanism

In gene prediction the '((s may *e getting close to replicating the same sort o$

accuracy as the *iological machine &the '((s ha-e the additional task o$ nding

the gene in the genomic 2.! conte#t %hich is not handled *y the *iological

machine that processes the .!)+ ,hat constraints does our statistical model

"4


26/27

place on the *iological mechanismP in particular can %e consider a *iological

mechanism that could use the same in$ormation as the '((

:. $eferences

1+ ;+ + a*iner and B+ '+ QuangAn Introduction to Hidden Markov Models IMMM

!SS6 (aga@ine Qanuary 19


27/27

6rocedings o$ the =ourth International Con$erence on Intelligent Systems $or

(olecular Biology &(enlo 6ark C!) !!!I 6ress 1998+

9+ + 'ughey and !+ Rrogh+ 'idden (arko- models $or se/uence analysis:

e#tension and analysis o$ the *asic method+ "omputer Applications in the

#iosciences 1":95310+1998

http:KK%%%+cse+ucsc+eduKresearchKcomp*ioKhtml$ormatpapersKhughkrogh98Kca

*ios+html

10+'endersonQ+ Sal@*ergS+ and =asmanR+Finding genes in human +(A with a

hidden Markov model+ Qournal o$ Computational Biology 199 4 1"33141+

11+An Introduction to Hidden Markov Models for #iological 'e)uences *y !+

Rrogh+ In S+ ;+ Sal@*erg et al+ eds+ Computational (ethods in (olecular

Biology 4538+ Mlse-ier 199

Documents

Hmm Report