29
Markov models and applications Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ [email protected] Oct 15 th , 2013

Markov models and applications Sushmita Roy BMI/CS 576 [email protected] Oct 15 th, 2013

Embed Size (px)

Citation preview

Page 1: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Markov models and applications

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/[email protected]

Oct 15th, 2013

Page 2: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Key concepts

• Markov chains– Computing the probability of a sequence– Estimating parameters of a Markov model

• Hidden Markov models– States– Emission and transition probabilities– Parameter estimation– Forward and backward algorithm– Viterbi algorithm

Page 3: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Why Markov models?

• So far we have assumed in our models that there is no dependency between consecutive base pair locations

• This assumption is rarely true in practice• Markov models allow us to model the dependencies

inherent in sequential data such as DNA or protein sequence

Page 4: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Applications of Markov models

• Genome annotation– Given a genome sequence find functional untis of the

genome• Genes, CpG islands, promoters..

• Sequence classification– A Hidden Markov model (Profile HMMs) can represent a

family of proteins

• Sequence alignment

Page 5: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Markov models

• Provide a generative model for sequence data• A Markov chain is the simplest type of Markov

models• Described by– A collection of states, each state representing a symbol

observed in the sequence– At each state a symbol is emitted– At each state there is a some probability of transitioning to

another state

Page 6: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Markov chain model notation

• Xt denote the state at time t

• aij=P(Xt=j|Xt-1=i) denotes the transition probability from i to j

Page 7: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

A Markov chain model for DNA sequence

A

TC

G

begin

state

transition

.34.16

.38

.12

transition probabilities

Page 8: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

• can also have an end state; allows the model to represent– a distribution over sequences of different lengths– preferences for ending sequences with certain symbols

Markov chain models

begin end

A

TC

G

Page 9: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Computing the probability of a sequence from a first Markov model

• Let X be a sequence of random variables X1 … XL representing a biological sequence

• from the chain rule of probability

Page 10: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Computing the probability of a sequence from a first Markov model

• Key property of a (1st order) Markov chain: the probability of each Xt depends only on the value of Xt-1

P(X) = P(XL | XL −1)P(XL −1 | XL −2) ... P(X2 | X1)P(X1)

= P(X1) P(X tt =2

L

∏ | X t −1)

This can be written as the initial probabilities or a transition probability from the “begin” state

Page 11: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Example of computing the probability from a Markov chain

begin end

A

tc

g

What is P(CGGT)?

Page 12: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Learning Markov model parameters

• Model parameters– Transition probabilities

• Estimated via maximum likelihood

Page 13: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Estimating model parameters

• Assume we have some training data that we know was generated from a first order Markov model

• Need to estimate transition probabilities– P(Xt|Xt-1)

Page 14: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Estimating model parameters

• P(X)

• P(Xt|Xt-1)

Number of times x follows G

Page 15: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Estimating parameters example

• Assume the following training dataACCGCGCTTAGCTTAGTGACTAGCCGTTAC

• Fill in the zeroth and first order transition probability values

P(A) 6/30

P(T) 8/30

P(G) 7/30

P(C) 9/30

0/5 0/5 2/5 3/5

1/7 2/7 0/7 4/7

A

T

G

C

A T G C

Do we really want to set this to 0?

Page 16: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Laplace estimates of parameters

• instead of estimating parameters strictly from the data, we could start with some prior belief for each

• for example, we could use Laplace estimates

• where represents the number of occurrences of character i

P(a) =na +1

(ni +1)i

• using Laplace estimates with the sequences

GCCGCGCTTGGCTTGGTGGCTGGCCGTTGC

pseudocount

in

Page 17: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Estimation for 1st order probabilities• Using Laplace estimates with the sequences for first order

probabilities

GCCGCGCTTGGCTTGGTGGCTGGCCGTTGC

P(A | G) =0 +1

12 + 4

P(C | G) =7 +1

12 + 4

P(G | G) =3 +1

12 + 4

P(T | G) =2 +1

12 + 4

P(A | C) =0 +1

7 + 4M

Page 18: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Using Markov chains to classify sequences as CpG islands or not

• CpG islands are isolated regions of the genome corresponding to high concentrations of CG dinucleotides– Range from 100-5000 bps

• For example regions upstream of genes or gene promoters

• Given a short sequence of DNA, how can we classify it as a CpG island?

Page 19: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Build a Markov chain model for CpG islands

• Learn two Markov models– One from sequences that look like CpG (positive)– One from sequences that don’t look like CpG (negative)

– Parameters estimated from ~60,000 nucleotides

+ A C G T

A .18 .27 .43 .12

C .17 .37 .27 .19

G .16 .34 .38 .12

T .08 .36 .38 .18

- A C G T

A .30 .21 .28 .21

C .32 .30 .08 .30

G .25 .24 .30 .21

T .18 .24 .29 .29

CpG Not CpGG is much more likely to follow A in the CpG positive model

Page 20: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Using the Markov chains to classify a new sequence

• Let y denote a new sequence• To use the Markov models to classify y we compute

the log odds ratio

• The larger the value of S(y) the more likely is y a CpG island– Note: must also normalize by the length

Page 21: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Distribution of scores for CpG and non CpG examples

CpG

Page 22: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Applying the CpG Markov chain models

• Is CGCGA a CpG island?– Un-normalized Score=

– =3.3684

• Is CCTGG a CpG island?– Un-normalized Score=0.3746

Page 23: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Extensions to Markov chains

• Hidden Markov models (Next lectures)

• Higher-order Markov models

• Inhomogeneous Markov models

Page 24: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Order of a Markov model

• Describes how much history we want to keep• First order Markov models are most common

• Second order

• nth order

Page 25: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Higher order Markov models

• An nth order Markov model over alphabet A is equivalent to a first order Markov model An

• An corresponds to the alphabet of n-tuples• For example– A 2nd order Markov model over A,T,G,C– And a 1st order Markov model over

AA,AT,AG,AC,TA,TG,TC,TT,GA,GT,GC,GG,CA,CT,CC,CG – Are equivalent

• However higher the order the more parameters we need to estimate

Page 26: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

A first order Markov chain for a second-order Markov chain with {A,B} alphabet

AA

BA BB

AB

Page 27: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Inhomogeneous Markov chains

• So far the Markov chain has the same transition probability for the entire sequence

• Sometimes it is useful to switch between Markov chains depending on where we are in the sequence

• For example, recall the genetic code that specifies what triplets of bases code for an amino acid– There are different levels of redundancy for each the first,

second or third positions– This could be modeled by three Markov chains

Page 28: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Inhomogeneous Markov chain

• Let 1, 2 and 3 denote the three Markov chains– So our transition probabilities look like– aij

1, aij2, aij

3

• Let x1 be in codon position 3

• Probability of x2, x3.. would be

• Exercise: write down the terms for x5 and x6

Page 29: Markov models and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Oct 15 th, 2013

Summary

• Markov models allow us to model dependencies in sequential data

• Markov models are described by – states, each state corresponding to a letter in our observed alphabet– Transition probabilities between states

• Parameter estimation can be done by counting the number of transitions between consecutive states (first order)– Laplace correction is often applied

• Often used for classifying sequences as CpG islands or not• Extensions of Markov models

– Order– Inhomogeneous Markov models