38
. Markov Chains Lecture #5 Background Readings : Durbin et. al. Section 3.1, Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s

Markov Chains Lecture #5

Embed Size (px)

DESCRIPTION

Markov Chains Lecture #5. Background Readings : Durbin et. al. Section 3.1, Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s. Dependencies along the genome. - PowerPoint PPT Presentation

Citation preview

Page 1: Markov Chains Lecture #5

.

Markov ChainsLecture #5

Background Readings: Durbin et. al. Section 3.1, Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s

Page 2: Markov Chains Lecture #5

2

Dependencies along the genome

In previous classes we assumed every letter in a sequence is sampled randomly from some distribution q() over the alpha bet {A,C,T,G}.

This is too restrictive for true genomes.

1. There are special subsequences in the genome, like TATA within the regulatory area, upstream a gene.

2. The pattern CG is less common than expected for random sampling.

We model such dependencies by Markov chains and hidden Markov model, which we define next.

Page 3: Markov Chains Lecture #5

3

Finite Markov Chain

An integer time stochastic process, consisting of a domain D of m>1 states {s1,…,sm} and

1. An m dimensional initial distribution vector ( p(s1),.., p(sm)).

2. An m×m transition probabilities matrix M= (msisj

)

For example, D can be the letters {A, C, T, G}, p(A) the probability of A to be the 1st letter in a sequence, and mAG the probability that G follows A in a sequence.

Page 4: Markov Chains Lecture #5

4

Markov Chain (cont.)

1 2 1 1 1 12

(( , ,... )) ( ) ( | )n

n i i i ii

p x x x p X x p X x X x

112

( )i i

n

x xi

p x m

X1 X2 Xn-1 Xn

• For each integer n, a Markov Chain assigns probability to sequences (x1…xn) in Dn (i.e, xi D) as follows:

Page 5: Markov Chains Lecture #5

5

Markov Chain (cont.)

X1 X2 Xn-1 Xn

Similarly, each Xi is a probability distributions over D, which is determined by the initial distribution (p1,..,pn) and the transition matrix M.

There is a rich theory which studies the properties of such “Markov sequences” (X1,…, Xi ,…). A bit of this theory is presented next.

Page 6: Markov Chains Lecture #5

6

Matrix Representation

1sttm

0100

0.800.20

0.300.50.2

00.05

00.95

A B

B

A

C

C

D

D

After one move, the distribution is changed to

X2 = X1M

M is a stochastic Matrix:

The initial distribution vector (u1…um) defines the distribution of X1 (p(X1=si)=ui) .

The transition probabilities Matrix M =(mst)

Page 7: Markov Chains Lecture #5

7

Matrix Representation

0100

0.800.20

0.300.50.2

00.05

00.95

A B

B

A

C

C

D

D

The i-th distribution is Xi = X1Mi-1

Example: if X1=(0, 1, 0, 0) then X2=(0.2, 0.5, 0, 0.3)

And if X1=(0, 0, 0.5, 0.5) then X2=(0, 0.1, 0.5, 0.4).

Basic question: what can we tell on the distributions Xi?

Page 8: Markov Chains Lecture #5

8

Representation of a Markov Chain as a Digraph

Each directed edge AB is associated with the positive transition probability from A to B.

A B

C D

0.2

0.3

0.5

0.05

0.95

0.2

0.8

1

0100

0.800.20

0.300.50.2

00.05

00.95

A B

B

A

C

C

D

D

Page 9: Markov Chains Lecture #5

9

Properties of Markov Chain states

A B

C D

States of Markov chains are classified by the digraph representation (omitting the actual probability values)

A, C and D are recurrent states: they are in strongly connected components which are sinks in the graph.

B is not recurrent – it is a transient state

Alternative definitions: A state s is recurrent if it can be reached from any state reachable from s; otherwise it is transient.

Page 10: Markov Chains Lecture #5

10

Another example of Recurrent and Transient States

A B

C D

A and B are transient states, C and D are recurrent states.

Once the process moves from B to D, it will never come back.

Page 11: Markov Chains Lecture #5

11

Irreducible Markov Chains

A Markov Chain is irreducible if the corresponding graph is strongly connected (and thus all its states are recurrent).

A B

C D

A B

C D

E

Page 12: Markov Chains Lecture #5

12

Periodic States

A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s. (in the shown graph the period of A is 2).

A B

C D

E

A Markov Chain is periodic if all the states in it have a period k >1. It is aperiodic otherwise.

Exercise: All the states in the same strongly connected component have the same period

Page 13: Markov Chains Lecture #5

13

Ergodic Markov Chains

A B

C D

A Markov chain is ergodic if :

1. the corresponding graph is strongly connected.

2. It is not peridoic

Ergodic Markov Chains are important since they guarantee the corresponding Markovian process converges to a unique distribution, in which each state has a strictly positive probability.

Page 14: Markov Chains Lecture #5

14

Stationary Distributions for Markov Chains

Let M be a Markov Chain of m states, and let V = (v1,…,vm) be a probability distribution over the m states

V = (v1,…,vm) is stationary distribution for M if VM=V.(ie, if one step of the process does not change the distribution).

V is a stationary distribution

V is a non-negative left (row) Eigenvector of M with Eigenvalue 1.

Page 15: Markov Chains Lecture #5

15

Stationary Distributions for a Markov Chain M

Exercise: A stochastic matrix always has a real left Eigenvector with Eigenvalue 1 (hint: show that a stochastic matrix has a right Eigenvector with Eigenvalue 1. Note that the left Eigenvalues of a Matrix are the same as the right Eigenvlues).

It can be shown that the above Eigenvector V can be non-negative. Hence each Markov Chain has a stationary distribution.

Page 16: Markov Chains Lecture #5

16

“Good” Markov chains

A Markov Chain is good if the distributions Xi , as i∞:

(1) Converge to a unique distribution, independent of the initial distribution.

(2) In that unique distribution, each state has a positive probability.

The Fundamental Theorem of Finite Markov Chains:A Markov Chain is good the corresponding graph is ergodic.

We will prove the part, by showing that non-ergodic Markov Chains are not good.

Page 17: Markov Chains Lecture #5

17

Examples of “Bad” Markov Chains

A Markov chain is “bad” if either :1. It does not converge to a unique distribution (independent on the initial distribution).2. It does converge to u.d., but some states in this distribution have zero probability.

Page 18: Markov Chains Lecture #5

18

Bad case 1: Mutual Unreachabaility

A B

C D

In case a), the sequence will stay at A forever.In case b), it will stay in {C,D} for ever.

Fact 1: If G has two states which are unreachable from each other, then {Xi} cannot converge to a distribution which is independent on the initial distribution.

Consider two initial distributions: a) p(X1=A)=1 (p(X1 = x)=0 if x≠A).

b) p(X1= C) = 1

Page 19: Markov Chains Lecture #5

19

Bad case 2: Transient States

A B

C D

Once the process moves from B to D, it will never come back.

Page 20: Markov Chains Lecture #5

20

Bad case 2: Transient States

A B

C D

Fact 2: For each initial distribution, with probability 1 a transient state will be visited only a finite number of times.

Proof: Let A be a transient state, and let X be the set of states from which A is unreachable. It is enough to show that, starting from any state, with probability 1 a state in X is reached after a finite number of steps (Exercise: complete the proof)

X

Page 21: Markov Chains Lecture #5

21

Corollary: A good Markov Chain is

irreducible

Page 22: Markov Chains Lecture #5

22

Bad case 3: Periodic Markov Chains

A B

C D

E

Recall: A Markov Chain is periodic if all the states in it have a period k >1. The above chain has period 2.In the above chain, consider the initial distribution p(B)=1.Then states {B, C} are visited (with positive probability) only in odd steps, and states {A, D, E} are visited in only even steps.

Page 23: Markov Chains Lecture #5

23

Bad case 3: Periodic States

A B

C D

E

Fact 3: In a periodic Markov Chain (of period k >1) there are initial distributions under which the states are visited in a periodic manner.Under such initial distributions Xi does not converge as i∞.

Corollary: A good Markov Chain is not

periodic

Page 24: Markov Chains Lecture #5

24

The Fundamental Theorem of Finite Markov Chains:

If a Markov Chain is ergodic, then 1. It has a unique stationary distribution vector V >

0, which is a right (row) Eigenvector of the transition matrix.

2. For any initial distribution, the distributions Xi , as i∞, converges to V.

We have proved that non-ergodic Markov Chains are not good. A proof of the opposite direction below is based on Perron-Frobenius theory on nonnegative matrices, and is beyond the scope of this course:

Page 25: Markov Chains Lecture #5

25

Use of Markov Chains in Genome search:

Modeling CpG Islands

In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG.

Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone.

Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes.

These areas are called CpG islands (-C-phosphate-G-).

Page 26: Markov Chains Lecture #5

26

Example: CpG Island (Cont.)

We consider two questions:

Question 1: Given a short stretch of genomic data, does it come from a CpG island ?

Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ?

We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:

Page 27: Markov Chains Lecture #5

27

Example: CpG Island (Cont.)

The “+” model: Use transition matrix M+ = (m+

st), Where: m+

st = (the probability that t follows s in a CpG island)

The “-” model: Use transition matrix M- = (m-

st), Where: m-

st = (the probability that t follows s in a non CpG island)

Page 28: Markov Chains Lecture #5

28

Example: CpG Island (Cont.)

With this model, to solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain, in which the parameters are determined by known data and the log odds-ratio test.

Page 29: Markov Chains Lecture #5

29

Question 1: Using two Markov chains

M+ (For CpG islands):

Xi-1

Xi

A C G T

A 0.18 0.27 0.43 0.12

C 0.17p+(C | C)

0.274

p+(T|C)

G0.16

p+

(C|G)

p+

(G|G)

p+(T|G)

T0.08

p+(C |T)

p+

(G|T)

p+(T|T)

We need to specify p+(xi | xi-1) where + stands for CpG Island. From Durbin et al we have:

(Recall: rows must add up to one; columns need not.)

Page 30: Markov Chains Lecture #5

30

Question 1: Using two Markov chains

M- (For non-CpG islands):

Xi-1

Xi

A C G T

A 0.3 0.2 0.29 0.21

C0.32

p-(C|C)

0.078

p-(T|C)

G0.25

p-(C|G)

p-

(G|G)

p-

(T|G)

T0.18

p-(C|T)

p-(G|T)

p-(T|T)

…and for p-(xi | xi-1) (where “-” stands for Non CpG island) we have:

Page 31: Markov Chains Lecture #5

31

Discriminating between the two models

Given a string x=(x1….xL), now compute the ratio

If RATIO>1, CpG island is more likely.Actually – the log of this ratio is computed:

X1 X2 XL-1 XL

1

01

1

01

model) (

model) (RATIO

L

iii

L

iii

xxp

xxp

p

p

)|(

)|(

|

|

x

x

Note: p+(x1|x0) is defined for convenience as p+(x1). p-(x1|x0) is defined for convenience as p-

(x1).

Page 32: Markov Chains Lecture #5

32

Log Odds-Ratio test

Taking logarithm yields

If logQ > 0, then + is more likely (CpG island).If logQ < 0, then - is more likely (non-CpG island).

i ii

ii

L

L

)|x(xp

)|x(xp

)|...xp(x

)|...xp(x Q

1

1

1

1 logloglog

Page 33: Markov Chains Lecture #5

33

Where do the parameters (transition- probabilities) come

from ?

Learning from complete data, namely, when the label is given and every xi is measured:

Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands.

Input: Tuples of the form (x1, …, xL, h), where h is + or -

Output: Maximum Likelihood parameters (MLE)

Count all pairs (Xi=a, Xi-1=b) with label +, and with label -, say the numbers are Nba,+ and Nba,- .

Page 34: Markov Chains Lecture #5

34

Maximum Likelihood Estimate (MLE) of the parameters (using

labeled data)

The needed parameters are:

P+(x1), p+ (xi | xi-1), p-(x1), p-(xi | xi-1)

The ML estimates are given by:

Using MLE is justified when we have a large sample. The numbers appearing in our tables (taken from Durbin et al p. 50) are based on 60,000 nucleotides.

X1 X2 XL-1 XL

xx

a

N

NaXp

,

,)( 1Where Nx,+ is the number of times letter x appear in CpG islands in the dataset.

xbx

baii

N

NbXaXp

,

,)|( 1

Where Nbx,+ is the number of times letter x appears after letter b in CpG islands in the dataset.

Page 35: Markov Chains Lecture #5

36

Question 2: Finding CpG Islands

Given a long genomic string with possible CpG Islands, we define a Markov Chain over 8 states, all interconnected (hence it is ergodic):

C+ T+G+A+

C- T-G-A-

The problem is that we don’t know the sequence of states which are traversed, but just the sequence of letters.

Therefore we use here Hidden Markov Model

Page 36: Markov Chains Lecture #5

37

Hidden Markov Model

1 11

( , , ) ( | )L

L i ii

p s s p s s

A Markov chain (s1,…,sL):

and for each state s and a symbol x we have p(Xi=x|Si=s)

Application in communication: message sent is (s1,…,sm) but we receive (x1,…,xm) . Compute what is the most likely message sent.

Application in speech recognition: word said is (s1,…,sm) but we recorded (x1,…,xm) . Compute what is the most likely word said.

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Page 37: Markov Chains Lecture #5

38

Hidden Markov Model

Notations:Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast

Emission probabilities: p(Xi = b| Si = s) = es(b)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

For Markov Chains we know:

What is p(s,x) = p(s1,…,sL;x1,…,xL)?

1 11

( ) ( , , ) ( | )L

L i ii

p p s s p s ss

Page 38: Markov Chains Lecture #5

39

Hidden Markov Model

p(Xi = b| Si = s) = es(b), means that the probability of xi depends only on the probability of si.Formally, this is equivalent to the conditional independence assumption:

p(Xi=xi|x1,..,xi-1,xi+1,..,xL,s1,..,si,..,sL) = esi

(xi)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

1 1 11

( , ) ( , , ; ,..., ) ( | ) ( )i

L

L L i i s ii

p p s s x x p s s e xs x

Thus