38
. Markov Chains Lecture #5 Background Readings : Durbin et. al. Section 3.1,

Statistics: Markov Chain & Processes

Embed Size (px)

Citation preview

Page 1: Statistics: Markov Chain & Processes

.

Markov ChainsLecture #5

Background Readings: Durbin et. al. Section 3.1,

Fall04-5,05-6: made up to slide 15 in 1st hour, and the rest in all lecture (quite leisurely).for next time: add a comment that for all L, the probabilities of length L markov chain sums up to one, and maybe extend the "Markov Chain" part to the full 2 hours (check first hoe lecture 6 goes) - say more on eigenvectors etc?.
Page 2: Statistics: Markov Chain & Processes

2

Dependencies along the genome

In previous classes we assumed every letter in a sequence is sampled randomly from some distribution q() over the alpha bet {A,C,T,G}.

This model could suffice for alignment scoring, but it is not the case in true genomes.

1. There are special subsequences in the genome, like TATA within the regulatory area, upstream a gene.

2. The pattern CG is less common than expected for random sampling.

We model such dependencies by Markov chains and hidden Markov model, which we define next.

Page 3: Statistics: Markov Chain & Processes

3

Finite Markov Chain

An integer time stochastic process, consisting of a domain D of m>1 states {s1,…,sm} and

1. An m dimensional initial distribution vector ( p(s1),.., p(sm)).

2. An m×m transition probabilities matrix M= (asisj)

For example, D can be the letters {A, C, T, G}, p(A) the probability of A to be the 1st letter in a sequence, and aAG the probability that G follows A in a sequence.

Page 4: Statistics: Markov Chain & Processes

4

Markov Chain (cont.)

1 2 1 1 1 12

(( , ,... )) ( ) ( | )n

n i i i ii

p x x x p X x p X x X x

112

( )i i

n

x xi

p x a

X1 X2 Xn-1 Xn

• For each integer n, a Markov Chain assigns probability to sequences (x1…xn) over D (i.e, xi D) as follows:

Similarly, (X1,…, Xi ,…) is a sequence of probability distributions over D. There is a rich theory which studies the properties of these sequences. A bit of it is presented next.

Page 5: Statistics: Markov Chain & Processes

5

Markov Chain (cont.)

X1 X2 Xn-1 Xn

Similarly, each Xi is a probability distributions over D, which is determined by the initial distribution (p1,..,pn) and the transition matrix M. There is a rich theory which studies the properties of such “Markov sequences” (X1,…, Xi ,…). A bit of this theory is presented next.

this slide was separated from the previous one _after_ the lecture at fall05-6,
Page 6: Statistics: Markov Chain & Processes

6

Matrix Representation

1stta

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D

Then after one move, the distribution is changed to X2 = X1M

M is a stochastic Matrix:

The initial distribution vector (u1…um) defines the distribution of X1 (p(X1=si)=ui) .

The transition probabilities Matrix M =(ast)

Page 7: Statistics: Markov Chain & Processes

7

Matrix Representation

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D

The i-th distribution is Xi = X1Mi-1

Example: if X1=(0, 1, 0, 0) then X2=(0.2, 0.5, 0, 0.3)

And if X1=(0, 0, 0.5, 0.5) then X2=(0, 0.1, 0.5, 0.4).

Page 8: Statistics: Markov Chain & Processes

8

Representation of a Markov Chain as a Digraph

Each directed edge AB is associated with the positive transition probability from A to B.

A B

C D

0.2

0.3

0.5

0.05

0.95

0.2

0.8

1

0100

0.800.20

0.300.50.2

00.0500.95A B

B

A

C

C

D

D

Page 9: Statistics: Markov Chain & Processes

9

Properties of Markov Chain states

A B

C D

States of Markov chains are classified by the digraph representation (omitting the actual probability values)

A, C and D are recurrent states: they are in strongly connected components which are sinks in the graph.

B is not recurrent – it is a transient state

Alternative definitions: A state s is recurrent if it can be reached from any state reachable from s; otherwise it is transient.

Page 10: Statistics: Markov Chain & Processes

10

Another example of Recurrent and Transient States

A B

C D

A and B are transient states, C and D are recurrent states.

Once the process moves from B to D, it will never come back.

Page 11: Statistics: Markov Chain & Processes

11

Irreducible Markov Chains

A Markov Chain is irreducible if the corresponding graph is strongly connected (and thus all its states are recurrent).

A B

C D1

A B

C D

E

Page 12: Statistics: Markov Chain & Processes

12

Periodic States

A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s. (in the shown graph the period of A is 2).

A B

C D

E

A Markov Chain is periodic if all the states in it have a period k >1. It is aperiodic otherwise.

Exercise: All the states in the same strongly connected component have the same period

Page 13: Statistics: Markov Chain & Processes

13

Ergodic Markov Chains

A B

C D

A Markov chain is ergodic if :1. the corresponding graph is

strongly connected.2. It is not peridoic

Ergodic Markov Chains are important since they guarantee the corresponding Markovian process converges to a unique distribution, in which all states have strictly positive probability.

Page 14: Statistics: Markov Chain & Processes

14

Stationary Distributions for Markov Chains

Let M be a Markov Chain of m states, and let V = (v1,…,vm) be a probability distribution over the m states

V = (v1,…,vm) is stationary distribution for M if VM=V.(ie, if one step of the process does not change the distribution).

V is a stationary distribution

V is a left (row) Eigenvector of M with Eigenvalue 1.

example of stationary vector (on the board):(0.8, 0.2) where M is:0.75 0.25 1 0
Page 15: Statistics: Markov Chain & Processes

15

Stationary Distributions for a Markov Chain M

Exercise: A stochastic matrix always has a real left Eigenvector with Eigenvalue 1 (hint: show that a stochastic matrix has a right Eigenvector with Eigenvalue 1. Note that the left Eigenvalues of a Matrix are the same as the right Eiganvlues).

[It can be shown that the above Eigenvector V can be non-negative. Hence each Markov Chain has a stationary distribution.]

Page 16: Statistics: Markov Chain & Processes

16

“Good” Markov chains

A Markov Chains is good if the distributions Xi , as i∞:

(1) converge to a unique distribution, independent of the initial distribution.

(2) In that unique distribution, each state has a positive probability.

The Fundamental Theorem of Finite Markov Chains:A Markov Chain is good the corresponding graph is ergodic.

We will prove the part, by showing that non-ergodic Markov Chains are not good.

Page 17: Statistics: Markov Chain & Processes

17

Examples of “Bad” Markov Chains

A Markov chains is not “good” if either :1. It does not converge to a unique distribution.2. It does converge to u.d., but some states in this distribution have zero probability.

Page 18: Statistics: Markov Chain & Processes

18

Bad case 1: Mutual Unreachabaility

A B

C D

In case a), the sequence will stay at A forever.In case b), it will stay in {C,D} for ever.

Fact 1: If G has two states which are unreachable from each other, then {Xi} cannot converge to a distribution which is independent on the initial distribution.

Consider two initial distributions: a) p(X1=A)=1 (p(X1 = x)=0 if x≠A).

b) p(X1= C) = 1

Page 19: Statistics: Markov Chain & Processes

19

Bad case 2: Transient States

A B

C D

Once the process moves from B to D, it will never come back.

Page 20: Statistics: Markov Chain & Processes

20

Bad case 2: Transient States

A B

C D

Fact 2: For each initial distribution, with probability 1 a transient state will be visited only a finite number of times.

Proof: Let A be a transient state, and let X be the set of states from which A is unreachable. It is enough to show that, starting from any state, with probability 1 a state in X is reached after a finite number of steps (Exercise: complete the proof)

X

Page 21: Statistics: Markov Chain & Processes

21

Corollary: A good Markov Chain is irreducible

Page 22: Statistics: Markov Chain & Processes

22

Bad case 3: Periodic Markov Chains

A B

C D

E

Recall: A Markov Chain is periodic if all the states in it have a period k >1. The above chain has period 2.In the above chain, consider the initial distribution p(B)=1.Then states {B, C} are visited (with positive probability) only in odd steps, and states {A, D, E} are visited in only even steps.

Page 23: Statistics: Markov Chain & Processes

23

Bad case 3: Periodic States

A B

C D

E

Fact 3: In a periodic Markov Chain (of period k >1) there are initial distributions under which the states are visited in a periodic manner.Under such initial distributions Xi does not converge as i∞.

Corollary: A good Markov Chain is not periodic

Page 24: Statistics: Markov Chain & Processes

24

The Fundamental Theorem of Finite Markov Chains:

If a Markov Chain is ergodic, then 1. It has a unique stationary distribution vector V > 0, which is an

Eigenvector of the transition matrix.2. For any initial distribution, the distributions Xi , as i∞,

converges to V.

We have proved that non-ergodic Markov Chains are not good. A proof of the other part (based on Perron-Frobenius theory) is beyond the scope of this course:

Page 25: Statistics: Markov Chain & Processes

25

Use of Markov Chains in Genome search: Modeling CpG Islands

In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG.

Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone.

Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes.

These areas are called CpG islands (p denotes “pair”).

Page 26: Statistics: Markov Chain & Processes

26

Example: CpG Island (Cont.)

We consider two questions (and some variants):

Question 1: Given a short stretch of genomic data, does it come from a CpG island ?

Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ?

We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:

Page 27: Statistics: Markov Chain & Processes

27

Example: CpG Island (Cont.)

The “+” model: Use transition matrix A+ = (a+st),

Where: a+

st = (the probability that t follows s in a CpG island)

The “-” model: Use transition matrix A- = (a-st),

Where: a-

st = (the probability that t follows s in a non CpG island)

Page 28: Statistics: Markov Chain & Processes

28

Example: CpG Island (Cont.)

With this model, to solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain, in which the parameters are determined by known data and the log odds-ratio test.

Page 29: Statistics: Markov Chain & Processes

29

Question 1: Using two Markov chains

A+ (For CpG islands):

Xi-1

Xi

A C G T

A 0.18 0.27 0.43 0.12

C 0.17 p+(C | C) 0.274 p+(T|C)

G 0.16 p+(C|G) p+(G|G) p+(T|G)

T 0.08 p+(C |T) p+(G|T) p+(T|T)

We need to specify p+(xi | xi-1) where + stands for CpG Island. From Durbin et al we have:

(Recall: rows must add up to one; columns need not.)

Page 30: Statistics: Markov Chain & Processes

30

Question 1: Using two Markov chains

A- (For non-CpG islands):

Xi-1

Xi

A C G T

A 0.3 0.2 0.29 0.21

C 0.32 p-(C|C) 0.078 p-(T|C)

G 0.25 p-(C|G) p-(G|G) p-(T|G)

T 0.18 p-(C|T) p-(G|T) p-(T|T)

…and for p-(xi | xi-1) (where “-” stands for Non CpG island) we have:

Page 31: Statistics: Markov Chain & Processes

31

Discriminating between the two models

Given a string x=(x1….xL), now compute the ratio

If RATIO>1, CpG island is more likely.Actually – the log of this ratio is computed:

X1 X2 XL-1 XL

1

01

1

01

model) (

model) (RATIO

L

iii

L

iii

xxp

xxp

p

p

)|(

)|(

|

|

x

x

Note: p+(x1|x0) is defined for convenience as p+(x1). p-(x1|x0) is defined for convenience as p-(x1).

Page 32: Statistics: Markov Chain & Processes

32

Log Odds-Ratio test

Taking logarithm yields

If logQ > 0, then + is more likely (CpG island).If logQ < 0, then - is more likely (non-CpG island).

i ii

ii

L

L

)|x(xp

)|x(xp

)|...xp(x

)|...xp(x Q

1

1

1

1 logloglog

Page 33: Statistics: Markov Chain & Processes

33

Where do the parameters (transition- probabilities) come from ?

Learning from complete data, namely, when the label is given and every xi is measured:

Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands.

Input: Tuples of the form (x1, …, xL, h), where h is + or -

Output: Maximum Likelihood parameters (MLE)

Count all pairs (Xi=a, Xi-1=b) with label +, and with label -, say the numbers are Nba,+ and Nba,- .

Page 34: Statistics: Markov Chain & Processes

34

Maximum Likelihood Estimate (MLE) of the parameters (using labeled data)

The needed parameters are:

P+(x1), p+ (xi | xi-1), p-(x1), p-(xi | xi-1)

The ML estimates are given by:

Using MLE is justified when we have a large sample. The numbers appearing in the text book are based on 60,000 nucleotides. When only small samples are available, Bayesian learning is an attractive alternative.

X1 X2 XL-1 XL

xx

a

N

NaXp

,

,)( 1 Where Nx,+ is the number of times letter x appear in CpG islands in the dataset.

xbx

baii

N

NbXaXp

,

,)|( 1

Where Nbx,+ is the number of times letter x appears after letter b in CpG islands in the dataset.

Page 35: Statistics: Markov Chain & Processes

36

Question 2: Finding CpG Islands

Given a long genomic string with possible CpG Islands, we define a Markov Chain over 8 states, all interconnected (hence it is ergodic):

C+ T+G+A+

C- T-G-A-

The problem is that we don’t know the sequence of states which are traversed, but just the sequence of letters.

Therefore we use here Hidden Markov Model

Page 36: Statistics: Markov Chain & Processes

37

Hidden Markov Model

1 11

( , , ) ( | )L

L i ii

p s s p s s

A Markov chain (s1,…,sL):

and for each state s and a symbol x we have p(Xi=x|Si=s)

Application in communication: message sent is (s1,…,sm) but we receive (x1,…,xm) . Compute what is the most likely message sent ?

Application in speech recognition: word said is (s1,…,sm) but we recorded (x1,…,xm) . Compute what is the most likely word said ?

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

look at lecture 5 of danny for more detailed discussion of probabilities of HMM, backwards and forwards algorithms.
Page 37: Statistics: Markov Chain & Processes

38

Hidden Markov Model

Notations:Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast

Emission probabilities: p(Xi = b| Si = s) = es(b)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

For Markov Chains we know:

What is p(s,x) = p(s1,…,sL;x1,…,xL) ?

1 11

( ) ( , , ) ( | )L

L i ii

p p s s p s ss

Page 38: Statistics: Markov Chain & Processes

39

Hidden Markov Model

p(Xi = b| Si = s) = es(b), means that the probability of xi depends only on the probability of si.Formally, this is equivalent to the conditional independence assumption:

p(Xi=xi|x1,..,xi-1,xi+1,..,xL,s1,..,si,..,sL) = esi(xi)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

1 1 11

( , ) ( , , ; ,..., ) ( | ) ( )i

L

L L i i s ii

p p s s x x p s s e xs x

Thus