32
. Markov Chains

Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

Embed Size (px)

Citation preview

Page 1: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

.

Markov Chains

Page 2: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

2

Dependencies along the genome

In previous classes we assumed every letter in a sequence is sampled randomly from some distribution q() over the alpha bet {A,C,T,G}.

This model could suffice for alignment scoring, but it is not the case in true genomes.

1. There are special subsequences in the genome, like TATA within the regulatory area, upstream a gene.

2. The pairs C followed by G is less common than expected for random sampling.

We model such dependencies by Markov chains and hidden Markov model, which we define next.

Page 3: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

3

Finite Markov Chain

An integer time stochastic process, consisting of a domain D of m states {s1,…,sm} and

1. An m dimensional initial distribution vector ( p(s1),.., p(sm)).

2. An m×m transition probabilities matrix M= (asisj)

For example, D can be the letters {A, C, T, G}, p(A) the probability of A to be the 1st letter in a sequence, and aAG the probability that G follows A in a sequence.

Page 4: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

4

Simple Model - Markov Chains

• Markov Property: The state of the system at time t+1 only depends on the state of the system at time t

X1X2 X3 X4 X5

] x X | x P[X

] x X , x X , . . . , x X , x X | x P[X

tt11t

00111-t1-ttt11t

t

t

Page 5: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

5

Markov Chain (cont.)

1 2 1 1 1 12

(( , ,... )) ( ) ( | )n

n i i i ii

p x x x p X x p X x X x

112

( )i i

n

x xi

p x a

X1 X2 Xn-1 Xn

• For each integer n, a Markov Chain assigns probability to sequences (x1…xn) over D (i.e, xi D) as follows:

Similarly, (X1,…, Xi ,…)is a sequence of probability distributions over D.

Page 6: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

6

Matrix Representation

1stta

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D

Then after one move, the distribution is changed to X2 = X1MAfter i moves the distribution is Xi = X1Mi-1

M is a stochastic Matrix:

The initial distribution vector (u1…um) defines the distribution of X1 (p(X1=si)=ui) .

The transition probabilities Matrix M =(ast)

Page 7: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

7

Weather:

–raining today rain tomorrow prr = 0.4

–raining today no rain tomorrow prn = 0.6

–no raining today rain tomorrow pnr = 0.2

–no raining today no rain tomorrow prr = 0.8

Simple Example

Page 8: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

8

Simple Example

Transition Matrix for Example

• Note that rows sum to 1

• Such a matrix is called a Stochastic Matrix

• If the rows of a matrix and the columns of a matrix all sum to 1, we have a Doubly Stochastic Matrix

8.02.0

6.04.0P

Page 9: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

9

Gambler’s Example

– At each play we have the following:

• Gambler wins $1 with probability p

• Gambler loses $1 with probability 1-p

– Game ends when gambler goes broke, or gains a fortune of $100

– Both $0 and $100 are absorbing states

0 1 2 N-1 N

p p p p

1-p 1-p 1-p 1-pStart (10$)

or

Page 10: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

10

Coke vs. Pepsi

Given that a person’s last cola purchase was Coke, there is a 90% chance that her next cola purchase will also be Coke.

If a person’s last cola purchase was Pepsi, there is an 80% chance that her next cola purchase will also be Pepsi.

coke pepsi

0.10.9 0.8

0.2

Page 11: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

11

Coke vs. Pepsi

Given that a person is currently a Pepsi purchaser, what is the probability that she will purchase Coke two purchases from now?

66.034.0

17.083.0

8.02.0

1.09.0

8.02.0

1.09.02P

8.02.0

1.09.0P

The transition matrix is:

(Corresponding to one purchase ahead)

Page 12: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

12

Coke vs. Pepsi

Given that a person is currently a Coke drinker, what is the probability that she will purchase Pepsi three purchases from now?

562.0438.0

219.0781.0

66.034.0

17.083.0

8.02.0

1.09.03P

Page 13: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

13

Coke vs. Pepsi

Assume each person makes one cola purchase per week. Suppose 60% of all people now drink Coke, and 40% drink Pepsi.

What fraction of people will be drinking Coke three weeks from now?

6438.0438.04.0781.06.0)0( )3(101

)3(000

1

0

)3(03

pQpQpQXPi

ii

Let (Q0,Q1)=(0.6,0.4) be the initial probabilities.

We will regard Coke as 0 and Pepsi as 1

We want to find P(X3=0)

8.02.0

1.09.0P

P00

Page 14: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

14

“Good” Markov chains

For certain Markov Chains, the distributions Xi , as i∞: (1) converge to a unique distribution, independent of the initial distribution. (2) In that unique distribution, each state has a positive probability.Call these Markov Chain “good”.

We describe these “good” Markov Chains by considering Graph representation of Stochastic matrices.

Page 15: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

15

Representation as a Digraph

Each directed edge AB is associated with the positive transition probability from A to B.

A B

C D

0.2

0.3

0.5

0.05

0.95

0.2

0.8

1

We now define properties of this graph which guarantee:1. Convergence to unique distribution:2. In that distribution, each state has positive probability.

0100

0.800.20

0.300.50.2

00.0500.95A B

B

A

C

C

D

D

Page 16: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

16

Examples of “Bad” Markov Chains

Markov chains are not “good” if either :1. They do not converge to a unique distribution.2. They do converge to u.d., but some states in this distribution have zero probability.

Page 17: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

17

Bad case 1: Mutual Unreachabaility

A B

C D

In case a), the sequence will stay at A forever.In case b), it will stay in {C,D} for ever.

Fact 1: If G has two states which are unreachable from each other, then {Xi} cannot converge to a distribution which is independent on the initial distribution.

Consider two initial distributions: a) p(X1=A)=1 (p(X1 = x)=0 if x≠A).

b) p(X1= C) = 1

Page 18: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

18

Bad case 2: Transient States

A B

C D

A and B are transient states, C and D are recurrent states.

Once the process moves from B to D, it will never come back.

Def: A state s is recurrent if it can be reached from any state reachable from s; otherwise it is transient.

Page 19: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

19

Bad case 2: Transient States

A B

C D

Fact 2: For each initial distribution, with probability 1 a transient state will be visited only a finite number of times.

X

Page 20: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

20

Bad case 3: Periodic States

A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s.

A B

C D

E

A Markov Chain is periodic if all the states in it have a period k >1. It is aperiodic otherwise.Example: Consider the initial distribution p(B)=1.Then states {B, C} are visited (with positive probability) only in odd steps, and states {A, D, E} are visited in only even steps.

Page 21: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

21

Bad case 3: Periodic States

A B

C D

E

Fact 3: In a periodic Markov Chain (of period k >1) there are initial distributions under which the states are visited in a periodic manner.Under such initial distributions Xi does not converge as i∞.

Page 22: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

22

Ergodic Markov Chains

The Fundamental Theorem of Finite Markov Chains:If a Markov Chain is ergodic, then 1. It has a unique stationary distribution vector V > 0, which is an

Eigenvector of the transition matrix.2. The distributions Xi , as i∞, converges to V.

A B

C D

0.2

0.3

0.5

0.05

0.95

0.2

0.8

1

A Markov chain is ergodic if :1. All states are recurrent (ie, the

graph is strongly connected)2. It is not periodic

Page 23: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

23

Use of Markov Chains in Genome search: Modeling CpG Islands

In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG.

Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone.

Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes.

These areas are called CpG islands (p denotes “pair”).

Page 24: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

24

Example: CpG Island (Cont.)

We consider two questions (and some variants):

Question 1: Given a short stretch of genomic data, does it come from a CpG island ?

Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ?

We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:

Page 25: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

25

Example: CpG Island (Cont.)

The “+” model: Use transition matrix A+ = (a+st),

Where: a+

st = (the probability that t follows s in a CpG island)

The “-” model: Use transition matrix A- = (a-st),

Where: a-

st = (the probability that t follows s in a non CpG island)

Page 26: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

26

Example: CpG Island (Cont.)

With this model, to solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain.

[to solve Question 2 we need to decide which parts of a given long sequence of letters is more likely to come from the “+” model, and which parts are more likely to come from the “–” model. This is done by using the Hidden Markov Model, to be defined later.]

We start with Question 1:

Page 27: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

27

Question 1: Using two Markov chains

A+ (For CpG islands):

Xi-1

Xi

A C G T

A 0.18 0.27 0.43 0.12

C 0.17 p+(C | C) 0.274 p+(T|C)

G 0.16 p+(C|G) p+(G|G) p+(T|G)

T 0.08 p+(C |T) p+(G|T) p+(T|T)

We need to specify p+(xi | xi-1) where + stands for CpG Island. From Durbin et al we have:

(Recall: rows must add up to one; columns need not.)

Page 28: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

28

Question 1: Using two Markov chains

A- (For non-CpG islands):

Xi-1

Xi

A C G T

A 0.3 0.2 0.29 0.21

C 0.32 p-(C|C) 0.078 p-(T|C)

G 0.25 p-(C|G) p-(G|G) p-(T|G)

T 0.18 p-(C|T) p-(G|T) p-(T|T)

…and for p-(xi | xi-1) (where “-” stands for Non CpG island) we have:

Page 29: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

29

Discriminating between the two models

Given a string x=(x1….xL), now compute the ratio

If RATIO>1, CpG island is more likely.Actually – the log of this ratio is computed:

X1 X2 XL-1 XL

1

01

1

01

model) (

model) (RATIO

L

iii

L

iii

xxp

xxp

p

p

)|(

)|(

|

|

x

x

Note: p+(x1|x0) is defined for convenience as p+(x1). p-(x1|x0) is defined for convenience as p-(x1).

Page 30: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

30

Log Odds-Ratio test

Taking logarithm yields

If logQ > 0, then + is more likely (CpG island).If logQ < 0, then - is more likely (non-CpG island).

i ii

ii

L

L

)|x(xp

)|x(xp

)|...xp(x

)|...xp(x Q

1

1

1

1 logloglog

Page 31: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

31

Where do the parameters (transition- probabilities) come from ?

Learning from complete data, namely, when the label is given and every xi is measured:

Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands.

Input: Tuples of the form (x1, …, xL, h), where h is + or -

Output: Maximum Likelihood parameters (MLE)

Count all pairs (Xi=a, Xi-1=b) with label +, and with label -, say the numbers are Nba,+ and Nba,- .

Page 32: Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution

32

Maximum Likelihood Estimate (MLE) of the parameters (using labeled data)

The needed parameters are:

P+(x1), p+ (xi | xi-1), p-(x1), p-(xi | xi-1)

The ML estimates are given by:

X1 X2 XL-1 XL

aa

a

N

NaXp

,

,)( 1 Where Na,+ is the number of times letter a appear in CpG islands in the dataset.

aba

baii

N

NbXaXp

,

,)|( 1

Where Nba,+ is the number of times letter b appears after letter a in CpG islands in the dataset.