23
Markov Models and HMM in Genome Analysis Bernard PRUM La genopole Evry France [email protected] Journée Statistique et Santé Paris 5 – 5 mai 2006

Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France [email protected] Journée Statistique et Santé Paris 5 – 5 mai 2006

Embed Size (px)

Citation preview

Page 1: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Markov Models and HMMin Genome Analysis

Bernard PRUM

La genopole – Evry – France

[email protected]

Journée Statistique et SantéParis 5 – 5 mai 2006

Page 2: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Why Markov Models ?

A biological sequence :

X = (X1, X2, … , Xn)

where Xk A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A

denoted by π(u , v) if Xk – 1 = u

Page 3: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Why MM ? – 2Exemple :

E. coli Rec BCD

viruses

own bacteria genomechi

A complex, called Rec BCD, protects the cell against viruses

To avoid the destruction of the genome of the cell, along the genome exists a password gctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Page 4: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter model M0 (Bernoulli)

Exemple HIV1

t 2164c 1773a 3410g 2370

Page 5: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter model M0 (Bernoulli),

• the frequency of 2-words tt , tc ta, tg, … gg model M1 (Markov)

Exemple HIV1 t c a g

t 548 342 684 590 2164c 470 413 795 95 1773a 713 561 1112 1024 3410g 432 457 820 661 2370

Page 6: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Results MM

Page 7: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Hidden Markov Models

An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions,

P(Xk = u) µ(u) (when k ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous.

There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts.

Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

Page 8: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

HMM – 2

Suppose that d states alternate along the sequence

And in each state we have a MC :

if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)

if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Our objectives• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

Page 9: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

HMM – 3

¡ Use the likelihood !!

L() = ∑ µ0(S1) µS (X1) ....1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

210 000 = 103 000 Désespoir !!!

Page 10: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

back to HMM

We do not know what is S = 1 and what is Sk = 2

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Step n° 1

We make an arbitrary allocation of these states :

“Knowing” the states, it is obvious to compute the parameters

exple : π1(c,g) = % of the c which are followed by a g in green parts.

π0(•,•) = % of • which are followed by a •

Page 11: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Baum & Churchill formula

Step n° 2

Define

the predictive probability k(v) = P(Sk = v | X1k-1)

the filtragee probability k(v) = P(Sk = v | X1k)

the estimated probability k(v) = P(Sk = v | X1n)

Bayes formula

k(v) = ∑u k-1(u) π0 (u, v)

k-1(u) =k-1(u) πu (Xk-2, Xk-1)

∑w k-1(w) πw (Xk-2, Xk-1)

Forward (k = 2 to n)

k-1(v) = k-1(u) ∑v π0 (u, v)Backward (k = n to 2)

k(v)

k(v)

Page 12: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

and Sk ?

Bad idea : Sk = arg max k(v) (*)

First good idea : keep the distribution k(v) [EM]

Second good idea : draw Sk according to k(v)[SEM]

(*) except, may be at the end of the algorithm (freezing)

Page 13: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Annotation

Page 14: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

SHMMNew defect : transition between states is described using a Markov model π0 =

1 - p p q 1 - q( )

• •••

Consequence : the length of segments of ‘a given colour’ are r.v. ~ Expo(–)

Page 15: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

SHMMDoes not correspond to the reality ! !Histograms of ‘biological segments‘ (after smoothing) look more like

h

density g(h)

It is easy to make the probability of leaving the state depend on hto get the suitable law :

p(h) =g(h)

1 - G(h)

Page 16: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Searching nucleosomes

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar :

Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

|||10 nm

What are nucleosomes ?

Page 17: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

NucleosomesCurvature within curvature

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

Back to one nucleosome :the DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides.

Total = 146 nt within each nucleosome.

Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Page 18: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Bendability

Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides.

a

ag

c

t

t

a

ag

c

t

t

Page 19: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

PNUC table 2nd lettera t g c

a 0.0 2.8 3,3 5.2 at 7,3 7,3 10,0 10,0 ag 3,0 6,4 6,2 7,5 ac 3,3 2,2 8,3 5,4 a

a 0.7 0.7 5,8 5,8 tt 2.8 0.0 5.2 3,3 tg 5.3 3.7 5.4 7,5 tc 6,7 5.2 5.4 5.4 t

1rst letter 3rd lettera 5.2 6,7 5.4 5.4 gt 2,2 3,3 5.4 8,3 gg 5.4 6,5 6,0 7,5 gc 4,2 4,2 4,7 4,7 g

a 3.7 5.3 7,5 5.4 ct 6,4 3,0 7,5 6,2 cg 5.6 5.6 8.2 8.2 cc 6,5 5.4 7,5 6,0 c

PNUC(cga) = 8,3

There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...)

We used PNUC-3 :

(*) Goodsell, Dickerson, NAR 22 (1994)

PNUC(tcg) = 8,3

Page 20: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

HMM for nucleosomes

Better : consider a different state for ”each” position in the nucleosome (say 146 states)

. . .

and the repetition of r (say 4) identical states for the between-nucleosome regions (= spacer).These brother states give to the law of the length of the b.n. regions a Gamma form which is not geometrical ! !

1 2 3 145 146

Page 21: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

A no-nuc state

. . .1 2 70

no-nuc

spacer

nucleosome core

Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility)The introduce “before“ nucleosome a “no-nucleosme” state.

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Page 22: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

Page 23: Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr Journée Statistique et Santé Paris 5 – 5 mai 2006

Acknowledgements

Labo «Statistique et Génome» Labo MIG – INRAMaurice BAUDRY Philippe BESSIÈREEtienne BIRMELE François RODOLPHECécile COT Sophie SCHBATHMark HOEBEKE Élisabeth de TURCKHEIMMickael GUEDJFrançois KÉPÈSSophie LEBRE Labo AGROCatherine MATIAS Jean-Noël BACROVincent MIELE Jean-Jacques DAUDINFlorence MURI-MAJOUBE Stéphane ROBINGrégory NUELHugues RICHARDAnne-Sophie TOCQUET Lab’ Rouen Dominique CELLIERNicolas VERGNE Dominique CELLIER

Sec : Michèle ILBERT Sabine MERCIER