Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France [email protected] Journée Statistique et Santé Paris 5 – 5 mai 2006

Markov Models and HMMin Genome Analysis

Bernard PRUM

La genopole – Evry – France

[email protected]

Journée Statistique et SantéParis 5 – 5 mai 2006

Why Markov Models ?

A biological sequence :

X = (X1, X2, … , Xn)

where Xk A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A

denoted by π(u , v) if Xk – 1 = u

Why MM ? – 2Exemple :

E. coli Rec BCD

viruses

own bacteria genomechi

A complex, called Rec BCD, protects the cell against viruses

To avoid the destruction of the genome of the cell, along the genome exists a password gctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter model M0 (Bernoulli)

Exemple HIV1

t 2164c 1773a 3410g 2370

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter model M0 (Bernoulli),

• the frequency of 2-words tt , tc ta, tg, … gg model M1 (Markov)

Exemple HIV1 t c a g

t 548 342 684 590 2164c 470 413 795 95 1773a 713 561 1112 1024 3410g 432 457 820 661 2370

Results MM

Hidden Markov Models

An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions,

P(Xk = u) µ(u) (when k ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous.

There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts.

Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

HMM – 2

Suppose that d states alternate along the sequence

And in each state we have a MC :

if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)

if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Our objectives• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

HMM – 3

¡ Use the likelihood !!

L() = ∑ µ0(S1) µS (X1) ....1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

210 000 = 103 000 Désespoir !!!

back to HMM

We do not know what is S = 1 and what is Sk = 2

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Step n° 1

We make an arbitrary allocation of these states :

“Knowing” the states, it is obvious to compute the parameters

exple : π1(c,g) = % of the c which are followed by a g in green parts.

π0(•,•) = % of • which are followed by a •

Baum & Churchill formula

Step n° 2

Define

the predictive probability k(v) = P(Sk = v | X1k-1)

the filtragee probability k(v) = P(Sk = v | X1k)

the estimated probability k(v) = P(Sk = v | X1n)

Bayes formula

k(v) = ∑u k-1(u) π0 (u, v)

k-1(u) =k-1(u) πu (Xk-2, Xk-1)

∑w k-1(w) πw (Xk-2, Xk-1)

Forward (k = 2 to n)

k-1(v) = k-1(u) ∑v π0 (u, v)Backward (k = n to 2)

k(v)

k(v)

and Sk ?

Bad idea : Sk = arg max k(v) (*)

First good idea : keep the distribution k(v) [EM]

Second good idea : draw Sk according to k(v)[SEM]

(*) except, may be at the end of the algorithm (freezing)

Annotation

SHMMNew defect : transition between states is described using a Markov model π0 =

1 - p p q 1 - q( )

• •••

Consequence : the length of segments of ‘a given colour’ are r.v. ~ Expo(–)

SHMMDoes not correspond to the reality ! !Histograms of ‘biological segments‘ (after smoothing) look more like

h

density g(h)

It is easy to make the probability of leaving the state depend on hto get the suitable law :

p(h) =g(h)

1 - G(h)

Searching nucleosomes

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar :

Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

|||10 nm

What are nucleosomes ?

NucleosomesCurvature within curvature



Back to one nucleosome :the DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides.

Total = 146 nt within each nucleosome.

Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Bendability

Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides.

a

ag

c

t

t

a

ag

c

t

t

PNUC table 2nd lettera t g c

a 0.0 2.8 3,3 5.2 at 7,3 7,3 10,0 10,0 ag 3,0 6,4 6,2 7,5 ac 3,3 2,2 8,3 5,4 a

a 0.7 0.7 5,8 5,8 tt 2.8 0.0 5.2 3,3 tg 5.3 3.7 5.4 7,5 tc 6,7 5.2 5.4 5.4 t

1rst letter 3rd lettera 5.2 6,7 5.4 5.4 gt 2,2 3,3 5.4 8,3 gg 5.4 6,5 6,0 7,5 gc 4,2 4,2 4,7 4,7 g

a 3.7 5.3 7,5 5.4 ct 6,4 3,0 7,5 6,2 cg 5.6 5.6 8.2 8.2 cc 6,5 5.4 7,5 6,0 c

PNUC(cga) = 8,3

There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...)

We used PNUC-3 :

(*) Goodsell, Dickerson, NAR 22 (1994)

PNUC(tcg) = 8,3

HMM for nucleosomes

Better : consider a different state for ”each” position in the nucleosome (say 146 states)

. . .

and the repetition of r (say 4) identical states for the between-nucleosome regions (= spacer).These brother states give to the law of the length of the b.n. regions a Gamma form which is not geometrical ! !

1 2 3 145 146

A no-nuc state

. . .1 2 70

no-nuc

spacer

nucleosome core

Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility)The introduce “before“ nucleosome a “no-nucleosme” state.

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)



Acknowledgements

Labo «Statistique et Génome» Labo MIG – INRAMaurice BAUDRY Philippe BESSIÈREEtienne BIRMELE François RODOLPHECécile COT Sophie SCHBATHMark HOEBEKE Élisabeth de TURCKHEIMMickael GUEDJFrançois KÉPÈSSophie LEBRE Labo AGROCatherine MATIAS Jean-Noël BACROVincent MIELE Jean-Jacques DAUDINFlorence MURI-MAJOUBE Stéphane ROBINGrégory NUELHugues RICHARDAnne-Sophie TOCQUET Lab’ Rouen Dominique CELLIERNicolas VERGNE Dominique CELLIER

Sec : Michèle ILBERT Sabine MERCIER

Documents

Markov Models and HMM in Genome Analysis Bernard PRUM La genopole – Evry – France [email protected] Journée Statistique et Santé Paris 5 – 5 mai 2006