41
1 Towards a model for -1 frameshift sites Alain Denise 1,2 , Michaël Bekaert 1 , Laure Bidou 1 , Guillemette Duchateau-Nguyen 1 , Jean-Paul Forest 2 , Christine Froidevaux 2 , Isabelle Hatin 1 , Jean-Pierre Rousset 1 , Michel Termier 1 1 IGM (Institut de Génétique et Microbiologie) 2 LRI (Laboratoire de Recherche en Informatique) Université Paris-Sud, Orsay

1 Towards a model for -1 frameshift sites Alain Denise 1,2, Michaël Bekaert 1, Laure Bidou 1, Guillemette Duchateau-Nguyen 1, Jean-Paul Forest 2, Christine

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

1

Towards a model for -1 frameshift sites

Alain Denise1,2, Michaël Bekaert1, Laure Bidou1, Guillemette Duchateau-Nguyen1,

Jean-Paul Forest2, Christine Froidevaux2,

Isabelle Hatin1, Jean-Pierre Rousset1, Michel Termier1

1 IGM (Institut de Génétique et Microbiologie)2 LRI (Laboratoire de Recherche en Informatique)

Université Paris-Sud, Orsay

2

Translation

CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’

mRNA

3

Translation

CAU AUG GAU UAC AUG GUC UAA GAU

The ribosome reads bases by triplets (or codons)from a START codon

ribosome

5’ 3’

4

Translation

CAU AUG GAU UAC AUG GUC UAA GAU

The ribosome synthetizes one amino-acid per codon

5’ 3’

5

Translation

CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’

6

Translation

CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’

7

Translation

CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’

8

Translation

CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’

9

Translation

CAU AUG GAU UAC AUG GUC UAA GAU

The synthesis goes on until a STOP codon is read

5’ 3’

1 mRNA gives 1 protein

10

Experimental fact

• Some mRNAs encode two distinct proteins with same 5’ end

11

Programmed -1 frameshifting

Non-deterministic event

ORF1a

START0 STOP0

0 phase

STOP-1

ORF1b -1 phase

usualtranslation

-1 frameshift

1 mRNA gives 2 distinct proteinswith accurate ratio

12

Typical -1 frameshift site [Brierley, 1989]

NNX XXY YYZ

AUG P SP

S1

L1

S2

L2

L’1

Slippery sequence Secondary structure

5’

3’

13

IBV frameshift site

UAU UUA AAC

AUG

S1

S2

Slippery sequence Pseudoknot

5’

3’

GGGUAC

UGACGAUGGGG

GCUG AUACCCC

A G G C U C G

U C C G A G C

G

UUGC

GAAA

15

Translation with frameshift

UAU UUA AAC GGG UAC

AUG

5’

3’

UGACGAUGGGG

GCUG AUACCCC

A G G C U C G

U C C G A G C

G

UUGC

GAAA

16

Translation with frameshift

UAU UUA AAC GGG UAC

5’

3’

UGACGAUGGGG

GCUG AUACCCC

A G G C U C G

U C C G A G C

G

UUGC

GAAA

17

Translation with frameshift

UAU UUA AAC GGG UAC

5’

3’

UGACGAUGGGG

GCUG AUACCCC

A G G C U C G

U C C G A G C

G

UUGC

GAAA

-1 shift

18

UA UUU AAA CGG GUA CGG GGU AGC AGU

Translation with frameshift

5’

3’

19

UA UUU AAA CGG GUA CGG GGU AGC AGU

Translation with frameshift

5’

3’

20

UA UUU AAA CGG GUA CGG GGU AGC AGU

Translation with frameshift

5’

3’

21

UA UUU AAA CGG GUA CGG GGU AGC AGU

Translation with frameshift

5’

3’

22

Goals

To improve the known model for viral frameshift sites

To identify new frameshift sites in viral and non viral genomes

23

Our approach

Biologicalsequences

Formalmodels

Predictiontools

In silicoand in vivo

validation

Applications toother genomes

representexplainpredict

24

IBV frameshift site: spacer

5’

3’

GGGUAC

25

Spacer consensus

HAST-1 UAC AAA

BEV UGU UG

EAV UGA GAG

HCV GAG UC

IBV GGG UAC

MHV GGG UU

TGEV GAG

RCNMV UAG GC

BWYV GGA GUG

PLRV GGG CAA

BLV UAA UAG A

FIV UGG AAG GC

HIV-1 GGG AAG AU

HTLV-2UCC UUA A

JSR UGG GUG A

MMTV gag-pro UUG UAA A

MMTV pro-pol UGA U

RSV UAG GGA

SRV-1 GGA CUG A

Consensus UGG UAG AGAA GUA

26

Lab experiments

lacZ luc

-1 phase

pSV40 lacZ luc

0 phase

pSV40 FS signal

FS signal N

Test construct

Control construct

Expression reporter FS reporter

27

Spacer: lab experiments

Spacer relative FS rate

wild-type IBV GGGUA 100U mutant UGGUA 100

A mutant AGGUA 55C mutant CGGUA 32CC mutant CCGUA 70CCU mutant CCUUA 49

28

Refining the model: Machine learning

• To identify relevant properties that characterize FS sites

• Disjunctive learning: all sequences do not frameshift for the same reasons [Giedroc et al., 2000]

29

Annotating data: spacer

5’

3’

GGGUAC

30

Example of data: SP

• SP = GGGUAC

– number of A = 1; C = 1; G = 3; U = 1;

– % of A = 33; C = 33; G = 50; U = 33;

– first = G;

– last = C;

31

Annotating data: stem 1

UGACGAUGGGG

GCUG AUACCCC

5’

3’

32

Example of data: stem 1

• S1 =

– 5' side : GGGGUAGCAGU– 3' side : CCCCAUAGUCG

– stability : -20,7 kcal/mol

33

Annotating data: full sequence

U UUA AAC

5’

3’

GGGUAC

UGACGAUGGGG

GCUG AUACCCC

A G G C U C G

U C C G A G C

G

UUGC

GAAA

34

Example of data : FS rate

FS rate = 22 %

35

GloBo

Disjunctive learning algorithm

Suited to small amount of data

Won the PTE challenge on analogous data

36

Example of rulesIf

SP length 5 and number of G in S1.5’ bottom half 3 and

number of G in S1.5’ 4 and %T in S2.5’ 30 and%G in S2.5’ 70

then FS rate 5%

If %G in S1.5' bottom half 80 and %C in L1 45

then FS rate 5%

If

SP length 5 and S1.3' length 6 and %C in S1.3' 45

then FS rate 5%

...

37

Covering and prediction

If

SP length 5 and number of G in S1.5’ bottom half 3 and

number of G in S1.5’ 4 and %T in S2.5’ 30 and%G in S2.5’ 70

then FS rate 5%

Covering of examples : 70 %

Examples predicted in test set : 80 %

38

Is R1relevant for frameshift ?

Stem 1 5’-side relative FS R1 rate

wild-type IBV GGGGU AUCAGU 100 yesmutant 1 GGUCG AUCAGU 41 yesmutant 2 GGGGU UCUACA 55 yes

mutant 3 GCUCG AUCAGU 36 nomutant 4 GCCCU AUCAGU 73 no

39

Covering and prediction

If

SP length 5 and S1.3' length 6 and %C in S1.3' 45

then FS rate 5%

Covering of examples : 45 %

Examples predicted in test set : 40 %

40

Conclusion

• Spacer:– correlation between primary sequence and

FS rate has been established– systematic experimentation going on

41

Conclusion

Biologicalsequences

Formalmodels

Predictiontools

In silicoand in vivo

validation

Applications toother genomes

58

SpacerVirus Sequence

HAST-I : U A C A A ABEV : U G U U GEAV : U G A G A GHCV : G A G U CIBV : G G G U A CMHV : G G G U UTGEV : G A GRCNMV : U A G G CBWYV : G G A G U GPLRV : G G G C A ABLV : U A A U A G AFIV : U G G A A G G CHIV-1 : G G G A A G A UHTLV-II : U C C U U A AJSR : U G G G U G AMMTV : U U G U A A AMMTV : U G A URSV : U A G G G ASRV-1 : G G A C U G A

Consensus : U G G U A G AG A A G U A