26
15-20 september WABI03 1 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano Pesole* Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy *Department of Physiology and Biochemistry, University of Milan, Italy Supported by FIRB Bioinformatics: Genomics and Proteomics

Paola Bonizzoni Graziano Pesole * Raffaella Rizzi

  • Upload
    lorna

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence. Paola Bonizzoni Graziano Pesole * Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy * Department of Physiology and Biochemistry, University of Milan, Italy - PowerPoint PPT Presentation

Citation preview

Page 1: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

1

A Method to Detect Gene Structure and Alternative

Splice Sites by Agreeing ESTs to a Genomic Sequence

Paola Bonizzoni Graziano Pesole* Raffaella Rizzi

DISCo, University of Milan-Bicocca, Italy*Department of Physiology and Biochemistry, University of

Milan, Italy

Supported by FIRB Bioinformatics: Genomics and Proteomics

Page 2: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

2

Outline Gene structure and alternative

splicing (AS) Problem definition and algorithm ASPic program Experimental results and

discussion

Page 3: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

3

Mechanism of Splicing3’

5’

5’

3’DNA

TRANSCRIPTION

5’

3’

exon 1 exon 2 exon 3pre-mRNA

SPLICING by spliceosome

exon 1 exon 2 exon 3 splicing productmRNA

EST Expressed Sequence Tag(cDNA)

exon 2exon 1 exon 3

Page 4: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

4

Modes of Alternative Splicing

1 2 3

Genomic sequence

1 2 3

ExonsIntrons

1 2 3

First splicing modeSecond splicing mode

1 3

Third splicing mode

2 3

Page 5: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

5

Modes of Alternative Splicing

1 2 32b

Competing 5’–3’

Exclusive exons: 1 31 2b

Page 6: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

6

Why AS is important? AS occurs in 59% of human genes

(Graveley, 2001) AS expands protein diversity

(generates from a single gene multiple transcripts)

AS is tissue-specific (Graveley, 2001)

AS is related to human diseases

Page 7: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

7

Motivations

predict alternative splicing forms analyze such a mechanism by a representation of splicing forms

Regulation of AS is still an open problem

NEED tools to

Page 8: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

8

What is available?Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001)

Squall (Ogasawara & Morishita, 2002)

But to predict the exon-intron gene structure is acomplicate goal because of

sequencing errors in EST make difficult to locate splice sites by alignment duplications, repeated sequences may produce more than one possible EST

alignment

Page 9: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

9

Open Problems Formal definition of AS prediction problem

Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure

Optimization criteria

Page 10: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

10

Formal Definitions Def 1

Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons

Def 2 Exon factorization of G, GE = f1 f2 f3 … fn

Def 3 EST factorization of an EST S compatible with GE is

S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n:

st = fit for t=2, 3, …, k-1 s1 is a suffix of fi1 and sk is a prefix of fik

st = suff (fit) or st = pref (fit)splice variant

Def 1 Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii

(i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons Def 2

Exon factorization of G, GE = f1 f2 f3 … fn Def 3

EST factorization of an EST S compatible with GE is S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n:

edit (st, fit) error for t=2, 3, …, k-1 edit(s1, suff(fi1)) error and edit(sk, pref(fik)) error

Page 11: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

11

The ProblemInput

- A genomic sequence G- A set of EST sequences S = {S1, S2, …, Sn}

Output

An exon factorization GE of G (GE = f1, f2, …, fn) and aset of ESTs factorizations compatible with GE

Objective: minimize n

Page 12: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

12

ExampleGenomic sequence G

EST set S = {S1, S2, S3}

S2 A1A2 B D1

S3 A2 D1D2 C1C2

A2 A1A2 B D1 C1 D1D2 C1C2

C1S1 A2 D1

A2 D1 C1A2 D1 C1A1A2 B D1A1A2 B D1A2 D1D2 C1C2A2 D1D2 C1C2

7 exons

B D1D2 C1C2

4 exons

A1A2

Page 13: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

13

Results MEFC is MAX-SNP-hard (linear

reduction from NODE-COVER)

heuristic algorithm: Iterate process to factorize each ESTbacktracking to recompute previous EST factorsif not compatible to GE

Page 14: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

14

The algorithm

si1 si j-1 sijSi

e1 e2G

Iterative jth step: partial EST factorization of Si (compute factor sij)

em

if (Compatible(em, exon_list)) thenadd em to exon_list;

otherwise try to place sij elsewhere;

em

If not possible then backtrack;

si-1 1 si-1 j-1 si-1 j si-1 nSi-1

After placing all the factors sij for the set S,place the external factors;

Page 15: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

15

The algorithm (more details)

G

si1 si j-1Sisi j

Compute factor sij

Sij can be divided into n components ck (k=1,2,…,n)At least one of these components for k from 1 to (n-1)is error-free and can be placed on G

sijc1 c2 c3 c4 c5

The algorithm searches a perfect match of c1 on G

c1

Suppose that c1 has no perfect match on G

Then the algorithm searches a perfect match of c2 on G

c2c1c1

Suppose that c2 has a perfect match on G

c2

Then the entire factor sij can be placed on GFind the canonical ag pattern on the left

ag

Find the rightmost gt pattern such that the edit distance between sijy and the genomic substring from ag to gt is bounded

gt

si jy

exon

Page 16: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

16

ASPic (Alternative Splicing PredICtion)

Input- A minimum length of an exon- A maximum number of exons in the exon factorization of the genomic sequence- An error percentage- A genomic sequence- An ESTs set (or cluster)

Output- A text file for all ESTs alignments- An HTML file for the exon factorization of the genomic sequence

Page 17: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

17

ASPic data validation

ASAP (Lee et al., 2003)

Genomic sequences from ASAP database EST clusters of human chromosome 1 from UniGene database

ASPic INPUT:

Validation Database:

Page 18: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

18

Experimental Results

Genomic sequence(official gene name)

Introns detectedby ASAPASAP intronsdetected by ASPic

Novel introns detectedby ASPic

Genomic shift detected by ASPic

Page 19: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

19

Execution timesPENTIUM IV, 1600 MHZ, 256 MB, running Linux

Page 20: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

20

An example of data (gene HNRPR)

ASPic finds a novel intronfrom 2144 to 5333 confirmedby 18 EST sequencesPositions are from 0 for ASPic and from 1 for ASAP

Page 21: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

21

An example of data (gene HNRPR, intron 2144-5333)

EST ID

Left and right ends of thetwo exonsEST exonsGenomic exons

Page 22: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

22

WEB site

Page 23: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

23

WEB site

Page 24: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

24

WEB site

Page 25: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

25

Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole

Responsabile disegno software: Raffaella Rizzi

Sito WEB:Gabriele RavanelliRappresentazione grafica: Francesco Perego

Anna RedondiAnalisi dati: Francesca RossinAltri contributi: Gianluca Dellavedova

Page 26: Paola Bonizzoni       Graziano Pesole *        Raffaella Rizzi

15-20 september WABI03

26

GRAZIE!