An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End

RNA-Seq Reads

Serghei MangulDepartment of Computer Science

Georgia State University

Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

2

Advances in Next Generation Sequencing

http://www.economist.com/node/16349358

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE reads/run

35-100bp read length

SOLiD 4/55001.4-2.4 billion PE reads/run

35-50bp read length

Ion Proton Sequencer

3

RNA-SeqRNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression

A B C

A C

D E

Transcriptome Reconstruction Isoform Expression

4

Transcriptome Reconstruction

• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

5

Transcriptome Reconstruction Types

• Genome-independent reconstruction (de novo)– de Brujin k-mer graph

• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph

• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts

6

Previous approaches

• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)

• Genome-guided reconstruction – Scripture(2010)

• Reports “all” transcripts

– Cufflinks(2010), IsoLasso(2011), SLIDE(2012)• Minimizes set of transcripts explaining reads

• Annotation-guided reconstruction– RABT(2011), DRUT(2011)

7

Challenges and Solutions

• Read length is currently much shorter then transcripts length

• Paired-end reads • Fragment length distribution

1 742 3 65t1 :

1 743 65t2 :

1 742 3 5t3 :

t4 : 1 743 5

1 742 3 65

Exon 2 and 6 are “distant” exons : how to phase them?

• Map the RNA-Seq reads to genome

• Construct Splice Graph - G(V,E)– V : exons– E: splicing events

• Candidate transcripts– depth-first-search (DFS)

• Filter candidate transcripts– fragment length distribution (FLD)– integer programming

9

Genome

TRIPTransciptome Reconstruction using Integer Programming

10

Gene representation

• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events

• Gene - set of non-overlapping pseudo-exons

e1 e3 e5

e2 e4 e6

Spse1Epse1

Spse2

Epse2Spse3

Epse3

Spse4

Epse4

Spse5

Epse5 Spse6

Epse6

Spse7Epse7

Pseudo-exons:

e1 e5

pse1 pse2 pse3 pse4 pse5 pse6 pse7

Tr1:

Tr2:

Tr3:

11

Splice GraphGenome

1 42 3 5 6 7 8 9

TSSpseudo-exons

TES

How to select?• Select the smallest set of candidate transcripts

that yields a good statistical fit between– the fragment length distribution empirically determined

during library preparation– fragment lengths implied by mapped read pairs

12

500

300

1 2 3

200 200 200

1 3

200 200

Series1

Mean : 500; Std. dev. 50

Series1

Mean : 500; Std. dev. 50

Simplified IP Formulation • Objective

• Constraints

T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwisex(p) - 1 if the pe read p is selected to be mapped 13

Tt

ty )(min

ppxtypTt

),()()1()(

p readsNpx )()2(

for each pe read at least one transcript is selected

IP Formulation• Objective

• Constraints

14

Tt

ty )(min

for each pe read from every category of std.dev. at least one transcript is selected

restricts the number of pe reads mapped within different std. dev.

each pe read is mapped no morethen with one category of std. dev.

every splice junction to be covered

15

Comparison on Simulated Data

100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd

FPTP

TPPPV

SensPPV

SensPPVFScore

2

FNTP

TPSens

16

Influence of Sequencing Parameters

100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd

FNTP

TPSens

FPTP

TPPPV

SensPPV

SensPPVFScore

2

TRIP-L : individual fragment lengths estimates

17

Results on Real RNA-Seq Data

• CD1 mouse retina RNA samples• specific gene that has 33 annotated transcripts

in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using

qPCR. • Cufflinks : 3 out of 10 transcripts

6906 alignments for 22346 read pairs with read length of 68

18

Conclusions

• We introduced a novel method for transcriptome reconstruction from paired-end RNA-Seq reads.

• Our method :– exploits distribution of fragment lengths– additional experimental data

• TSS/TES (TRIP with TSS/TES)• individual fragment lengths estimates (TRIP-L)

19

Documents

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads