Upload
clay
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads. Serghei Mangul Department of Computer Science Georgia State University. Joint work with Adrian Caciula , Sahar Al Seesi , Dumitru Brinza , - PowerPoint PPT Presentation
Citation preview
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End
RNA-Seq Reads
Serghei MangulDepartment of Computer Science
Georgia State University
Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky
2
Advances in Next Generation Sequencing
http://www.economist.com/node/16349358
Roche/454 FLX Titanium400-600 million reads/run
400bp avg. length
Illumina HiSeq 2000Up to 6 billion PE reads/run
35-100bp read length
SOLiD 4/55001.4-2.4 billion PE reads/run
35-50bp read length
Ion Proton Sequencer
3
RNA-SeqRNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression
A B C
A C
D E
Transcriptome Reconstruction Isoform Expression
4
Transcriptome Reconstruction
• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.
5
Transcriptome Reconstruction Types
• Genome-independent reconstruction (de novo)– de Brujin k-mer graph
• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph
• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts
6
Previous approaches
• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)
• Genome-guided reconstruction – Scripture(2010)
• Reports “all” transcripts
– Cufflinks(2010), IsoLasso(2011), SLIDE(2012)• Minimizes set of transcripts explaining reads
• Annotation-guided reconstruction– RABT(2011), DRUT(2011)
7
Challenges and Solutions
• Read length is currently much shorter then transcripts length
• Paired-end reads • Fragment length distribution
1 742 3 65t1 :
1 743 65t2 :
1 742 3 5t3 :
t4 : 1 743 5
1 742 3 65
Exon 2 and 6 are “distant” exons : how to phase them?
• Map the RNA-Seq reads to genome
• Construct Splice Graph - G(V,E)– V : exons– E: splicing events
• Candidate transcripts– depth-first-search (DFS)
• Filter candidate transcripts– fragment length distribution (FLD)– integer programming
9
Genome
TRIPTransciptome Reconstruction using Integer Programming
10
Gene representation
• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events
• Gene - set of non-overlapping pseudo-exons
e1 e3 e5
e2 e4 e6
Spse1Epse1
Spse2
Epse2Spse3
Epse3
Spse4
Epse4
Spse5
Epse5 Spse6
Epse6
Spse7Epse7
Pseudo-exons:
e1 e5
pse1 pse2 pse3 pse4 pse5 pse6 pse7
Tr1:
Tr2:
Tr3:
11
Splice GraphGenome
1 42 3 5 6 7 8 9
TSSpseudo-exons
TES
How to select?• Select the smallest set of candidate transcripts
that yields a good statistical fit between– the fragment length distribution empirically determined
during library preparation– fragment lengths implied by mapped read pairs
12
500
300
1 2 3
200 200 200
1 3
200 200
Series1
Mean : 500; Std. dev. 50
Series1
Mean : 500; Std. dev. 50
Simplified IP Formulation • Objective
• Constraints
T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwisex(p) - 1 if the pe read p is selected to be mapped 13
Tt
ty )(min
ppxtypTt
),()()1()(
p readsNpx )()2(
for each pe read at least one transcript is selected
IP Formulation• Objective
• Constraints
14
Tt
ty )(min
for each pe read from every category of std.dev. at least one transcript is selected
restricts the number of pe reads mapped within different std. dev.
each pe read is mapped no morethen with one category of std. dev.
every splice junction to be covered
15
Comparison on Simulated Data
100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd
FPTP
TPPPV
SensPPV
SensPPVFScore
2
FNTP
TPSens
16
Influence of Sequencing Parameters
100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd
FNTP
TPSens
FPTP
TPPPV
SensPPV
SensPPVFScore
2
TRIP-L : individual fragment lengths estimates
17
Results on Real RNA-Seq Data
• CD1 mouse retina RNA samples• specific gene that has 33 annotated transcripts
in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using
qPCR. • Cufflinks : 3 out of 10 transcripts
6906 alignments for 22346 read pairs with read length of 68
18
Conclusions
• We introduced a novel method for transcriptome reconstruction from paired-end RNA-Seq reads.
• Our method :– exploits distribution of fragment lengths– additional experimental data
• TSS/TES (TRIP with TSS/TES)• individual fragment lengths estimates (TRIP-L)
19