Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Manuel Tardaguila, PhD
SQANTI: Classification, Curation and Quantification of a PacBio transcriptome
Ana Conesa Lab
PacBio in Oligodendrogenesis
Neural Stem Cells (neurospheres)
Oligodendrocyte Progenitor Cells
Immature Oligodendrocytes
Premyelinating mature Oligodendrocytes
Myelinated mature Oligodendrocytes
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAAAAAAA
AAAAA
AAAAA
AAAAAIndels
SQANTI
PacBio RoIs ToFU pipeline
ToFU transcripts
RefSeq
Illumina Short reads
cDNA
CuratedTranscriptome
QC features + Machine Learning
Global Transcriptome Expressed Transcriptomes
cDNA
Mapping
CorrectedTranscriptome
Tardaguila et al SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification Pre-print BioRxiv (2017)
1. CLASSIFICATION OF PACBIO TRANSCRIPTS
Splice-based classification (I): Known Transcripts
Full Splice Match (FSM)
Reference
PacBio
Reference
PacBio
Reference
PacBio
PacBio
Reference
Incomplete Splice Match (ISM)
Donor Acceptor
Splice Junction
n= 1
6,10
4
Splice-based classification (II): Novel Transcripts from known genes
n= 1
6,10
4
Novel SJs with known Donors and acceptors
Reference
PacBio
Reference
PacBio
RetainedIntron
New combination of Known SJs
Novel In Catalog (NIC)
Novel Not in Catalog (NNC)
Novel SJs from novel Donors and/or Acceptors
Reference
PacBio
Splice-based classification (III): Novel Transcripts from novel genes
n= 1
6,10
4
Intergenic
Fusion Transcripts
Gene A Gene B
PacBio
Genic Intron
PacBio
Exon A Exon BIntron A-B
Gene A Gene B
PacBio
Splice-based classification (IV): Antisense and Genic Genomic transcripts
n= 1
6,10
4Gene A
Ref 1Ref 2PacBio
Antisense
Antisense
Genic Genomic
PacBio
Exon A Exon BIntron A-B
CLASSIFICATION SUMMARY
• Splice based classification allows to separate known transcripts
(FSM and ISM) from novel transcripts arising from novel splice junctions
or new combinations of already known splice junctions (NIC and NNC)
• The predominant categories in our transcriptome are FSM,ISM, NIC
and NNC and they are mostly coding (oligodT library preparation)
2. CURATION OF PACBIO TRANSCRIPTS
40% novel isoforms in mouse... Are all of them real?
PacBio output needs curation
• NNC transcripts characterized by novel Donors and/or Acceptors of splicing concentrate traits of low quality transcripts
Using Short reads and Long reads to mine QC featuresB)
CorrectedTranscriptome
Features selected for Random Forest Classificationbite
test set
as.fa
ctor
(test
ing$
bite
)
NEG POS
FALS
ETR
UE
0.0
0.2
0.4
0.6
0.8
1.0
NEG POS
25
1020
5010
020
050
020
0050
00
FL
test set
NEG POS
5e+0
15e
+02
5e+0
35e
+04
5e+0
5
geneExp
test set
NEG POS
110
010
000
isoExp
test set
NEG POS
12
510
20
Min_sample_cov
test set
NEG POS1
100
1000
0
MinCov
test set
NEG POS
2050
100
200
500
1000
2000
5000
MinCovPos
test set
NEG POS
12
510
20
nIndels
test set
NEG POS
1.0
1.5
2.0
2.5
3.0
3.5
4.0
nIndelsJunc
test set
NEG POS
1.0
1.2
1.4
1.6
1.8
2.0
ratioExp
test set
NEG POS
110
100
1000
1000
0
sdCov
test set
FSM_class
test set
test
ing[
, i]
NEG POS
AB
C
0.0
0.2
0.4
0.6
0.8
1.0
NEG POS
510
2050
exons
test set
NEG POS
1000
2000
3000
4000
5000
7000
length
test set
PCR validation in an independent set of transcripts
PB.5557.1NNC isoform
Pgam5
463 pb*500 pb
+-
42
-+42
-+50
oligodTR.H.
Temp (°C)
* 895 pb
PB.4207.4NNC isoform
Trp53inp2
500 pb
1000 pb
+-
42
-+42
-+50
oligodTR.H.
Temp (°C)
PB.3658.3NNC isoform
Nolc1
729 pb500 pb
1000 pb
+-
42
-+42
-+50
oligodTR.H.
Temp (°C)
olig
o (d
T) P
CR
R.H
. PC
R
Positive
Negative
Total
Positive
Negative
Total
23 (3nc)
0
23
FSMNIC NNC
Fusion
7
2
9
8 (3 nc)
12 (8 nc)
20
1
2
3
5 (3nc)
0
5
7
0
7
2
6 (3nc)
8
1
0
1
Figure: First line: figure on the complete dataset. Second line: on the all mouse data; application of the classifier on all categories except FSM and ISM. Isoform distribution and quality control, threshold 0.5 . 1.2) Results on the PCR set:
Figure 3: ROC curve obtained on the PCR set. The red curve was obtained on the complete PCR set, the dark red curve on the PCR set reduced to the novel transcripts.
Figure 4: boxplot of the values of the random forest probabilities to be positive obtained on the PCR sets (red for the PCR negative transcripts, green for the PCR positive transcripts).
The area under roc obtained on the PCR sets is comprise between 96.66 (if all the type of sequences are evaluated) and 95.56 if the classifier is only applied to the novels.
Splice Junction independent curation: Intrapriming features
• oligodT can prime outside polyA regions in A rich regions inside transcripts• We looked for transcripts showing >= 80% Adenines in the 20 nts downstream (DNA) the end of thetranscript• If this transcripts lack a consensus poly Adenilation site we filter them out • 605 transcripts filtered out
FSM
% o
f in
trapr
imin
g Fi
ltere
d ou
t in
each
cat
egor
y
0
20
40
60
80
100
ISM NICNNC
Genic
Genom
ic
Antise
nse
Fusion
Genic
Intron
Interg
enic
Coding
Non Coding
n=35 n=81 n=135 n=17 n=108 n=17 n=210
SQANTI filter eliminates transcript with bad QC featuresn=
12,
908
SQANTI filter works across different PacBio datasets: Maize
• Before SQANTI
• AFTER SQANTI
SQANTI filter works across different PacBio datasets: MCF-7
• Before SQANTI
• AFTER SQANTI
CURATION SUMMARY
• PacBio output needs curation, specially in NNC novel acceptor/
donnor transcripts
• SQANTI mines features based in splice junctions, expression and
structure to feed a RF classifier that succeeds in filtering out transcripts
that fail to validate in an independent PCR set
• SQANTI works across different datasets
3. QUANTIFICATION OF PACBIO TRANSCRIPTOME
Quantification: PacBio curated vs RefSeq: Transcripts
21934 36748782
6408
239
6845
6,8456,408
Tran
scrip
t ex
pres
ion
(log(
TPM
))
0.0
2.5
5.0
7.5
10.0
Found by both PacBio Exclusive
OLP
repl
icat
e 1
R2 = 0.39
PacBio
R2 = 0.89
R2 = 0.99
OLP replicate 2
High Expn=6,357
Medium Expn=5,677
Low Expn=422
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.3R2 = 0.88
R2 = 0.99
Found by both RefSeq Exclusive
High Expn=7,440
Medium Expn=7,440
Low Expn=8,379
% o
f tot
al G
enes
0
20
40
60
80
100
12-34-5>= 6
Isoforms per gene
0
20
40
60
80
100
% o
f tot
al G
enes
ENSEMBL
NIC: n
ew co
mbinati
on of
know
n SJNIC
: mon
oexo
n by I
R
NIC: n
ovel
SJ of k
nown D
/ANNC
antis
ense
fusion
genic
geno
mic
genic
intro
n
interg
enic
% o
f nov
el tr
ansc
ripts
iPac
Bio
Excl
usiv
e
0
10
20
30
40
50
603,674 PacBio exclusive transcripts
12-34-5>= 6
Isoforms per gene
PacBio
RefSeq
239
0.0
2.5
5.0
7.5
10.0
Tran
scrip
t ex
pres
ion
(log(
TPM
))8,78221,934 3,674
Found by both PacBio Exclusive
High Expn=2,729
Medium Expn=4,181
Low Expn=174
Found by both RefSeq Exclusive
High Expn=3,330
Medium Expn=6,589
Low Expn=3,315
OLP
repl
icat
e 1
R2 = 0.92
PacBio
R2 = 0.99
R2 = 0.99
OLP replicate 2
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.93
R2 = 0.98
R2 = 0.99
• PacBio finds a set of highly expressed transcripts and reproducible transcripts in common with RefSeq• RefSeq detects lots of lowly expressed, difficult to reproduce transcripts. However it also captures exclusively 30% of highly expressed ones (PacBio 18%).
Most of the PacBio exclusive transcripts are new combinations of already known Splice Junction
21934 36748782
6408
239
6845
6,8456,408
Tran
scrip
t ex
pres
ion
(log(
TPM
))
0.0
2.5
5.0
7.5
10.0
Found by both PacBio Exclusive
OLP
repl
icat
e 1
R2 = 0.39
PacBio
R2 = 0.89
R2 = 0.99
OLP replicate 2
High Expn=6,357
Medium Expn=5,677
Low Expn=422
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.3R2 = 0.88
R2 = 0.99
Found by both RefSeq Exclusive
High Expn=7,440
Medium Expn=7,440
Low Expn=8,379
% o
f tot
al G
enes
0
20
40
60
80
100
12-34-5>= 6
Isoforms per gene
0
20
40
60
80
100
% o
f tot
al G
enes
ENSEMBL
NIC: n
ew co
mbinati
on of
know
n SJNIC
: mon
oexo
n by I
R
NIC: n
ovel
SJ of k
nown D
/ANNC
antis
ense
fusion
genic
geno
mic
genic
intro
n
interg
enic
% o
f nov
el tr
ansc
ripts
iPac
Bio
Excl
usiv
e
0
10
20
30
40
50
603,674 PacBio exclusive transcripts
12-34-5>= 6
Isoforms per gene
PacBio
RefSeq
239
0.0
2.5
5.0
7.5
10.0
Tran
scrip
t ex
pres
ion
(log(
TPM
))
8,78221,934 3,674
Found by both PacBio Exclusive
High Expn=2,729
Medium Expn=4,181
Low Expn=174
Found by both RefSeq Exclusive
High Expn=3,330
Medium Expn=6,589
Low Expn=3,315
OLP
repl
icat
e 1
R2 = 0.92
PacBio
R2 = 0.99
R2 = 0.99
OLP replicate 2
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.93
R2 = 0.98
R2 = 0.99
Quantification: PacBio curated vs RefSeq: Genes
21934 36748782
6408
239
6845
6,8456,408Tr
ansc
ript
expr
esio
n (lo
g(TP
M))
0.0
2.5
5.0
7.5
10.0
Found by both PacBio Exclusive
OLP
repl
icat
e 1
R2 = 0.39
PacBio
R2 = 0.89
R2 = 0.99
OLP replicate 2
High Expn=6,357
Medium Expn=5,677
Low Expn=422
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.3R2 = 0.88
R2 = 0.99
Found by both RefSeq Exclusive
High Expn=7,440
Medium Expn=7,440
Low Expn=8,379
% o
f tot
al G
enes
0
20
40
60
80
100
12-34-5>= 6
Isoforms per gene
0
20
40
60
80
100
% o
f tot
al G
enes
ENSEMBL
NIC: n
ew co
mbinati
on of
know
n SJNIC
: mon
oexo
n by I
R
NIC: n
ovel
SJ of k
nown D
/ANNC
antis
ense
fusion
genic
geno
mic
genic
intro
n
interg
enic
% o
f nov
el tr
ansc
ripts
iPac
Bio
Excl
usiv
e
0
10
20
30
40
50
603,674 PacBio exclusive transcripts
12-34-5>= 6
Isoforms per gene
PacBio
RefSeq
239
0.0
2.5
5.0
7.5
10.0
Tran
scrip
t ex
pres
ion
(log(
TPM
))
8,78221,934 3,674
Found by both PacBio Exclusive
High Expn=2,729
Medium Expn=4,181
Low Expn=174
Found by both RefSeq Exclusive
High Expn=3,330
Medium Expn=6,589
Low Expn=3,315
OLP
repl
icat
e 1
R2 = 0.92
PacBio
R2 = 0.99
R2 = 0.99
OLP replicate 2
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.93
R2 = 0.98
R2 = 0.99
• RefSeq detects lots of lowly expressed genes. However it captures an important fraction of highly expressed ones 19% (PacBio 1.5%)
Imposing a higher EXP threshold highlights advantages and disadvantages of PacBio quantification
5101 21117425
2110 59258385,8382,110
Tran
scrip
t ex
pres
ion
(log(
TPM
))
0.0
2.5
5.0
7.5
10.0
Found by both PacBio Exclusive
PacBio
OLP
repl
icat
e 1
R2 = -0.01R2 = 0.89
R2 = 0.99
OLP replicate 2
High Expn=6,350
Medium Expn=3,167
Low Expn=19
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.22R2 = 0.87
R2 = 0.99
Found by both RefSeq Exclusive
High Expn=7,339
Medium Expn=5,109
Low Expn=78
% o
f tot
al G
enes
12-34-5>= 6
Isoforms per gene
% o
f tot
al G
enes
12-34-5>= 6
Isoforms per gene
PacBio
RefSeq
592
0.0
2.5
5.0
7.5
10.0
Tran
scrip
t ex
pres
ion
(log(
TPM
))
7,4255,101 2,111
Found by both PacBio Exclusive
High Expn=2,698
Medium Expn=3,725
Low Expn=7
Found by both RefSeq Exclusive
High Expn=3,268
Medium Expn=4,488
Low Expn=186
OLP
repl
icat
e 1
R2 = 0.98
PacBio
R2 = 0.98
R2 = 0.99
OLP replicate 2
RefSeq
OLP
repl
icat
e 1
OLP replicate 2
R2 = 0.95
R2 = 0.98
R2 = 0.99
0
20
40
60
80
100
0
20
40
60
80
100
0
10
20
30
40
50
60
ENSEMBL
NIC: n
ew co
mbinati
on of
know
n SJNIC
: mon
oexo
n by I
R
NIC: n
ovel
SJ of k
nown D
/ANNC
antis
ense
fusion
genic
geno
mic
genic
intro
n
interg
enic
% o
f nov
el tr
ansc
ripts
iPac
Bio
Excl
usiv
e
2,111 PacBio exclusive transcripts
Analysis of Most Expressed Transcript (MET) reveal unaccounted 3’ end variability that is captured by PacBio
1000
500
0
500
1000
Dow
nstre
amU
pstre
am
Dis
tanc
e to
refe
renc
e T
TS (n
ts)
1000 pb1500 pb
1000 pb
1500 pb
1000 pb
1500 pb
Mean Transcript expresion (TPM)0 10 20 30 40 50
Read Coverage
Dhrs7b - Dehydrogenase / reductase (SDR family) member 7B
NM_145428 /PB.1035.1
NM_001172112 /PB.1035.2
XM_006532869
Same M
ET
Differe
nt MET
PB.1035.1
XM_006532869
NM_145428
PB.1035.2
NM_001172112
***
ExR
GlRLo
wes
t SJ
cove
rage
by
sho
rt re
ads
***ns
Same M
ET
Differe
nt MET
ExR
GlR
Same M
ET
Differe
nt MET
***ns
Low
est m
ean
cove
rage
of
an
exon
0
0
500
1000
1500
2000
0
500
1000
1500
2000
CUANTIFICATION SUMMARY
• PacBio captures a robust fraction of transcripts and genes being
expressed and at the same time allow for novel discovery.
• RefSeq captures more transcripts and genes, many lowly expressed
and hardly reproducible ones
• However, PacBio TOFU fails to capture many highly expressed
genes detected by RefSeq.
• PacBio transcriptome reveals unaccounted for 3’ end variability in
known transcripts that hamper RefSeq quantification.
Tardaguila et al SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification Pre-print BioRxiv (2017)
4. Functional outcome of alternative splicing: Transcript2GO
T2GO combines SQANTI classification with Genomic, transcript and protein annotation to maximize analytical possibilities
Transcript2GO: assessing the functional outcome of alternative splicingmanuscript in preparation
Diferential splicing linked to the appearance of sequence motifs on a transcriptome wide scale
NPC1 NPC2 OPC1 OPC2 MM
INPUT
NPC1 NPC2 OPC1 OPC2 M
Nuclear Fraction
NPC1 NPC2 OPC1OPC2
Cytosolic Fraction
GAPDH
H3-AcORF collapse
ns
***
***
Transcript2GO: assessing the functional outcome of alternative splicingmanuscript in preparation
Thanks!