SQANTI: Classification, Curation and Quantification of a ...€¦ · Disance o reerence TTS ns Upstream 1 b 1 b 1 b 1 b 1 b 1 b Mean Transcri eresion TPM 0 1 2 30 40 50 Read Coerage

Manuel Tardaguila, PhD

SQANTI: Classification, Curation and Quantification of a PacBio transcriptome

Ana Conesa Lab

PacBio in Oligodendrogenesis

Neural Stem Cells (neurospheres)

Oligodendrocyte Progenitor Cells

Immature Oligodendrocytes

Premyelinating mature Oligodendrocytes

Myelinated mature Oligodendrocytes

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAA

AAAAAAAAAA

AAAAA

AAAAA

AAAAAIndels

SQANTI

PacBio RoIs ToFU pipeline

ToFU transcripts

RefSeq

Illumina Short reads

cDNA

CuratedTranscriptome

QC features + Machine Learning

Global Transcriptome Expressed Transcriptomes

cDNA

Mapping

CorrectedTranscriptome

Tardaguila et al SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification Pre-print BioRxiv (2017)

1. CLASSIFICATION OF PACBIO TRANSCRIPTS

Splice-based classification (I): Known Transcripts

Full Splice Match (FSM)

Reference

PacBio

Reference

PacBio

Reference

PacBio

PacBio

Reference

Incomplete Splice Match (ISM)

Donor Acceptor

Splice Junction

n= 1

6,10

4

Splice-based classification (II): Novel Transcripts from known genes

n= 1

6,10

4

Novel SJs with known Donors and acceptors

Reference

PacBio

Reference

PacBio

RetainedIntron

New combination of Known SJs

Novel In Catalog (NIC)

Novel Not in Catalog (NNC)

Novel SJs from novel Donors and/or Acceptors

Reference

PacBio

Splice-based classification (III): Novel Transcripts from novel genes

n= 1

6,10

4

Intergenic

Fusion Transcripts

Gene A Gene B

PacBio

Genic Intron

PacBio

Exon A Exon BIntron A-B

Gene A Gene B

PacBio

Splice-based classification (IV): Antisense and Genic Genomic transcripts

n= 1

6,10

4Gene A

Ref 1Ref 2PacBio

Antisense

Antisense

Genic Genomic

PacBio

Exon A Exon BIntron A-B

CLASSIFICATION SUMMARY

• Splice based classification allows to separate known transcripts

(FSM and ISM) from novel transcripts arising from novel splice junctions

or new combinations of already known splice junctions (NIC and NNC)

• The predominant categories in our transcriptome are FSM,ISM, NIC

and NNC and they are mostly coding (oligodT library preparation)

2. CURATION OF PACBIO TRANSCRIPTS

40% novel isoforms in mouse... Are all of them real?

PacBio output needs curation

• NNC transcripts characterized by novel Donors and/or Acceptors of splicing concentrate traits of low quality transcripts

Using Short reads and Long reads to mine QC featuresB)

CorrectedTranscriptome

Features selected for Random Forest Classificationbite

test set

as.fa

ctor

(test

ing$

bite

)

NEG POS

FALS

ETR

UE

0.0

0.2

0.4

0.6

0.8

1.0

NEG POS

25

1020

5010

020

050

020

0050

00

FL

test set

NEG POS

5e+0

15e

+02

5e+0

35e

+04

5e+0

5

geneExp

test set

NEG POS

110

010

000

isoExp

test set

NEG POS

12

510

20

Min_sample_cov

test set

NEG POS1

100

1000

0

MinCov

test set

NEG POS

2050

100

200

500

1000

2000

5000

MinCovPos

test set

NEG POS

12

510

20

nIndels

test set

NEG POS

1.0

1.5

2.0

2.5

3.0

3.5

4.0

nIndelsJunc

test set

NEG POS

1.0

1.2

1.4

1.6

1.8

2.0

ratioExp

test set

NEG POS

110

100

1000

1000

0

sdCov

test set

FSM_class

test set

test

ing[

, i]

NEG POS

AB

C

0.0

0.2

0.4

0.6

0.8

1.0

NEG POS

510

2050

exons

test set

NEG POS

1000

2000

3000

4000

5000

7000

length

test set

PCR validation in an independent set of transcripts

PB.5557.1NNC isoform

Pgam5

463 pb*500 pb

+-

42

-+42

-+50

oligodTR.H.

Temp (°C)

* 895 pb


Trp53inp2

500 pb

1000 pb

+-

42

-+42

-+50

oligodTR.H.

Temp (°C)


Nolc1

729 pb500 pb

1000 pb

+-

42

-+42

-+50

oligodTR.H.

Temp (°C)

olig

o (d

T) P

CR

R.H

. PC

R

Positive

Negative

Total

Positive

Negative

Total

23 (3nc)

0

23

FSMNIC NNC

Fusion

7

2

9

8 (3 nc)

12 (8 nc)

20

1

2

3

5 (3nc)

0

5

7

0

7

2

6 (3nc)

8

1

0

1

Figure: First line: figure on the complete dataset. Second line: on the all mouse data; application of the classifier on all categories except FSM and ISM. Isoform distribution and quality control, threshold 0.5 . 1.2) Results on the PCR set:

Figure 3: ROC curve obtained on the PCR set. The red curve was obtained on the complete PCR set, the dark red curve on the PCR set reduced to the novel transcripts.

Figure 4: boxplot of the values of the random forest probabilities to be positive obtained on the PCR sets (red for the PCR negative transcripts, green for the PCR positive transcripts).

The area under roc obtained on the PCR sets is comprise between 96.66 (if all the type of sequences are evaluated) and 95.56 if the classifier is only applied to the novels.

Splice Junction independent curation: Intrapriming features

• oligodT can prime outside polyA regions in A rich regions inside transcripts• We looked for transcripts showing >= 80% Adenines in the 20 nts downstream (DNA) the end of thetranscript• If this transcripts lack a consensus poly Adenilation site we filter them out • 605 transcripts filtered out

FSM

% o

f in

trapr

imin

g Fi

ltere

d ou

t in

each

cat

egor

y

0

20

40

60

80

100

ISM NICNNC

Genic

Genom

ic

Antise

nse

Fusion

Genic

Intron

Interg

enic

Coding

Non Coding

n=35 n=81 n=135 n=17 n=108 n=17 n=210

SQANTI filter eliminates transcript with bad QC featuresn=

12,

908

SQANTI filter works across different PacBio datasets: Maize

• Before SQANTI

• AFTER SQANTI

SQANTI filter works across different PacBio datasets: MCF-7

• Before SQANTI

• AFTER SQANTI

CURATION SUMMARY

• PacBio output needs curation, specially in NNC novel acceptor/

donnor transcripts

• SQANTI mines features based in splice junctions, expression and

structure to feed a RF classifier that succeeds in filtering out transcripts

that fail to validate in an independent PCR set

• SQANTI works across different datasets

3. QUANTIFICATION OF PACBIO TRANSCRIPTOME

Quantification: PacBio curated vs RefSeq: Transcripts

21934 36748782

6408

239

6845

6,8456,408

Tran

scrip

t ex

pres

ion

(log(

TPM

))

0.0

2.5

5.0

7.5

10.0

Found by both PacBio Exclusive

OLP

repl

icat

e 1

R2 = 0.39

PacBio

R2 = 0.89

R2 = 0.99

OLP replicate 2

High Expn=6,357

Medium Expn=5,677

Low Expn=422

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.3R2 = 0.88

R2 = 0.99

Found by both RefSeq Exclusive

High Expn=7,440

Medium Expn=7,440

Low Expn=8,379

% o

f tot

al G

enes

0

20

40

60

80

100

12-34-5>= 6

Isoforms per gene

0

20

40

60

80

100

% o

f tot

al G

enes

ENSEMBL

NIC: n

ew co

mbinati

on of

know

n SJNIC

: mon

oexo

n by I

R

NIC: n

ovel

SJ of k

nown D

/ANNC

antis

ense

fusion

genic

geno

mic

genic

intro

n

interg

enic

% o

f nov

el tr

ansc

ripts

iPac

Bio

Excl

usiv

e

0

10

20

30

40

50

603,674 PacBio exclusive transcripts

12-34-5>= 6

Isoforms per gene

PacBio

RefSeq

239

0.0

2.5

5.0

7.5

10.0

Tran

scrip

t ex

pres

ion

(log(

TPM

))8,78221,934 3,674


High Expn=2,729

Medium Expn=4,181

Low Expn=174


High Expn=3,330

Medium Expn=6,589

Low Expn=3,315

OLP

repl

icat

e 1

R2 = 0.92

PacBio

R2 = 0.99

R2 = 0.99

OLP replicate 2

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.93

R2 = 0.98

R2 = 0.99

• PacBio finds a set of highly expressed transcripts and reproducible transcripts in common with RefSeq• RefSeq detects lots of lowly expressed, difficult to reproduce transcripts. However it also captures exclusively 30% of highly expressed ones (PacBio 18%).

Most of the PacBio exclusive transcripts are new combinations of already known Splice Junction

21934 36748782

6408

239

6845

6,8456,408

Tran

scrip

t ex

pres

ion

(log(

TPM

))

0.0

2.5

5.0

7.5

10.0


OLP

repl

icat

e 1

R2 = 0.39

PacBio

R2 = 0.89

R2 = 0.99

OLP replicate 2

High Expn=6,357

Medium Expn=5,677

Low Expn=422

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.3R2 = 0.88

R2 = 0.99


High Expn=7,440

Medium Expn=7,440

Low Expn=8,379

% o

f tot

al G

enes

0

20

40

60

80

100

12-34-5>= 6

Isoforms per gene

0

20

40

60

80

100

% o

f tot

al G

enes

ENSEMBL

NIC: n

ew co

mbinati

on of

know

n SJNIC

: mon

oexo

n by I

R

NIC: n

ovel

SJ of k

nown D

/ANNC

antis

ense

fusion

genic

geno

mic

genic

intro

n

interg

enic

% o

f nov

el tr

ansc

ripts

iPac

Bio

Excl

usiv

e

0

10

20

30

40

50


12-34-5>= 6

Isoforms per gene

PacBio

RefSeq

239

0.0

2.5

5.0

7.5

10.0

Tran

scrip

t ex

pres

ion

(log(

TPM

))

8,78221,934 3,674


High Expn=2,729

Medium Expn=4,181

Low Expn=174


High Expn=3,330

Medium Expn=6,589

Low Expn=3,315

OLP

repl

icat

e 1

R2 = 0.92

PacBio

R2 = 0.99

R2 = 0.99

OLP replicate 2

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.93

R2 = 0.98

R2 = 0.99

Quantification: PacBio curated vs RefSeq: Genes

21934 36748782

6408

239

6845

6,8456,408Tr

ansc

ript

expr

esio

n (lo

g(TP

M))

0.0

2.5

5.0

7.5

10.0


OLP

repl

icat

e 1

R2 = 0.39

PacBio

R2 = 0.89

R2 = 0.99

OLP replicate 2

High Expn=6,357

Medium Expn=5,677

Low Expn=422

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.3R2 = 0.88

R2 = 0.99


High Expn=7,440

Medium Expn=7,440

Low Expn=8,379

% o

f tot

al G

enes

0

20

40

60

80

100

12-34-5>= 6

Isoforms per gene

0

20

40

60

80

100

% o

f tot

al G

enes

ENSEMBL

NIC: n

ew co

mbinati

on of

know

n SJNIC

: mon

oexo

n by I

R

NIC: n

ovel

SJ of k

nown D

/ANNC

antis

ense

fusion

genic

geno

mic

genic

intro

n

interg

enic

% o

f nov

el tr

ansc

ripts

iPac

Bio

Excl

usiv

e

0

10

20

30

40

50


12-34-5>= 6

Isoforms per gene

PacBio

RefSeq

239

0.0

2.5

5.0

7.5

10.0

Tran

scrip

t ex

pres

ion

(log(

TPM

))

8,78221,934 3,674


High Expn=2,729

Medium Expn=4,181

Low Expn=174


High Expn=3,330

Medium Expn=6,589

Low Expn=3,315

OLP

repl

icat

e 1

R2 = 0.92

PacBio

R2 = 0.99

R2 = 0.99

OLP replicate 2

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.93

R2 = 0.98

R2 = 0.99

• RefSeq detects lots of lowly expressed genes. However it captures an important fraction of highly expressed ones 19% (PacBio 1.5%)

Imposing a higher EXP threshold highlights advantages and disadvantages of PacBio quantification

5101 21117425

2110 59258385,8382,110

Tran

scrip

t ex

pres

ion

(log(

TPM

))

0.0

2.5

5.0

7.5

10.0


PacBio

OLP

repl

icat

e 1

R2 = -0.01R2 = 0.89

R2 = 0.99

OLP replicate 2

High Expn=6,350

Medium Expn=3,167

Low Expn=19

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.22R2 = 0.87

R2 = 0.99


High Expn=7,339

Medium Expn=5,109

Low Expn=78

% o

f tot

al G

enes

12-34-5>= 6

Isoforms per gene

% o

f tot

al G

enes

12-34-5>= 6

Isoforms per gene

PacBio

RefSeq

592

0.0

2.5

5.0

7.5

10.0

Tran

scrip

t ex

pres

ion

(log(

TPM

))

7,4255,101 2,111


High Expn=2,698

Medium Expn=3,725

Low Expn=7


High Expn=3,268

Medium Expn=4,488

Low Expn=186

OLP

repl

icat

e 1

R2 = 0.98

PacBio

R2 = 0.98

R2 = 0.99

OLP replicate 2

RefSeq

OLP

repl

icat

e 1

OLP replicate 2

R2 = 0.95

R2 = 0.98

R2 = 0.99

0

20

40

60

80

100

0

20

40

60

80

100

0

10

20

30

40

50

60

ENSEMBL

NIC: n

ew co

mbinati

on of

know

n SJNIC

: mon

oexo

n by I

R

NIC: n

ovel

SJ of k

nown D

/ANNC

antis

ense

fusion

genic

geno

mic

genic

intro

n

interg

enic

% o

f nov

el tr

ansc

ripts

iPac

Bio

Excl

usiv

e


Analysis of Most Expressed Transcript (MET) reveal unaccounted 3’ end variability that is captured by PacBio

1000

500

0

500

1000

Dow

nstre

amU

pstre

am

Dis

tanc

e to

refe

renc

e T

TS (n

ts)

1000 pb1500 pb

1000 pb

1500 pb

1000 pb

1500 pb

Mean Transcript expresion (TPM)0 10 20 30 40 50

Read Coverage

Dhrs7b - Dehydrogenase / reductase (SDR family) member 7B

NM_145428 /PB.1035.1

NM_001172112 /PB.1035.2

XM_006532869

Same M

ET

Differe

nt MET

PB.1035.1

XM_006532869

NM_145428

PB.1035.2

NM_001172112

***

ExR

GlRLo

wes

t SJ

cove

rage

by

sho

rt re

ads

***ns

Same M

ET

Differe

nt MET

ExR

GlR

Same M

ET

Differe

nt MET

***ns

Low

est m

ean

cove

rage

of

an

exon

0

0

500

1000

1500

2000

0

500

1000

1500

2000

CUANTIFICATION SUMMARY

• PacBio captures a robust fraction of transcripts and genes being

expressed and at the same time allow for novel discovery.

• RefSeq captures more transcripts and genes, many lowly expressed

and hardly reproducible ones

• However, PacBio TOFU fails to capture many highly expressed

genes detected by RefSeq.

• PacBio transcriptome reveals unaccounted for 3’ end variability in

known transcripts that hamper RefSeq quantification.

Tardaguila et al SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification Pre-print BioRxiv (2017)

4. Functional outcome of alternative splicing: Transcript2GO

T2GO combines SQANTI classification with Genomic, transcript and protein annotation to maximize analytical possibilities

Transcript2GO: assessing the functional outcome of alternative splicingmanuscript in preparation

Diferential splicing linked to the appearance of sequence motifs on a transcriptome wide scale

NPC1 NPC2 OPC1 OPC2 MM

INPUT

NPC1 NPC2 OPC1 OPC2 M

Nuclear Fraction

NPC1 NPC2 OPC1OPC2

Cytosolic Fraction

GAPDH

H3-AcORF collapse

ns

***

***

Transcript2GO: assessing the functional outcome of alternative splicingmanuscript in preparation

Thanks!

Documents

SQANTI: Classification, Curation and Quantification of a ...€¦ · Disance o reerence TTS ns Upstream 1 b 1 b 1 b 1 b 1 b 1 b Mean Transcri eresion TPM 0 1 2 30 40 50 Read Coerage