Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Differential expression analysis of de novo assembled

transcriptomes Nadia Davidson

Murdoch Childrens Research Institute

WEHI Bioinformatics Seminar April 9th 2013

RNA-Seq on non-model organisms

• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression

• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome

• By de novo assembling the transcriptome – But it has its challenges

De novo assembly

Transcriptome Assemblers – For genome assembly a k-mer length must be selected and

optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple

assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach.

– Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach.

– Many studies compare the different assemblers. – Few studies explore ways to do a differential expression

analysis after the transcriptome has been assembled • Our aim

Our RNA-Seq dataset

• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples

– Already had the data from another project – Model organism

• Assembled the data using Trinity and

Oases. – Starting with these assemblies we

investigated how to perform a differential expression analysis

– 300k and 600k transcripts from Trinity and Oases respectively.

Q1. Why so many transcripts?

Transcripts grow with reads Fracis et. Al., BMC Genomics 2013

– Sequencing errors – Heterozygosity – Different Isoforms – Paralogs

AGGTCTGA

ATTCGATG

ATTCCATG ACCTGAGA

AGGTCTGA ATTCGATG ACCTGAGA

AGGTCTGA ATTCCATG ACCTGAGA

Reported Transcripts

De Bruijn Graph Complexity

Vijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7

A simulation study of de novo transcriptome assemblies. 17K genes. 100 million, 100bp paired-end reads

“Even in the data sets simulated without alternative splicing, no sequencing error, no polymorphism and no paralogs for 7.87% of the genes many isoforms were erroneously inferred (ranging from 2 to 335 isoforms per gene)”

Simulation Study

Variation in coverage • Across transcripts

Reported Transcripts

Reported Transcripts Different coverage could mean different contigs assembled for each k-mer

Other transcripts • We get about 4.3 transcript for each known chicken

gene (Ensembl) in our Trinity assembly and 13.4 for Oases

• What are the other transcripts?

Known genes

Novel in genome

Novel not in geome

Trinity Assembly Oases Assembly

S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233

with > 100 million reads

Abundance of Gene Type from ENCODE

Our novel genes

Trinity assembly

Q2. Isoform or gene-level analysis?

The Cons Isoforms • List may be too long:

– Difficult to interpret – Computationally expensive – Larger correction for

multiple testing • Not obvious how to assign

ambiguously aligned reads – Can lead to double

counting if ignored, or – Less power if reads are

split between transcripts • Not all transcript represent

different isoforms anyway

Genes • Not sensitive to differential

splicing • Not obvious how transcripts

should be clustered into genes

Q3. How to cluster transcripts into genes

Which clusters to use?

• This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by

splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g.

Gene A

Gene A Two incomplete sequences from the same gene

Cluster ✓

Gene A

Gene B Low coverage repeat sequence past UTR

Do not cluster ✗

Clustering Options

• What you can use: – The locus/component information from the assembler.

• General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length>

– Sequence similarity clustering such as CD-HIT, Blastclust etc.

We tested the accuracy of these clustering methods on our assemblies

“Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)

“over clustered”

“und

How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2

TP = 2, TN = 6 FP = 0, FN = 2

False negative indicate “under clustering”

TP = 4, TN = 0 FP = 6, FN = 0

False positives indicate “over clustering”

Trinity Assembly Clustering

TP = true positives number of pairs of transcripts which correctly share a cluster FN = false negatives number of pairs of transcripts which are incorrectly are split

335,377 transcripts

Number of clusters:

“und

✕ Ideal CD-HIT-EST Trinity

Oases Assembly Clustering

Number of clusters:

Max transcripts in a cluster:

540,933 transcripts “over clustered”

“und

✕ Ideal CD-HIT-EST Oases

Can we do better?

• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low

level – We could separate sequences which show different

expression between sample groups – Using pair-end reads gives extra leverage to group

transcripts

• We are developing a tools which will take multi-mapped reads and output clusters along with counts for each cluster

The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters

• Transcripts are grouped if they share ANY reads with another transcript

– For each super-cluster • For each pair of transcripts, calculate the distance

• To do: incorporate sample information too

– Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too

large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster

Rab – Number of reads which map to transcript a and b

Step 1

Step 2 – make a distance matrix

0 (R=2)

0.5 (R=1)

0 (R=2)

0 (R=2) 0 (R=2)

Distance = , R = reads

Step 1

0 (R=2)

0.5 (R=1) 0 (R=2)

Update R2’2’ = R22+R33-R23 R12’ = max(R12,R23) Recalculate the distance

Distance = 0.5

Cutting the tree at 0.5 or less would give the correct clustering

Distance = 0

Distance = 1

Trinity assembly

“und

How do we do?

✕ Ideal CD-HIT-EST Oases/Trinity

Oases assembly

“und

Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Impact on differential expression (DE) • To assess this we:

– Mapped reads back to all transcripts (“best” mapping - bowtie)

– Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list

• Obtained from a genome based analysis on RefSeq genes (5 thousand genes)

• True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded

DE results - Oases

Conclusion: Better to “under” cluster than “over” cluster

“und

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

DE results - Trinity “over clustered”

“und

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Q4. What is the best way to go from reads to counts?

Approaches 1. Do what we did before – add up counts for each

cluster 2. Trinity and Oases suggest:

– Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously

mapped reads • RSEM can take the clustering as input and return gene-

level counts.

3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts

e.g. Sandmann et. al., Genome Biology, 2011 12:R76

Gene–level counts

Multi-map Reads:

Trinity script

Get counts: RSEM

Gene–level DE results

Best-map reads: bowtie

Count reads overlapping transcripts: samtools

Select Representative Transcript:

longest

The alternatives: “Best” map

Reads: bowtie

Count reads overlapping transcripts: samtools

Add counts in a cluster

own script

Use the same clustering for all three approaches

Results

Oases Assembly Trinity Assembly

Difference between methods is small - could probably do any of them

Conclusions • Q1. Why so many transcripts?

– Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation.

– De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult.

– Sequencing to a high depth may give you more intergenic non-coding transcripts.

• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts

has a number of advantages over isoform-level counts

Conclusions cont.

• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the

clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works

better than the alternatives based on differential expression results

• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance

estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the

differential expression results than other steps in the pipe-line

Future Work • We have only looked at one RNA-Seq dataset.

– Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general:

• Different species (all model organisms) • Different read depths

• Our clustering tool: – Would like to output the gene-level counts for each cluster.

• Then compare to other abundance estimation approaches.

– Would like to incorporate differences in expression between groups to improve the clustering

• More investigation into the pipe-line methods – E.g. mapping

Acknowledgements

Chicken RNA-Seq Data from Katie Ayers (MCRI) Craig Smith (MCRI)

MCRI Bioinformatics Alicia Oshlack The Bioinformatics Group

Red Jungle Fowl (credit: NHGRI)

VLSCI AGRF

Extra Slides

Trinity and Oases compared Oases

Trinity – version from the start of 2012

Trinity – version from the end of 2012

frac_match = length of the longest matching assembled transcript / “true” length of the transcript

Number of genes to transcripts (ordered by DE)

Chicken (Trinity)

Chicken (Oases)

Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Technology

Comparative description of ten transcriptomes of newly

Tissue-specific transcriptomes of Anisakis simplex (sensu stricto) … · 2018. 3. 4. · RESEARCH Open Access Tissue-specific transcriptomes of Anisakis simplex (sensu stricto) and

Mapping’Bacterial’Transcriptomes’from’ ’’RNA6Seq ... · The’importance’of’mapping’bacterial’ transcriptomes’ ’ Deﬁning’transcribed’ regions’and’how’they’

Analysis of anther transcriptomes to identify genes ... Priyanka; ... Analysis of anther transcriptomes to identify genes contributing to ... Analysis of anther transcriptomes to identify

Regulating Divergent Transcriptomes through mRNA Splicing

Transcriptomes of newly-isolated Trypanosoma brucei ... · Trypanosoma brucei rhodesiense reveal hundreds of mRNAs ... as the name implies ... Transcriptomes of newly-isolated Trypanosoma

Nuclear Transcriptomes of the Seven Neuronal Cell Types ...repository.cshl.edu/37666/1/2019.Shih.MushroomBodies.pdf · nuclear transcriptomes of all seven MB neuronal subtypes that

Allelic diversity in the transcriptomes of contrasting ...digital.csic.es/bitstream/10261/125805/1/12870_2014_Article_376.pdf · Allelic diversity in the transcriptomes of contrasting

Deep sequencing analysis of transcriptomes in Aspergillus ......RESEARCH ARTICLE Open Access Deep sequencing analysis of transcriptomes in Aspergillus flavus in response to resveratrol

Reference genomes and transcriptomes of Nicotiana sylvestris and

The Xylem and Phloem Transcriptomes from Secondary

L4.1 - Transcriptomes

Reptilian Transcriptomes v2.0: An Extensive Resource … · Reptilian Transcriptomes v2.0: An Extensive Resource for Sauropsida Genomics and Transcriptomics Athanasia C. Tzika1,2,3,*,

Comparative Analysis of Salivary Gland Transcriptomes of ...vl-ethiopia.huji.ac.il/Publications/2014Vlkova Porientalis saliva... · Comparative Analysis of Salivary Gland Transcriptomes

Mapping the yeast mitochondrial transcriptomes by direct ......Mapping the yeast mitochondrial transcriptomes by direct RNA sequencing Hana Lichancova1, Filip Brazdovic1,2, Viktoria

Quantitative Analysis of Fission Yeast Transcriptomes and ...patofyziologie.lf1.cuni.cz/file/100/Aebersold_skoro kompletni... · Resource Quantitative Analysis of Fission Yeast Transcriptomes

Translating Cancer Genomes and Transcriptomes for Precision Oncology

RESEARCH ARTICLE Open Access Primary transcriptomes of

NADIA SADI, Nadia lives with NASH - Novo Nordisk · 2021. 1. 28. · NADIA SADI, Nadia lives with NASH. Denmark. SLIDE 2. Forward-looking statements CAPITAL MARKETS DAY 2019 Novo

Comparative analysis of the root transcriptomes of ... · Comparative analysis of the root transcriptomes of cultivated and wild rice ... Transcriptome analysis, RNA-sequencing, Wild