Differential expression analysis of de novo assembled
transcriptomes Nadia Davidson
Murdoch Childrens Research Institute
WEHI Bioinformatics Seminar April 9th 2013
RNA-Seq on non-model organisms
• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression
• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome
• By de novo assembling the transcriptome – But it has its challenges
De novo assembly
Transcriptome Assemblers – For genome assembly a k-mer length must be selected and
optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple
assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach.
– Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach.
– Many studies compare the different assemblers. – Few studies explore ways to do a differential expression
analysis after the transcriptome has been assembled • Our aim
Our RNA-Seq dataset
• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples
– Already had the data from another project – Model organism
• Assembled the data using Trinity and
Oases. – Starting with these assemblies we
investigated how to perform a differential expression analysis
– 300k and 600k transcripts from Trinity and Oases respectively.
Q1. Why so many transcripts?
Transcripts grow with reads Fracis et. Al., BMC Genomics 2013
– Sequencing errors – Heterozygosity – Different Isoforms – Paralogs
AGGTCTGA
ATTCGATG
ATTCCATG ACCTGAGA
AGGTCTGA ATTCGATG ACCTGAGA
AGGTCTGA ATTCCATG ACCTGAGA
Reported Transcripts
De Bruijn Graph Complexity
Vijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7
A simulation study of de novo transcriptome assemblies. 17K genes. 100 million, 100bp paired-end reads
“Even in the data sets simulated without alternative splicing, no sequencing error, no polymorphism and no paralogs for 7.87% of the genes many isoforms were erroneously inferred (ranging from 2 to 335 isoforms per gene)”
Simulation Study
Variation in coverage • Across transcripts
Reported Transcripts
Reported Transcripts Different coverage could mean different contigs assembled for each k-mer
Other transcripts • We get about 4.3 transcript for each known chicken
gene (Ensembl) in our Trinity assembly and 13.4 for Oases
• What are the other transcripts?
Known genes
Novel in genome
Novel not in geome
Trinity Assembly Oases Assembly
S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233
with > 100 million reads
Abundance of Gene Type from ENCODE
Our novel genes
Trinity assembly
Q2. Isoform or gene-level analysis?
The Cons Isoforms • List may be too long:
– Difficult to interpret – Computationally expensive – Larger correction for
multiple testing • Not obvious how to assign
ambiguously aligned reads – Can lead to double
counting if ignored, or – Less power if reads are
split between transcripts • Not all transcript represent
different isoforms anyway
Genes • Not sensitive to differential
splicing • Not obvious how transcripts
should be clustered into genes
Q3. How to cluster transcripts into genes
Which clusters to use?
• This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by
splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g.
Gene A
Gene A Two incomplete sequences from the same gene
Cluster ✓
Gene A
Gene B Low coverage repeat sequence past UTR
Do not cluster ✗
Clustering Options
• What you can use: – The locus/component information from the assembler.
• General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length>
– Sequence similarity clustering such as CD-HIT, Blastclust etc.
We tested the accuracy of these clustering methods on our assemblies
“Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)
“over clustered”
“und
er c
lust
ered
”
How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2
TP = 2, TN = 6 FP = 0, FN = 2
False negative indicate “under clustering”
TP = 4, TN = 0 FP = 6, FN = 0
False positives indicate “over clustering”
Trinity Assembly Clustering
TP = true positives number of pairs of transcripts which correctly share a cluster FN = false negatives number of pairs of transcripts which are incorrectly are split
335,377 transcripts
Number of clusters:
“over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Trinity
Oases Assembly Clustering
Number of clusters:
Max transcripts in a cluster:
540,933 transcripts “over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases
Can we do better?
• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low
level – We could separate sequences which show different
expression between sample groups – Using pair-end reads gives extra leverage to group
transcripts
• We are developing a tools which will take multi-mapped reads and output clusters along with counts for each cluster
The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters
• Transcripts are grouped if they share ANY reads with another transcript
– For each super-cluster • For each pair of transcripts, calculate the distance
• To do: incorporate sample information too
– Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too
large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster
Rab – Number of reads which map to transcript a and b
Step 1
Step 2 – make a distance matrix
0 (R=2)
0.5 (R=1)
0.5 (R=1)
0 (R=2)
0 (R=2) 0 (R=2)
Distance = , R = reads
Step 1
0 (R=2)
0.5 (R=1) 0 (R=2)
Update R2’2’ = R22+R33-R23 R12’ = max(R12,R23) Recalculate the distance
Distance = 0.5
Cutting the tree at 0.5 or less would give the correct clustering
Distance = 0
Distance = 1
Trinity assembly
“over clustered”
“und
er c
lust
ered
”
How do we do?
✕ Ideal CD-HIT-EST Oases/Trinity
Oases assembly
“over clustered”
“und
er c
lust
ered
”
Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
Impact on differential expression (DE) • To assess this we:
– Mapped reads back to all transcripts (“best” mapping - bowtie)
– Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list
• Obtained from a genome based analysis on RefSeq genes (5 thousand genes)
• True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded
DE results - Oases
Conclusion: Better to “under” cluster than “over” cluster
vs.
“over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
DE results - Trinity “over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
Q4. What is the best way to go from reads to counts?
Approaches 1. Do what we did before – add up counts for each
cluster 2. Trinity and Oases suggest:
– Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously
mapped reads • RSEM can take the clustering as input and return gene-
level counts.
3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts
e.g. Sandmann et. al., Genome Biology, 2011 12:R76
Gene–level counts
Multi-map Reads:
Trinity script
Get counts: RSEM
edgeR
Gene–level DE results
Best-map reads: bowtie
Count reads overlapping transcripts: samtools
Select Representative Transcript:
longest
The alternatives: “Best” map
Reads: bowtie
Count reads overlapping transcripts: samtools
Add counts in a cluster
own script
Use the same clustering for all three approaches
Results
Oases Assembly Trinity Assembly
Difference between methods is small - could probably do any of them
Conclusions • Q1. Why so many transcripts?
– Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation.
– De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult.
– Sequencing to a high depth may give you more intergenic non-coding transcripts.
• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts
has a number of advantages over isoform-level counts
Conclusions cont.
• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the
clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works
better than the alternatives based on differential expression results
• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance
estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the
differential expression results than other steps in the pipe-line
Future Work • We have only looked at one RNA-Seq dataset.
– Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general:
• Different species (all model organisms) • Different read depths
• Our clustering tool: – Would like to output the gene-level counts for each cluster.
• Then compare to other abundance estimation approaches.
– Would like to incorporate differences in expression between groups to improve the clustering
• More investigation into the pipe-line methods – E.g. mapping
Acknowledgements
Chicken RNA-Seq Data from Katie Ayers (MCRI) Craig Smith (MCRI)
MCRI Bioinformatics Alicia Oshlack The Bioinformatics Group
Red Jungle Fowl (credit: NHGRI)
VLSCI AGRF
Extra Slides
Trinity and Oases compared Oases
Trinity – version from the start of 2012
Trinity – version from the end of 2012
frac_match = length of the longest matching assembled transcript / “true” length of the transcript
Number of genes to transcripts (ordered by DE)
Yeast
Chicken (Trinity)
Chicken (Oases)