View
5.515
Download
1
Category
Preview:
DESCRIPTION
With next generation sequencing it has become possible to analyse the transcriptome of non-model organism by performing a de novo assembly of RNA-seq reads. In particular, differential expression analysis can be undertaken without the need for a reference genome or annotation. While a number of studies have compared the relative merits of different transcriptome assembly programs, less attention has been given to the methodology for performing a differential expression analysis after the transcriptome has been assembled. Differential expression analysis on a de novo assembly suffers from several challenges including mapping reads to transcripts, clustering similar transcripts and producing a summary of read counts for statistical testing. In particular, we have found that transcriptome assembly produces a much larger number of transcripts than would generally be expected. I will discuss the reasons for this and will assess the different strategies for taking the de novo assembled transcripts and producing a list of differentially expressed genes. I demonstrate that clustering transcripts into loci improves the interpretability of results and increases statistical power, but that results are very dependent on the choice of clustering. Most clustering tools are not optimised for de novo assembled sequences, and to address this, we are developing a method which uses hierarchical clustering to group transcripts based on shared reads. We also explore possible choices for mapping and summarising read counts to gene clusters.
Citation preview
Differential expression analysis of de novo assembled
transcriptomes Nadia Davidson
Murdoch Childrens Research Institute
WEHI Bioinformatics Seminar April 9th 2013
RNA-Seq on non-model organisms
• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression
• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome
• By de novo assembling the transcriptome – But it has its challenges
De novo assembly
Transcriptome Assemblers – For genome assembly a k-mer length must be selected and
optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple
assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach.
– Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach.
– Many studies compare the different assemblers. – Few studies explore ways to do a differential expression
analysis after the transcriptome has been assembled • Our aim
Our RNA-Seq dataset
• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples
– Already had the data from another project – Model organism
• Assembled the data using Trinity and
Oases. – Starting with these assemblies we
investigated how to perform a differential expression analysis
– 300k and 600k transcripts from Trinity and Oases respectively.
Q1. Why so many transcripts?
Transcripts grow with reads Fracis et. Al., BMC Genomics 2013
– Sequencing errors – Heterozygosity – Different Isoforms – Paralogs
AGGTCTGA
ATTCGATG
ATTCCATG ACCTGAGA
AGGTCTGA ATTCGATG ACCTGAGA
AGGTCTGA ATTCCATG ACCTGAGA
Reported Transcripts
De Bruijn Graph Complexity
Vijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7
A simulation study of de novo transcriptome assemblies. 17K genes. 100 million, 100bp paired-end reads
“Even in the data sets simulated without alternative splicing, no sequencing error, no polymorphism and no paralogs for 7.87% of the genes many isoforms were erroneously inferred (ranging from 2 to 335 isoforms per gene)”
Simulation Study
Variation in coverage • Across transcripts
Reported Transcripts
Reported Transcripts Different coverage could mean different contigs assembled for each k-mer
Other transcripts • We get about 4.3 transcript for each known chicken
gene (Ensembl) in our Trinity assembly and 13.4 for Oases
• What are the other transcripts?
Known genes
Novel in genome
Novel not in geome
Trinity Assembly Oases Assembly
S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233
with > 100 million reads
Abundance of Gene Type from ENCODE
Our novel genes
Trinity assembly
Q2. Isoform or gene-level analysis?
The Cons Isoforms • List may be too long:
– Difficult to interpret – Computationally expensive – Larger correction for
multiple testing • Not obvious how to assign
ambiguously aligned reads – Can lead to double
counting if ignored, or – Less power if reads are
split between transcripts • Not all transcript represent
different isoforms anyway
Genes • Not sensitive to differential
splicing • Not obvious how transcripts
should be clustered into genes
Q3. How to cluster transcripts into genes
Which clusters to use?
• This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by
splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g.
Gene A
Gene A Two incomplete sequences from the same gene
Cluster ✓
Gene A
Gene B Low coverage repeat sequence past UTR
Do not cluster ✗
Clustering Options
• What you can use: – The locus/component information from the assembler.
• General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length>
– Sequence similarity clustering such as CD-HIT, Blastclust etc.
We tested the accuracy of these clustering methods on our assemblies
“Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)
“over clustered”
“und
er c
lust
ered
”
How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2
TP = 2, TN = 6 FP = 0, FN = 2
False negative indicate “under clustering”
TP = 4, TN = 0 FP = 6, FN = 0
False positives indicate “over clustering”
Trinity Assembly Clustering
TP = true positives number of pairs of transcripts which correctly share a cluster FN = false negatives number of pairs of transcripts which are incorrectly are split
335,377 transcripts
Number of clusters:
“over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Trinity
Oases Assembly Clustering
Number of clusters:
Max transcripts in a cluster:
540,933 transcripts “over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases
Can we do better?
• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low
level – We could separate sequences which show different
expression between sample groups – Using pair-end reads gives extra leverage to group
transcripts
• We are developing a tools which will take multi-mapped reads and output clusters along with counts for each cluster
The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters
• Transcripts are grouped if they share ANY reads with another transcript
– For each super-cluster • For each pair of transcripts, calculate the distance
• To do: incorporate sample information too
– Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too
large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster
Rab – Number of reads which map to transcript a and b
Step 1
Step 2 – make a distance matrix
0 (R=2)
0.5 (R=1)
0.5 (R=1)
0 (R=2)
0 (R=2) 0 (R=2)
Distance = , R = reads
Step 1
0 (R=2)
0.5 (R=1) 0 (R=2)
Update R2’2’ = R22+R33-R23 R12’ = max(R12,R23) Recalculate the distance
Distance = 0.5
Cutting the tree at 0.5 or less would give the correct clustering
Distance = 0
Distance = 1
Trinity assembly
“over clustered”
“und
er c
lust
ered
”
How do we do?
✕ Ideal CD-HIT-EST Oases/Trinity
Oases assembly
“over clustered”
“und
er c
lust
ered
”
Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
Impact on differential expression (DE) • To assess this we:
– Mapped reads back to all transcripts (“best” mapping - bowtie)
– Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list
• Obtained from a genome based analysis on RefSeq genes (5 thousand genes)
• True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded
DE results - Oases
Conclusion: Better to “under” cluster than “over” cluster
vs.
“over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
DE results - Trinity “over clustered”
“und
er c
lust
ered
”
✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9
Q4. What is the best way to go from reads to counts?
Approaches 1. Do what we did before – add up counts for each
cluster 2. Trinity and Oases suggest:
– Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously
mapped reads • RSEM can take the clustering as input and return gene-
level counts.
3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts
e.g. Sandmann et. al., Genome Biology, 2011 12:R76
Gene–level counts
Multi-map Reads:
Trinity script
Get counts: RSEM
edgeR
Gene–level DE results
Best-map reads: bowtie
Count reads overlapping transcripts: samtools
Select Representative Transcript:
longest
The alternatives: “Best” map
Reads: bowtie
Count reads overlapping transcripts: samtools
Add counts in a cluster
own script
Use the same clustering for all three approaches
Results
Oases Assembly Trinity Assembly
Difference between methods is small - could probably do any of them
Conclusions • Q1. Why so many transcripts?
– Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation.
– De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult.
– Sequencing to a high depth may give you more intergenic non-coding transcripts.
• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts
has a number of advantages over isoform-level counts
Conclusions cont.
• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the
clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works
better than the alternatives based on differential expression results
• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance
estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the
differential expression results than other steps in the pipe-line
Future Work • We have only looked at one RNA-Seq dataset.
– Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general:
• Different species (all model organisms) • Different read depths
• Our clustering tool: – Would like to output the gene-level counts for each cluster.
• Then compare to other abundance estimation approaches.
– Would like to incorporate differences in expression between groups to improve the clustering
• More investigation into the pipe-line methods – E.g. mapping
Acknowledgements
Chicken RNA-Seq Data from Katie Ayers (MCRI) Craig Smith (MCRI)
MCRI Bioinformatics Alicia Oshlack The Bioinformatics Group
Red Jungle Fowl (credit: NHGRI)
VLSCI AGRF
Extra Slides
Trinity and Oases compared Oases
Trinity – version from the start of 2012
Trinity – version from the end of 2012
frac_match = length of the longest matching assembled transcript / “true” length of the transcript
Number of genes to transcripts (ordered by DE)
Yeast
Chicken (Trinity)
Chicken (Oases)
Recommended