View
934
Download
5
Category
Tags:
Preview:
DESCRIPTION
Third part in the 'RNA-seq for DE analysis'. See http://www.bits.vib.be for more details.
Citation preview
Mapping to assign reads to genes
Joachim Jacob20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
Goal
Assign reads to genes.
The result of the mapping will be used to construct a summary of the counts: the count table.
GeneA: 12GeneB: 5
2 scenarios
Reference genome sequence available
NO reference genome sequence available● De novo assembly of the reads (trinity) (transcriptome construction)● Map the reads to the assembly (RSEM mapper)● Extract count table(note:no removal of polyA is required.
Computationally expensive!)
Reference genome sequence available
Preprocessed reads are mapped to the reference sequence:
1. Reference is haplotype: mixture of alleles, leads to mismatches.
2. Reads contain sequencing errors
3. Reads derived from mRNA, genome is DNA.
35 for 2 alleles→together
If we compare samples within the same specimen, this effect is similar for all samples.
mRNA reads: some reads span introns
● Reads are derived from mRNA
mRNA
etc.
Many reads span introns: they need to be aligned with gaps. This can be used to detect intron-exon junctions
exon
intron
One isoform!
http://www.ensembl.org
mRNA reads: multiple isoforms exist
● Isoforms are transcribed at different levels, contributing differently to the number of reads.
http://www.ensembl.org
Algorithm: gapped read mapping
● Exon-first approach: TopHat (popular)
TopHat: discovering splice junctions with RNA-SeqVol. doi:10.1093/bioinformatics/btp12025 no. 9 2009, pages 1105–1111
Junction database constructed to try to map unmapped reads.
Principle of gapped read mapping
● STAR: fast and suited for longer reads
STAR: ultrafast universal RNA-seq alignerAlexander Dobin et al. Bioinformatics
Checklist for mapping to reference genome
1. A reference genome sequence (fasta), to be indexed by the alignment software.
2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (optional, but highly recommended)
3. The cleaned (preprocessed) reads (fastq)
Getting your reference genome sequence
● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here
● If your genome is not
listed above, check
http://ensembl.org
and
http://ensemblgenomes.org ; and follow indexing software
● If still no luck, try a specialized species website, e.g.
Indexing a genome
● Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast.
● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list.
Using genome annotation information
● Annotation info is stored in text files formatted as GTF or GFF3 files.
● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL.
● We will use a GTF file from an respected genome database to assist the mapping of reads.
http://cufflinks.cbcb.umd.edu/
Using genome annotation information
GTF example
TIP: You can look at an example in Galaxy!
Mapping in Galaxy
Mapping in Galaxy
Basic settings
!
Mapping in Galaxy
!!!
Mapping in Galaxy
Mapping QC
● TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome.
● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads').
● The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.
Mapping QC
The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.
Let's visualize
Check whether this visualization matches:- paired end- splice junctions- strandedness- ...
Practical tips
Add the GTF to the viz
These are the reads, 2 coloursbecause of the sense and
antisense strand. (obviouslythis library was not stranded!)
Position on the reference genome sequence
Some reads span an intron
Mapping QC - RSeQC
After checking the mapping visually, determine more metrics with RseQC.
http://rseqc.sourceforge.net/
Mapping QC - RSeQC
Duplication rate observed in the RNA-seq data.
http://rseqc.sourceforge.net/
Mapping QC - RSeQC
Read quality of aligned reads
http://rseqc.sourceforge.net/
Mapping QC - RSeQC
Sequence depth saturation
http://rseqc.sourceforge.net/
Early flattening points to saturation
Q1 Q4: from low→count genes
to high count genes
Mapping QC - RSeQC
Sequence depth saturation
http://rseqc.sourceforge.net/
Mapping QC - RSeQC
After checking visually, determine more metrics with RseQC.
http://rseqc.sourceforge.net/
Mapping QC - RSeQC
After checking visually, determine more metrics with RseQC.
http://rseqc.sourceforge.net/
Deviating!
Mapping QC - BamQC
Another useful tool is BamQC of the Qualimap Suite. Be aware however: also useful for DNA-seq!
http://qualimap.bioinfo.cipf.es/
Mapping QC: BamQC
http://qualimap.bioinfo.cipf.es/ Fraction of genome sequence not covered
Mapping QC: BamQC
Some examples to watch out for.
http://qualimap.bioinfo.cipf.es/
Mapping QC: BamQC
Some examples to watch out for.
http://qualimap.bioinfo.cipf.es/
Mapping QC: BamQC
Some examples to watch out for.
http://qualimap.bioinfo.cipf.es/
Keywordshaplotype
Gapped mapping
GTF
duplication
isoforms
strandedness
coverage
Write in your own words what the terms mean
Exercise
→ → Mapping exercise
Break
Recommended