RNA-seq: Mapping and quality control - part 3

Mapping to assign reads to genes

Joachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Assign reads to genes.

The result of the mapping will be used to construct a summary of the counts: the count table.

GeneA: 12GeneB: 5

2 scenarios

Reference genome sequence available

NO reference genome sequence available● De novo assembly of the reads (trinity) (transcriptome construction)● Map the reads to the assembly (RSEM mapper)● Extract count table(note:no removal of polyA is required.

Computationally expensive!)

Reference genome sequence available

Preprocessed reads are mapped to the reference sequence:

1. Reference is haplotype: mixture of alleles, leads to mismatches.

2. Reads contain sequencing errors

3. Reads derived from mRNA, genome is DNA.

35 for 2 alleles→together

If we compare samples within the same specimen, this effect is similar for all samples.

mRNA reads: some reads span introns

● Reads are derived from mRNA

Many reads span introns: they need to be aligned with gaps. This can be used to detect intron-exon junctions

intron

One isoform!

http://www.ensembl.org

mRNA reads: multiple isoforms exist

● Isoforms are transcribed at different levels, contributing differently to the number of reads.

http://www.ensembl.org

Algorithm: gapped read mapping

● Exon-first approach: TopHat (popular)

TopHat: discovering splice junctions with RNA-SeqVol. doi:10.1093/bioinformatics/btp12025 no. 9 2009, pages 1105–1111

Junction database constructed to try to map unmapped reads.

Principle of gapped read mapping

● STAR: fast and suited for longer reads

STAR: ultrafast universal RNA-seq alignerAlexander Dobin et al. Bioinformatics

Checklist for mapping to reference genome

1. A reference genome sequence (fasta), to be indexed by the alignment software.

2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (optional, but highly recommended)

3. The cleaned (preprocessed) reads (fastq)

Getting your reference genome sequence

● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here

● If your genome is not

listed above, check

http://ensembl.org

http://ensemblgenomes.org ; and follow indexing software

● If still no luck, try a specialized species website, e.g.

Indexing a genome

● Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast.

● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list.

Using genome annotation information

● Annotation info is stored in text files formatted as GTF or GFF3 files.

● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL.

● We will use a GTF file from an respected genome database to assist the mapping of reads.

http://cufflinks.cbcb.umd.edu/

Using genome annotation information

GTF example

TIP: You can look at an example in Galaxy!

Mapping in Galaxy

Basic settings

Mapping in Galaxy

Mapping QC

● TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome.

● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads').

● The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Mapping QC

The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Let's visualize

Check whether this visualization matches:- paired end- splice junctions- strandedness- ...

Practical tips

Add the GTF to the viz

These are the reads, 2 coloursbecause of the sense and

antisense strand. (obviouslythis library was not stranded!)

Position on the reference genome sequence

Some reads span an intron

Mapping QC - RSeQC

After checking the mapping visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

Duplication rate observed in the RNA-seq data.

Mapping QC - RSeQC

Read quality of aligned reads

Mapping QC - RSeQC

Sequence depth saturation

Early flattening points to saturation

Q1 Q4: from low→count genes

to high count genes

Mapping QC - RSeQC

Sequence depth saturation

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

Deviating!

Mapping QC - BamQC

Another useful tool is BamQC of the Qualimap Suite. Be aware however: also useful for DNA-seq!

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/ Fraction of genome sequence not covered

Mapping QC: BamQC

Some examples to watch out for.

Mapping QC: BamQC

Keywordshaplotype

Gapped mapping

duplication

isoforms

strandedness

coverage

Write in your own words what the terms mean

Exercise

→ → Mapping exercise

RNA-seq: Mapping and quality control - part 3

Education

RNA- Seq Lab

Bioinformatics for DNA - seq and RNA- seq experiments

Tutorial - QIAGEN Bioinformatics€¦ · Four workflows: 1.RNA-Seq and IPA analysis workflow 2.RNA-Seq and IPA advanced analysis workflow 3.RNA-Seq analysis workflow 4.RNA-Seq analysis

Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

RNA-Seq-BasedBreastCancerSubtypesClassificationUsing

RNA-Seq - · PDF fileWhat is RNA-seq? • RNA-seq is the high-throughput sequencing of the cDNA! • It’s used to measure the RNA expression! • It’s the NGS equivalent of

RNA-seq data

MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poster rna seq-molecular_medtriconf_2011_a_vladimirova

RNA-Seq / ChIP-Seq Analysis Workflow

Mapping and quantifying mammalian transcriptomes by RNA-Seqmembers.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Mortazavi200… · This RNA-Seq approach avoids the need for bacterial

RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

RNA‐Seq: Methods and Applicaonsbarc.wi.mit.edu/education/hot_topics/RNAseq/RNA_Seq.pdf · Outline • Intro to RNA‐Seq Biological Quesons Comparison with Other Methods RNA‐Seq

Update - AH diagnostics...Advanta RNA-Seq NGS Library Prep on Juno RNA sequencing (RNA-seq) is the gold standard for hypothesis-free profiling of the transcriptome. Advanta™ RNA-Seq

Part 3 of RNA-seq for DE analysis: the mapping process

RNA-seq Data Analysis - Cornell Universitycbsu.tc.cornell.edu/doc/RNA-Seq-2017-Lecture1.pdf · RNA-seq Data Analysis Qi Sun Bioinformatics Facility. Biotechnology Resource Center

Analysing RNA-Seq data produced by Mars-Seq protocoldors.weizmann.ac.il/course/course2018/AnalysingRNA-Seq...Analysing RNA-Seq data produced by Mars-Seq protocol Dena Leshkowitz, Introduction

RNA-Seq Module 1

Practical RNAPractical RNA-Seq analysisbarc.wi.mit.edu/education/hot_topics/RNAseq_Feb2014/RNA-seq_Feb_2014.slides_color.pdfPractical RNAPractical RNA-Seq analysis BaRC Hot Topics