RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

RNA-Seq Software, Tools, and Workflows

Monica Britton, Ph.D. Sr. Bioinformatics Analyst

June 2016 Workshop

Some mRNA-Seq Applications

• Differential gene expression analysis

• Transcriptional profiling

• Identification of splice variants

• Novel gene identification

• Transcriptome assembly

• SNP finding

• RNA editing

Assumption: Changes in transcription/mRNA levels correlate with phenotype (protein expression)

Experimental Design

• What biological question am I trying to answer?

• What types of samples (tissue, timepoints, etc.)?

• How much sequence do I need?

• Length of read?

• Platform?

• Single-end or paired-end?

• Barcoding?

• Pooling?

• Biological replicates: how many?

• Technical replicates: how many?

• Protocol considerations?

Strand-Specific (Directional) RNA-Seq

• Now the default for Illumina TruSeq kits

• Preserves orientation of RNA after reverse transcription to cDNA

• Informs alignments to genome – Determine which genomic DNA strand is transcribed

– Identify anti-sense transcription (e.g., lncRNAs)

– Quantify expression levels more precisely

– Demarcate coding sequences in microbes with overlapping genes

• Very useful in transcriptome assemblies – Allows precise construction of sense and anti-sense transcripts

--3’ 5’--

--3’ 5’--

Influence of the Organism

• Novel – little/no previous sequencing (may need assembly)

• Non-Model – some sequence available (draft genome or transcriptome assembly)

– Thousands of scaffolds, maybe tens of chromosomes

– Some annotation (ab initio, EST-based, etc.)

• Model – genome fully sequenced and annotated – Multiple genomes available for comparison

– Well-annotated transcriptome based on experimental evidence

– Genetic maps with markers available

– Basic research can be conducted to verify annotations (mutants available)

Differential Gene Expression Generalized Workflow

Read QC and Trimming

Scythe

Sickle

Stringency of trimming depends on read length and downstream processing.

Alignment can often tolerate some amount of contamination.

Various sizes of trimmed reads can introduce biases in alignments.

Transcriptome assembly works best with aggressive trimming.

QC and Alignment

You may want to align your reads to a few different datasets, including

• rRNA of your organism (or related organisms)

• Contaminant screening, if desirable

• Genome or transcriptome to generate counts for differential expression analysis

First we’ll go through alignment to a gene set/transcriptome, then we’ll look at aligning to a genome…

Alignment – Choosing an Aligner (Pt. 1 – Gene Set)

• Examples of gene sets:

– rRNA sequences

– potential contaminant sequences (such as PhiX),

– a transcriptome (eukaryotic or bacterial), where each sequence in the fasta represents a different transcript (mRNA)

– De novo transcriptome assembly

• A splicing-aware aligner is not needed

• bwa aln (<76 bp), bwa mem (>75 bp), or bowtie2

• Requiring strand-specific orientation can be more difficult (requires filtering the bam file).

Library QC – rRNA Contamination

• Total RNA contains >95% rRNA

• Poly A selected libraries should contain <5% rRNA

• rRNA-depleted libraries should contain <15% rRNA

• Align reads to rRNA sequences from organism or relatives

• Higher than expected or inconsistent rRNA content may indicate degraded RNA or problems in library prep.

• Generally, don’t need to remove rRNA reads

Total RNA sample

rRNA

mRNA

After PolyA isolation

rRNA

mRNA

After rRNA depletion

rRNA

mRNA

Insert Size (determined from gene set alignments)

The “insert” is the cDNA (or RNA) ligated between the adapters.

Typical insert size is 160-200 bases, but can be larger.

Insert size distribution depends on library prep method.

Adapter ~60 bases

Adapter ~60 bases

cDNA 100-800 bases

=Insert Size

Single-end read

Gapped paired-end reads

Overlapping paired-end reads

0 20 40 60 80 100 120 140 160 180 200

Paired-End Reads

0 20 40 60 80 100 120 140 160 180 200

Read 1

Read 2

0 20 40 60 80 100 120 140 160 180 200

Read 1

Read 2

INSERT SIZE (TLEN)

INNER DISTANCE (+ or -)

Generating Raw Counts from Gene Set Alignments

The sam/bam file can be filtered for just those pairs where both reads align in concordantly (proper pairs) to one transcript.

sam2counts or samtools idxstats can be used for counting.

Counts per transcript need to be added up to counts per gene before running stats software

Alignment to a Genome - Choosing a Reference

• Human/mouse: GENCODE (uses Ensembl IDs) (http://www.gencodegenes.org/), but may need some manipulation to work with certain software

• Ensembl genomes (http://ensemblgenomes.org/) and Biomart (http://www.ensembl.org/biomart/martview/)

• Illumina igenomes (http://support.illumina.com/sequencing/ sequencing_software/igenome.html) provides genome indexes for some software, and files with extra info for Tophat/cufflinks.

• NCBI genomes (http://www.ncbi.nlm.nih.gov/genome/)

• Many specialized databases (Phytozome, Patric, VectorBase, FlyBase, WormBase)

• “Do it yourself” genome assembly and gene-finding (don’t forget functional annotation)

http://www.gencodegenes.org/

http://ensemblgenomes.org/

http://www.ensembl.org/biomart/martview/

http://support.illumina.com/sequencing/sequencing_software/igenome.html




http://www.ncbi.nlm.nih.gov/genome/

Alignment – Choosing an Aligner (Pt. 2 – Genome)

Eukaryotic genes are separated into exons and introns, so if you are using a genome reference, you need an aligner that is either “splicing-aware” or won’t choke on spliced RNA reads.

A splicing-aware aligner will recognize the difference between a short insert and a read that aligns across exon-intron boundaries

Read Pairs (joined by solid line)

Spliced Reads (joined by dashed line)

(Some) Splicing-Aware or Splicing-Agnostic Aligners

• Tophat2 (bowtie2) – what we’ll be using in these exercises. (remember, igenomes); also integrates with cufflinks to find novel genes and transcripts

• HISAT/HISAT2 – successor to tophat2, and also works for alignment of populations (i.e., multiple human samples). I haven’t tried this yet. It’s on our Galaxy … try it out and report what happens.

• GSNAP – I haven’t used this; can handle (and produce) non-standard file formats.

• STAR – this is what I use. It’s very fast and doesn’t need a separate counting step. But it does need a lot of memory (>128G RAM to index a mammalian genome).

• bwa mem – a local aligner, is not splicing-aware, but can handle local alignments (doesn’t require the entire read to align).

Generating Raw Counts from Genome Alignments

Reads aligning to a chromosome or scaffold must be parsed to assign them to a gene.

We’ll need a “roadmap” to indicate where each gene (exon) is located in the genome

Generating Raw Counts from Genome Alignments

This roadmap is the GTF (Gene Transfer Format) file:

Let’s look at the GTF fields in more detail …

The right column includes attributes, including gene ID, etc.

The left columns list source, feature type, and genomic coordinates

Fields in the GTF File (left to right)

Sequence Name (i.e., chromosome, scaffold, etc.) chr12

Source (program that generated the gtf file or feature) unknown

Feature (i.e., gene, exon, CDS, start codon, stop codon) CDS

Start (starting location on sequence) 3677872

End (end position on sequence) 3678014

Score .

Strand (+ or -) +

Frame (0, 1, or 2: which is first base in codon, zero-based) 2

Attribute (“;”-delimited list of tags with additional info)

gene_id "PRMT8"; gene_name "PRMT8"; p_id "P10933";

transcript_id "NM_019854"; tss_id "TSS4368";

An Unusual GTF File

Generating Gene Counts from Genome Alignments

• HTSeq-count – runs rather slowly; doesn’t work well on coordinate-sorted bam files.

• FeatureCounts – Runs faster, and will work on coordinate-sorted bam files. Has more parameters than HTSeq-count.

• STAR – Will generate counts equivalent to HTSeq-count with default parameters, in the same run as alignment.

• All of these can utilize strand-specific information in GTF file to assign reads from strand-specific libraries.

• But what about those reads that can’t be assigned unambiguously to a gene?

Multiple-Mapping Reads (“Multireads”)

• Some reads will align to more than one place in the reference, due to: – Common domains, gene families

– Paralogs, pseudogenes, polyploidy etc.

• This can distort counts, leading to misleading expression levels

• If a read can’t be uniquely mapped, how should it be counted?

• Should it be ignored (not counted at all)?

• Should it be randomly assigned to one location among all the locations to which it aligns equally well?

• This may depend on the question you’re asking…

• …and also depends on the alignment and counting software you use.

Reads That Align to Gene Features

This read unambiguously aligns to Gene A

This read also uniquely aligns to Gene A

This read aligns uniquely, but could be unspliced

This spliced read uniquely aligns to Gene A

Genes A and B overlap, but the read is uniquely within Gene A

Genes A and B overlap, the read is within Gene A, but overlaps Gene B

Genes A and B overlap, and the read ambiguously maps to both genes

From: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Read Count Definitions

• When using tools such as HTSeq-count, FeatureCounts, and STAR, the output table will include counts of reads that can be assigned to each gene and some additional lines for reads that can’t be assigned, including:

• Ambiguous – read overlaps more than one gene

• Multiple Mapping – read maps equally well to more than one place in the genome

• No Feature – read maps uniquely to genome, but in an intergenic (or intronic) region

• Unmapped – read cannot be mapped to genome (based on run parameters)

Examples From Two Recent Experiments

Genome Human (hg38) Mouse (mm10)

Annotation GENCODE GENCODE

Library Prep rRNA depletion polyA selection

rRNA 8.8% 5.6%

Aligned to Genome 98.6% 98.1%

Assigned to Genes (exons) 51.8% 84.1%

Ambiguous 2.2% 1.3%

Multimapping 12.9% 6.2%

No Feature 31.6% 6.5%

Unmapped 1.4% 1.9%

How Well Did Your Reads Align to the Reference?

Calculate percentage of reads mapped per sample

0

20

40

60

80

100

% R

ead

s M

app

ed

Incomplete Reference?

Good Great! Sample Contamination?

Checking Your Results

• Key genes that may confirm sample ID – Knock-out or knock-down genes

– Genes identified in previous research

• Specific genes of interest – Hypothesis testing

– Important pathways

• Experimental validation (e.g., qRT-PCR) – The best way to determine if your analysis protocol accurately models

your organism/experiment

– Ideally, validation should be conducted on a different set of samples


Bioinformatics analyses are in silico experiments

The tools and parameters you choose will be influenced by factors including: – Available reference/annotation

– Experimental design (e.g., pairwise vs. multi-factor)

The “right” tools are the ones that best inform on your experiment

Don’t just shop for methods that give you the answer you want


Fastq

Fasta (reference) gtf (gene features)

File Types

sam/bam

Text tables

Today’s Exercises – Differential Gene Expression

Today, we’ll analyze a few types of data you’ll typically encounter:

1. Single-end reads (“tag counting” with well-annotated genomes)

using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data.

2. Paired-end reads (finding novel transcripts in a genome with incomplete annotation)

3. Paired-end reads for differential gene expression when only a transcriptome is available (such as after a transcriptome assembly)

Today’s Exercises – Differential Gene Expression

And we’ll be using a few different software:

1. Tophat to align spliced reads to a genome

2. FeatureCounts to generate raw counts tables for…

3. edgeR, for differential expression of genes

4. Cufflinks to find novel genes and transcripts

5. Bowtie2 to align reads to a transcriptome reference

6. sam2counts to extract raw counts from bowtie2 alignments

Gene Construction

Alignment and Differential Expression

TopHat

edgeR/limma

Read set(s)

Existing annotation (GTF)

bam file(s)

Toptables, etc.

We will follow these steps in the first exercise.

FeatureCounts

But, do we have all the genes?

• For organisms with genomes, gene models are stored in gtf files

• Assumptions: – The gtf file contains annotation for ALL transcripts and genes

– All splice sites, start/stop codons, etc. are correct

• Are these assumptions correct for every sequenced organism?

• RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation

• Method used depends on how much sequence information there is for the organism…

Gene Construction (Alignment) vs. Assembly

Haas and Zody (2010) Nat. Biotech. 28:421-3

Novel or Non-Model Organisms

Genome-Sequenced Organisms

Trinity software

Gene / Transcriptome Construction

• Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons

– Combine expressed exons into genes

– Find all splice variants for a gene

– Discover novel transcripts

• For newly sequenced organisms – Validate ab initio annotation

– Comparison between different annotation sets

• Can assist in finding some types of contamination – Reconstruction of rRNA genes

– Genomic/mitochondrial DNA in RNA library preps.

Reference Annotation Based Transcript (RABT) Assembly

TopHat

Cufflinks

Cuffmerge

Cuffcompare

Read set(s)

Existing annotation

(GTF) [optional]

bam file(s)

Read-set specific GTF(s)

Merged GTF

Final assembly (GTF and stats)

Toptables, etc. edgeR/limma FeatureCounts

A Few Words About Bacterial RNA-Seq

Eukaryotic and Bacterial Gene Structures are Different

• Eukaryotes – Gene structure includes introns and exons

– Splicing, poly-adenylation

– Each mRNA is a discrete molecule when translated

• Bacteria / Prokaryotes – Individual genes and groups of genes in operons

– Generally, no splicing, no polyA

– One mRNA can contain coding sequences for multiple proteins

Bacterial RNA-Seq Considerations

• rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules

• Splicing-aware aligners (such as Tophat) may not be useful

• Reads from polycistronic mRNA may overlap two genes – How would HTSeq-Count or FeatureCounts handle this?

• Compare alignments to the genome to alignments to transcriptome. – Some aligners, such as bwa-mem, will report secondary alignments

– Transcriptome alignments can be used to generate counts table for edgeR

• Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/Rockhopper/)

http://cs.wellesley.edu/~btjaden/Rockhopper/

Documents

RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with