40
RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 2016 Workshop

RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

RNA-Seq Software, Tools, and Workflows

Monica Britton, Ph.D. Sr. Bioinformatics Analyst

June 2016 Workshop

Page 2: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Some mRNA-Seq Applications

• Differential gene expression analysis

• Transcriptional profiling

• Identification of splice variants

• Novel gene identification

• Transcriptome assembly

• SNP finding

• RNA editing

Assumption: Changes in transcription/mRNA levels correlate with phenotype (protein expression)

Page 3: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Experimental Design

• What biological question am I trying to answer?

• What types of samples (tissue, timepoints, etc.)?

• How much sequence do I need?

• Length of read?

• Platform?

• Single-end or paired-end?

• Barcoding?

• Pooling?

• Biological replicates: how many?

• Technical replicates: how many?

• Protocol considerations?

Page 4: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Strand-Specific (Directional) RNA-Seq

• Now the default for Illumina TruSeq kits

• Preserves orientation of RNA after reverse transcription to cDNA

• Informs alignments to genome – Determine which genomic DNA strand is transcribed

– Identify anti-sense transcription (e.g., lncRNAs)

– Quantify expression levels more precisely

– Demarcate coding sequences in microbes with overlapping genes

• Very useful in transcriptome assemblies – Allows precise construction of sense and anti-sense transcripts

--3’ 5’--

--3’ 5’--

Page 5: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Influence of the Organism

• Novel – little/no previous sequencing (may need assembly)

• Non-Model – some sequence available (draft genome or transcriptome assembly)

– Thousands of scaffolds, maybe tens of chromosomes

– Some annotation (ab initio, EST-based, etc.)

• Model – genome fully sequenced and annotated – Multiple genomes available for comparison

– Well-annotated transcriptome based on experimental evidence

– Genetic maps with markers available

– Basic research can be conducted to verify annotations (mutants available)

Page 6: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Differential Gene Expression Generalized Workflow

Page 7: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Read QC and Trimming

Scythe

Sickle

Stringency of trimming depends on read length and downstream processing.

Alignment can often tolerate some amount of contamination.

Various sizes of trimmed reads can introduce biases in alignments.

Transcriptome assembly works best with aggressive trimming.

Page 8: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

QC and Alignment

You may want to align your reads to a few different datasets, including

• rRNA of your organism (or related organisms)

• Contaminant screening, if desirable

• Genome or transcriptome to generate counts for differential expression analysis

First we’ll go through alignment to a gene set/transcriptome, then we’ll look at aligning to a genome…

Page 9: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Alignment – Choosing an Aligner (Pt. 1 – Gene Set)

• Examples of gene sets:

– rRNA sequences

– potential contaminant sequences (such as PhiX),

– a transcriptome (eukaryotic or bacterial), where each sequence in the fasta represents a different transcript (mRNA)

– De novo transcriptome assembly

• A splicing-aware aligner is not needed

• bwa aln (<76 bp), bwa mem (>75 bp), or bowtie2

• Requiring strand-specific orientation can be more difficult (requires filtering the bam file).

Page 10: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Library QC – rRNA Contamination

• Total RNA contains >95% rRNA

• Poly A selected libraries should contain <5% rRNA

• rRNA-depleted libraries should contain <15% rRNA

• Align reads to rRNA sequences from organism or relatives

• Higher than expected or inconsistent rRNA content may indicate degraded RNA or problems in library prep.

• Generally, don’t need to remove rRNA reads

Total RNA sample

rRNA

mRNA

After PolyA isolation

rRNA

mRNA

After rRNA depletion

rRNA

mRNA

Page 11: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Insert Size (determined from gene set alignments)

The “insert” is the cDNA (or RNA) ligated between the adapters.

Typical insert size is 160-200 bases, but can be larger.

Insert size distribution depends on library prep method.

Adapter ~60 bases

Adapter ~60 bases

cDNA 100-800 bases

=Insert Size

Single-end read

Gapped paired-end reads

Overlapping paired-end reads

0 20 40 60 80 100 120 140 160 180 200

Page 12: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Paired-End Reads

0 20 40 60 80 100 120 140 160 180 200

Read 1

Read 2

0 20 40 60 80 100 120 140 160 180 200

Read 1

Read 2

INSERT SIZE (TLEN)

INNER DISTANCE (+ or -)

Page 13: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Generating Raw Counts from Gene Set Alignments

The sam/bam file can be filtered for just those pairs where both reads align in concordantly (proper pairs) to one transcript.

sam2counts or samtools idxstats can be used for counting.

Counts per transcript need to be added up to counts per gene before running stats software

Page 14: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Alignment to a Genome - Choosing a Reference

• Human/mouse: GENCODE (uses Ensembl IDs) (http://www.gencodegenes.org/), but may need some manipulation to work with certain software

• Ensembl genomes (http://ensemblgenomes.org/) and Biomart (http://www.ensembl.org/biomart/martview/)

• Illumina igenomes (http://support.illumina.com/sequencing/ sequencing_software/igenome.html) provides genome indexes for some software, and files with extra info for Tophat/cufflinks.

• NCBI genomes (http://www.ncbi.nlm.nih.gov/genome/)

• Many specialized databases (Phytozome, Patric, VectorBase, FlyBase, WormBase)

• “Do it yourself” genome assembly and gene-finding (don’t forget functional annotation)

Page 15: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Alignment – Choosing an Aligner (Pt. 2 – Genome)

Eukaryotic genes are separated into exons and introns, so if you are using a genome reference, you need an aligner that is either “splicing-aware” or won’t choke on spliced RNA reads.

A splicing-aware aligner will recognize the difference between a short insert and a read that aligns across exon-intron boundaries

Read Pairs (joined by solid line)

Spliced Reads (joined by dashed line)

Page 16: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

(Some) Splicing-Aware or Splicing-Agnostic Aligners

• Tophat2 (bowtie2) – what we’ll be using in these exercises. (remember, igenomes); also integrates with cufflinks to find novel genes and transcripts

• HISAT/HISAT2 – successor to tophat2, and also works for alignment of populations (i.e., multiple human samples). I haven’t tried this yet. It’s on our Galaxy … try it out and report what happens.

• GSNAP – I haven’t used this; can handle (and produce) non-standard file formats.

• STAR – this is what I use. It’s very fast and doesn’t need a separate counting step. But it does need a lot of memory (>128G RAM to index a mammalian genome).

• bwa mem – a local aligner, is not splicing-aware, but can handle local alignments (doesn’t require the entire read to align).

Page 17: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Generating Raw Counts from Genome Alignments

Reads aligning to a chromosome or scaffold must be parsed to assign them to a gene.

We’ll need a “roadmap” to indicate where each gene (exon) is located in the genome

Page 18: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Generating Raw Counts from Genome Alignments

This roadmap is the GTF (Gene Transfer Format) file:

Let’s look at the GTF fields in more detail …

The right column includes attributes, including gene ID, etc.

The left columns list source, feature type, and genomic coordinates

Page 19: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Fields in the GTF File (left to right)

Sequence Name (i.e., chromosome, scaffold, etc.) chr12

Source (program that generated the gtf file or feature) unknown

Feature (i.e., gene, exon, CDS, start codon, stop codon) CDS

Start (starting location on sequence) 3677872

End (end position on sequence) 3678014

Score .

Strand (+ or -) +

Frame (0, 1, or 2: which is first base in codon, zero-based) 2

Attribute (“;”-delimited list of tags with additional info)

gene_id "PRMT8"; gene_name "PRMT8"; p_id "P10933";

transcript_id "NM_019854"; tss_id "TSS4368";

Page 20: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

An Unusual GTF File

Page 21: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Generating Gene Counts from Genome Alignments

• HTSeq-count – runs rather slowly; doesn’t work well on coordinate-sorted bam files.

• FeatureCounts – Runs faster, and will work on coordinate-sorted bam files. Has more parameters than HTSeq-count.

• STAR – Will generate counts equivalent to HTSeq-count with default parameters, in the same run as alignment.

• All of these can utilize strand-specific information in GTF file to assign reads from strand-specific libraries.

• But what about those reads that can’t be assigned unambiguously to a gene?

Page 22: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Multiple-Mapping Reads (“Multireads”)

• Some reads will align to more than one place in the reference, due to: – Common domains, gene families

– Paralogs, pseudogenes, polyploidy etc.

• This can distort counts, leading to misleading expression levels

• If a read can’t be uniquely mapped, how should it be counted?

• Should it be ignored (not counted at all)?

• Should it be randomly assigned to one location among all the locations to which it aligns equally well?

• This may depend on the question you’re asking…

• …and also depends on the alignment and counting software you use.

Page 23: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Reads That Align to Gene Features

This read unambiguously aligns to Gene A

This read also uniquely aligns to Gene A

This read aligns uniquely, but could be unspliced

This spliced read uniquely aligns to Gene A

Genes A and B overlap, but the read is uniquely within Gene A

Genes A and B overlap, the read is within Gene A, but overlaps Gene B

Genes A and B overlap, and the read ambiguously maps to both genes

From: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Page 24: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Read Count Definitions

• When using tools such as HTSeq-count, FeatureCounts, and STAR, the output table will include counts of reads that can be assigned to each gene and some additional lines for reads that can’t be assigned, including:

• Ambiguous – read overlaps more than one gene

• Multiple Mapping – read maps equally well to more than one place in the genome

• No Feature – read maps uniquely to genome, but in an intergenic (or intronic) region

• Unmapped – read cannot be mapped to genome (based on run parameters)

Page 25: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Examples From Two Recent Experiments

Genome Human (hg38) Mouse (mm10)

Annotation GENCODE GENCODE

Library Prep rRNA depletion polyA selection

rRNA 8.8% 5.6%

Aligned to Genome 98.6% 98.1%

Assigned to Genes (exons) 51.8% 84.1%

Ambiguous 2.2% 1.3%

Multimapping 12.9% 6.2%

No Feature 31.6% 6.5%

Unmapped 1.4% 1.9%

Page 26: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

How Well Did Your Reads Align to the Reference?

Calculate percentage of reads mapped per sample

0

20

40

60

80

100

% R

ead

s M

app

ed

Incomplete Reference?

Good Great! Sample Contamination?

Page 27: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Checking Your Results

• Key genes that may confirm sample ID – Knock-out or knock-down genes

– Genes identified in previous research

• Specific genes of interest – Hypothesis testing

– Important pathways

• Experimental validation (e.g., qRT-PCR) – The best way to determine if your analysis protocol accurately models

your organism/experiment

– Ideally, validation should be conducted on a different set of samples

Page 28: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Differential Gene Expression Generalized Workflow

Bioinformatics analyses are in silico experiments

The tools and parameters you choose will be influenced by factors including: – Available reference/annotation

– Experimental design (e.g., pairwise vs. multi-factor)

The “right” tools are the ones that best inform on your experiment

Don’t just shop for methods that give you the answer you want

Page 29: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Differential Gene Expression Generalized Workflow

Fastq

Fasta (reference) gtf (gene features)

File Types

sam/bam

Text tables

Page 30: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Today’s Exercises – Differential Gene Expression

Today, we’ll analyze a few types of data you’ll typically encounter:

1. Single-end reads (“tag counting” with well-annotated genomes)

using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data.

2. Paired-end reads (finding novel transcripts in a genome with incomplete annotation)

3. Paired-end reads for differential gene expression when only a transcriptome is available (such as after a transcriptome assembly)

Page 31: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Today’s Exercises – Differential Gene Expression

And we’ll be using a few different software:

1. Tophat to align spliced reads to a genome

2. FeatureCounts to generate raw counts tables for…

3. edgeR, for differential expression of genes

4. Cufflinks to find novel genes and transcripts

5. Bowtie2 to align reads to a transcriptome reference

6. sam2counts to extract raw counts from bowtie2 alignments

Page 32: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Gene Construction

Page 33: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Alignment and Differential Expression

TopHat

edgeR/limma

Read set(s)

Existing annotation (GTF)

bam file(s)

Toptables, etc.

We will follow these steps in the first exercise.

FeatureCounts

Page 34: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

But, do we have all the genes?

• For organisms with genomes, gene models are stored in gtf files

• Assumptions: – The gtf file contains annotation for ALL transcripts and genes

– All splice sites, start/stop codons, etc. are correct

• Are these assumptions correct for every sequenced organism?

• RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation

• Method used depends on how much sequence information there is for the organism…

Page 35: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Gene Construction (Alignment) vs. Assembly

Haas and Zody (2010) Nat. Biotech. 28:421-3

Novel or Non-Model Organisms

Genome-Sequenced Organisms

Trinity software

Page 36: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Gene / Transcriptome Construction

• Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons

– Combine expressed exons into genes

– Find all splice variants for a gene

– Discover novel transcripts

• For newly sequenced organisms – Validate ab initio annotation

– Comparison between different annotation sets

• Can assist in finding some types of contamination – Reconstruction of rRNA genes

– Genomic/mitochondrial DNA in RNA library preps.

Page 37: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Reference Annotation Based Transcript (RABT) Assembly

TopHat

Cufflinks

Cuffmerge

Cuffcompare

Read set(s)

Existing annotation

(GTF) [optional]

bam file(s)

Read-set specific GTF(s)

Merged GTF

Final assembly (GTF and stats)

Toptables, etc. edgeR/limma FeatureCounts

Page 38: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

A Few Words About Bacterial RNA-Seq

Page 39: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Eukaryotic and Bacterial Gene Structures are Different

• Eukaryotes – Gene structure includes introns and exons

– Splicing, poly-adenylation

– Each mRNA is a discrete molecule when translated

• Bacteria / Prokaryotes – Individual genes and groups of genes in operons

– Generally, no splicing, no polyA

– One mRNA can contain coding sequences for multiple proteins

Page 40: RNA-Seq Software, Tools, and Workflows · using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. 2. Paired-end reads (finding novel transcripts in a genome with

Bacterial RNA-Seq Considerations

• rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules

• Splicing-aware aligners (such as Tophat) may not be useful

• Reads from polycistronic mRNA may overlap two genes – How would HTSeq-Count or FeatureCounts handle this?

• Compare alignments to the genome to alignments to transcriptome. – Some aligners, such as bwa-mem, will report secondary alignments

– Transcriptome alignments can be used to generate counts table for edgeR

• Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/Rockhopper/)