Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Next Generation Sequencing Introduction to RNA-sequencing
BMI7830
1
Gulcin Ozer, PhD [email protected]
Department of Biomedical Informatics The Ohio State University
November 5th, 2015
Background
What is RNA-sequencing? § Sequencing of reverse transcribed cDNA from
purified mRNA § Operationally not much different than DNA
sequencing, but the downstream processing is markedly different
Background
Benefits and opportunities of RNA-seq § Digital measure of gene expression § Differential expression § Annotation of new exons or transcribed regions,
genes, or non-coding RNAs § Alternative splicing events § Allele specific expression § Fusion genes in cancer § Variant information (with important caveats)
Background
cDNA Microarray RNA-sequencing Analog Digital Fixed probeset Any sequence, organism No variant information SNV, indels, fusions Reasonable cost Reasonable to outrageous Large batch-to-batch var. Little variation across runs
Compared to cDNA microarrays …
RNA-seq Library Preparation
§ Poly-A or Total transcriptome § Poly-A
§ Complete transcripts for protein coding genes and some non-protein coding genes contain a 3’ poly(A) tail
§ Oligo-d(T) capture § Total transcriptome
§ No oligo-d(T) capture step § Typically requires ribosomal depletion § Captures more noncoding transcripts at cost of lower
coding coverage
§ Unstranded or stranded
5
RNA-seq Library Sequencing
6
Nadia Davidson
@HWI-EAS264:8:87:1418:8428#0 GGGGACGTCTGCGACACCGGGGACAGAGCAACTATGGATGAAGAGGGCTACATCTGGTTCCTGGGGAGGAGCCATG + CBCBCCCCCCCCACCCCCCBCCCCCCCCCCCCC9B8?A>?BCACCC;CC@C>@CBB@CBA@@BA@B1?;4@*A@@A
Millions of
RNA-seq Library Sequencing
§ Sequenced as in DNA sequencing § Paired end (PE) or single end (SE) § For an analysis of alternative splicing, paired-end
and longer reads may be helpful
7
RNA-seq Quality Control (QC)
FastQC § Base quality per position § Nucleotide per position § GC content § K-mer enrichment http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
8
RNA-seq Quality Control (QC)
9
Base quality per position
RNA-seq Quality Control (QC)
In addition to FastQC, there are RNA-seq specific QC tools • RNASeQC
• http://www.broadinstitute.org/cancer/cga/rna-seqc • Deluca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD,
Williams C, Reich M, Winckler W, Getz G. (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics
• RSeQC • http://rseqc.sourceforge.net/ • Wang L, Wang S, Li W* RSeQC: quality control of RNA-
seq experiments Bioinformatics (2012) 28 (16): 2184-2185. doi: 10.1093/bioinformatics/bts356 pubmed
10
RNA-seq Quality Control (QC)
11
RSeQC • Python based • Both table- and graph-based output
Wang, Bioinformatics 2012
Coverage uniformity over
gene body (5’ – 3’ bias)
Saturation analysis of expression for 25% highest expressed
genes
Saturation analysis of
junction detection
Annotation of detected splice
junctions
RNA-seq Data Processing
12
FASTQ file(s)
BAM file
Expression Estimates
Mapping (alignment)
Expression Quantification
Downstream analysis • Differential expression testing • Functional annotation/clustering • Pathway analysis
Quality Control
RNA-seq Mapping
13
1- Mapping to reference genome 2- Mapping to transcriptome
Select reference genome Spliced genome alignment
• TopHat/Bowtie • STAR
Expression Quantification
• Cufflinks • HTSEQ (counts)
Construct transcriptome Mapping to transcriptome
• Bowtie, BWA Expression Quantification
• eXpress • RSEM • Counts
RNA-seq Mapping
14
1- Mapping to reference genome 2- Mapping to transcriptome
Select reference genome Spliced genome alignment
• TopHat/Bowtie • STAR
Expression Quantification
• Cufflinks • HTSEQ (counts)
Construct transcriptome Mapping to transcriptome
• Bowtie, BWA Expression Quantification
• eXpress • RSEM • Counts
Tuxedo Suit
RNA-seq Tuxedo Suit
15
Trapnell, Nature Protocols 2012
16
17
18 Gulcin Ozer, PhD
Trapnell, Nature Biotechnology 2012
Transcript Construction Transcript Abundance Estimation
19
RNA Abundance Estimation
Garber, Nature Methods 2011
Transcripts of different lengths with different read coverage levels
20
RNA Abundance Estimation
§ RPKM/FPKM § Reads (fragments) per kilobase exon per 106 reads sequenced § RPKM: Reads per kilobase of exon per million mapped reads.
Introduced by Mortazavi, 2008 § FPKM: Fragments per kilobase of exon per million mapped
reads. Introduced by Salzberg, Pachter, 2010 § TPM
§ Transcripts per million. Introduced by Li, 2011 § Normalized by total transcript count instead of read count in
addition to average read length.
§ Count-based methods § Gene level § Transcript level
21
RNA Abundance Estimation
FPKM(i) = ci i103
×N106
=ci iN
×103 ×106
=ci iN
×109
NcZiTPMi
i
××= 610)(
If you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen for transcript i.
Z: Sum of all length normalized transcript counts
Fragments per kilobase of exon per million mapped reads
RNA Abundance Estimation (Counts)
§ Estimate counts
Trapnell, Nature Biotechnology 2013
Isoform level counts
are not accurate
RNA-seq Mapping
23
1- Mapping to reference genome 2- Mapping to transcriptome
Select reference genome Spliced genome alignment
• TopHat/Bowtie • STAR
Expression Quantification
• Cufflinks • HTSEQ (counts)
Construct transcriptome Mapping to transcriptome
• Bowtie, BWA Expression Quantification
• eXpress • RSEM • Counts
Transcriptome Creation
24
ENSG0000171862
ENST0000610634
ENST0000371953
ENST0000472832
ENST0000462694
Mapping to Transcriptome
25
Allow multiple mapping Given isoform X and isoform Y, an RNA-seq read could map equally well to either. To which should it be assigned? Solution: Assign probabilistically based on expectation-maximization (EM)
Expectation Maximization A 3 step “rescue” method for read assignment:
1. Estimate abundances from uniquely mapping reads
2. For each ambiguous read, allocate it among the transcripts to which it maps, pro rata according to abundances from step 1
3. Re-compute abundances from updated counts
Up to how many valid
alignments per sequence read?
How many EM rounds?
26
An illustrative example of abundance estimation for two transcripts with shared (blue) and unique (red, yellow) sequences. To estimate transcript abundances, RNA-seq reads (short bars) are first aligned to the transcript sequences (long bars, bottom). Unique regions of isoforms will capture uniquely mapping RNA-seq reads (red and yellow short bars), and shared sequences between isoforms will capture multiply-mapping reads (blue short bars). An expectation maximization algorithm, implemented in the RSEM software, estimates the most likely relative abundances of the transcripts and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms resulting from iterations of expectation maximization are illustrated as filled short bars (right), and eliminated assignments are shown as hollow bars.
Haas, Nature Protocols 2013
Tools for EM based gene abundance
• RSEM: RNA-seq by Expectation-maximization • Li et al (Senior: Colin Dewey) • Older and in wider use • Better support • Nicer output and tools
• eXpress • Roberts et al (Senior: Pachter) • Faster • Easier to setup • Supports gapped alignments
• Both output isoform-level counts, FPKM, and TPM
27
28
RSEM/eXpress data tables
bundle_id target_id length eff_length tot_counts 1 ENST00000449446.1| 1768 0.000000 0 2 ENST00000414345.2| 264 0.000000 0 3 ENST00000539941.2| 1692 1401.381015 1169 3 ENST00000268717.5| 1641 1341.768590 1221 3 ENST00000578317.1| 1625 1425.445323 1158 3 ENST00000439936.2| 1394 1162.609069 889 3 ENST00000577210.1| 752 566.131134 301 3 ENST00000417352.1| 1009 749.337675 364 3 ENST00000584216.1| 876 598.044491 329
uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0 0.000000 0.000000 0.000000e+00 0.000000e+00 2 61.099987 73.770929 6.452375e+00 1.209575e+02 2 1057.886341 1293.808410 1.289255e+01 1.991645e+00 1 73.860586 84.200670 3.911023e+00 5.819462e+01 0 0.166363 0.199474 4.230245e-02 2.260106e+02 0 0.218552 0.290305 1.150804e-02 1.583792e+01 1 1.487002 2.002282 9.004745e-01 6.702923e+02 2 16.457838 24.107012 7.108123e-01 1.536598e+01
fpkm fpkm_conf_low fpkm_conf_high solvable tpm 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 1.715095e+00 1.281888e+00 2.148302e+00 T 2.264485e+00 3.101450e+01 2.910003e+01 3.292896e+01 T 4.094926e+01 2.038288e+00 1.563572e+00 2.513004e+00 T 2.691205e+00 5.628942e-03 0.000000e+00 3.329140e-02 T 7.432041e-03 1.518591e-02 0.000000e+00 8.230336e-02 T 2.005035e-02 7.806159e-02 4.737860e-03 1.513853e-01 T 1.030668e-01 1.082537e+00 5.667161e-01 1.598357e+00 T 1.429302e+00
…
Annotation
§ Typically in GTF/GFF3 format § Ensembl (http://www.ensembl.org/)
§ BioMart § Coding and non-coding RNA species
§ RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) § FTP or UCSC table browser § Coding RNA species
§ GenCode (http://www.gencodegenes.org) § Coding and non-coding RNA species § Human and mouse only
29
Upload fastq files Quality control (FastQC)
Quality trimming (FASTQ Quality Trimmer) Quality control (FastQC)
Mapping (TopHat) Visualization (Trackster)
Visualization (IGV) Upload annotation GTF
Expression quantification (Cufflinks)
https://osu.box.com/exome-seq-example
32
https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
EXERCISE
https://osu.box.com/rna-seq-data
Annotation
46
http://www.acgt.me/blog/2015/4/27/the-dangers-of-default-parameters-in-bioinformatics-lessons-from-bowtie-and- t
52
Selen Yilmaz, MS