53
1 Next Generation Sequencing Introduction to RNA-sequencing BMI7830 1 Gulcin Ozer, PhD [email protected] Department of Biomedical Informatics The Ohio State University November 5 th , 2015

Next Generation Sequencing

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Next Generation Sequencing

1

Next Generation Sequencing Introduction to RNA-sequencing

BMI7830

1

Gulcin Ozer, PhD [email protected]

Department of Biomedical Informatics The Ohio State University

November 5th, 2015

Page 2: Next Generation Sequencing

Background

What is RNA-sequencing? §  Sequencing of reverse transcribed cDNA from

purified mRNA §  Operationally not much different than DNA

sequencing, but the downstream processing is markedly different

Page 3: Next Generation Sequencing

Background

Benefits and opportunities of RNA-seq §  Digital measure of gene expression §  Differential expression §  Annotation of new exons or transcribed regions,

genes, or non-coding RNAs §  Alternative splicing events §  Allele specific expression §  Fusion genes in cancer §  Variant information (with important caveats)

Page 4: Next Generation Sequencing

Background

cDNA Microarray RNA-sequencing Analog Digital Fixed probeset Any sequence, organism No variant information SNV, indels, fusions Reasonable cost Reasonable to outrageous Large batch-to-batch var. Little variation across runs

Compared to cDNA microarrays …

Page 5: Next Generation Sequencing

RNA-seq Library Preparation

§ Poly-A or Total transcriptome §  Poly-A

§ Complete transcripts for protein coding genes and some non-protein coding genes contain a 3’ poly(A) tail

§ Oligo-d(T) capture §  Total transcriptome

§ No oligo-d(T) capture step §  Typically requires ribosomal depletion § Captures more noncoding transcripts at cost of lower

coding coverage

§ Unstranded or stranded

5

Page 6: Next Generation Sequencing

RNA-seq Library Sequencing

6

Nadia Davidson

@HWI-EAS264:8:87:1418:8428#0 GGGGACGTCTGCGACACCGGGGACAGAGCAACTATGGATGAAGAGGGCTACATCTGGTTCCTGGGGAGGAGCCATG + CBCBCCCCCCCCACCCCCCBCCCCCCCCCCCCC9B8?A>?BCACCC;CC@C>@CBB@CBA@@BA@B1?;4@*A@@A

Millions of

Page 7: Next Generation Sequencing

RNA-seq Library Sequencing

§  Sequenced as in DNA sequencing §  Paired end (PE) or single end (SE) §  For an analysis of alternative splicing, paired-end

and longer reads may be helpful

7

Page 8: Next Generation Sequencing

RNA-seq Quality Control (QC)

FastQC §  Base quality per position §  Nucleotide per position §  GC content §  K-mer enrichment http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

8

Page 9: Next Generation Sequencing

RNA-seq Quality Control (QC)

9

Base quality per position

Page 10: Next Generation Sequencing

RNA-seq Quality Control (QC)

In addition to FastQC, there are RNA-seq specific QC tools •  RNASeQC

•  http://www.broadinstitute.org/cancer/cga/rna-seqc •  Deluca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD,

Williams C, Reich M, Winckler W, Getz G. (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics

•  RSeQC •  http://rseqc.sourceforge.net/ •  Wang L, Wang S, Li W* RSeQC: quality control of RNA-

seq experiments Bioinformatics (2012) 28 (16): 2184-2185. doi: 10.1093/bioinformatics/bts356 pubmed

10

Page 11: Next Generation Sequencing

RNA-seq Quality Control (QC)

11

RSeQC •  Python based •  Both table- and graph-based output

Wang, Bioinformatics 2012

Coverage uniformity over

gene body (5’ – 3’ bias)

Saturation analysis of expression for 25% highest expressed

genes

Saturation analysis of

junction detection

Annotation of detected splice

junctions

Page 12: Next Generation Sequencing

RNA-seq Data Processing

12

FASTQ file(s)

BAM file

Expression Estimates

Mapping (alignment)

Expression Quantification

Downstream analysis • Differential expression testing • Functional annotation/clustering • Pathway analysis

Quality Control

Page 13: Next Generation Sequencing

RNA-seq Mapping

13

1- Mapping to reference genome 2- Mapping to transcriptome

Select reference genome Spliced genome alignment

• TopHat/Bowtie • STAR

Expression Quantification

• Cufflinks • HTSEQ (counts)

Construct transcriptome Mapping to transcriptome

• Bowtie, BWA Expression Quantification

• eXpress • RSEM • Counts

Page 14: Next Generation Sequencing

RNA-seq Mapping

14

1- Mapping to reference genome 2- Mapping to transcriptome

Select reference genome Spliced genome alignment

• TopHat/Bowtie • STAR

Expression Quantification

• Cufflinks • HTSEQ (counts)

Construct transcriptome Mapping to transcriptome

• Bowtie, BWA Expression Quantification

• eXpress • RSEM • Counts

Tuxedo Suit

Page 15: Next Generation Sequencing

RNA-seq Tuxedo Suit

15

Trapnell, Nature Protocols 2012

Page 16: Next Generation Sequencing

16

Page 17: Next Generation Sequencing

17

Page 18: Next Generation Sequencing

18 Gulcin Ozer, PhD

Trapnell, Nature Biotechnology 2012

Transcript Construction Transcript Abundance Estimation

Page 19: Next Generation Sequencing

19

RNA Abundance Estimation

Garber, Nature Methods 2011

Transcripts of different lengths with different read coverage levels

Page 20: Next Generation Sequencing

20

RNA Abundance Estimation

§  RPKM/FPKM §  Reads (fragments) per kilobase exon per 106 reads sequenced §  RPKM: Reads per kilobase of exon per million mapped reads.

Introduced by Mortazavi, 2008 §  FPKM: Fragments per kilobase of exon per million mapped

reads. Introduced by Salzberg, Pachter, 2010 §  TPM

§  Transcripts per million. Introduced by Li, 2011 §  Normalized by total transcript count instead of read count in

addition to average read length.

§  Count-based methods §  Gene level §  Transcript level

Page 21: Next Generation Sequencing

21

RNA Abundance Estimation

FPKM(i) = ci i103

×N106

=ci iN

×103 ×106

=ci iN

×109

NcZiTPMi

i

××= 610)(

If you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen for transcript i.

Z: Sum of all length normalized transcript counts

Fragments per kilobase of exon per million mapped reads

Page 22: Next Generation Sequencing

RNA Abundance Estimation (Counts)

§  Estimate counts

Trapnell, Nature Biotechnology 2013

Isoform level counts

are not accurate

Page 23: Next Generation Sequencing

RNA-seq Mapping

23

1- Mapping to reference genome 2- Mapping to transcriptome

Select reference genome Spliced genome alignment

• TopHat/Bowtie • STAR

Expression Quantification

• Cufflinks • HTSEQ (counts)

Construct transcriptome Mapping to transcriptome

• Bowtie, BWA Expression Quantification

• eXpress • RSEM • Counts

Page 24: Next Generation Sequencing

Transcriptome Creation

24

ENSG0000171862

ENST0000610634

ENST0000371953

ENST0000472832

ENST0000462694

Page 25: Next Generation Sequencing

Mapping to Transcriptome

25

Allow multiple mapping Given isoform X and isoform Y, an RNA-seq read could map equally well to either. To which should it be assigned? Solution: Assign probabilistically based on expectation-maximization (EM)

Expectation Maximization A 3 step “rescue” method for read assignment:

1. Estimate abundances from uniquely mapping reads

2. For each ambiguous read, allocate it among the transcripts to which it maps, pro rata according to abundances from step 1

3. Re-compute abundances from updated counts

Up to how many valid

alignments per sequence read?

How many EM rounds?

Page 26: Next Generation Sequencing

26

An illustrative example of abundance estimation for two transcripts with shared (blue) and unique (red, yellow) sequences. To estimate transcript abundances, RNA-seq reads (short bars) are first aligned to the transcript sequences (long bars, bottom). Unique regions of isoforms will capture uniquely mapping RNA-seq reads (red and yellow short bars), and shared sequences between isoforms will capture multiply-mapping reads (blue short bars). An expectation maximization algorithm, implemented in the RSEM software, estimates the most likely relative abundances of the transcripts and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms resulting from iterations of expectation maximization are illustrated as filled short bars (right), and eliminated assignments are shown as hollow bars.

Haas, Nature Protocols 2013

Page 27: Next Generation Sequencing

Tools for EM based gene abundance

•  RSEM: RNA-seq by Expectation-maximization •  Li et al (Senior: Colin Dewey) •  Older and in wider use •  Better support •  Nicer output and tools

•  eXpress •  Roberts et al (Senior: Pachter) •  Faster •  Easier to setup •  Supports gapped alignments

•  Both output isoform-level counts, FPKM, and TPM

27

Page 28: Next Generation Sequencing

28

RSEM/eXpress data tables

bundle_id target_id length eff_length tot_counts 1 ENST00000449446.1| 1768 0.000000 0 2 ENST00000414345.2| 264 0.000000 0 3 ENST00000539941.2| 1692 1401.381015 1169 3 ENST00000268717.5| 1641 1341.768590 1221 3 ENST00000578317.1| 1625 1425.445323 1158 3 ENST00000439936.2| 1394 1162.609069 889 3 ENST00000577210.1| 752 566.131134 301 3 ENST00000417352.1| 1009 749.337675 364 3 ENST00000584216.1| 876 598.044491 329

uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0 0.000000 0.000000 0.000000e+00 0.000000e+00 2 61.099987 73.770929 6.452375e+00 1.209575e+02 2 1057.886341 1293.808410 1.289255e+01 1.991645e+00 1 73.860586 84.200670 3.911023e+00 5.819462e+01 0 0.166363 0.199474 4.230245e-02 2.260106e+02 0 0.218552 0.290305 1.150804e-02 1.583792e+01 1 1.487002 2.002282 9.004745e-01 6.702923e+02 2 16.457838 24.107012 7.108123e-01 1.536598e+01

fpkm fpkm_conf_low fpkm_conf_high solvable tpm 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 1.715095e+00 1.281888e+00 2.148302e+00 T 2.264485e+00 3.101450e+01 2.910003e+01 3.292896e+01 T 4.094926e+01 2.038288e+00 1.563572e+00 2.513004e+00 T 2.691205e+00 5.628942e-03 0.000000e+00 3.329140e-02 T 7.432041e-03 1.518591e-02 0.000000e+00 8.230336e-02 T 2.005035e-02 7.806159e-02 4.737860e-03 1.513853e-01 T 1.030668e-01 1.082537e+00 5.667161e-01 1.598357e+00 T 1.429302e+00

Page 29: Next Generation Sequencing

Annotation

§  Typically in GTF/GFF3 format §  Ensembl (http://www.ensembl.org/)

§  BioMart §  Coding and non-coding RNA species

§  RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) §  FTP or UCSC table browser §  Coding RNA species

§  GenCode (http://www.gencodegenes.org) §  Coding and non-coding RNA species §  Human and mouse only

29

Page 30: Next Generation Sequencing
Page 31: Next Generation Sequencing

Upload fastq files Quality control (FastQC)

Quality trimming (FASTQ Quality Trimmer) Quality control (FastQC)

Mapping (TopHat) Visualization (Trackster)

Visualization (IGV) Upload annotation GTF

Expression quantification (Cufflinks)

https://osu.box.com/exome-seq-example

Page 32: Next Generation Sequencing

32

Page 33: Next Generation Sequencing
Page 34: Next Generation Sequencing

https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

EXERCISE

https://osu.box.com/rna-seq-data

Annotation

Page 35: Next Generation Sequencing
Page 36: Next Generation Sequencing
Page 37: Next Generation Sequencing
Page 38: Next Generation Sequencing
Page 39: Next Generation Sequencing
Page 40: Next Generation Sequencing
Page 41: Next Generation Sequencing
Page 42: Next Generation Sequencing
Page 43: Next Generation Sequencing
Page 44: Next Generation Sequencing
Page 45: Next Generation Sequencing
Page 46: Next Generation Sequencing

46

Page 47: Next Generation Sequencing
Page 48: Next Generation Sequencing
Page 49: Next Generation Sequencing
Page 50: Next Generation Sequencing
Page 51: Next Generation Sequencing

http://www.acgt.me/blog/2015/4/27/the-dangers-of-default-parameters-in-bioinformatics-lessons-from-bowtie-and- t

Page 52: Next Generation Sequencing

52

Selen Yilmaz, MS

Page 53: Next Generation Sequencing

53

Gulcin Ozer, PhD 340C Lincoln Tower 1800 Cannon Dr. [email protected]

340 Lincoln Tower