Next Generation Sequencing

1

Next Generation Sequencing Introduction to RNA-sequencing

BMI7830

1

Gulcin Ozer, PhD [email protected]

Department of Biomedical Informatics The Ohio State University

November 5th, 2015

Background

What is RNA-sequencing? §  Sequencing of reverse transcribed cDNA from

purified mRNA §  Operationally not much different than DNA

sequencing, but the downstream processing is markedly different

Background

Benefits and opportunities of RNA-seq §  Digital measure of gene expression §  Differential expression §  Annotation of new exons or transcribed regions,

genes, or non-coding RNAs §  Alternative splicing events §  Allele specific expression §  Fusion genes in cancer §  Variant information (with important caveats)

Background

cDNA Microarray RNA-sequencing Analog Digital Fixed probeset Any sequence, organism No variant information SNV, indels, fusions Reasonable cost Reasonable to outrageous Large batch-to-batch var. Little variation across runs

Compared to cDNA microarrays …

RNA-seq Library Preparation

§ Poly-A or Total transcriptome §  Poly-A

§ Complete transcripts for protein coding genes and some non-protein coding genes contain a 3’ poly(A) tail

§ Oligo-d(T) capture §  Total transcriptome

§ No oligo-d(T) capture step §  Typically requires ribosomal depletion § Captures more noncoding transcripts at cost of lower

coding coverage

§ Unstranded or stranded

5

RNA-seq Library Sequencing

6

Nadia Davidson

@HWI-EAS264:8:87:1418:8428#0 GGGGACGTCTGCGACACCGGGGACAGAGCAACTATGGATGAAGAGGGCTACATCTGGTTCCTGGGGAGGAGCCATG + CBCBCCCCCCCCACCCCCCBCCCCCCCCCCCCC9B8?A>?BCACCC;CC@C>@CBB@CBA@@BA@B1?;4@*A@@A

Millions of

RNA-seq Library Sequencing

§  Sequenced as in DNA sequencing §  Paired end (PE) or single end (SE) §  For an analysis of alternative splicing, paired-end

and longer reads may be helpful

7

RNA-seq Quality Control (QC)

FastQC §  Base quality per position §  Nucleotide per position §  GC content §  K-mer enrichment http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

8


9

Base quality per position


In addition to FastQC, there are RNA-seq specific QC tools •  RNASeQC

•  http://www.broadinstitute.org/cancer/cga/rna-seqc •  Deluca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD,

Williams C, Reich M, Winckler W, Getz G. (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics

•  RSeQC •  http://rseqc.sourceforge.net/ •  Wang L, Wang S, Li W* RSeQC: quality control of RNA-

seq experiments Bioinformatics (2012) 28 (16): 2184-2185. doi: 10.1093/bioinformatics/bts356 pubmed

10


11

RSeQC •  Python based •  Both table- and graph-based output

Wang, Bioinformatics 2012

Coverage uniformity over

gene body (5’ – 3’ bias)

Saturation analysis of expression for 25% highest expressed

genes

Saturation analysis of

junction detection

Annotation of detected splice

junctions

RNA-seq Data Processing

12

FASTQ file(s)

BAM file

Expression Estimates

Mapping (alignment)

Expression Quantification

Downstream analysis • Differential expression testing • Functional annotation/clustering • Pathway analysis

Quality Control

RNA-seq Mapping

13

1- Mapping to reference genome 2- Mapping to transcriptome

Select reference genome Spliced genome alignment

• TopHat/Bowtie • STAR


• Cufflinks • HTSEQ (counts)

Construct transcriptome Mapping to transcriptome

• Bowtie, BWA Expression Quantification

• eXpress • RSEM • Counts

RNA-seq Mapping

14









Tuxedo Suit

RNA-seq Tuxedo Suit

15

Trapnell, Nature Protocols 2012

16

17

18 Gulcin Ozer, PhD

Trapnell, Nature Biotechnology 2012

Transcript Construction Transcript Abundance Estimation

19

RNA Abundance Estimation

Garber, Nature Methods 2011

Transcripts of different lengths with different read coverage levels

20


§  RPKM/FPKM §  Reads (fragments) per kilobase exon per 106 reads sequenced §  RPKM: Reads per kilobase of exon per million mapped reads.

Introduced by Mortazavi, 2008 §  FPKM: Fragments per kilobase of exon per million mapped

reads. Introduced by Salzberg, Pachter, 2010 §  TPM

§  Transcripts per million. Introduced by Li, 2011 §  Normalized by total transcript count instead of read count in

addition to average read length.

§  Count-based methods §  Gene level §  Transcript level

21


FPKM(i) = ci i103

×N106

=ci iN

×103 ×106

=ci iN

×109

NcZiTPMi

i

××= 610)(

If you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen for transcript i.

Z: Sum of all length normalized transcript counts

Fragments per kilobase of exon per million mapped reads

RNA Abundance Estimation (Counts)

§  Estimate counts

Trapnell, Nature Biotechnology 2013

Isoform level counts

are not accurate

RNA-seq Mapping

23









Transcriptome Creation

24

ENSG0000171862

ENST0000610634

ENST0000371953

ENST0000472832

ENST0000462694

Mapping to Transcriptome

25

Allow multiple mapping Given isoform X and isoform Y, an RNA-seq read could map equally well to either. To which should it be assigned? Solution: Assign probabilistically based on expectation-maximization (EM)

Expectation Maximization A 3 step “rescue” method for read assignment:

1. Estimate abundances from uniquely mapping reads

2. For each ambiguous read, allocate it among the transcripts to which it maps, pro rata according to abundances from step 1

3. Re-compute abundances from updated counts

Up to how many valid

alignments per sequence read?

How many EM rounds?

26

An illustrative example of abundance estimation for two transcripts with shared (blue) and unique (red, yellow) sequences. To estimate transcript abundances, RNA-seq reads (short bars) are first aligned to the transcript sequences (long bars, bottom). Unique regions of isoforms will capture uniquely mapping RNA-seq reads (red and yellow short bars), and shared sequences between isoforms will capture multiply-mapping reads (blue short bars). An expectation maximization algorithm, implemented in the RSEM software, estimates the most likely relative abundances of the transcripts and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms resulting from iterations of expectation maximization are illustrated as filled short bars (right), and eliminated assignments are shown as hollow bars.

Haas, Nature Protocols 2013

Tools for EM based gene abundance

•  RSEM: RNA-seq by Expectation-maximization •  Li et al (Senior: Colin Dewey) •  Older and in wider use •  Better support •  Nicer output and tools

•  eXpress •  Roberts et al (Senior: Pachter) •  Faster •  Easier to setup •  Supports gapped alignments

•  Both output isoform-level counts, FPKM, and TPM

27

28

RSEM/eXpress data tables

bundle_id target_id length eff_length tot_counts 1 ENST00000449446.1| 1768 0.000000 0 2 ENST00000414345.2| 264 0.000000 0 3 ENST00000539941.2| 1692 1401.381015 1169 3 ENST00000268717.5| 1641 1341.768590 1221 3 ENST00000578317.1| 1625 1425.445323 1158 3 ENST00000439936.2| 1394 1162.609069 889 3 ENST00000577210.1| 752 566.131134 301 3 ENST00000417352.1| 1009 749.337675 364 3 ENST00000584216.1| 876 598.044491 329

uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0 0.000000 0.000000 0.000000e+00 0.000000e+00 2 61.099987 73.770929 6.452375e+00 1.209575e+02 2 1057.886341 1293.808410 1.289255e+01 1.991645e+00 1 73.860586 84.200670 3.911023e+00 5.819462e+01 0 0.166363 0.199474 4.230245e-02 2.260106e+02 0 0.218552 0.290305 1.150804e-02 1.583792e+01 1 1.487002 2.002282 9.004745e-01 6.702923e+02 2 16.457838 24.107012 7.108123e-01 1.536598e+01

fpkm fpkm_conf_low fpkm_conf_high solvable tpm 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00 1.715095e+00 1.281888e+00 2.148302e+00 T 2.264485e+00 3.101450e+01 2.910003e+01 3.292896e+01 T 4.094926e+01 2.038288e+00 1.563572e+00 2.513004e+00 T 2.691205e+00 5.628942e-03 0.000000e+00 3.329140e-02 T 7.432041e-03 1.518591e-02 0.000000e+00 8.230336e-02 T 2.005035e-02 7.806159e-02 4.737860e-03 1.513853e-01 T 1.030668e-01 1.082537e+00 5.667161e-01 1.598357e+00 T 1.429302e+00

…

Annotation

§  Typically in GTF/GFF3 format §  Ensembl (http://www.ensembl.org/)

§  BioMart §  Coding and non-coding RNA species

§  RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) §  FTP or UCSC table browser §  Coding RNA species

§  GenCode (http://www.gencodegenes.org) §  Coding and non-coding RNA species §  Human and mouse only

29

Upload fastq files Quality control (FastQC)

Quality trimming (FASTQ Quality Trimmer) Quality control (FastQC)

Mapping (TopHat) Visualization (Trackster)

Visualization (IGV) Upload annotation GTF

Expression quantification (Cufflinks)

https://osu.box.com/exome-seq-example

32

https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

EXERCISE

https://osu.box.com/rna-seq-data

Annotation

46

http://www.acgt.me/blog/2015/4/27/the-dangers-of-default-parameters-in-bioinformatics-lessons-from-bowtie-and- t

52

Selen Yilmaz, MS

53

Gulcin Ozer, PhD 340C Lincoln Tower 1800 Cannon Dr. [email protected]

340 Lincoln Tower

Documents

Next Generation Sequencing