TOX680 Unveiling the Transcriptome using RNA- seq

TOX680Unveiling the Transcriptome using RNA-seq

Jinze Liu

Outline

• What is the transcriptome?• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

Genome, Transcriptome, ProteomeSchematic illustration

of a eukaryotic cell

cell nucleus

The transcriptome isall RNA molecules

transcribed from DNA

Proteins

Genome

Proteome

Dynamics of the Transcriptome• Cells with the same genome may produce a different transcriptome … how?

• Two main mechanisms(1) differential gene expression (2) differential gene transcription

Proteins

mRNA transcripts

pre-mRNA

Proteins

Alternate transcription• multiple mRNA transcript “isoforms” within one gene

– proteins with different functions may be produced– e.g. skipped exon in CYT-2 isoform of ERBB4 leads to increased cell proliferation

Muraoka-Cook et al. (2009) Mol Cell Biol

CYT-2: deletes 16 amino acids (WW domain binding motif)

Forms of alternative splicing

Castle et al. (2008) Nature Genetics

Gene VEGFA combines multiple alternative splicing forms (not independently!) ….

2 2 23 3

How to measure the transcriptome?

• Ideally, given a sample of RNA– which transcripts are present?– how much of each?

• Given two samples of RNA– which transcripts are differentially expressed?

Microarrays• Most common technique for measuring

transcriptome

– hybridized probes detect the presence and abundance of specific known transcripts

• difficult to observe differenttranscript isoforms

• abundance has limited dynamic range

Differential gene expression

• Identify transcriptome differences between two samples

Outline

• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

The RNA-seq protocol

Nature Review | Genetics

• Protocol– mRNA is reverse transcribed to cDNA– cDNA is randomly fragmented– adapters are added to the fragments– fragments are sequenced using HT

sequencing technology• e.g. Illumina: up to a billion 100bp

reads sequenced in a single run

• Each sequence is a randomly sampled fragment of the transcriptome

– identity determined by alignment to a transcript library or to a reference genome

– the number of alignments toa genomic locus is a measure ofabundance

RNA-seq view of transcriptome

• Issues– non-random fragmentation– sequencing bias– DNA or pre-mRNA contamination

• Spliced alignments– not a problem if aligning to a transcript library– challenging if aligning to the genome

Outline

Spliced alignment strategies• Annotation based discovery

– contiguous alignment of reads to existing EST/cDNA sequences with known splice junctions– contiguous alignment of reads to paired exons from database of known or suspected

junctions (Mortazavi et al. 2008, Wang et al. 2008)

• Ab initio discovery by alignment to reference genome– QPalma (Bona et al. 2008)

• supervised splice site prediction and gapped alignment algorithm for aligning spliced reads

– TopHat (Trapnell et al. 2009)• detect potential junctions based on structural features of introns, e.g. GT – AG

dinucleotide sequences flanking the exons• test alignment of reads to candidate exon pairs

Improved splice detection• Issues

– Can not easily find non-canonical splices or long-range splices– Single long reads may include multiple splice junctions– Spurious alignment is a serious problem

• MapSplice: a second generation ab initio method– alignment of reads

• does not depend on any structural features• finds multiple candidate alignments

– splice inference• leverages the quality and diversity of read alignments to disambiguate

true junctions from spurious junctions– efficient and scalable

Finding spliced alignments

Genome

mRNA tag Tt1 t2 t3 t4

k k hj1 j2

exon 1 exon 2 exon 3

• Example: 100 bp tag T is split into 25bp segments– segments are tested for (approximate) alignment to the genome– unaligned segments implicate splices– find splices by searching from neighboring aligned segments

• Theorem: if no exon is shorter than 2k, then at least one segment must align in every pair of consecutive length k segments.

MapSplice algorithm (1)

Ti …t1 t2 tj tn

…(1) Segmentation of reads

tj tj+2

? tj+1

tj tj+1

3’5’

Contiguous

Missed alignment double anchored

tj? tj+1 Missed alignment

single anchored

tj tj+2

? tj+1

s(j+1)

3’5’

(2) Segment exonic alignment (3) Segment spliced alignment

tj tj+1 3’5’

INPUTSset of RNA-Seq reads

Reference genome

MapSplice algorithm

OUTPUTS: Splices and splice coverage

Read alignments

3’5’

t1 t2tj tj+1 tn… …

Ti … …

(4) Segment assembly

1. Alignment quality2. Anchor significance3. Entropy

High Confidence Low confidence

Ti2 TiTi3 Ti4

(5) Junction inference

(6) Identify best alignment for tags

3’5’

Validating the algorithm

• How can we tell if it is working well?– comparison against transcriptome library alignment

– but how do we know that novel alignments are valid?• run on synthetic transcriptome for which we know ground

truth!

BWAidentically

aligned80.4%

BWA aligned

only1.2%

MapSplicealigned

only5.0% /6.8%

unaligned 10.2%

by both81.4%

Synthetic Transcriptome1. Sample each gene’s ABUNDANCE from Wang et al. (2008)

2. Choose a DISTRIBUTION across annotated transcript isoforms in RefSeq

3. Randomly pick the START position for each read (& introduce errors)

4. Align reads with MapSplice and analyze performance.

MapSplice performance

Improved accuracy from multiple criteria in junction classification

Outline

• Transcriptome changes in response to time, disease, etc• Characteristics of a transcriptome

• Qualitatively, which transcripts are expressed• Quantitatively, what are their expression levels

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Transcript Abundance

Protein Expression

Protein α

transcript α41 2 3

• Transcriptome changes in response to time, disease, etc• Differential Splicing: alternative splicing events that exhibit significantly

different splicing ratios between different samples

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Protein Expression

41 2 3

Protein α

transcript α3 41

Protein β

transcript β

3 41 2

Protein α

transcript α41 2 3

Normal TumorDifferential Splicing

• Differential Splicing: why important?• Understanding of cell differentiation and development• Identification of disease biomarkers

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Protein Expression

41 2 3

Protein α

transcript α3 41

Protein β

transcript β

3 41 2

Protein α

transcript α41 2 3

Normal TumorDifferential Splicing

Observed read coverage

Splice structure E1 E2 E3 E4 E5J1 J2

J5Unify structural information (exons and junctions) from all samples

DiffSplice – Unified Graph Representation

RNA-seq read alignment

Reference genome5’ 3’

Splice structure

Unified Expression-weighted Splice Graph (ESG)

Weighted DAG (Directed Acyclic Graph)• Vertex – Exonic segment• Edge – Splice junction• Weight – Expression level

E1 E2 E3 E4 E5TS TE

E1 E2 E3 E4 E5J1 J2

Differentiate samples by the weights

DiffSplice – Unified Graph Representation

source sink

E1 E2 E3

E1 E3immed. pre-dominator

immed. post-dominator E3 TE

immed. pre-dominator

immed. post-dominator

source sink

E3 E4 E5 TE

E1 E2 E3 E4 E5TS TE

DiffSplice – Alternative Splicing Modules (ASMs)

source sink

E1 E2 E3

source sink

E3 E4 E5 TE

E1 E2 E3 E4 E5TS TE

path 1

path 2

path 1

path 2

Level 0

Level 1

ASM1 ASM2

DiffSplice – Alternative Splicing Modules (ASMs)

E1 E2 E3

path 1

path 2

95.2E1 E2 E3

ASM1 in sample A1path 1

path 2

observed expression

estimated expression

DiffSplice – Isoform Abundance Estimation

w(E1) w(E2) w(E3) w(J1) w(J2) w(J3)

Poisson dist’n

Normal dist’n

96.7% 3.3%

alternative path proportion

estimated expression of ASM1

swPmaxargˆ

q qt s

tPoissontNormal NTfTtswf

,|||maxarg

ASM1 in sample A1

observed expression

estimated expression

DiffSplice – Isoform Abundance Estimation

E1 E2 E3

path 1

path 2

(96.7%)

(3.3%)

95.2E1 E2 E3

path 1

path 2

TOX680 Unveiling the Transcriptome using RNA- seq

Documents

Introduction to RNA-Seq & Transcriptome Analysis Jessica Kirkpatrick PowerPoint by Casey Hanson RNA-Seq Lab | Jessica Kirkpatrick | 20151

An Introduction to RNA-Seq Transcriptome Profiling with iPlant

Quantifying RNA binding sites transcriptome-wide using DO-RIP-seq

Next-generation transcriptome assembly · transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq

Axon-Seq Decodes the Motor Axon Transcriptome and Its ...€¦ · SUPPLEMENTAL INFORMATION Axon-seq decodes the motor axon transcriptome and its modulation in response to ALS Jik

RNA-seq: A High-resolution View of the Transcriptome

RNA-seq: Quantifying the Transcriptome

RESEARCH ARTICLE Open Access De novo transcriptome ......RESEARCH ARTICLE Open Access De novo transcriptome analysis of Hevea brasiliensis tissues by RNA-seq and screening for molecular

Spatiotemporal Transcriptome Analysis Provides Insights into ...transcriptome during development, we generated 15 RNA-seq libraries from the upper tepals (S2-U) and basal tepals (S1-D,

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

Review Article ...downloads.hindawi.com/journals/bmri/2010/853916.pdf · RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction, CNV-Seq for large genome nucleotide

Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Single-cell RNA-seq transcriptome analysis of linear …...METHOD Open Access Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos

RNA-Seq and transcriptome analysis

Application Note RNA-Seq analysis using de novo …pages.ingenuity.com/rs/ingenuity/images/P_pringlei_Appnote.pdf · RNA-Seq analysis using de novo assembled transcriptome as reference

Research Article RNA-Seq Based Transcriptome Analysis of the …edoc.mdc-berlin.de/16383/1/16383oa.pdf · 2017. 3. 13. · Research Article RNA-Seq Based Transcriptome Analysis of

RNA-Seq (Transcriptome) Sequencing - BGI€¦ · RNA-Seq (Transcriptome) Sequencing For BGI’s scientiﬁc publications relating to DNBseq Small RNA Sequencing, sample shipping instructions

OPEN Data Descriptor: Daphnia magna transcriptome by ...mpfrende/PDFs/Orsini_et_al_SciData...Data Descriptor: Daphnia magna transcriptome by RNA-Seq across 12 environmental stressors

RNA-Seq: Sequencing the Transcriptome - BioconductorRNA-Seq: Sequencing the Transcriptome Kasper Daniel Hansen Department of Biostatistics Johns Hopkins Bloomberg School of Public

Introduction to RNA-Seq & Transcriptome Analysisveda.cs.uiuc.edu/CompGen2017/labs/04_Transcriptomics_2017.pdf · Exercise Use the Tuxedo Suite to: 1. Align RNA-Seq reads using TopHat