View
299
Download
7
Category
Preview:
Citation preview
What is RNA-seq?
• RNA-seq is the high-throughput sequencing of the cDNA
• It’s used to measure the RNA expression
• It’s the NGS equivalent of microarray gene-expression
RNA-seq applications
• Discovery
• new transcript
• transcript boundaries
• splice junctions
• Comparison (between different samples)
• evaluate gene expression
• evaluate difference in splice patterns, isoform abundance.
44 Revolution – RNA-Seq – PCR-free – Ribo-Seq – CLIP-Seq – Normalization - FFPE
RNA families
RNA
Coding
PolyAmRNA
Non-PolyAmRNA
Non-coding
Structural
DNA associated
Replisome
DNA Repair
Telomeric
DNA methylation
(piRNA)
RNA associated
Ribosome associated rRNA
Regulatory
Micro RNA
TSS associated
Anti-sense
Enhancer RNA
RNA-seq and poly-A
• RNA-seq preparation protocols usually includes poly-A selection.
• tentative to remove rRNA
RNA-seq and poly-A
• RNA-seq preparation protocols usually includes poly-A selection.
• tentative to remove rRNA
• not only the mRNA appears to be poly-adenylated
• Always look at the used library preparation protocol (other approach are possible eg.: rRNA depletion kit)
RNA-seq vs microarray
• Microarray: • Pro:
• Costs, well established methods, small data
• Cons:
• Hybridization bias, sequence must be known
• RNAseq • Pro:
• Reproducible (no replicate needed), real transcriptome
• Information rich - not limited to expression -
• Cons:
• Complexity (need a lot of step to have actual results)
• Size and computational power
RNA-seq vs microarray
Marioni J C et al. Genome Res. 2008;18:1509-1517
Differential expressed genes called by microarray and RNA-seq
Alignment methods
• Two different approach are possible:
• Align vs the transcriptome
• faster, easier
• Align vs the whole genome
• the complete information
Alignment tools
• NGS common alignment program:
• BWA
• Bowtie (Bowtie2)
• Novoalign
• Take into account splice-junction
• Tophat/Cufflinks
Alternative splicing
Alternative splicing is a normal biological phenomenon. !One gene can encode different protein, by changing the combination of transcribed exons
De novo Assembly
• Transcriptomic content is more changeable then DNA genomic content
• Isoforms, alternative splicing.
• gene fusion
• Mapping reads on reference genome is unable to cope with such structural alterations.
• De novo transcriptome assembly
De novo Assembly
• Underlying assumptions relative to RNA-expression
• sequence coverage is similar in reads of the same transcript
• strand specific (sense and antisense transcripts)
• Assemblers:
• Velvet (Genomic and transcriptomic)
• Trinity (Transcriptomic)
• Cufflinks (Transcriptominc, reassemble pre-aligned transcripts to find alternative splicing based on differential expression)
RNA-seq and “reads”
• Reads, counts, call them as you wish. The number of reads for region reflect the expression level.
• Different way to consider the reads
• each reads = 1 count
• FPKM (fragment per kilobase of exon per million)
FPKM
• The aim of FPKM is to deal with the fact that most reads will map to several transcripts. Each read influence the FPKM values of all these transcripts, but will not augment each count by one.
• FPKM is calculated by software like Cufflinks (http://cufflinks.cbcb.umd.edu/)
• Is not possible to converting FPKM back to reads
• length transcript times FPKM != reads. Each read match more transcripts
• Useful to compare abundance of different transcript within the same sample
• Might be able to detect alternative splicing
reads-count
• Considering reads we need to be sure that the alignment is unambiguous.
• Software like HTSeq counts reads discarding ambiguous or not-unique match.
Differential expression
• Reads obtained from different samples
• Compare reads for the same transcript/gene in the various samples
• Challenge:
• Annotation
• Statistics
Challenges
• Annotation
• Alignment to the transcriptome (transcript_id).
• Alignment to the reference genome
• use a GTF to map the desired features type into the genome (HTSeq uses that)
• Statistics
• R/Bioconductor (edgeR, DESeq... more)
Statistics
• A series of observations can be associated with a distribution function.
• Generally the most correct function is described by a binomial or a Poisson distribution.
• The advantage of using a distribution is that given few parameter (eg: size and mean) we can describe the whole data
Statistics
• The advantage of using a distribution is that given few parameter (eg: size and mean) we can describe the whole data
Statistics
• Fitting RNA-seq data in a pure Poisson distribution, the observed variance would results higher then expected (Overdispersion)
• A negative-binomial distribution, is a similar to a Poisson distribution with higher variance.
• neg. binom. is implemented in several R packages, it is a better fit in counts model, like RNA-seq case
Statistics
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
Statistics
Poisson and Neg. binomial parameterhttp://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
Statistics
Poisson and Neg. binomial parameterhttp://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
Additional negative binomial parameter. when overdispersion = 0
neg. binom = Poisson
Statistics
• Normalization:
• Different sample have different number of total reads (library size)
• Normalization for library size (each package implement a different method)
edgeR
• From the author of limma (linear model microarray)
• Negative-binomial distribution
• Normalize for size (Normalization-factor)
Recommended