Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 10. Microarray and RNA-seq

Hyun Seok Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute,Yonsei University College of Medicine

Lecture 10. Microarray and RNA-seq

: Identification of Differentially Expressed Genes (DEG)

MES7594-01 Genome Informatics I (2015 Spring)

Gene expression

What is a microarray?

• Platforms– Glass slides (cDNA array)– Chips (Affymetrix)– Glass beads (Illumina)

• 10000s of oligonuceotide (or cDNA) probes are fixed on the surface of the platforms.

• Microarrays can detect and quantify– mRNA– microRNA– SNP– LOH– CNV

…

cDNA

Affymetrix

Illumina


Plat-forms

4

Questions of Interest

o Determine steady-state gene expression levels of a sample

in whole transcriptome scale.

o Identify differentially expressed genes between samples.

o Identify differentially regulated pathways or protein com-

plexes.


Affymetrix GeneChip for mRNA quantifi-cation

About Affy GeneChip platform• Probes (25 mers) are synthe-

sized on a chip using a pho-tolithographic manufacturing process.

• At each x, y location of a GeneChip, a particular oligonucleotide is synthesized with millions of copies.

• Each gene is represented by a unique set of probe pairs (PM and MM). MM helps increase specificity of the PM signal.


Affymetrix GeneChip for mRNA quantifi-cation

About Affy workflow• Isolate total RNA (need bio-

logical replicates)• Sample amplification and la-

beling• Sample injected into microar-

ray• Probe array hybridization,

washing• Probe array scanning and in-

tensity quantification• Intensity translated into nu-

cleic acid abundance


Illumina BeadArray for mRNA quatifi-cation

Each bead has one type of oligo and thousands of these oligos/bead

Bead is deposited on wells in glass slides. The beads are decoded by a step by propri-etary technology


Beadchip platform

Affy vs Illumina


Affymetrix GeneChip Illumina BeadArray

25 mer Longer oligo

Probe synthesized on chips

Bead technology

Multiple probes/probeset Single probe

Multiple probes/tran-script

Multiple probes/tran-script

.dat, .cel, .cdf, .chp file types

Image file processed by Bead Studio

Normalization by MAS5, RMA, GC-RMA etc.

Normalization by aver-age, quantile, RSN etc.

TXT output for down-stream analysis

TXT output for down-stream analysis

Annotations can be up-dated

Annotations can be up-dated

Adapted from Dr. Chandran@pitt

RNA sequencing

Condition 1(normal colon)

Condition 2(colon tumor)

Isolate RNAs

Sequence ends

100s of millions of paired reads10s of billions bases of sequence

Generate cDNA, fragment, size select, add linkersSamples of interest

Map to genome, transcriptome, and

predicted exon junctions

Downstream analysis

Adapted from Canadian Bioinformatics Workshop

Pros and Cons of RNA-seq (versus microarray)

Pros• More powerful in de-

tecting low expressing genes

• Detect splicing vari-ants and fusion tran-scripts

• Measure allele specific expression

• Discover mutation

Cons• Biased to highly ex-

pressed genes (e.g. ri-bosomal, mitochondrial genes)

• More complicated analy-sis workflow (mapping to reference genome)

• More expensive e.g. Hiseq2500, 100bpX2, 4Gb -> $700/sample(vs. < $500/array)

RNA-seq: Experimental De-sign

• Single end read: one read sequenced from one end of each sample cDNA insert

• Paired end read: two reads (one from each end) sequenced from each sample cDNA insert. Better to map reads over repetitive regions. Detect fusions and novel transcripts.

RNA-seq workflows• Sequencing: obtain raw data

(fastq format)• Quality control (optional):

FASTX

• Workflow 1: tophat2 (align) -> cufflinks (transcript as-sembly) -> cuffdiff (DEGs), cuffmerge (merge assem-blies)

Workflow 2: bowtie2 (align) -> HTSeq-count (count by gene) -> edgeR or DESeq (DEGs)

• Fusion detection (optional): “chimerascan” or “defuse”

Normalization method (old)

• RPKM: Reads per kilobase per million mapped reads (sigle end)

• RPKM = (10^9*C)/(N*L)C = number of reads mapped to a geneN = total mapped reads in the experimentL = exon length in kb for a gene

• RPKM measure is inconsistent among samples. • FPKM: Fragments per kilobase per million frag-

ments reads (paired end).• RPKM and FPKM based DEG discovery is af-

fected by gene length (no more recommended).

microarray

RNA-seqAdapted from EMBL

Poisson

Negative binomial Microarray data follows a Poisson

distribution. However RNA seq does not. In RNA Seq genes with high mean counts (either because they’re long or highly expressed)

tend to show more variance (between samples) than genes

with low mean counts. Thus this data fits a Negative Binomial

Distribution. edgeR and DESeq identfy DEGs based on negative

binomial distribution.

RNA-seq

A case study for practice session

• GEO accession number: GSE41588


GEO entryof GSE41558

Two platforms: Affy and HiSeq

Raw files: CEL, count files

Matrix file: processed data

Data Pre-processing

• Affy produces CEL format file as raw data.

• CEL file contains the fea-ture quantifications

• CEL file still has probes spread over the chip

• Values still need to be summarized to probe set level; for example 90525_at = 250 units

• Probe set: a collection of probes designed to inter-rogate a given sequence

250


CEL file to TXT file• In going from .CEL to .TXT file to generate signal values, the multiple probes

within a probe set are “averaged” to produce a single value for that gene/transcript.

• the CEL files must first be normalized to account for technical variation be-tween the arrays


19

Robust Multi-array Average (RMA)1. Background adjust PM values from .CEL files.

2. Take the base-2 log of each background-ad-justed PM intensity.

3. Quantile normalize values from step 2 across all GeneChips.

4. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe.

5. For each row, find the average of the fitted val-ues from step 4 to use as probe-set-specific ex-pression measures for each GeneChip. -> .TXT files

Log Transformation

Reason for working with log transformed intensities• Spread features more evenly across intensity range• Makes variability more constant across intensity range• Makes results close to normal distribution of intensities

and errors

How to normalize?• Many methods

– Median scaling – median intensity for all chips should be the same

– Known genes, house keeping, invariant genes

– Quantile normalization: RMA (Robust Multiarray Averaging), GC- RMA

– Normalization method may differ de-pending on array platform

– (Reading materials) GC-RMA: Wu et al. (2004), JASA, 99, 909-917. RMA: Irizarry et al. (2003), Nuc Acids Res, 31, e15.


Raw data

After normalization

22

RMA: Quantile Normalization

1. After background adjustment, find the smallest log2(PM) on each chip.

2. Average the values from step 1.

3. Replace each value in step 1 with the average computed in step 2.

4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

23

RMA: Median Polish

• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

• μi from median polish is the probe-set-specific measure

of expression for GeneChip i after correcting for array effect and probe effect.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual for thejth probe on theith GeneChip

Differentially Expressed Genes (DEG)

• Criteria for DEG discovery- Amount of difference: Fold change, Signal to noise ratio- Statistical significance: p-value, false discovery rate (FDR), odds ratio

• Statistical Methods- Parametric: t-test - Non-parametric: Wilcoxon rank-sum tests - Significance Analysis of Microarrays (SAM; permutation based)- Empirical Bayesian (Linear Models of Microarrays, LIMMA, Affy data)- ANOVA (multiple factors; e.g. two different strains +/- drug)

• Multiplicity of testing: p-value adjustments- Methods: FDR, bonferroni, etc.


Limma & Empirical Bayesian• Limma is an R package to find DEGs• It uses linear models

- Fitted to normalized intensities for each gene given a series of arrays- Design matrix: indicates which RNA samples have been ap-plied to each array- Contrast matrix: specifies which comparisons you would like to make between the RNA samples- Can be used to compare two or more groups

• Assumption: normal distribution• Uses empirical Bayesian analysis to improve power in small

sample sizes- Borrowing information across genes

• Output: p-values (adjusted for multiple testing)

Moderated/Bayesian t-test

• Ordinary t-test is testing for differences in means between two groups given the variability within each group

• Moderated/Bayesian t-test: rather than estimating within-group variability over and over again for each gene, pool the information from many similar genes.

• Advantage: eliminate occurrence of accidentally large t-statistics due to accidentally small within-group variance.

Further reading

• RNA-seq normalization: Dillies M-A et al. Briefings in Bioinformatics, 2012, 14, 671-683.

• Limma & eBayes: Smyth GK. Statisti-cal Applications in Genetics and Molecular Biology, 2004, 3 (1), article 3.

Course homepage: http://wiki.tgilab.org/MES7594


Notice

http://wiki.tgilab.org/MES7594



Documents

Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 10. Microarray and RNA-seq