Gene expression: Microarray data analysiscschweikert/cisc4020/MicroArray.pdf · “There is no...

Preview:

Citation preview

Gene expression:Microarray data analysis

Compare gene expression

in this cell type…

…after viral infection

…in samplesfrom patients

…relative to a knockout

…after drug treatment

…at a later developmental time

…in a different body region

• by region (e.g. brain versus kidney)

• in development (e.g. fetal versus adult tissue)

• in dynamic response to environmental signals

Gene expression is context-dependent,and is regulated in several basic ways

(e.g. immediate-early response genes)

• in disease states

• by gene activity

DNA RNA protein DNA RNA protein

cDNA cDNA

UniGene

SAGE

microarray

next-generation sequencing!!!

UniGene: unique genes via ESTs

• Find UniGene at NCBI:www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries.Thus, when you look up a gene in UniGeneyou get information on its abundanceand its regional distribution.

Microarrays: tools for gene expression

A microarray is a solid support (such as a membraneor glass microscope slide) on which DNA of knownsequence is deposited in a grid-like array.

Microarrays: tools for gene expression

The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levelsof thousands of genes.

Fast Data on >20,000 transcripts in ~2 weeks

Comprehensive Entire yeast or mouse genome on a chip

Flexible Custom arrays can be made to represent genes of interest

Advantages of microarray experiments

to represent genes of interest

Easy Submit RNA samples to a core facility

Cheap! Chip representing 20,000 genes for $300

Cost ■ Some researchers can’t afford to doappropriate numbers of controls, replicates (x100)

RNA ■ The final product of gene expression is proteinsignificance ■ “Pervasive transcription” of the genome is

poorly understood (ENCODE project)■ There are many noncoding RNAs not yet

Disadvantages of microarray experiments

■ There are many noncoding RNAs not yetrepresented on microarrays

Quality ■ Impossible to assess elements on array surfacecontrol ■ Artifacts with image analysis

■ Artifacts with data analysis■ Not enough attention to experimental design■ Not enough collaboration with statisticians

Sampleacquisition

Dataacquisition

Biological insight

Data analysis

Data confirmation

Stage 1: Experimental design

Stage 3: Hybridization to DNA arrays

Stage 2: RNA and probe preparation

Stage 4: Image analysis Stage 4: Image analysis

Stage 5: Microarray data analysis

Stage 6: Biological confirmation

Stage 7: Microarray databases

Stage 1: Experimental design

[1] Biological samples: technical and biological replicates:determine the data analysis approach at the outset

[2] RNA extraction, conversion, labeling, hybridization:except for RNA isolation, routinely performed at core facilities

[3] Arrangement of array elements on a surface:randomization can reduce spatially-based artifacts

Sample 1 Sample 2 Sample 3

One sample per array(e.g. Affymetrix or radioactivity-based platforms)

Samples 1,2 Samples 1,3 Samples 2,3

Two samples per array (competitive hybridization)

Sample 1, pool Sample 2, poolSamples 2,1:switch dyes

Stage 2: RNA preparation

For Affymetrix chips, need total RNA (about 5 ug)

Confirm purity by running agarose gelConfirm purity by running agarose gel

Measure a260/a280 to confirm purity, quantity

One of the greatest sources of error in microarrayexperiments is artifacts associated with RNA isolation;be sure to create an appropriately balanced,randomized experimental design.

Stage 3: Hybridization to DNA arrays

The array consists of cDNA or oligonucleotides

Oligonucleotides can be deposited by photolithographyOligonucleotides can be deposited by photolithography

The sample is converted to cRNA or cDNA

(Note that the terms “probe” and “target” may refer to theelement immobilized on the surface of the microarray, orto the labeled biological sample; for clarity, it may be simplest to avoid both terms.)

Microarrays: array surface

Southern et al. (1999) Nature Genetics, microarray supplement

Stage 4: Image analysis

RNA transcript levels are quantitated

Fluorescence intensity is measured with a scanner,Fluorescence intensity is measured with a scanner,or radioactivity with a phosphorimager

Control

Differential Gene Expression on a cDNA Microarray

Rett

αααα B Crystallin is over-expressed in Rett Syndrome

Stage 5: Microarray data analysis

Hypothesis testing• How can arrays be compared? • Which RNA transcripts (genes) are regulated?• Are differences authentic?• What are the criteria for statistical significance?

Clustering• Are there meaningful patterns in the data (e.g. groups)?

Classification• Do RNA transcripts predict predefined groups, such as disease subtypes?

Stage 6: Biological confirmation

Microarray experiments can be thought of as

“hypothesis-generating” experiments.

The differential up- or down-regulation of specific RNAThe differential up- or down-regulation of specific RNA

transcripts can be measured using independent assays

such as

-- Northern blots

-- polymerase chain reaction (RT-PCR)

-- in situ hybridization

Stage 7: Microarray databases

There are two main repositories:

Gene expression omnibus (GEO) at NCBIGene expression omnibus (GEO) at NCBI

ArrayExpress at the European Bioinformatics Institute

(EBI)

Array Express at the European Bioinformatics Institutehttp://www.ebi.ac.uk/arrayexpress/

MIAME

In an effort to standardize microarray data presentationand analysis, Alvis Brazma and colleagues at 17institutions introduced Minimum Information About aMicroarray Experiment (MIAME). The MIAME framework standardizes six areas of information:

►experimental design►microarray design►sample preparation►hybridization procedures►image analysis►controls for normalization

Visit http://www.mged.org

Microarray data analysis

• begin with a data matrix (gene expression values

versus samples)

genes(RNA

transcriptlevels)

Microarray data analysis

• begin with a data matrix (gene expression values

versus samples)

Typically, there areTypically, there aremany genes(>> 20,000) and few samples (~ 10)

Microarray data analysis

• begin with a data matrix (gene expression values

versus samples)

PreprocessingPreprocessing

Inferential statistics Descriptive statistics

Microarray data analysis: preprocessing

Observed differences in gene expression could be

due to transcriptional changes, or they could be

caused by artifacts such as:

• different labeling efficiencies of Cy3, Cy5

• uneven spotting of DNA onto an array surface

• variations in RNA purity or quantity

• variations in washing efficiency

• variations in scanning efficiency

Microarray data analysis: preprocessing

The main goal of data preprocessing is to remove

the systematic bias in the data as completely as

possible, while preserving the variation in gene

expression that occurs because of biologically

relevant changes in transcription.relevant changes in transcription.

A basic assumption of most normalization procedures

is that the average gene expression level does not

change in an experiment.

Data analysis: global normalization

Global normalization is used to correct two or more

data sets. In one common scenario, samples are

labeled with Cy3 (green dye) or Cy5 (red dye) and

hybridized to DNA elements on a microrarray. After

washing, probes are excited with a laser and detected

with a scanning confocal microscope.

Data analysis: global normalization

Global normalization is used to correct two or more

data sets

Example: total fluorescence in

Cy3 channel = 4 million units

Cy5 channel = 2 million unitsCy5 channel = 2 million units

Then the uncorrected ratio for a gene could show

2,000 units versus 1,000 units. This would artifactually

appear to show 2-fold regulation.

Data analysis: global normalization

Global normalization procedure

Step 1: subtract background intensity values

(use a blank region of the array)

Step 2: globally normalize so that the average ratio = 1

(apply this to 1-channel or 2-channel data sets)

Scatter plots

Useful to represent gene expression values from

two microarray experiments (e.g. control, experimental)

Each dot corresponds to a gene expression value

Most dots fall along a line

Outliers represent up-regulated or down-regulated genes

Brain Fibroblast

Differential Gene Expressionin Different Tissue and Cell Types

Astrocyte Astrocyte

high

Ex

pre

ss

ion

le

ve

l (s

am

ple

2)

low

Expression level (sample 1)

Ex

pre

ss

ion

le

ve

l (s

am

ple

2)

Log-log transformation

Scatter plots

Typically, data are plotted on log-log coordinates

Visually, this spreads out the data and offers symmetry

raw ratio log2 ratio

time behavior value valuetime behavior value value

t=0 basal 1.0 0.0

t=1h no change 1.0 0.0

t=2h 2-fold up 2.0 1.0

t=3h 2-fold down 0.5 -1.0

expression level

highlow

up

Lo

g r

ati

o

down

Mean log intensity

Lo

g r

ati

o

You can make these plots in Excel…

…but for many bioinformatics applications use R.Visit http://www.r-project.org to download it.

Visit http://www.r-project.org to download it. See chapter 9 (2nd edition) for a tutorial on microarray data analysis.

M M

A A

After RMA (a normalization procedure), the median is near zero, and skewing is corrected.

Scatterplots display the effects of normalization.

SNOMAD converts array data to scatter plotshttp://www.snomad.org

-2 -1 0 1 2-2

-1

0

1

2

0 10 20 30 40

0

10

20

30

40

EX

P

CON

EX

P

CON

Linear-linear

plotLog-log

plot

-1 0 1-1.0

-0.5

0.0

0.5

1.0

2-fold

2-fold

Lo

g10

(Rati

o )

Mean ( Log10 ( Intensity ) )

CON CON

EX

P >

CO

NE

XP

< C

ON

SNOMAD corrects local variance artifacts

0.5

1.0

0.5

1.0

2-fold

( R

ati

o )

robust localregression fit residual

EX

P >

CO

N

( R

ati

o )

[r

esid

uals

]-1 0 1

-1.0

-0.5

0.0

-1 0 1-1.0

-0.5

0.0

2-fold

Lo

g10

( R

ati

o )

Mean ( Log10 ( Intensity ) )

EX

P <

CO

N

Co

rrecte

d L

og

10

[resid

uals

]

Mean ( Log10 ( Intensity ) )

SNOMAD describes regulated genes in Z-scores

0

1

2

Co

rrecte

d L

og

10

( R

ati

o )

2-fold

Locally estimated standarddeviation of positive ratios

Z= 1

-5

0

5

10

Lo

cal L

og

10

( R

ati

o )

Z-S

co

re

Z= 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-2

-1

0

Co

rrecte

d L

og

Mean ( Log10 ( Intensity ) )

2-foldZ= -1

Locally estimated standarddeviation of negative ratios

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-10

Mean ( Log10 ( Intensity ) )

Z= -5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-2

-1

0

1

2

Co

rrecte

d L

og

10

( R

ati

o )

Mean ( Log10 ( Intensity ) )

2-fold

2-fold

Z= 2

Z= 1

Z= -1

Z= -2

Z= 5

Z= -5

Robust multi-array analysis (RMA)

• Developed by Rafael Irizarry, Terry Speed, and others• Available at www.bioconductor.org as an R package• Also available in various software packages (including

Partek, www.partek.com and Iobion Gene Traffic)• See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4

There are three steps:

[1] Background adjustment based on a normal plus exponential model (no mismatch data are used)

[2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution)

[3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect

12

14

81

06

1 2 3 4 5 6 7 8 9 10 11 12 13 14

array

log s

ignal in

tensity

14array

46

81

01

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

array

log s

ignal in

tensity

Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.

RMA adjusts for the effect of GC content

log

in

ten

sit

y

GC content

log

in

ten

sit

y

precision accuracyprecision with

accuracy

Good performance:reproducibility ofthe result

Good quality of the result(relative to agold standard)

Robust multi-array analysis (RMA)

RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software).

precision

log

exp

ressio

n S

D

average log expression

log

exp

ressio

n S

D

RMA

MAS 5.0

Robust multi-array analysis (RMA)

RMA offers comparable accuracy to MAS 5.0.

ob

serv

ed

lo

g e

xp

ressio

n accuracy

log nominal concentration

ob

serv

ed

lo

g e

xp

ressio

n

Inferential statistics

Inferential statistics are used to make inferences

about a population from a sample.

Hypothesis testing is a common form of inferential

statistics. A null hypothesis is stated, such as:

“There is no difference in signal intensity for the gene“There is no difference in signal intensity for the gene

expression measurements in normal and diseased

samples.” The alternative hypothesis is that there

is a difference.

We use a test statistic to decide whether to accept or

reject the null hypothesis. For many applications,

we set the significance level α to p < 0.05.

Inferential statistics

A t-test is a commonly used test statistic to assess

the difference in mean values between two groups.

t = = x1 – x2

SE

difference between mean values

variability (standard errorof the difference)

Questions:

Is the sample size (n) adequate?

Are the data normally distributed?

Is the variance of the data known?

Is the variance the same in the two groups?

Is it appropriate to set the significance level to p < 0.05?

Inferential statistics

A t-test is a commonly used test statistic to assess

the difference in mean values between two groups.

t = =

Notes

x1 – x2

SE

difference between mean values

variability (standard errorof the difference)

Notes

• t is a ratio (it thus has no units)

• We assume the two populations are Gaussian

• The two groups may be of different sizes

• Obtain a P value from t using a table

• For a two-sample t test, the degrees of freedom is N -2.

For any value of t, P gets smaller as df gets larger

Analyzing expression data in ExcelQuestion: for each of my 20,000 transcripts, decide whether

it is significantly regulated in some disease.

control disease

[1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. You can also calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples.

Analyzing expression data in Excel

[2] You can calculate the ratios of control versus disease.

Note that you can use the formula =E5/I5 in this case to divide the mean control and disease values.

Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold.

Analyzing expression data in Excel

[3] Perform a t-test. When you enter =TTEST into the function box above, a dialog box appears. Enter the range of values for controls and for disease samples, and specify a 1-or 2-tailed test.

Analyzing expression data in Excel

[3] Perform a t-test (continued). For a one-tailed test, your prior hypothesis is that the transcript in the disease group is up (or down) relative to controls; the change is unidirectional. For example, in Down syndrome samples you might hypothesize that chromosome 21 transcripts are significantly up-regulated because of the extra copy of chromosome 21.

Analyzing expression data in Excel

[3] Perform a t-test (continued). For a two-tailed test, you hypothesize that the two groups are different, but you do not know in which direction.

Analyzing expression data in Excel

[3] Note the results: you can have…

a small p value (<0.05) with a big ratio differencea small p value (<0.05) with a trivial ratio differencea large p value (>0.05) with a big ratio differencea large p value (>0.05) with a trivial ratio difference

Only the first group is worth reporting! Why?

disease

vs normal

t-test to determine statistical significance

Error

difference between mean of disease and normalt statistic =

variation due to error

Inferential statistics

Paradigm Parametric test Nonparametric

Compare two

unpaired groups Unpaired t-test Mann-Whitney test

Compare two

paired groups Paired t-test Wilcoxon test

Compare 3 or ANOVA

more groups

Error

ANOVA partitions total data variability

Before partitioning After partitioning

Subject

disease

vs normal

disease

vs normal

Error

Error

Tissue type

variation between DS and normalF ratio =

variation due to error

Descriptive statistics

Microarray data are highly dimensional: there are

many thousands of measurements made from a small

number of samples.

Descriptive (exploratory) statistics help you to find

meaningful patterns in the data.meaningful patterns in the data.

A first step is to arrange the data in a matrix.

Next, use a distance metric to define the relatedness

of the different data points. Two commonly used

distance metrics are:

-- Euclidean distance

-- Pearson coefficient of correlation

What is a cluster?

A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.

Data matrix(20 genes and

3 time pointsgenes

samples (time points)

3 time pointsfrom Chu et al., 1998)

Software: S-PLUS package

genes

t=2.0

3D plot (using S-PLUS software)

t=0t=0.5

Descriptive statistics: clustering

Clustering algorithms offer useful visual descriptions

of microarray data.

Genes may be clustered, or samples, or both.

We will next describe hierarchical clustering.We will next describe hierarchical clustering.

This may be agglomerative (building up the branches

of a tree, beginning with the two most closely related

objects) or divisive (building the tree by finding the

most dissimilar objects first).

In each case, we end up with a tree having branches

and nodes.

Agglomerative clustering

a

b

c

a,b

43210

c

d

e

Adapted from Kaufman and Rousseeuw (1990)

a

b

c

a,b

43210

Agglomerative clustering

c

d

ed,e

a

b

c

a,b

43210

Agglomerative clustering

c

d

ed,e

c,d,e

a

b

c

a,b

a,b,c,d,e

43210

Agglomerative clustering

c

d

ed,e

c,d,e

…tree is constructed

Divisive clustering

a,b,c,d,e

4 3 2 1 0

Divisive clustering

a,b,c,d,e

c,d,e

4 3 2 1 0

Divisive clustering

a,b,c,d,e

d,e

c,d,e

4 3 2 1 0

Divisive clustering

a,b

a,b,c,d,e

d,e

c,d,e

4 3 2 1 0

Divisive clusteringa

b

c

a,b

a,b,c,d,e

c

d

ed,e

c,d,e

4 3 2 1 0

…tree is constructed

agglomerative

a

b

c

a,b

a,b,c,d,e

43210

divisive

c

d

ed,e

c,d,e

4 3 2 1 0

Adapted from Kaufman and Rousseeuw (1990)

1

12

1

12

Agglomerative and divisive clustering sometimes give conflictingresults, as shown here

Cluster and TreeView

clustering PCASOMK means

Cluster and TreeView

Two-way clusteringof genes (y-axis)

and cell lines(x-axis)(Alizadeh et al.,2000)

An exploratory technique used to reduce thedimensionality of the data set to 2D or 3D

For a matrix of m genes x n samples, create a new

Principal components analysis (PCA)

For a matrix of m genes x n samples, create a newcovariance matrix of size n x n

Thus transform some large number of variables intoa smaller number of uncorrelated variables calledprincipal components (PCs).

Principal components analysis (PCA): objectives

• to reduce dimensionality

• to determine the linear combination of variables

• to choose the most useful variables (features)• to choose the most useful variables (features)

• to visualize multidimensional data

• to identify groups of objects (e.g. genes/samples)

• to identify outliers

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Copyright notice

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomics

by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

Recommended