27
Analysis of Microarray Data By H. Bjørn Nielsen

Analysis of Microarray Data - DTU Bioinformatics · Buy Chip/Array Statistical Analysis Fit to Model ... Image analysis The DNA Array Analysis Pipeline. ... For gene expression analysis

Embed Size (px)

Citation preview

Analysis ofMicroarray Data

ByH. Bjørn Nielsen

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

What's the question?

Typically we want to identify differentially expressed genes

without alcohol with alcohol

alcohol dehydrogenase

Example:alcohol dehydrogenase is expressed at a higher levelwhen alcohol is added to the media

Intensities are not just mRNAconcentrations

Sample preparation

• RNA degradation• RNA purification• Reverse transcription• Amplification efficiency• Dye effect (cy3/cy5)

Array / hybridization

• Spotting• DNA-support binding• Spatial effects (washing etc.)• Image segmentation

Gene-specific variationGlobal variation

• Spotting efficiency,– Spot size– Spot shape

• Cross-/unspecifichybridization

• Biological variation– Effect– Noise

•Amount of RNA in the biopsy

•Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection

Systematic

Two kinds of variation

Stochastic

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

Calibration = Normalization =Scaling

Nonlinear normalization

Lowess Normalization

• One of the most commonly utilized normalizationtechniques is the LOcally Weighted ScatterplotSmoothing (LOWESS) algorithm.

M

A

* * * *** *

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

Expression index value• Some microarrays have multiple probes

addressing the expression of the same gene

– Affymetrix chips have 11-20 probe pairs pr. Gene

- Perfect Match (PM)- MisMatch (MM)

PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT

Why use gene expression index?

For most downstream analysis only onevalue pr. gene are used.

Therefore we collapse the intensities frommany probes into one value:a gene expression index value

Expression index calculation• Simplest method?

Median

But more sophisticated methods exists:dChip, RMA and MAS 5 (from Affymetrix)

All of this is implemented in…

• R

• In the BioConductor packages ‘affy’

(Gautier et al., 2003).

StochasticSystematic

• Too random to be explicitly accounted for

• “noise”

• Similar effect on manymeasurements

• Corrections can beestimated from data

Normalization Statistical testing

Sources of variation

Global variation Gene-specific variation

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

He’s going to say it

However, the measurements containstochastic noise

There is no way around it

Statistics

Noisymeasurements p-value

statistics

You can choose to think ofstatistics as a black box

But, you still need to understand how tointerpret the results

P-valueThe chance of rejecting the nullhypothesis by coincidence----------------------------For gene expression analysis wecan say:the chance that a gene iscategorized as differentiallyexpressed by coincidence

The output of the statistics

The statistics gives us a p-valuefor each gene

We can rank the genes according to the p-value

But, we can’t really trust the p-value in a strictstatistical way!

Why not!

For two reasons:

1. We are rarely fulfilling all theassumptions of the statistical test

2. We have to take multi-testing intoaccount

The t-test Assumptions

1. The observations in the two categoriesmust be independent

2. The observations should benormal distributed

3. The sample size must be ‘large’(>30 replicates)

log2 fold change (M)

P-value

Volcano Plot

What's inside the black box ‘statistics’

t-test or ANOVA

The t-test

Calculate T

Lookup T in a table

The t-test tests for difference in means (µ)

Intensityof gene x

Density µwt wt µmut mutant

The t-test II

Conclusion

• Array data contains stochastic noise– Therefore statistics is needed to conclude on differential

expression• We can’t really trust the p-value• But the statistics can rank genes• The capacity/needs of downstream processes can be

used to set cutoff• FDR can be estimated• t-test is used for two category tests• ANOVA is used for multiple categories