Artifacts and Effects in Gene Expression Data Carlo Colantuoni April 12, 2006

Preview:

Citation preview

Artifacts and Effects in Gene Expression Data

Carlo Colantuoni

April 12, 2006

Experimental Artifacts

~200 microarrays ~100 samples

Nylon

NIA cDNA microarray Core Facility

P33

9600MGC

elements

Uncorrected Intensities: MDS Colored by Batch

Removing The Batch Effect

We Will Use These Dimensions for Additional Corrective Transformations

Much LikeRed:Green Analysis

Uncorrected Intensities: MDS Colored by Batch

Batch Subtracted Measures: MDS Colored by Batch

MDS of All Array Experiments: Subject Replicates

Hybridization Artifacts

A “Simple” Pilot:

2 subjects in rep. = 4 arrays

Differing amounts of dye2-color (reference)

~48,000 probes

4 arrays: Raw Log Intensities

4 arrays: Raw Linear Intensities

1 array: Ratio v. Intensity

1 array: Ratio v. Intensity

Biological Effects

… or are they?

Big Effects:

Tissue Types and Growth Factor

Treatments

Illumina 24K

Smaller Effects:

Correlation of Gene Expression with

Biological Indices

pH

PMI

age

NylonP33

10K

Illuminacustom

700

More Subtle Effects:

Differential Gene Expression by Genotype

COMT Val158Met SNP Affects Cognition and Risk for Schizophrenia

COMT enzyme activity

GeneticsCognition & Disease

Risk for Schizophrenia

Working Memory Performance

Patterns of Cortical Activation

Amphetamine & Tolcapone Response

VVVMMM

p<0.00002

Over-Expression of HSP70 in VV Homozygotes

VV-VM

Effect of COMT V158M on Gene Expression

NylonP33

10K

MM-VM

Effect of COMT V158M on Gene Expression

NylonP33

10K

VV-MM

Effect of COMT V158M on Gene Expression

NylonP33

10K

VV-VM T-stat

MM

-VM

T-s

tat

Looking Across Multiple Effects: Age and

Genotype

N=15 genes across 80 subjects

p<7.34e-13

Alternative Approaches

COMT Activity as a Function of COMT Genotype

-0.4 -0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Distribution of Observed (black) and Permuted (blue) Correlations (r)

Correlation (r)

Den

sity

Correlation of COMT Activity with Expression

Permuted

Observed

Correlation (r)

N=64

p<0.000089

r=0.45

AcknowledgementsClinical Brain Disorders Branch, NIMH, NIH

Daniel Weinberger

Section on NeuropathologyJoel KleinmanThomas Hyde

Tissue ResourcesMary Herman

Amy Deep-SoboslayColleen Lynch

GenotypingRichard Straub

Bhaskar Kolachana

COMT ActivityJingshan ChenSamer Helem

RNA ResourcesJohanna CreswellClaudia AguirreRobert Fatula

Jeet BahraIsha Khan

Debora RothmondBarbara Lipska

Nick BeMariam Khan

National Institute on Drug Abuse, NIH, DHHSWilliam FreedElin Lehrmann

National Institute on Aging , NIH, DHHSKevin BeckerWilliam Wood

Diane Teichberg

Johns Hopkins School of Public HealthDepartment of Biostatistics

Scott ZegerZhianqan TanRafael Irizarry

Giovanni ParmigianiElizabeth Johnson

NHGRI Microarray FacilityAbdel Elkahloun

Iddil Berkov CBDB

Beyond Individual Genes:Functional Gene Groups

• Borrow statistical power across entire

dataset

• Beyond threshold enrichment

• Systematic patterns throughout the dataset

-0.4 -0.2 0.0 0.2 0.4

01

23

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Correlation of Age with Gene Expression

Over-Expression of HSP70’s in VV Homozygotes

p<7.42e-08

T statistic

3 Statistical Tests:

2

Kolmogorov-Smirnov

“Information”

Is THIS …

… Different from THIS?

histogrambins

E

O

2

ED =

(O-E)2______

2 is the sum of D values where:

All Genes

Subset of Interest

All Genes

Subset of Interest

Kolmogorov-Smirnov

All Genes

Subset of Interest

Product of Individual Probabilities

histogrambins

E

O

2

ED =

(O-E)2______

2 is the sum of D values where: E^0.5DPCA =

O-E______

Dimension #1

Dim

ensi

on #

2

p value

0.0

>0.130

600

540

p<0.001

N = 20Pent.Phos.#30

p<0.032

N = 25Fruc.Mann.#51

p<0.097

N = 94Sphingo-Glycolip.#600

51

p<0.110

N = 96IP3#562

p<0.996

N = 44Pyrimid.Metabo.#240

p<0.999

N = 17Ribo-flavin#740

562

240

p<0.079

N = 3Lipo-Polysacch.#540

p<0.107

N = 4Lys.Biosyn.#300

740

300

Log10 Ratio Z-Score

Pro

port

ion

of G

enes

p<0.079

N = 24Aln.Asp.#252

p<0.133

N = 7C byfolate#670

252

670

N = 89 Gene Subsets

All Genes

The distribution of gene expression values for each gene group is passed to PCA as D^0.5 values and then plotted as a single point in low dimensional space.

Distance from center indicates deviation from distribution of all gene expression values in the microarray experiment

Proximity indicates similarity in the shape of distributions.

ED =

(O-E)2______

E^0.5DPCA =

O-E______

Analysis of Gene Networks

No Effect of Other COMT SNPs: P3224

Permuted

Observed

T statistic

1/1-1/2N=21 N=30

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

Distribution of p-values from Observed (Black) and Permuted Data

p-value

Den

sity

Distribution of p-values

Permuted

Observed

p-value

N=90

-0.4 -0.2 0.0 0.2 0.4

01

23

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Permuted

Correlation of Age with Gene Expression

Observed

Correlation (r)

N=90

-0.45 -0.40 -0.35 -0.30

0.00

0.05

0.10

0.15

0.20

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Permuted

Observed=

Correlation of Age with Gene Expression

FDR =False Pos.

Total Pos.

Permuted

Observed

Correlation (r)

Correlation of GFAP Expression with Age

r=0.47

p<0.000002

Age (yr)

Ex

pre

ss

ion

: L

og

(Rat

io)

SD

Un

its

fro

m M

ea

n

(p<0.02)

2 arrays(4 channels)