Test of significance for small samples Javier Cabrera

1

Test of significance for small samples

Javier CabreraDirector, Biostatistics Institute Rutgers University

Dhammika Amaratunga,Johnson & Johnson Pharmaceutical Research & Development

2

Outline

• Microarray Experiments and Differential expression

• Small sample size issues• Conditional t approach• Comparison with other methods• Extensions

Reference: Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley.2004. Amaratunga, Cabrera.

Software: DNAMR and DNAMRweb

http://www.rci.rutgers.edu/~cabrera/DNAMR

3

A gene is expressed via the process:

DNA mRNA protein transcription translation

replication

The central dogma of molecular biology

Genes: A gene is a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein.

AKAP6: CATCATGCAGCAGGTCAAACAAGGCATCTCCTAGTATTGCATCCTACA……

4

cDNA oroligonucleotide

preparation

Glass slide Biological sample

mRNA

Reverse transcribeand label

SampleMicroarray +

Image

Quantify spot intensities

Gene expression data

5k-50k genes arrayed in rectangular grid; one spot per gene

Microarray experiment

Hybridize, wash and scan

Print or synthesize

http://images.google.com/imgres?imgurl=http://www.biochem.wisc.edu/medialab/clipart/test_tube2_sm.gif&imgrefurl=http://www.biochem.wisc.edu/medialab/clipart.html&h=110&w=43&sz=2&tbnid=8M0m_N1umSEJ:&tbnh=79&tbnw=31&start=4&prev=/images%3Fq%3D%2Bsite:www.biochem

5

Differential gene expression

An organism’s genome is the complete set of genes in each of its cells. Given an organism, every one of its cells has a copy of the exact same genome, but

different cells express different genes

different genes express under different conditions

differential gene expression leads toaltered cell states

6

C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27G10 5.12 4.85 3.79 4.13 3.12 4.79G11 4.67 3.50 4.77 4.09 3.86 2.88G12 6.22 6.42 5.02 6.38 6.54 6.80G13 2.88 3.76 2.78 2.98 4.81 4.15.......

Differential Expression for small samples

1. Preprocessed data.2. Perform a t-test for each gene.3. Select the most significant subset.

7

The t test statistic for testing for a mean effect is: 1/ 2

2 1 1 2( ) /( (1/ 1/ ) )g g g gT X X s n n

where sg, the pooled standard error, is the positive square root of: 2 2 2

1 1 2 2 1 2(( 1) ( 1) ) /( 2)g g gs n s n s n n

If there is no mean effect,

1 2( 2)~g n nT t

(Student / Fisher)

The pooled variances T-test

8

300 21983

Plot t vs sp Distribution of sp

Random Data

Differentially expressed genes have smaller sp.

Is this effect Statistical or Biological?

9

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT, False Discoveries True DiscoveriesT-test 44 22 z-test 43 29

2 from Chi-square(df=3), False Discoveries True DiscoveriesT-test 43 28 z-test 53 13

10

The effect of small sample size

Often the sample size per group is small.

unreliable variances (inferences)

dependence between the test statistics (tg) and the standard error estimates (sg)

borrow strength across genes (LPE/EB)

regularize the test statistics (SAM)

work with tg|sg (Conditional t).

11

Analysis results

Top 10 genes (sorted by t-test p-value)

Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964G19945 3.25 U 0.000005 0.1102G21586 1.64 U 0.000008 0.1765G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248G19057 1.85 U 0.000046 1.0000G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000

12

SAM: Determining c

v1 () =mad{ Tg}

v2() v3() v4() v5() v6() v7()

Tg

sg

cv()

( ) g

g

g

rT s

s s

For each

cv(1

)s1

cv(2

)s2

cv(3

)s3

cv(4

)s4

cv(5

)s5

cv(6

)s6

cv(7

)s7

Min c

13

-3 -2 -1 0 1 2 3

-50

5

db

d

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Pooled Sd

P(|

SA

M|>

t)

ˆ( )gT c

( ) ˆ( )gT c

SAM: Gene selection

( ) ˆ( )gT c ˆ( )gT c= Expected value of under permutations

14

Let Xgij denote the preprocessed intensity measurement for gene g in array i of group j.

Model: Xgij = gj + g gij

Effect of interest:g= g2 - g1

Error model:gij ~ F(location=0, scale=1)

Gene mean-variance model:(g1,g2)

~ F

with marginals: g1 ~ Fand g2 ~ F

Conditional t: Basic Model

15

Parametric: Assume functional forms for F and F and apply either a Bayes or Empirical Bayes procedure.

Nonparametric:1. or

For small samples is not a good estimator of F Use method of moments = Target estimation

2. Proceed via resampling and estimate the distribution: t |sp (Conditional t).

Estimate F: edf , ,F, of {( 1gX, sg2)}

Estimate F: edf , F, of {( )/gij gj gX X s }

Possible approaches

16

(1) D raw a gene, g , at random from {1, … , G }.

C all it g*. ( * 1gX , *

2

gs ) ~ ,F .

(2) Take a random sam ple (w ith replacem ent)

of size n1+n2 from F : * ˆ~ijr F

(3) C om bine these to form pseudo-data:

* *

* *

1ij ijg gX X s r

(4) C alculate the pooled standard error s* and t test statistic t* for the pseudo-data {X ij

*}.

Procedure

17

(5) Repeat steps (1)-(4) a large number (10,000) of times. (6) Given , estimate the “critical envelope”, t(sg), as the (/2) and (1-/2) quantile curves in the tg vs sg relationship. (7) Genes that fall outside the critical envelope defined by t(sg) are deemed significant at level . (Overall unconditional Type I error rate = )

Procedure (cont.)

18

ˆ ( ) is not a good estimator of ( ) F t F t

Let {Xij} be a sample from the model with F

and let the variance obtained from the {Xij} be s2

Then Var(s2) > Var(2)

For example, if we assume that F = 32, n=4 and

~ N(0,1), then Var(2)=6 and Var(s2)=15.

Fix by target estimation: Method of moments.

Shrink towards the center

Roadblock

19

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Case 1

S1

E7

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Case 2

S2

E7

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Case 3

S3

E7

E7 Data

Sp

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

00

Case 1

Sp

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0

050

010

0015

00

Case 2

Sp

Fre

quen

cy

0 1 2 3 4

010

020

030

040

050

0

Case 3

Sp

Fre

quen

cy

0 1 2 3 4 5 6

020

040

060

080

010

00

Example: Checking for the distribution of g

1. Df=0.5

2. Df=2 3. Df=6

1. Df=0.5 2. Df=2

3. Df=6

Mice Data

2 2 2 2 2 20.5 2 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

20

Tox Data

Sp

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5

020

060

010

00

Case 1

Sp

Fre

quen

cy

0.2 0.4 0.6 0.8 1.0 1.2 1.4

020

040

060

0

Case 2

Sp

Fre

quen

cy

0 1 2 3 4

020

060

010

00

Case 3

Sp

Fre

quen

cy

0 100 200 300 400

050

015

0025

00

Another Example

0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.1

0.2

0.3

0.4

0.5

Case 1

S1

Tox

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

Case 2

S2

Tox

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Case 3

S3

Tox

0.0 0.1 0.2 0.3 0.4 0.5

-3-2

-10

12

3

mean diff vs Sp

Sp

Mea

n di

ff

Df=0.5

Df=3 Df=6

Df=0.5

Df=3

Df=3

Df=6Df=6

2 2 2 2 2 20.5 3 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

21

Fixing the variance distribution

The idea is to estimate the function h:[0:1] [0,1] defined by

h(F(x)) = F (x). Since h is strictly monotonic, it can be inverted

in order to obtain an estimate of F(x). Procedure:

(1) Assume that F (x) is the true distribution of and draw a

random sample, s*2, from F .

(2) Take a random sample (with replacement) of size N from F : * ˆ~ijr F for i=1,…, nj, j=1,2.

(3) Combine these to form pseudo-data: * * *ij ijX s r

22

( 4 ) C a l c u l a t e t h e p o o l e d s t a n d a r d e r r o r s * * f o r t h e p s e u d o - d a t a { X i j

* } . ( 5 ) R e p e a t s t e p s ( B 1 ) - ( B 4 ) a l a r g e n u m b e r ( s a y 1 0 0 , 0 0 0 ) o f t i m e s a n d r e c o r d , f o r e a c h i t e r a t i o n , t h e p a i r o f v a l u e s { ( s * 2 , s * * 2 ) } .

( 6 ) L e t *ˆF ( x ) b e t h e e m p i r i c a l d i s t r i b u t i o n o f t h e s * * 2

g ’ s . T h e n t h e

e s t i m a t o r o f h i s o b t a i n e d b y m a p p i n g t h e e m p i r i c a l d i s t r i b u t i o n ˆF i n t o *

ˆF . M o r e p r e c i s e l y 1

*ˆ ˆ ˆ ˆ( ( ) ) ( ( ) )h y F x F F y

a n d 1 1*

ˆ ˆ ˆ( ) ( ( ) )h y F F y .

H e n c e t h e b i a s - c o r r e c t e d e s t i m a t o r o f F i s : 1

*ˆ ˆ ˆ( ) ( ( ( ) ) )F x F F F x

.

Fixing the variance distribution (contd)

Proceed as before …

23

191 22092

Plot t vs sp

Differentially expressed genes may have large sp

24

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT False Discoveries True DiscoveriesT-test 44 22 z-test 43 29C-t 45 30

2 from Chi-square(df=3) False Discoveries True DiscoveriesT-test 43 28 z-test 53 13C-t 42 38

25

Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level.

T-testSAM

Ct

26

To generate p-values, recall that the Ct procedure generates curves, c(s). Start with a set of curves,

1( ) ( )

kg gc s c s , for a set of

prespecified values, 1 k .

Now consider the relationship between vi=log(-log(i)) and ui=log( ( ))

i gc s

To assign an approximate p-value to the gth gene, if |tg | ( )k gc s ,

interpolate the relationship between the {ui} and the {vi}.

Generating p-values

27

Extensions F test: - Condition on the sqrt(MSE) Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE) Gene Ontology. - Test for the significance of groups.

- Use Hypergeometric Statistic, mean t, mean p-value, or other.

- Condition on log of the number of genes per group

28

Conditional F

0.2 0.4 0.6 0.8 1.0 1.2

01

02

03

0

Sqrt(MSE)

Sq

rt(F

)

29

0 2 4 6

0.0

0.5

1.0

1.5

2.0

Sd

|T|

GO Ontology: Conditioning on log(n)

Abs(T)

Log(n)

30

The Details:ReferenceExploration and Analysis of DNA Microarray and Protein Array Data. Wiley . Jan 2004.Amaratunga, Cabrera.

[email protected]@prdus.jnj.com

Webpage for DNAMR and DNAMRwebhttp://www.rci.rutgers.edu/~cabrera/DNAMR

31

Target Estimation:

Cabrera, Fernholz (1999)

- Bias Reduction.

- MSE reduction.

Recent Applications:

- Ellipse Estimation (Multivariate Target).

- Logistic Regression:

• Cabrera, Fernholz, Devas (2003)

• Patel (2003) Target Conditional MLE (TCMLE)

Implementation in StatXact (CYTEL) and

logXact Proc’s in SAS(by CYTEL).

Target Estimation

32

Target Estimation

T(x1,x2,…,xn)

E(T)

E(T) =

g(

33

Target Estimation:1

1

ˆSuppose we have an estimator ( ,..., ) of a paramter

ˆTarget estimator : Solve ( )

ˆ ( ) ( ) then ( )

nT x x

E T

h E T h

Algorithms: - Stochastic approximation.

- Simulation and iteration.

- Exact algorithm for TCMLE

Documents

Test of significance for small samples Javier Cabrera