33
1 t of significance for small sam Javier Cabrera Director, Biostatistics Institute Rutgers Universit Dhammika Amaratunga, hnson & Johnson Pharmaceutical Research & Developme

Test of significance for small samples Javier Cabrera

  • Upload
    jenna

  • View
    35

  • Download
    2

Embed Size (px)

DESCRIPTION

Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga , Johnson & Johnson Pharmaceutical Research & Development. Outline. Microarray Experiments and Differential expression Small sample size issues - PowerPoint PPT Presentation

Citation preview

Page 1: Test of significance for small samples Javier Cabrera

1

Test of significance for small samples

Javier CabreraDirector, Biostatistics Institute Rutgers University

Dhammika Amaratunga,Johnson & Johnson Pharmaceutical Research & Development

Page 2: Test of significance for small samples Javier Cabrera

2

Outline

• Microarray Experiments and Differential expression

• Small sample size issues• Conditional t approach• Comparison with other methods• Extensions

Reference: Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley.2004. Amaratunga, Cabrera.

Software: DNAMR and DNAMRweb

http://www.rci.rutgers.edu/~cabrera/DNAMR

Page 3: Test of significance for small samples Javier Cabrera

3

A gene is expressed via the process:

DNA mRNA protein transcription translation

replication

The central dogma of molecular biology

Genes: A gene is a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein.

AKAP6: CATCATGCAGCAGGTCAAACAAGGCATCTCCTAGTATTGCATCCTACA……

Page 4: Test of significance for small samples Javier Cabrera

4

cDNA oroligonucleotide

preparation

Glass slide Biological sample

mRNA

Reverse transcribeand label

SampleMicroarray +

Image

Quantify spot intensities

Gene expression data

5k-50k genes arrayed in rectangular grid; one spot per gene

Microarray experiment

Hybridize, wash and scan

Print or synthesize

Page 5: Test of significance for small samples Javier Cabrera

5

Differential gene expression

An organism’s genome is the complete set of genes in each of its cells. Given an organism, every one of its cells has a copy of the exact same genome, but

different cells express different genes

different genes express under different conditions

differential gene expression leads toaltered cell states

Page 6: Test of significance for small samples Javier Cabrera

6

C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27G10 5.12 4.85 3.79 4.13 3.12 4.79G11 4.67 3.50 4.77 4.09 3.86 2.88G12 6.22 6.42 5.02 6.38 6.54 6.80G13 2.88 3.76 2.78 2.98 4.81 4.15.......

Differential Expression for small samples

1. Preprocessed data.2. Perform a t-test for each gene.3. Select the most significant subset.

Page 7: Test of significance for small samples Javier Cabrera

7

The t test statistic for testing for a mean effect is: 1/ 2

2 1 1 2( ) /( (1/ 1/ ) )g g g gT X X s n n

where sg, the pooled standard error, is the positive square root of: 2 2 2

1 1 2 2 1 2(( 1) ( 1) ) /( 2)g g gs n s n s n n

If there is no mean effect,

1 2( 2)~g n nT t

(Student / Fisher)

The pooled variances T-test

Page 8: Test of significance for small samples Javier Cabrera

8

300 21983

Plot t vs sp Distribution of sp

Random Data

Differentially expressed genes have smaller sp.

Is this effect Statistical or Biological?

Page 9: Test of significance for small samples Javier Cabrera

9

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT, False Discoveries True DiscoveriesT-test 44 22 z-test 43 29

2 from Chi-square(df=3), False Discoveries True DiscoveriesT-test 43 28 z-test 53 13

Page 10: Test of significance for small samples Javier Cabrera

10

The effect of small sample size

Often the sample size per group is small.

unreliable variances (inferences)

dependence between the test statistics (tg) and the standard error estimates (sg)

borrow strength across genes (LPE/EB)

regularize the test statistics (SAM)

work with tg|sg (Conditional t).

Page 11: Test of significance for small samples Javier Cabrera

11

Analysis results

Top 10 genes (sorted by t-test p-value)

Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964G19945 3.25 U 0.000005 0.1102G21586 1.64 U 0.000008 0.1765G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248G19057 1.85 U 0.000046 1.0000G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000

Page 12: Test of significance for small samples Javier Cabrera

12

SAM: Determining c

v1 () =mad{ Tg}

v2() v3() v4() v5() v6() v7()

Tg

sg

cv()

( ) g

g

g

rT s

s s

For each

cv(1

)s1

cv(2

)s2

cv(3

)s3

cv(4

)s4

cv(5

)s5

cv(6

)s6

cv(7

)s7

Min c

Page 13: Test of significance for small samples Javier Cabrera

13

-3 -2 -1 0 1 2 3

-50

5

db

d

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Pooled Sd

P(|

SA

M|>

t)

ˆ( )gT c

( ) ˆ( )gT c

SAM: Gene selection

( ) ˆ( )gT c ˆ( )gT c= Expected value of under permutations

Page 14: Test of significance for small samples Javier Cabrera

14

Let Xgij denote the preprocessed intensity measurement for gene g in array i of group j.

Model: Xgij = gj + g gij

Effect of interest:g= g2 - g1

Error model:gij ~ F(location=0, scale=1)

Gene mean-variance model:(g1,g2)

~ F

with marginals: g1 ~ Fand g2 ~ F

Conditional t: Basic Model

Page 15: Test of significance for small samples Javier Cabrera

15

Parametric: Assume functional forms for F and F and apply either a Bayes or Empirical Bayes procedure.

Nonparametric:1. or

For small samples is not a good estimator of F Use method of moments = Target estimation

2. Proceed via resampling and estimate the distribution: t |sp (Conditional t).

Estimate F: edf , ,F, of {( 1gX, sg2)}

Estimate F: edf , F, of {( )/gij gj gX X s }

Possible approaches

Page 16: Test of significance for small samples Javier Cabrera

16

(1) D raw a gene, g , at random from {1, … , G }.

C all it g*. ( * 1gX , *

2

gs ) ~ ,F .

(2) Take a random sam ple (w ith replacem ent)

of size n1+n2 from F : * ˆ~ijr F

(3) C om bine these to form pseudo-data:

* *

* *

1ij ijg gX X s r

(4) C alculate the pooled standard error s* and t test statistic t* for the pseudo-data {X ij

*}.

Procedure

Page 17: Test of significance for small samples Javier Cabrera

17

(5) Repeat steps (1)-(4) a large number (10,000) of times. (6) Given , estimate the “critical envelope”, t(sg), as the (/2) and (1-/2) quantile curves in the tg vs sg relationship. (7) Genes that fall outside the critical envelope defined by t(sg) are deemed significant at level . (Overall unconditional Type I error rate = )

Procedure (cont.)

Page 18: Test of significance for small samples Javier Cabrera

18

ˆ ( ) is not a good estimator of ( ) F t F t

Let {Xij} be a sample from the model with F

and let the variance obtained from the {Xij} be s2

Then Var(s2) > Var(2)

For example, if we assume that F = 32, n=4 and

~ N(0,1), then Var(2)=6 and Var(s2)=15.

Fix by target estimation: Method of moments.

Shrink towards the center

Roadblock

Page 19: Test of significance for small samples Javier Cabrera

19

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Case 1

S1

E7

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Case 2

S2

E7

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Case 3

S3

E7

E7 Data

Sp

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

080

010

00

Case 1

Sp

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0

050

010

0015

00

Case 2

Sp

Fre

quen

cy

0 1 2 3 4

010

020

030

040

050

0

Case 3

Sp

Fre

quen

cy

0 1 2 3 4 5 6

020

040

060

080

010

00

Example: Checking for the distribution of g

1. Df=0.5

2. Df=2 3. Df=6

1. Df=0.5 2. Df=2

3. Df=6

Mice Data

2 2 2 2 2 20.5 2 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

Page 20: Test of significance for small samples Javier Cabrera

20

Tox Data

Sp

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5

020

060

010

00

Case 1

Sp

Fre

quen

cy

0.2 0.4 0.6 0.8 1.0 1.2 1.4

020

040

060

0

Case 2

Sp

Fre

quen

cy

0 1 2 3 4

020

060

010

00

Case 3

Sp

Fre

quen

cy

0 100 200 300 400

050

015

0025

00

Another Example

0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.1

0.2

0.3

0.4

0.5

Case 1

S1

Tox

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

Case 2

S2

Tox

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Case 3

S3

Tox

0.0 0.1 0.2 0.3 0.4 0.5

-3-2

-10

12

3

mean diff vs Sp

Sp

Mea

n di

ff

Df=0.5

Df=3 Df=6

Df=0.5

Df=3

Df=3

Df=6Df=6

2 2 2 2 2 20.5 3 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

Page 21: Test of significance for small samples Javier Cabrera

21

Fixing the variance distribution

The idea is to estimate the function h:[0:1] [0,1] defined by

h(F(x)) = F (x). Since h is strictly monotonic, it can be inverted

in order to obtain an estimate of F(x). Procedure:

(1) Assume that F (x) is the true distribution of and draw a

random sample, s*2, from F .

(2) Take a random sample (with replacement) of size N from F : * ˆ~ijr F for i=1,…, nj, j=1,2.

(3) Combine these to form pseudo-data: * * *ij ijX s r

Page 22: Test of significance for small samples Javier Cabrera

22

( 4 ) C a l c u l a t e t h e p o o l e d s t a n d a r d e r r o r s * * f o r t h e p s e u d o - d a t a { X i j

* } . ( 5 ) R e p e a t s t e p s ( B 1 ) - ( B 4 ) a l a r g e n u m b e r ( s a y 1 0 0 , 0 0 0 ) o f t i m e s a n d r e c o r d , f o r e a c h i t e r a t i o n , t h e p a i r o f v a l u e s { ( s * 2 , s * * 2 ) } .

( 6 ) L e t *ˆF ( x ) b e t h e e m p i r i c a l d i s t r i b u t i o n o f t h e s * * 2

g ’ s . T h e n t h e

e s t i m a t o r o f h i s o b t a i n e d b y m a p p i n g t h e e m p i r i c a l d i s t r i b u t i o n ˆF i n t o *

ˆF . M o r e p r e c i s e l y 1

*ˆ ˆ ˆ ˆ( ( ) ) ( ( ) )h y F x F F y

a n d 1 1*

ˆ ˆ ˆ( ) ( ( ) )h y F F y .

H e n c e t h e b i a s - c o r r e c t e d e s t i m a t o r o f F i s : 1

*ˆ ˆ ˆ( ) ( ( ( ) ) )F x F F F x

.

Fixing the variance distribution (contd)

Proceed as before …

Page 23: Test of significance for small samples Javier Cabrera

23

191 22092

Plot t vs sp

Differentially expressed genes may have large sp

Page 24: Test of significance for small samples Javier Cabrera

24

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT False Discoveries True DiscoveriesT-test 44 22 z-test 43 29C-t 45 30

2 from Chi-square(df=3) False Discoveries True DiscoveriesT-test 43 28 z-test 53 13C-t 42 38

Page 25: Test of significance for small samples Javier Cabrera

25

Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level.

T-testSAM

Ct

Page 26: Test of significance for small samples Javier Cabrera

26

To generate p-values, recall that the Ct procedure generates curves, c(s). Start with a set of curves,

1( ) ( )

kg gc s c s , for a set of

prespecified values, 1 k .

Now consider the relationship between vi=log(-log(i)) and ui=log( ( ))

i gc s

To assign an approximate p-value to the gth gene, if |tg | ( )k gc s ,

interpolate the relationship between the {ui} and the {vi}.

Generating p-values

Page 27: Test of significance for small samples Javier Cabrera

27

Extensions F test: - Condition on the sqrt(MSE) Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE) Gene Ontology. - Test for the significance of groups.

- Use Hypergeometric Statistic, mean t, mean p-value, or other.

- Condition on log of the number of genes per group

Page 28: Test of significance for small samples Javier Cabrera

28

Conditional F

0.2 0.4 0.6 0.8 1.0 1.2

01

02

03

0

Sqrt(MSE)

Sq

rt(F

)

Page 29: Test of significance for small samples Javier Cabrera

29

0 2 4 6

0.0

0.5

1.0

1.5

2.0

Sd

|T|

GO Ontology: Conditioning on log(n)

Abs(T)

Log(n)

Page 30: Test of significance for small samples Javier Cabrera

30

The Details:ReferenceExploration and Analysis of DNA Microarray and Protein Array Data. Wiley . Jan 2004.Amaratunga, Cabrera.

[email protected]@prdus.jnj.com

Webpage for DNAMR and DNAMRwebhttp://www.rci.rutgers.edu/~cabrera/DNAMR

Page 31: Test of significance for small samples Javier Cabrera

31

Target Estimation:

Cabrera, Fernholz (1999)

- Bias Reduction.

- MSE reduction.

Recent Applications:

- Ellipse Estimation (Multivariate Target).

- Logistic Regression:

• Cabrera, Fernholz, Devas (2003)

• Patel (2003) Target Conditional MLE (TCMLE)

Implementation in StatXact (CYTEL) and

logXact Proc’s in SAS(by CYTEL).

Target Estimation

Page 32: Test of significance for small samples Javier Cabrera

32

Target Estimation

T(x1,x2,…,xn)

E(T)

E(T) =

g(

Page 33: Test of significance for small samples Javier Cabrera

33

Target Estimation:1

1

ˆSuppose we have an estimator ( ,..., ) of a paramter

ˆTarget estimator : Solve ( )

ˆ ( ) ( ) then ( )

nT x x

E T

h E T h

Algorithms: - Stochastic approximation.

- Simulation and iteration.

- Exact algorithm for TCMLE