10
Limits to Statistical Theory Bootstrap analysis ESM 206 11 April 2006

Limits to Statistical Theory Bootstrap analysis

Embed Size (px)

DESCRIPTION

Limits to Statistical Theory Bootstrap analysis. ESM 206 11 April 2006. Assumption of t -test. Sample mean is a t -distributed random variable Guaranteed if observations are normally distributed random variables or sample size is very large - PowerPoint PPT Presentation

Citation preview

Page 1: Limits to Statistical Theory Bootstrap analysis

Limits to Statistical TheoryBootstrap analysis

ESM 206

11 April 2006

Page 2: Limits to Statistical Theory Bootstrap analysis

Assumption of t-test

• Sample mean is a t-distributed random variable– Guaranteed if observations are normally distributed random variables or

sample size is very large

– In practice, OK if observations are not too skewed and sample size is reasonably large

• This assumption also applies when using standard formula for 95% CI of mean

Page 3: Limits to Statistical Theory Bootstrap analysis

Resampling for a confidence interval of the mean

IN AN IDEAL WORLD

• Take sample

• Calculate sample mean

• Take new sample

• Calculate new mean

• Repeat many times

• Look at the distribution of sample means

• 95% CI ranges from 2.5 percentile to 97.5 percentile

• IN THE REAL WORLD

• Find some way to simulate taking a sample

• Calculate the sample mean

• Repeat many times

• Look at the distribution of sample means

• 95% CI ranges from 2.5 percentile to 97.5 percentile

Page 4: Limits to Statistical Theory Bootstrap analysis

Bootstrap resampling

PARAMETRIC BOOTSTRAP• Assume data are random variables from

a particular distribution– E.g., log-normal

• Use data to estimate parameters of the distribution

– E.g., mean, variance

• Use random number generator to create sample

– Same size as original– Calculate sample mean

• Allows us to ask: What if data were a random sample from specified distribution with specified parameters?

NONPARAMETRIC BOOTSTRAP• Assume underlying distribution from

which data come is unknown– Best estimate of this distribution is the

data themselves – the empirical distribution function

• Create a new dataset by sampling with replacement from the data

– Same size as original– Calculate sample mean

WHICH IS BETTER?• If underlying distribution is correctly

chosen, parametric has more precision• If underlying distribution incorrectly

chosen, parametric has more bias

Page 5: Limits to Statistical Theory Bootstrap analysis

TcCB in the cleanup site

• Parametric bootstrap– If Y is log-normal, it is specified in

terms of mean and standard deviation of X = log(Y)

– Mean = -0.547

– SD = 1.360

– Use “Monte Carlo Simulation” to generate 999 replicate simulated datasets from log-normal distribution

– Calculate mean of each replicate and sort means

– 25th value is lower end of 95% CI

– 975th value is upper end of 95% CI

0

50

100

150

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

168.64

168.64

57.80

2.70

1.15

0.43

0.23

0.17

0.09

0.09

0.09

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

3.9151948

20.0156

2.2809894

8.4581788

-0.627789

77

Moments

Cleanup

Distributions

0

50

100

150

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

168.64

168.64

57.80

2.70

1.15

0.43

0.23

0.17

0.09

0.09

0.09

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

3.9151948

20.0156

2.2809894

8.4581788

-0.627789

77

Moments

Cleanup

Distributions

95% CI: [-0.678, 8.458]

Page 6: Limits to Statistical Theory Bootstrap analysis

Parametric bootstrap: results

• 95% CI: [0.917, 2.293]

Distribution of sample means

0

20

40

60

80

100

120

140

160

180

0.83

68503

1.06

81231

1.29

93959

1.53

06687

1.76

19415

1.99

32143

2.22

44871

2.45

57599

2.68

70327

2.91

83055

3.14

95783

3.38

08511

3.61

21239

3.84

33967

4.07

46695

Bin (label shows upper limit)

Fre

qu

ency

-3

-2

-1

0

1

2

3

4

5

6

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

-0.547426

1.3604488

0.1550375

-0.238642

-0.85621

77

Moments

log(cleanup)

Distributions

Page 7: Limits to Statistical Theory Bootstrap analysis

Normal QQ Plot

• Sort data

• Index the values (i = 1,2,…,n)

• Calculate q = i /(n+1)– This is the quantile

• Plot quantiles against data values– This is the empirical cumulative

distribution function (CDF)

• Construct CDF of standard normal using same quantiles

• Compare the distributions at the same quantiles

-3

-2

-1

0

1

2

3

4

5

6.01 .05 .10 .25 .50 .75 .90 .95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

-0.547426

1.3604488

0.1550375

-0.238642

-0.85621

77

Moments

log(cleanup)

Distributions

Page 8: Limits to Statistical Theory Bootstrap analysis

Nonparametric bootstrap: results

• 95% CI: [0.851, 9.248]

0

20

40

60

80

100

120

140

0.5

1.5

2.5

3.5

4.5

5.5

6.5

7.5

8.5

9.5

10.5

11.5

12.5

13.5

14.5

Bootstrap mean

Fre

qu

ency

Page 9: Limits to Statistical Theory Bootstrap analysis

Bootstrap and hypothesis tests

• One sample t-test– Calculate bootstrap CI of mean– Does it overlap test value?

• Paired t-test– Calculate differences:

• Di = xi - yi

– Find bootstrap CI of mean difference– Does it overlap zero?

• Two-sample t-test– Want to create simulated data where

H0 is true (same mean) but allow variance and shape of distribution to differ between populations

– Easiest with nonparametric:• Subtract mean from each sample.

Now both samples have mean zero• Resample these residuals, creating

simulated group A from residuals of group A and simulated group B from residuals of group B

– Generate distribution of t values– P is fraction of simulated t’s that

exceed t calculated from data

Page 10: Limits to Statistical Theory Bootstrap analysis

TcCB: H0: cleanup mean = reference mean

• t = 1.45

• Bootstrapped ‘t’ values do not follow a t distribution!

• P = 0.02

0

100

200

300

400

500

600

-37.6

7550

2

-34.8

1313

6

-31.9

5077

1

-29.0

8840

5

-26.2

2604

-23.3

6367

4

-20.5

0130

9

-17.6

3894

3

-14.7

7657

8

-11.9

1421

2

-9.05

1846

8

-6.18

9481

3

-3.32

7115

8

-0.46

4750

3

2.39

76152

2

Bin (label shows upper limit)

Fre

qu

ency