Intro Bootstrap 341

Introduction to the Bootstrap

Machelle D. Wilson

Outline

Why the Bootstrap? Limitations of traditional statistics

How Does it Work? The Empirical Distribution Function and the

Plug-in Principle Accuracy of an estimate: Bootstrap standard

error and confidence intervals Examples How Good is the Bootstrap?

Limitations of Traditional Statistics: Problems with distributional assumptionsOften data can not safely be assumed

to be from an identifiable distribution.Sometimes the distribution of the

statistic is mathematically intractable, even assuming that distributional assumptions can be made.

Hence, often the bootstrap provides a superior alternative to parametric statistics.

An example data set

80 100 120 140 160 180

050

100

150

200

250

1000 Bootstrapped Means

Mean conc. and Dose rate fixedmean dose

Red Lines=BS CI

Black Lines=Normal CI

An Example Data Set

50 100 150 200 250 300 350

050

100

150

200

250

300

1000 Bootstrapped Means

Mean Conc and Dose Rate Randommean dose

Red Lines=BS CIBlack Lines=Normal CI

Statistics in the Computer Age

Efron and Tibshirani, 1991 in Science: “Most of our familiar statistical methods, such as hypothesis testing, linear regression, analysis of variance, and maximum likelihood estimation, were designed to be implemented on mechanical calculators. Modern electronic computation has encouraged a host of new statistical methods that require fewer distributional assumptions than their predecessors and can be applied to more complicated statistical estimators…without the usual concerns for mathematical tractability.”

The Bootstrap Solution

With the advent of cheap, high power computing, it has become relatively easy to use resampling techniques, such as the bootstrap, to estimate the distribution of sample statistics empirically rather than making distributional assumptions.

The bootstrap resamples the data with equal probability and with replacement and calculates the statistic of interest at each resampling. The resulting histogram, mean, quantiles and variance of the bootstrapped statistics provide an estimate of its distribution.

Example

Take the data set 1,2,3. There are 10 possible resamplings, where re-orderings are considered the same sampling.

1,2,3 1,1,2

1,1,3 2,2,1

2,2,3 3,3,1

3,3,2 1,1,1

2,2,2 3,3,3

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30

5

10

15

20

25

30

The Bootstrap Solution

In general, the number of bootstrap samples, Cn, is

Table of possible distinct bootstrap re-samplings by sample size.

2 1.

1n

nC

n

n 5 10 12 15 20 25 30

Cn 126

92,378

1.35x104

7.76x105

6.89x1010

6.32x1013

5.91x1016

The Empirical Distribution FunctionHaving observed a random sample of

size n from a probability distribution F,

the empirical distribution function (edf), assigns to a set A in the sample space of x its empirical probability

1 2, ,... nF x x x

ˆ ,F

ˆ ˆ # /iF A P A x A n

Example

A random sample of 100 throws of a die yields 13 ones, 19 twos, 10 threes, 17 fours, 14 fives, and 27 sixes. Hence the edf is

ˆ (1) 0.13

ˆ (2) 0.19

ˆ (3) 0.10

F

F

F

ˆ (4) 0.17

ˆ (5) 0.14

ˆ (6) 0.27

F

F

F

The Plug-in Principle

It can be shown that is a sufficient statistic for F.

That is, all the information about F contained in x is also contained in .

The plug-in principle estimates

by

F̂

F̂

( )T F

ˆ ˆ( )T F

The Plug-in Principle

If the only information about F comes from the sample x, then is a minimum variance unbiased estimator of .

The bootstrap is drawing B samples from the empirical distribution to estimate B statistics of interest,

Hence, the bootstrap is both sampling from an edf (of the original sample) and generating an edf (of the statistic).

ˆ ˆ( )T F

*ˆ .

Graphical Representation of the Bootstrap

x={x1,x2,…,xn}

x*1 x*2 x*3 …. …. … x*B

T(x*1) T(x*2) T(x*3) … … T(x*B)

2

1

[ ( ) ]ˆ ( ( )) 1

Bb

b

T x tse T x B

*

1

1ˆ ( )

Bb

b

t T x T xB

Bootstrap Standard Error and Confidence intervals.

The bootstrap estimate of the mean is just the empirical average of the statistic over all bootstrap samples.

The bootstrap estimate of standard error is just the empirical standard deviation of the bootstrap statistic over all bootstrap samples.

Bootstrap Confidence Intervals

The percentile interval: the bootstrap confidence interval for any statistic is simply the /2 and 1-/2 quantiles.

For example, if B=1000, then to construct the BS confidence interval we rank the statistics and take the 25th and the 975th values.

There are other BS CIs but this is the easiest and makes the fewest assumptions.

Example: Bootstrap of the MedianGo to Splus.

How Good is the Bootstrap?

The bootstrap, in most cases is as good as the empirical distribution function.

The bootstrap is not optimal when there is good information about F that did not come from the data–i.e. prior information or strong, valid distributional assumptions.

The bootstrap does not work well for extreme values and needs some what difficult modifications for autocorrelated data such as times series.

When all our information comes from the sample itself, we can not do better than the bootstrap.

Documents

Intro Bootstrap 341