27
1 Stat 6601 Presentation Presented by: Xiao Li (Winnie) Wenlai Wang Ke Xu Nov. 17, 2004 V & R 6.6

BOOTSTRAPPING LINEAR MODELS

Embed Size (px)

DESCRIPTION

INTRODUCTION TO. BOOTSTRAPPING LINEAR MODELS. V & R 6.6. Stat 6601 Presentation. Presented by: Xiao Li (Winnie) Wenlai Wang Ke Xu Nov. 17, 2004. Bootstrapping Linear Models. 11/17/2004. Preview of the Presentation. Introduction to Bootstrap Data and Modeling - PowerPoint PPT Presentation

Citation preview

Page 1: BOOTSTRAPPING LINEAR MODELS

11Stat 6601 Presentation

Presented by:

Xiao Li (Winnie)Wenlai Wang

Ke Xu

Nov. 17, 2004

V & R 6.6

Page 2: BOOTSTRAPPING LINEAR MODELS

22

Preview of the Preview of the PresentationPresentation

11/17/2004

Bootstrapping Linear Models

Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results Issues and Discussion Summary

Page 3: BOOTSTRAPPING LINEAR MODELS

33

What is Bootstrapping ?What is Bootstrapping ?

11/17/2004

Bootstrapping Linear Models

Invented by Bradley Efron, and further developed by Efron and Tibshirani

A method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample

A method to determine the trustworthiness of a statistic (generalization of the standard deviation)

Page 4: BOOTSTRAPPING LINEAR MODELS

44

Why uses Bootstrapping ?Why uses Bootstrapping ?

11/17/2004

Bootstrapping Linear Models

Start with 2 questions: What estimator should be used? Having chosen an estimator, how

accurate is it?

Linear Model with normal random errors having constant variance Least Square

Generalized non-normal errors and non-constant variance ???

Page 5: BOOTSTRAPPING LINEAR MODELS

55

The Mammals DataThe Mammals Data

11/17/2004

Bootstrapping Linear Models

A data frame with average brain and body weights for 62 species of land mammals.

“body” : Body weight in Kg “brain” : Brain weight in g “name”: Common name of species

Page 6: BOOTSTRAPPING LINEAR MODELS

66

Data and ModelData and Model

11/17/2004

Bootstrapping Linear Models

0 2000 5000

010

0020

0030

0040

0050

00

Original Data

body weight

brai

n w

eigh

t

-4 0 2 4 6 8

-20

24

68

Log-Transformed Data

log body weight

log

brai

n w

eigh

t

Linear Regression Model:

where j = 1, …, n, and

is considered random

y = log(brain weight)

x = log(body weight)

Page 7: BOOTSTRAPPING LINEAR MODELS

77

Summary of Original FitSummary of Original Fit

11/17/2004

Bootstrapping Linear Models

Residuals:

Min 1Q Median 3Q Max

-1.71550 -0.49228 -0.06162 0.43597 1.94829

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.13479 0.09604 22.23 <2e-16 ***

log(body) 0.75169 0.02846 26.41 <2e-16 ***

Residual standard error: 0.6943 on 60 DF

Multiple R-Squared: 0.9208

Adjusted R-squared: 0.9195

F-statistic: 697.4 on 1 and 60 DF

p-value: < 2.2e-16

Page 8: BOOTSTRAPPING LINEAR MODELS

88

for Original for Original ModelingModeling

11/17/2004

Bootstrapping Linear Models

library(MASS)

library(boot)

c <- par(mfrow=c(1,2))

data <- data(mammals)

plot(mammals$body, mammals$brain, main='Original Data', xlab='body weight', ylab='brain weight', col=’brown’) # plot of data

plot(log(mammals$body), log(mammals$brain), main='Log-Transformed Data', xlab='log body weight', ylab='log brain weight', col=’brown’) # plot of log-transformed data

mammal <- data.frame(log(mammals$body), log(mammals$brain))

dimnames(mammal) <- list((1:62), c("body", "brain"))

attach(mammal)

log.fit <- lm(brain~body, data=mammal)

summary(log.fit)

Page 9: BOOTSTRAPPING LINEAR MODELS

99

Two MethodsTwo Methods

11/17/2004

Bootstrapping Linear Models

Case-based Resampling: randomly sample pairs (Xi, Yi) with replacement

No assumption about variance homogeneity Design fixes the information content of a sample

Model-based Resampling: resample the residuals

Assume model is correct with homoscedastic errors Resampling model has the same “design” as the data

Page 10: BOOTSTRAPPING LINEAR MODELS

1010

Case-Based Resample Case-Based Resample AlgorithmAlgorithm

11/17/2004

Bootstrapping Linear Models

For r = 1, …, R,1. sample randomly with replacement

from {1, 2, …,n}

2. for j = 1, …, n, set , then

3. fit least squares regression to , …,

giving estimates , , .

Page 11: BOOTSTRAPPING LINEAR MODELS

1111

Model-Based Resample Model-Based Resample AlgorithmAlgorithm

11/17/2004

Bootstrapping Linear Models

For r = 1, …, n,1. For j = 1, … , n,

a) Setb) Randomly sample from , …, ; thenc) Set

1. Fit least squares regression to ,…, giving estimates , , .

Page 12: BOOTSTRAPPING LINEAR MODELS

1212

Case-Based Bootstrap Case-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

ORDINARY NONPARAMETRIC BOOTSTRAP

Bootstrap Statistics :

original bias std. error

t1* 2.134789 -0.0022155790 0.08708311

t2* 0.751686 0.0001295280 0.02277497

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Intervals :

Level Normal Percentile BCa

95% ( 1.966, 2.308 ) ( 1.963, 2.310 ) ( 1.974, 2.318 )

95% ( 0.7069, 0.7962 ) ( 0.7082, 0.7954 ) ( 0.7080, 0.7953 )

Calculations and Intervals on Original Scale

Page 13: BOOTSTRAPPING LINEAR MODELS

1313

Case-Based Bootstrap Case-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

Histogram of t

t*

De

nsi

ty

0.70 0.80

05

10

15

20

-3 -1 1 3

0.7

00

.75

0.8

0

Quantiles of Standard Normal

t*

Histogram of t

t*

De

nsi

ty

1.8 2.0 2.2 2.4

01

23

45

-3 -1 1 3

1.9

2.0

2.1

2.2

2.3

2.4

Quantiles of Standard Normal

t*

Bootstrap Distribution Plots for intercept and Slope

Page 14: BOOTSTRAPPING LINEAR MODELS

1414

Case-Based Bootstrap Case-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

Standardized Jackknife-after-Bootstrap Plots for intercept and Slope

-2 -1 0 1 2 3

-0.0

6-0

.04

-0.0

20

.00

0.0

20

.04

standardized jackknife value

5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %

-ile

s o

f (T

*-t)

** *

** ** *

* ***

***

*

*****

******

***

*************

*********

**** ** * * * *

** ** * ** **

**********

************

*****************

******** *

* * * * *

* * ** * ** ********

******************

****************

******* ** * * * *

* * ** * ** ** *********************************************** ** * * * *

* * ** * ** ** *******

*************************

*************** *

* * * * ** * *

** ** ** *

***********

********************

******

*****

**** *

** * * *

* * ** *

**

** **

********

**********

*********

*******

*******

**** ** * * *

*

11 54 37 10 51 425318 60 40 52 48 55

4 2 33 28 44 43 5 29 12 30 14 46 32

58 36 31 45 25 1 17 7 62 39 35 38

56 26 22 21 6 418 16 9 59 47 19

49 61 13 27 50 15 3 24 57 34 23 20

-2 -1 0 1 2

-0.2

-0.1

0.0

0.1

0.2

standardized jackknife value

5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %

-ile

s o

f (T

*-t)

* * * ****** * ****

**** **

********************

*** *

***

**********

** ** *

** * *****

* ****** ***

************************ * *

************

* ** ** ** * *

****** * ***** *** *************

*********** * *************

* ** ** *

* * * ****** * ***** *** ************************ * ************** ** ** *

* * * ****** * ***** *** ***

******************

***

* ************** **** *

* * * ****** ****** *** **

*********************

* * *******

******* **** *

* **

****** * ****

* *** *******************

***** *

****

********** ** *

**

34 52 57 37 60291522 41 10 619 11

55 20 43 17 121421 5 39 7 27 1 2

59 38 13 28 45193 50 25 46 32 24

48 49 18 4484036 53 30 31 51 47

23 16 4 5658336 42 62 26 54 35

Page 15: BOOTSTRAPPING LINEAR MODELS

1515

for Case-for Case-Based Based

11/17/2004

Bootstrapping Linear Models

# Case-Based Resampling

fit.case <- function(data) coef(lm(log(data$brain)~log(data$body)))

mam.case <- function(data, i) fit.case(data[i, ])

mam.case.boot <- boot(mammals, mam.case, R = 999)

mam.case.boot

boot.ci(mam.case.boot, type=c("norm", "perc", "bca"))

boot.ci(mam.case.boot, index=2, type=c("norm", "perc", "bca"))

plot(mam.case.boot)

plot(mam.case.boot, index=2)

jack.after.boot(mam.case.boot)

jack.after.boot(mam.case.boot, index=2)

Page 16: BOOTSTRAPPING LINEAR MODELS

1616

Model-Based Bootstrap Model-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

ORDINARY NONPARAMETRIC BOOTSTRAP

Bootstrap Statistics :

original bias std. error

t1* 2.134789 0.0049756072 0.09424796

t2* 0.751686 -0.0006573983 0.02719809

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Intervals :

Level Normal Percentile Bca

95% ( 1.945, 2.315 ) ( 1.948, 2.322 ) ( 1.941, 2.316 )

95% ( 0.6990, 0.8057 ) ( 0.6982, 0.8062 ) ( 0.6987, 0.8077 )

Calculations and Intervals on Original Scale

Page 17: BOOTSTRAPPING LINEAR MODELS

1717

Model-Based Bootstrap Model-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

Histogram of t

t*

De

nsi

ty

1.9 2.1 2.3 2.5

01

23

4

-3 -1 1 3

1.9

2.0

2.1

2.2

2.3

2.4

2.5

Quantiles of Standard Normal

t*

Histogram of t

t*

De

nsi

ty

0.70 0.80

05

10

15

-3 -1 1 3

0.7

00

.75

0.8

00

.85

Quantiles of Standard Normal

t*

Bootstrap Distribution Plots for intercept and Slope

Page 18: BOOTSTRAPPING LINEAR MODELS

1818

Model-Based Bootstrap Model-Based Bootstrap

11/17/2004

Bootstrapping Linear Models

Standardized Jackknife-after-Bootstrap Plots for intercept and Slope

-2 -1 0 1 2

-0.3

-0.2

-0.1

0.0

0.1

0.2

standardized jackknife value

5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %

-ile

s o

f (T

*-t)

* ****

*********

*******

*****

**

*********

***** ***

******** *

** *

** * * *

***** ***

******* *****

****

********

********** ***

******** *

** *

** * * ** *

*** *

********* ************

*************** ***

******** *** * ** * * *

* **** ********** ***************************

*********** *** * ** * * *

* **** *****

***** ******

*******

************** **********

* *** * ** * **

***

** *

*

******** ***

*****

********

*******

**** ***********

*** * ** * * *

* **** ********** ******

*****************

********

***

******

* ***

* **

34 48 58 18 14 60 41 6 19 1 11 54 35

59 52 45 38 13 21 2815 50 30 26 9 32

36 49 12 8 43 22 3344 62 57 25 46

4 16 20 37 3 5 40 53 31 27 10 47

56 39 55 23 29 42 7 61 17 51 24 2

-2 -1 0 1 2

-0.0

50

.00

0.0

5

standardized jackknife value

5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %

-ile

s o

f (T

*-t)

**

*** * **

********* *

*******

***

*** ***** ******* ****

******** *

* **** *

**

*** * ******

***** *

********

** *

** ***** ******* ****

********

** **** *

* * ***

*******

***** *

********** *** ***** ******* *******

***** ** ***

**

* * *** * *********** *********** *** ***** ******* ************ ** **** *

**

***

********

****

*********** **

*****

****

**** **********

**** **

** ***

*** *

*********** *******

**** *** *

****

*******

**********

**** **

** **

****

*******

***** ***

********

*** *****

*******

*

***

********

**

****

*

17 57 4844 10 42 24 50 26 16 46 15 14

9 33 4945 19 7 20 47 41 60 59 61 39

32 54 51 27 62 13 8 1 40 29 21 58

3 28 31 30 53 23 55 43 4 12 6 34

38 18 2 52 22 37 25 56 35 11 5 36

Page 19: BOOTSTRAPPING LINEAR MODELS

1919

for Model-Basedfor Model-Based

11/17/2004

Bootstrapping Linear Models

# Model-Based Resampling (Resample Residuals)

fit.res <- lm(brain ~ body, data=mammal)

mam.res.data <- data.frame(mammal, res=resid(fit.res), fitted=fitted(fit.res))

mam.res <- function(data, i){

d <- data

d$brain <- d$fitted + d$res[i]

coef(update(fit.res, data=d))

}

fit.res.boot <- boot(mam.res.data, mam.res, R = 999)

fit.res.boot

boot.ci(fit.res.boot, type=c("norm", "perc", "bca"))

boot.ci(fit.res.boot, index=2, type=c("norm", "perc", "bca"))

plot(fit.res.boot)

plot(fit.res.boot, index=2)

boot.ci(fit.res.boot, type=c("norm", "perc", "bca"))

jack.after.boot(fit.res.boot)

jack.after.boot(fit.res.boot, index=2)

Page 20: BOOTSTRAPPING LINEAR MODELS

2020

Comparisons and Comparisons and DiscussionDiscussion

11/17/2004

Bootstrapping Linear Models

Comparing

Fields

Original Model

Case-Based

(Fixed)

Model-Bsed

(Random)

Intercept (t1*)

Stand Error

2.13479 0.09604

2.134789 0.08708311

2.134789 0.09424796

Slope (t2*)

Stand Error

0.75169 0.02846

0.751686 0.02277497

0.751686 0.02719809

Page 21: BOOTSTRAPPING LINEAR MODELS

2121

Case-Based Vs. Model-Case-Based Vs. Model-BasedBased

11/17/2004

Bootstrapping Linear Models

Model-based resampling enforces the assumption that errors are randomly distributed by resampling the residuals from a common distribution

If the model is not specified correctly – i.e., unmodeled nonlinearity, non-constant error variance, or outliers – these attributes do not carry over to the bootstrap samples

The effects of outliers is clear in the

case-based, but not with the model-based.

Page 22: BOOTSTRAPPING LINEAR MODELS

2222

When Might Bootstrapping When Might Bootstrapping Fail?Fail?

11/17/2004

Bootstrapping Linear Models

Incomplete Data Assume that missing data are not problematic If multiple imputation is used beforehand

Dependent Data Bootstrap imposes mutual dependence on the Yj,

and thus their joint distribution is

Outliers and Influential Cases Remove/Correct obvious outliers Avoid the simulations to depend on particular observations

Page 23: BOOTSTRAPPING LINEAR MODELS

2323

Review & More Review & More ResamplingResampling

11/17/2004

Bootstrapping Linear Models

Resampling techniques are powerful tools for:-- estimating SD from small samples-- when the statistics do not have easily determined SD

Bootstrapping involves:-- taking ‘new’ random samples with replacement from the original data-- calculate boostrap SD and statistical test from the average of the statistic from the bootstrap samples

More resampling techniques:-- Jackknife resampling-- Cross-validation

Page 24: BOOTSTRAPPING LINEAR MODELS

2424

SUMMARYSUMMARY

11/17/2004

Bootstrapping Linear Models

Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results and Comparisons Issues and Discussion

Page 25: BOOTSTRAPPING LINEAR MODELS

2525

ReferenceReference

11/17/2004

Bootstrapping Linear Models

Anderson, B. “Resampling and Regression” McMaster University. http://socserv.mcmaster.ca/anderson

Davision, A.C. and Hinkley D.V. (1997) Bootstrap methods and their application. pp.256-273. Cambridge University Press

Efron and Gong (February 1983), A Leisurely Look at the Bootstrap, the Jackknife, and Cross Validation, The American Statistician.

Holmes, S. “Introduction to the Bootstrap” Stanford University. http://wwwstat.stanford.edu/~susan/courses/s208/

Venables and Ripley (2002), Modern Applied Statistics with S, 4th ed. pp. 163-165. Springer

Page 26: BOOTSTRAPPING LINEAR MODELS

262611/17/20

04Bootstrapping Linear

Models

Page 27: BOOTSTRAPPING LINEAR MODELS

2727

Extra Stuff…Extra Stuff…

11/17/2004

Bootstrapping Linear Models

Jackknife Resampling takes new samples of the data by omitting each case individually and recalculating the statistic each time Resampling data by randomly taking a single observation out # of jackknife samples used # of cases in the original sample Works well for robust estimators of location, but not for SD

Cross-Validation randomly splits the sample into two groups comparing the model results from one sample to the results from the other. 1st subset is used to estimate a statistical model

(screening/training sample) Then test our findings on the second subset.

(confirmatory/test sample)