41
Lecture 4 Logistic Regression + Residual Analysis Colin Rundel 1/29/2018 1

Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Lecture 4Logistic Regression + Residual Analysis

Colin Rundel1/29/2018

1

Page 2: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Background

Today we’ll be looking at data on the presence and absence of theshort-finned eel (Anguilla australis) at a number of sites in New Zealand.

These data come from

• Leathwick, J. R., Elith, J., Chadderton, W. L., Rowe, D. and Hastie, T. (2008),Dispersal, disturbance and the contrasting biogeographies of NewZealand’s diadromous and non-diadromous fish species. Journal ofBiogeography, 35: 1481–1497.

2

Page 3: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Species Distribution

3

Page 4: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Codebook:

• presence - presence (1) or absence (0) of Anguilla australis at thesampling location

• SegSumT - Summer air temperature (degrees C)• DSDist - Distance to coast (km)• DSMaxSlope - Maximum downstream slope (degrees)• USRainDays - days per month with rain greater than 25 mm• USSlope - average slope in the upstream catchment (degrees)• USNative - area with indigenous forest (proportion)• DSDam - Presence of known downstream obstructions, mostly dams• Method - fishing method (electric, net, spot, trap, or mixture)• LocSed - weighted average of proportional cover of bed sediment

1. mud2. sand3. fine gravel4. coarse gravel5. cobble6. boulder7. bedrock 4

Page 5: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Data

anguilla## # A tibble: 617 x 10## presence SegSumT DSDist DSMaxSlope USRainDays USSlope USNative DSDam## * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>## 1 1 18.7 133 1.15 1.15 8.30 0.340 0## 2 0 18.3 107 0.570 0.847 0.400 0 0## 3 0 16.7 167 1.72 0.210 0.400 0.220 1## 4 0 15.1 11.2 1.72 3.30 25.7 1.00 0## 5 0 12.7 42.4 2.86 0.430 9.60 0.0900 0## 6 1 18.2 94.4 3.43 0.847 20.5 0.920 0## 7 1 18.3 91.9 1.72 0.861 6.70 0.580 1## 8 1 17.1 6.80 0.520 0.620 0.700 0 0## 9 0 13.4 190 3.43 0.770 20.1 0.990 0## 10 0 13.1 224 6.84 0.290 9.80 0.980 0## # ... with 607 more rows, and 2 more variables: Method <fct>, LocSed <dbl>

5

Page 6: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

EDA

Corr:

−0.325

Corr:

−0.383

Corr:0.304

Corr:

0.0878

Corr:−0.297

Corr:

−0.0717

Corr:

−0.192

Corr:0.105

Corr:

0.341

Corr:

0.0933

Corr:

−0.318

Corr:0.133

Corr:

0.346

Corr:

0.223

Corr:0.637

Corr:

−0.319

Corr:0.0896

Corr:

0.248

Corr:

0.112

Corr:0.379

Corr:

0.396

presence SegSumT DSDist DSMaxSlopeUSRainDays USSlope USNative DSDam Method LocSed presenceSegSum

TDS

DistD

SM

axSlope

US

RainD

aysU

SS

lopeUS

NativeD

SD

amM

ethodLocS

ed

02040600204060 12.515.017.520.00100200300 0 5 101520 1 2 3 0 1020300.000.250.500.751.000204060020406002040600204060020406002040600204060 2 4 6

0100200300400500

12.515.017.520.0

0100200300400

05

101520

123

0102030

0.000.250.500.751.00

0100200300400

0100200300400

01002003004000100200300400010020030040001002003004000100200300400

246

6

Page 7: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

EDA (part 1)

Corr:

−0.325

Corr:

−0.383

Corr:

0.304

Corr:

0.0878

Corr:

−0.297

Corr:−0.0717

presence SegSumT DSDist DSMaxSlope USRainDayspresence

SegS

umT

DS

Dist

DS

MaxS

lopeU

SR

ainDays

0 204060 0 204060 12.5 15.0 17.5 20.00 100 200 300 0 5 10 15 20 1 2 3

0100200300400500

12.5

15.0

17.5

20.0

0

100

200

300

400

05

101520

1

2

3

7

Page 8: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

EDA (part 2)

Corr:

0.637

Corr:

0.379

Corr:

0.396

presence USSlope USNative DSDam Method LocSedpresence

US

Slope

US

Native

DS

Dam

Method

LocSed

0 2040600 204060 0 10 20 30 0.000.250.500.751.000 2040600 20406002040600204060020406002040600204060 2 4 6

0100200300400500

0102030

0.000.250.500.751.00

0100200300400

0100200300400

0100200300400

0100200300400

0100200300400

0100200300400

0100200300400

2

4

6

8

Page 9: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

EDA (part 1)

Corr:

−0.325

Corr:

−0.383

Corr:

0.304

Corr:

0.0878

Corr:

−0.297

Corr:−0.0717

presence SegSumT DSDist DSMaxSlope USRainDayspresence

SegS

umT

DS

Dist

DS

MaxS

lopeU

SR

ainDays

0 204060 0 204060 12.5 15.0 17.5 20.00 100 200 300 0 5 10 15 20 1 2 3

0100200300400500

12.5

15.0

17.5

20.0

0

100

200

300

400

05

101520

1

2

3

9

Page 10: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

EDA (part 3)

presence SegSumTpresence

SegS

umT

0 10 20 30 40 50 0 10 20 30 40 50 12.5 15.0 17.5 20.0

0

100

200

300

400

500

12.5

15.0

17.5

20.0

10

Page 11: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Simple Model

11

Page 12: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Model

inv_logit = function(x) 1/(1+exp(-x))

g = glm(presence~SegSumT, family=binomial, data=anguilla)summary(g)#### Call:## glm(formula = presence ~ SegSumT, family = binomial, data = anguilla)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.5755 -0.6260 -0.3452 -0.1299 3.0039#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -16.74184 1.65897 -10.092 <2e-16 ***## SegSumT 0.90009 0.09413 9.562 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 621.91 on 616 degrees of freedom## Residual deviance: 479.39 on 615 degrees of freedom## AIC: 483.39#### Number of Fisher Scoring iterations: 6

12

Page 13: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Fit

d_g = anguilla %>%mutate(p_hat = predict(g, anguilla, type=”response”))

d_g_pred = data.frame(SegSumT = seq(11,25,by=0.1)) %>%modelr::add_predictions(g,”p_hat”) %>%mutate(p_hat = inv_logit(p_hat))

0.0

0.4

0.8

10 15 20 25

SegSumT

pres

ence

13

Page 14: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Separation

ggplot(d_g, aes(x=p_hat, y=presence, color=as.factor(presence))) +geom_jitter(height=0.1, alpha=0.5) +labs(color=”presence”)

0.0

0.4

0.8

0.0 0.2 0.4 0.6

p_hat

pres

ence presence

0

1

14

Page 15: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Residuals

d_g = d_g %>% mutate(resid = presence - p_hat)

ggplot(d_g, aes(x=SegSumT, y=resid)) +geom_point(alpha=0.1)

−0.5

0.0

0.5

1.0

12.5 15.0 17.5 20.0

SegSumT

resi

d

15

Page 16: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Binned Residuals

d_g %>%mutate(bin = p_hat - (p_hat %% 0.05)) %>%group_by(bin) %>%summarize(resid_mean = mean(resid)) %>%ggplot(aes(y=resid_mean, x=bin)) +

geom_point()

−0.4

−0.3

−0.2

−0.1

0.0

0.1

0.0 0.2 0.4 0.6

bin

resi

d_m

ean

16

Page 17: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Pearson Residuals

𝑟𝑖 = 𝑌𝑖 − 𝐸(𝑌𝑖)𝑉 𝑎𝑟(𝑌𝑖)

= 𝑌𝑖 − ̂𝑝𝑖̂𝑝𝑖(1 − ̂𝑝𝑖)

d_g = d_g %>% mutate(pearson = (presence - p_hat) / (p_hat * (1-p_hat)))

0

25

50

75

0.0 0.2 0.4 0.6

p_hat

pear

son

17

Page 18: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Binned Pearson Residuals

−1.5

−1.0

−0.5

0.0

0.5

0.0 0.2 0.4 0.6

bin

pear

son_

mea

n

18

Page 19: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Deviance Residuals

𝑑𝑖 = sign(𝑌𝑖 − ̂𝑝𝑖)√−2 (𝑌𝑖 log ̂𝑝𝑖 + (1 − 𝑌𝑖) log(1 − ̂𝑝𝑖))d_g = d_g %>%

mutate(deviance = sign(presence - p_hat) *sqrt(-2 * (presence*log(p_hat) + (1-presence)*log(1 - p_hat) )))

−1

0

1

2

3

0.0 0.2 0.4 0.6

p_hat

devi

ance

19

Page 20: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Binned Deviance Residuals

−0.75

−0.50

−0.25

0.00

0.25

0.0 0.2 0.4 0.6

bin

devi

ance

_mea

n

20

Page 21: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Checking Deviance

sum(d_g$deviance^2)## [1] 479.3914

glm(presence~SegSumT, family=binomial, data=anguilla)#### Call: glm(formula = presence ~ SegSumT, family = binomial, data = anguilla)#### Coefficients:## (Intercept) SegSumT## -16.7418 0.9001#### Degrees of Freedom: 616 Total (i.e. Null); 615 Residual## Null Deviance: 621.9## Residual Deviance: 479.4 AIC: 483.4

21

Page 22: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

All together

deviance

pearson

resid

0.0 0.2 0.4 0.6

−0.4

−0.3

−0.2

−0.1

0.0

0.1

−1.5

−1.0

−0.5

0.0

0.5

−0.75

−0.50

−0.25

0.00

0.25

bin

mea

n

type

resid

pearson

deviance

22

Page 23: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Full Model

23

Page 24: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Model

f = glm(presence~., family=binomial, data=anguilla)summary(f)#### Call:## glm(formula = presence ~ ., family = binomial, data = anguilla)#### Deviance Residuals:## Min 1Q Median 3Q Max## -2.10254 -0.53092 -0.27156 -0.08821 3.12463#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -11.554287 1.872102 -6.172 6.75e-10 ***## SegSumT 0.765864 0.103173 7.423 1.14e-13 ***## DSDist -0.002551 0.002103 -1.213 0.22523## DSMaxSlope -0.062525 0.063093 -0.991 0.32169## USRainDays -0.619025 0.227316 -2.723 0.00647 **## USSlope -0.041399 0.024657 -1.679 0.09315 .## USNative -0.607045 0.475456 -1.277 0.20169## DSDam -0.922073 0.483492 -1.907 0.05651 .## Methodmixture -0.231175 0.498189 -0.464 0.64263## Methodnet -1.229762 0.534845 -2.299 0.02149 *## Methodspo -1.493876 0.733468 -2.037 0.04168 *## Methodtrap -2.476408 0.628486 -3.940 8.14e-05 ***## LocSed -0.175944 0.098204 -1.792 0.07319 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 621.91 on 616 degrees of freedom## Residual deviance: 420.18 on 604 degrees of freedom## AIC: 446.18#### Number of Fisher Scoring iterations: 6

24

Page 25: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Separation

0.0

0.4

0.8

0.0 0.2 0.4 0.6

p_hat

pres

ence presence

0

1

SegSumT Model

0.0

0.4

0.8

0.00 0.25 0.50 0.75

p_hat

pres

ence presence

0

1

Full Model

25

Page 26: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Residuals vs fitted

resid

pearson

deviance

0.00 0.25 0.50 0.75

−1.0

−0.5

0.0

0.5

−3

−2

−1

0

1

−0.4

−0.2

0.0

0.2

bin

mea

n

type

deviance

pearson

resid

26

Page 27: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Model Performance

27

Page 28: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Confusion Tables

0.0

0.4

0.8

0.00 0.25 0.50 0.75

p_hat

pres

ence presence

0

1

Full Model

28

Page 29: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Predictive Performance (ROC / AUC)

AUC = c(0.878, 0.824)

0.00

0.10

0.25

0.50

0.75

0.90

1.00

0.00 0.10 0.25 0.50 0.75 0.90 1.00

False positive fraction

True

pos

itive

frac

tion

model

Full Model

SegSumT Model

29

Page 30: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Out of sample predictive performance

AUC = c(0.824, 0.748, 0.878, 0.828)

0.00

0.10

0.25

0.50

0.75

0.90

1.00

0.00 0.10 0.25 0.50 0.75 0.90 1.00

False positive fraction

True

pos

itive

frac

tion

type

Train

Test

model

SegSumT Model

Full Model

30

Page 31: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

What about something non-parametric?

31

Page 32: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Gradient Boosting Model

y = anguilla$presence %>% as.integer()x = model.matrix(presence~.-1, data=anguilla)x_test = model.matrix(presence~.-1, data=anguilla_test)

xg = xgboost::xgboost(data=x, label=y, nthead=4, nround=25,objective=”binary:logistic”, verbose = FALSE)

MethodspoMethodnetDSDamMethodmixtureMethodtrapMethodelectricDSMaxSlopeUSSlopeUSNativeUSRainDaysLocSedDSDistSegSumT

Relative importance

0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0

32

Page 33: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Residuals?

resid pearson deviance

0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75

−1.0

−0.5

0.0

0.5

1.0

−1

0

1

−0.25

0.00

0.25

bin

mea

n

type

resid

pearson

deviance

33

Page 34: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Separation?

0.0

0.4

0.8

0.00 0.25 0.50 0.75 1.00

p_hat

pres

ence presence

0

1

34

Page 35: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Effect of nround - Training Data

0.0

0.4

0.8

0.2 0.4 0.6 0.8

p_hat

pres

ence presence

0

1

XGBoost − 5 rounds − Training Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75

p_hat

pres

ence presence

0

1

XGBoost − 10 rounds − Training Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75 1.00

p_hat

pres

ence presence

0

1

XGBoost − 15 rounds − Training Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75 1.00

p_hat

pres

ence presence

0

1

XGBoost − 25 rounds − Training Data

35

Page 36: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Effect of nround - Test Data

0.0

0.4

0.8

0.2 0.4 0.6 0.8

p_hat

pres

ence presence

0

1

XGBoost − 5 rounds − Test Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75

p_hat

pres

ence presence

0

1

XGBoost − 10 rounds − Test Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75

p_hat

pres

ence presence

0

1

XGBoost − 15 rounds − Test Data

0.0

0.4

0.8

0.00 0.25 0.50 0.75 1.00

p_hat

pres

ence presence

0

1

XGBoost − 25 rounds − Test Data

36

Page 37: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

ROC Curves

0.00

0.10

0.25

0.50

0.75

0.90

1.00

0.00 0.10 0.25 0.50 0.75 0.90 1.00

False positive fraction

True

pos

itive

frac

tion

type

Train

Test

model

SegSumT Model

Full Model

XGBoost − 5 rounds

XGBoost − 10 rounds

XGBoost − 15 rounds

XGBoost − 25 rounds

37

Page 38: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Aside: Species Distribution Modeling

38

Page 39: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Model Choice

We have been fitting a model that looks like the following,

𝑦𝑖 ∼ Bern(𝑝𝑖)logit(𝑝𝑖) = 𝐗𝑖⋅𝜷

Interpretation of 𝑦𝑖 and 𝑝𝑖?

39

Page 40: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Absence of evidence …

If we observe a species at a particular location what does that tell us?

If we don’t observe a species at a particular location what does that tell us?

40

Page 41: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0

Revised Model

If we allow for crypsis, then

𝑦𝑖 ∼ Bern(𝑞𝑖 𝑧𝑖)𝑧𝑖 ∼ Bern(𝑝𝑖)

logit(𝑞𝑖) = 𝐗𝑖⋅𝜸logit(𝑝𝑖) = 𝐗𝑖⋅𝜷

Interpretation of 𝑦𝑖, 𝑧𝑖, 𝑝𝑖, and 𝑞𝑖?

41