Assignment #8 - Department of Zoology, UBCmfscott/lectures/17_Regression.pdfAssignment #8 Chapter...

Preview:

Citation preview

Assignment #8

Chapter 14: 26 Chapter 15: 18, 27 Due next Friday Nov. 27th by 2pm in your TA’s homework box

Assignment #9

Chapter 16: 20 Chapter 17: 33 Not Due! Just for practice. Answers will be posted on Friday Dec. 4th

Reading

For Today: Chapter 17 For Thursday: Chapter 17

Lab Report •  Posted on web-site •  Dates

–  Rough draft due to TAs homework box Monday Nov. 16th –  Rough draft returned in your registered lab section this week –  Final draft due at start of your registered lab section next week

à MUST HAND IN ROUGH DRAFT WITH FINAL DRAFT (penalty -10 points)

•  10% of course grade –  Rough Draft - 5% –  Final draft - 5% –  If you’re happy with your rough draft mark, you can tell your TA to use it for

the final draft à YOU MUST TELL YOUR TA

•  Read the “Writing a Lab Report” section of your lab notebook for guidance!!

Lab Report •  Posted on web-site •  Dates

–  Rough draft due to TAs homework box Monday Nov. 16th –  Rough draft returned in your registered lab section this week –  Final draft due at start of your registered lab section next week

à MUST HAND IN ROUGH DRAFT WITH FINAL DRAFT (penalty -10 points)

•  10% of course grade –  Rough Draft - 5% –  Final draft - 5% –  If you’re happy with your rough draft mark, you can tell your TA to use it for

the final draft à YOU MUST TELL YOUR TA

•  Read the “Writing a Lab Report” section of your lab notebook for guidance!!

Lab Report •  Posted on web-site •  Dates

–  Rough draft due to TAs homework box Monday Nov. 16th –  Rough draft returned in your registered lab section this week –  Final draft due at start of your registered lab section next week

à MUST HAND IN ROUGH DRAFT WITH FINAL DRAFT (penalty -10 points)

•  10% of course grade –  Rough Draft - 5% –  Final draft - 5% –  If you’re happy with your rough draft mark, you can tell your TA to use it for

the final draft à YOU MUST TELL YOUR TA

•  Read the “Writing a Lab Report” section of your lab notebook for guidance!!

Chapter 16 Review

Correlation: r

•  r is called the “correlation coefficient”

•  Describes the relationship between two numerical variables

•  Parameter: ρ (rho) Estimate: r

•  -1 < ρ < 1 -1 < r < 1

Estimating the correlation coefficient

r =

Xi − X ( )∑ Yi − Y ( )

Xi − X ( )2∑ Yi − Y ( )2∑

“Sum of products”

“Sum of squares”

Standard error of r

SEr =1− r 2

n − 2

If ρ = 0,...

t =rSEr

r is normally distributed with mean 0

Therefore, we test a null hypothesis of no correlation using:

with df = n -2

Hypotheses

H0: X and Y are not correlated (ρ = 0). HA: X and Y are correlated (ρ ≠ 0).

Correlation assumes...

• Random sample

• X is normally distributed with equal variance for all values of Y

• Y is normally distributed with equal variance for all values of X

Bivariate Normal Distribution

•  The relationship between X and Y is linear

•  The cloud of points in a scatter plot of X and Y has a circular or elliptical shape •  The frequency distribution of X and Y separately are normal

Most Frequent departures from bivariate normal distribution

Chapter 16 Continued: Correlation between numerical variables

Spearman's rank correlation

•  An alternative to correlation that does not make so many assumptions

Example: Spearman's rs VERSIONS: 1. Boy climbs up rope, climbs down again 2. Boy climbs up rope, seems to vanish, re-appears at top, climbs down again 3. Boy climbs up rope, seems to vanish at top 4. Boy climbs up rope, vanishes at top, reappears somewhere the audience was not looking 5. Boy climbs up rope, vanishes at top, reappears in a place which has been in full view

Example: Spearman's rs

Hypotheses H0: The difficulty of the described trick is not correlated with the time elapsed since it was observed. HA: The difficulty of the described trick is correlated with the time elapsed since it was observed.

Years Elapsed Rank Years Impressiveness Score Rank Impressiveness

2 1 1 2

5 3.5 1 2

5 3.5 1 2

4 2 2 5

17 5.5 2 5

17 5.5 2 5

31 13 3 7

20 7 4 12.5

22 8 4 12.5

25 9 4 12.5

28 10.5 4 12.5

29 12 4 12.5

34 14.5 4 12.5

43 17 4 12.5

44 18 4 12.5

46 19 4 12.5

34 14.5 4 12.5

28 10.5 5 19.5

39 16 5 19.5

50 20.5 5 19.5

50 20.5 5 19.5

Finding rs

Ri − R( ) Si − S( )i=1

n

∑ = RiSi∑#

$%

&

'(−

Ri Si∑∑n

= 566

Ri − R( )2

i=1

n

∑ = Ri2( )∑ −

Ri∑#

$%

&

'(

2

n= 767.5

Si − S( )2

i=1

n

∑ = Si2( )∑ −

Si∑#

$%

&

'(

2

n= 678.5

rS =566

767.5( ) 678.5( )= 0.784

rS(0.05,21)=0.434 rS(0.01,21)=0.550 Since rS=0.784 is greater than 0.550, P<0.01 We reject the null hypothesis There is a positive correlation between the impressiveness score and number of years elapsed

Spearman’s rank correlation for n >100

SE[rS ]=1− rS

2

n− 2

t = rSSE[rS ]

df = n− 2

Attenuation: The estimated correlation will be lower

if X or Y are estimated with error

Real correlation

Y estimated with measurement

error

X and Y estimated with measurement

error

Correlation depends on range

Chapter 17: Regression

Regression

•  Predicts Y from X

•  Linear regression assumes that the relationship between X and Y can be described by a line

Correlation vs. regression

Regression assumes... •  Random sample

•  Y is normally distributed with equal variance for all values of X

The least squares regression line is the line for which the sum of all the squared

deviations in Y is smallest

The parameters of linear regression

Y = α + β X

Intercept Slope

Positive β

Negative β

β = 0

Higher α

Lower α

Estimating a regression line

Y = a + b X

Nomenclature

Residual:

Yi − ˆ Y i

Predicted Value:

Yi

Data Point:

Xi,Yi

Finding the "least squares" regression line

SSresidual = Yi − ˆ Y i( )2

i =1

n

∑Minimize:

Best estimate of the slope

b =

Xi − X ( ) Yi − Y ( )i =1

n

Xi − X ( )2i =1

n

(= "Sum of products" over "Sum of squares of X")

Remember the shortcuts:

Xi − X ( ) Yi − Y ( )i =1

n

∑ = XiYi∑$

% & &

'

( ) ) −

Xi Yi∑∑

n

Xi − X ( )2i =1

n

∑ = Xi2( )∑ −

Xi∑$

% & &

'

( ) )

2

n

Finding a

Y = a + bX So..

a = Y − bX

Example: Predicting age based on radioactivity in teeth

Many above ground nuclear bomb tests in the ‘50s and ‘60s may have left a radioactive signal in developing teeth. Is it possible to predict a person’s age based on dental 14C?

Data from 1965 to present from Spalding et al. 2005. Forensics: age written in teeth by nuclear tests. Nature 437: 333–334.

Teeth data:

Δ14C Date of Birth

622 1963.5

262 1971.7

471 1963.7

112 1990.5

285 1975

439 1970.2

363 1972.6

391 1971.8

Δ14C Date of Birth

89 1985.5

109 1983.5

91 1990.5

127 1987.5

99 1990.5

110 1984.5

123 1983.5

105 1989.5

Teeth data:

X = 3798, Y∑∑ = 31674

X 2 =1340776, XY( )∑∑ = 7495223

Y 2∑ = 62704042

n =16

X = 237.375 Y =1979.63

Let X be the Δ14C, and Y be the year of birth.

Xi − X( ) Yi −Y( )i=1

n

∑ = XiYi∑#

$%

&

'(−

Xi Yi∑∑n

= 7495223−3798( ) 31674( )

16= −23393

Xi − X( )2

i=1

n

∑ = Xi2( )∑ −

Xi∑#

$%

&

'(

2

n

=1340776−3798( )2

16= 439226

b = −23393439226

= −0.053

Calculating a

a =Y − bX=1979.63− −0.053( )237.375=1992.2

Predicted Values The predicted value of Y from a regression line (Y hat)

estimates the the mean value of Y for all individuals having a given value of X

YX1

YX2

YX3

YX4

Y =1992.2− 0.053X

Predicting Y from X

Y =1992.2− 0.053X=1992.2− 0.053 200( )=1981.6

If a cadaver has a tooth with Δ14C content equal to 200, what does the regression line predict its year of birth to be?

Testing hypotheses about regression

H0: β = 0 HA: β ≠ 0

b has a t distribution

Confidence interval for a slope:

b ± tα[2],df SEb

Hypothesis tests can use t:

t =b − β0SEb

Standard error of a slope

SEb =MSresidual

Xi − X( )2∑

MSresidual = SSresidual / dfresidual

Sums of squares for regression

SSTotal = Yi2∑ −

Yi∑$

% &

'

( )

2

n

SSregression = b Xi − X ( )∑ Yi −Y ( )

SSresidual + SSregression = SSTotal

With n - 2 degrees of freedom for the residual

Radioactive teeth: Sums of squares

SSTotal = Yi2∑ −

Yi∑#

$%

&

'(

2

n

= 62704042−31674( )2

16=1339.75

SSregression = b Xi − X( )∑ Yi −Y( )

= −0.053( ) −23393( ) =1239.8

Teeth: Sums of squares

SSresidual = SSTotal − SSregression =1339.75−1239.8 = 99.9dfresidual =16− 2 =14

Calculating residual mean squares

MSresidual = SSresidual / dfresidual

MSresidual =99.914

= 7.1

Standard error of a slope

SEb =MSresidual

Xi − X( )2∑

= 7.1439226

= 0.004

b has a t distribution

Confidence interval for a slope:

b ± tα[2],df SEb

Hypothesis tests can use t:

t =b − β0SEb

Example: 95% confidence interval for slope with teeth

example

b± tα[2],df SEb = b± t0.05[2],14SEb

= −0.053± 2.14 0.004( )= −0.053± 0.0018

Confidence bands: confidence intervals for predictions of

mean Y for given X

Prediction intervals: confidence intervals for predictions of

individual Y for given X

Hypothesis tests on slopes

H0: β = 0 HA: β ≠ 0

t =b − β0SEb

t = −0.053− 00.004

=13.25

t0.0001(2),14= ±5.36

So we can reject H0, P<0.0001

r2 predicts the amount of variance in Y explained by the

regression line

r2 is the “coefficient of determination: it is the square of the correlation coefficient r

Recommended