MATH 423/533 - ASSIGNMENT 2 S OLUTIONS The following data gives data on average public teacher annual salary in dollars, recorded in the data frame salary as the variable SALARY, and spending (SPENDING) per pupil (in thousands of dollars) on public schools in 1985 in the 50 US states and the District of Columbia. The objective of the analysis is to understand whether there is a relationship between teacher pay, y, and per- pupil spending, x. An analysis in R is presented below: some of the output has been deleted and replaced by XXXXX. 1 > salary<-read.csv('salary.csv',header=TRUE) 2 > x1<-salary$SPENDING/1000 3 > y<-salary$SALARY 4 > fit.Salary<-lm(y ∼ x1);summary(fit.Salary) 5 6 Call: 7 lm(formula = y ∼ x1) 8 9 Residuals: 10 Min 1Q Median 3Q Max 11 -3848.0 -1844.6 -217.5 1660.0 5529.3 12 13 Coefficients: 14 Estimate Std. Error t value Pr(>|t|) 15 (Intercept) 12129.4 XXXXX 10.13 1.31e-13 *** 16 x1 3307.6 311.7 10.61 2.71e-14 *** 17 --- 18 Signif. codes: 0' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1 19 20 Residual standard error: XXXXX on 49 degrees of freedom 21 Multiple R-squared: 0.6968, Adjusted R-squared: 0.6906 22 F-statistic: XXXXX on 1 and 49 DF, p-value: 2.707e-14 In answering the following questions, you may not use the lm function or its result on these data (or the functions coef(), residuals() etc.), but instead should use vector and matrix calculations. (a) Write R code to verify the calculation of the entries in the Estimate column, and show that your code produces the correct results. 2 Marks (b) Write R code to compute the value of the omitted entry for the Residual standard error on line 20. 2 Marks (c) Compute the value of the entry in the Std. Error column on line 15 first using entries already given in the table, and then using the data directly. 2 Marks (d) The entry for Multiple R-squared on line 21 is computed using the formula R 2 = SS R SS T where SS R is the ‘regression sum-of-squares’ and SS T is the ‘total sum of squares’ as defined in lectures. Write R code to verify the calculation of R 2 . 2 Marks (e) Prove for a simple linear regression that, in the notation from lectures, SS R = b β 1 S xy and show this result holds numerically for the salary data. 2 Marks MATH 423/533 ASSIGNMENT 2 Solutions Page 1 of 15
MATH 423/533 - ASSIGNMENT 2 SOLUTIONS
The following data gives data on average public teacher annual
salary in dollars, recorded in the data frame salary as the
variable SALARY, and spending (SPENDING) per pupil (in thousands of
dollars) on public schools in 1985 in the 50 US states and the
District of Columbia.
The objective of the analysis is to understand whether there is a
relationship between teacher pay, y, and per- pupil spending, x. An
analysis in R is presented below: some of the output has been
deleted and replaced by XXXXX.
1 > salary<-read.csv('salary.csv',header=TRUE) 2 >
x1<-salary$SPENDING/1000 3 > y<-salary$SALARY 4 >
fit.Salary<-lm(y∼x1);summary(fit.Salary) 5 6 Call: 7 lm(formula
= y ∼ x1) 8 9 Residuals:
10 Min 1Q Median 3Q Max 11 -3848.0 -1844.6 -217.5 1660.0 5529.3 12
13 Coefficients: 14 Estimate Std. Error t value Pr(>|t|) 15
(Intercept) 12129.4 XXXXX 10.13 1.31e-13 *** 16 x1 3307.6 311.7
10.61 2.71e-14 *** 17 --- 18 Signif. codes: 0 '***' 0.001 '**' 0.01
'*' 0.05 '.' 0.1 ' ' 1 19 20 Residual standard error: XXXXX on 49
degrees of freedom 21 Multiple R-squared: 0.6968, Adjusted
R-squared: 0.6906 22 F-statistic: XXXXX on 1 and 49 DF, p-value:
2.707e-14
In answering the following questions, you may not use the lm
function or its result on these data (or the functions coef(),
residuals() etc.), but instead should use vector and matrix
calculations.
(a) Write R code to verify the calculation of the entries in the
Estimate column, and show that your code produces the correct
results. 2 Marks
(b) Write R code to compute the value of the omitted entry for the
Residual standard error on line 20. 2 Marks
(c) Compute the value of the entry in the Std. Error column on line
15 first using entries already given in the table, and then using
the data directly. 2 Marks
(d) The entry for Multiple R-squared on line 21 is computed using
the formula
R2 = SSR
SST
where SSR is the ‘regression sum-of-squares’ and SST is the ‘total
sum of squares’ as defined in lectures. Write R code to verify the
calculation of R2.
2 Marks
(e) Prove for a simple linear regression that, in the notation from
lectures,
SSR = β1Sxy
and show this result holds numerically for the salary data. 2
Marks
MATH 423/533 ASSIGNMENT 2 Solutions Page 1 of 15
(f) The F-statistic on line 22 is computed using the
sums-of-squares decomposition
SST = SSRes + SSR
F = SSR/(p− 1)
SSRes/(n− p) where here p = 2 for simple linear regression. Write R
code to compute the omitted value for F .
2 Marks
(g) In the notation from lectures, we have that the sums-of-squares
decomposition can be written
y>(In −H1)y = y>(In −H)y + y>(H−H1)y.
Show, mathematically and numerically, that
trace(In −H1) = n− 1 trace(H−H1) = p− 1
for this example, where p = 2 for simple linear regression. 2
Marks
(h) Using residual plots, assess the validity of the assumptions
underlying the least squares analysis. Verify numerically the
orthogonality results concerning the residuals, that is, in vector
form
1>n e = 0 X>e = 0p y>e = 0.
2 Marks
(i) Using the fitted model, predict what the average public teacher
annual salary would be in a state where the spending per pupil is
$4800.
1 Mark
(j) The prediction at an arbitrary new x value, xnew 1 can be
written in terms of the estimates β as
ynew = xnew 1 β = [1 xnew
1 ] β = β0 + β1x new 1
with β the least squares estimate. Compute the estimated standard
prediction error for ynew, that is, the square root of the
estimated variance of the corresponding random variable
Y new = xnew 1 β = [1 xnew
1 ] β
now with β the least squares estimator, if xnew 1 is $4800. 3
Marks
MATH 423/533 ASSIGNMENT 2 Solutions Page 2 of 15
EXTRA QUESTION FOR STUDENTS IN MATH 533 The figure below plots the
percent differences on the log scale between successive recorded
quarterly Gross Domestic Product (GDP) values in the US between the
first quarter of 1947 and the first quarter of 2016 (277 data
points).
Time
− 2
US GDP Growth: log scale differences, 1947−2016
The data may be read in from the file US-GDP.txt as follows. For
regression purposes, we define the predictor x1 by considering time
(in quarters) since Q1, 1947.
1 y0<-scan('US-GDP.txt') 2 y<-100*log(y0[-1]/y0[-278]) 3
x1<-c(1:277)
Is there any statistical evidence that there is a ‘changepoint’ in
the GDP series at the year 1980 (when x1 = 133), that is, that the
relationship between y and x1 prior to Q1 1980 is different from
the relationship after that time ? Investigate this possibility
using straight line regression modelling (not a single simple
linear regression), and report the result of an appropriate
hypothesis test.
5 Marks
SOLUTION
The results to verify are given in the output below:
salary<-read.csv('salary.csv',header=TRUE)
x1<-salary$SPENDING/1000 y<-salary$SALARY
fit.Salary<-lm(y˜x1);summary(fit.Salary)
+ + Call: + lm(formula = y ˜ x1) + + Residuals: + Min 1Q Median 3Q
Max + -3848.0 -1844.6 -217.5 1660.0 5529.3 + + Coefficients: +
Estimate Std. Error t value Pr(>|t|) + (Intercept) 12129.4
1197.4 10.13 1.31e-13 *** + x1 3307.6 311.7 10.61 2.71e-14 *** +
--- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1 + + Residual standard error: 2325 on 49 degrees of freedom +
Multiple R-squared: 0.6968,Adjusted R-squared: 0.6906 +
F-statistic: 112.6 on 1 and 49 DF, p-value: 2.707e-14
The values here are rounded quite aggressively; to show more
digits
print(summary(fit.Salary),digits=8)
+ + Call: + lm(formula = y ˜ x1) + + Residuals: + Min 1Q Median 3Q
Max + -3847.97573 -1844.55654 -217.51923 1659.97327 5529.34250 + +
Coefficients: + Estimate Std. Error t value Pr(>|t|) +
(Intercept) 12129.37102 1197.35080 10.13017 1.3081e-13 *** + x1
3307.58500 311.70427 10.61129 2.7069e-14 *** + --- + Signif. codes:
+ 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual
standard error: 2324.779 on 49 degrees of freedom + Multiple
R-squared: 0.69678128,Adjusted R-squared: 0.69059314 + F-statistic:
112.59952 on 1 and 49 DF, p-value: 2.7068708e-14
In the following calculations, as much precision is carried forward
as possible, with no rounding of values in the calculation
(although there is some rounding in printing).
MATH 423/533 ASSIGNMENT 2 Solutions Page 4 of 15
(a) To compute the estimate, use the formula for the least squares
solution
X<-cbind(1,x1) XTX<-t(X)%*%X beta.hat<-solve(XTX,t(X)%*%y)
print(beta.hat)
+ [,1] + 12129.371 + x1 3307.585
The estimates are therefore (12129.371, 3307.585) for β0 and β1
respectively. From the output, we see that the estimated
coefficients are also (12129.371, 3307.585).
2 Marks
+ [1] 2324.779
Therefore we have σ = (2324.7789), whereas from the output we have
(2324.7789). 2 Marks
(c) From the table, we have simply that t0 = β0/e.s.e(β0) so that
therefore
e.s.e(β0) = β0
1.2129371× 104
10.1301732 = 1197.3508043
that is, that the missing standard error value is 1197.3508. From
first principles, we have
(estimated.covariance<-sigma.hatˆ2 * solve(XTX))
(ese.vals<-sqrt(diag(estimated.covariance)))
+ x1 + 1197.3508 311.7043
that is, that the missing standard error value is 1197.3508 as
before. 2 Marks
(d) Here we have
+ [1] 0.6967813
and R2 is confirmed as 0.69678128. Here we have that
SST = 873380264.627 SSRes = 264825249.995 SSR = 608555014.633
To confirm the calculation using the formula
y>(In −H1)y = y>(In −H)y + y>(H−H1)y.
we have the alternate code
H<-X %*% (solve(XTX) %*% t(X)) One<-cbind(rep(1,length(y)))
H1<-(One %*% t(One))/length(y) SSQ.T<-t(y) %*%
(diag(1,length(y))-H1) %*% y SSQ.Res<-t(y) %*%
(diag(1,length(y))-H) %*% y SSQ.R<-t(y) %*% (H-H1) %*% y
print(c(SSQ.T,SSQ.Res,SSQ.R),digits=12)
+ [1] 873380264.627 264825249.995 608555014.633
n∑ i=1
(β0 + β1xi1 − y)2
but, from an earlier result, we know that the fitted straight line
passes through the point (x1, y), that is, we know that
y = β0 + β1x1.
n∑ i=1
But also from a previous result, β1 = Sxy/Sxx, so therefore
SSR = β1(β1Sxx) = β1Sxy
(S.xy<-sum((y-mean(y))*(x1-mean(x1))))
+ [1] 183987.7
2 Marks
(f) From the previous computation the F statistic is confirmed as
112.5995.
Fstat<-(SS.R/(2-1))/(SS.Res/(length(y)-2)) Fstat
MATH 423/533 ASSIGNMENT 2 Solutions Page 6 of 15
2 Marks
(h) Here is a plot of the fitted line
par(mar=c(4,4,0,0))
plot(x1,y,pch=19,cex=0.75);abline(coef(fit.Salary),col='red')
20 00
0 25
00 0
30 00
0 35
00 0
40 00
0
x1
y
which seems to do a good job in reflecting the relationship. For
the residual plots versus the predictor and versus the fitted
values:
par(mar=c(4,4,0,0))
plot(x1,residual.vec,pch=19,cex=0.75,ylim=range(-6000,6000));abline(h=0,lty=2)
MATH 423/533 ASSIGNMENT 2 Solutions Page 7 of 15
− 60
00 −
− 60
00 −
MATH 423/533 ASSIGNMENT 2 Solutions Page 8 of 15
and we can conclude that there is no evidence to suggest that the
assumptions concerning the model and residual errors are invalid.
To demonstrate the orthogonality, there are different ways to
compute:
#Result 1 sum(residual.vec) #Summation
+ [1] 1.599318e-09
+ [1] 3.29758e-06
+ [,1] + [1,] 3.311783e-06
The last result is several orders of magnitude away from the
others, although still very close to the required zero value. This
is in part due to the fitted and residual values being stored in an
earlier calculation. For example, recomputing using differently
stored objects, we have
#Result 3 recomputed sum((cbind(1,x1) %*%
coef(fit.Salary))*(y-(cbind(1,x1) %*% coef(fit.Salary))))
+ [1] 1.30618e-07
2 Marks
(i) Recalling that in the original data, the predictors are
recorded in thousands of dollars, we have that the prediction is
computed using the fitted straight line as
format(as.numeric(c(1,4.8) %*% coef(fit.Salary)),digits=8)
MATH 423/533 ASSIGNMENT 2 Solutions Page 9 of 15
(j) As we may write the predicted value Y new as a linear
combination of the estimator vector elements
Y new = xnew 1 β = [1 xnew
1 ] β,
we can use standard results from lectures to deduce that
VarY |X,Xnew [Y new|X,xnew 1 ] = xnew
1 VarY |X[β|X]{xnew 1 }> = xnew
1 {σ2(X>X)−1}{xnew 1 }>
which would then be estimated by replacing σ2 by its estimate. That
is, for the problem at hand
x1new<-matrix(c(1,4.8),nrow=1) pred.var<-sigma.hatˆ2*x1new
%*% solve(XTX) %*% t(x1new) pred.var
+ [,1] + [1,] 224261.7
+ [,1] + [1,] 473.5628
so that the required estimated standard prediction error is
473.5628. 3 Marks
To check the results for (i) and (j) using predict, we would
write
predict(fit.Salary,newdata=data.frame(x1=4.8),se.fit=TRUE)
+ $fit + 1 + 28005.78 + + $se.fit + [1] 473.5628 + + $df + [1] 49 +
+ $residual.scale + [1] 2324.779
MATH 423/533 ASSIGNMENT 2 Solutions Page 10 of 15
EXTRA QUESTION FOR STUDENTS IN MATH 533 We can proceed in (at
least) three different ways:
1. Fit using a ‘flexible’ model as in one of the earlier Extra Hour
classes, assuming that there are two discontin- uous straight lines
either side of x1 = 133, so that the modelled expected value
is
EYi|X[Yi|xi] = I(0,133](xi1)(β10 + β11xi1) + I(133,277)(xi1)(β20 +
β21xi1) =
{ β10 + β11xi1 0 ≤ xi1 ≤ 133
β20 + β21xi1 133 < xi1 ≤ 277
However, an alternative parameterization in terms of contrasts more
readily permits the assessment of inter- est: that is, we
write
EYi|X[Yi|xi] = β0 + β1xi1 + I(133,277)(xi1)(δ0 + δ1xi1) =
{ β0 + β1xi1 0 ≤ xi1 ≤ 133
(β0 + δ0) + (β1 + δ1)xi1 133 < xi1 ≤ 277
so that δ0 measures the change in intercept and δ1 measures the
change in slope.
y0<-scan('US-GDP.txt') y<-100*log(y0[-1]/y0[-278])
x1<-c(1:277) fit.c1<-lm(y˜(x1>133)*x1)
summary(fit.c1)
+ + Call: + lm(formula = y ˜ (x1 > 133) * x1) + + Residuals: +
Min 1Q Median 3Q Max + -3.2428 -0.4601 0.0010 0.4981 4.5554 + +
Coefficients: + Estimate Std. Error t value Pr(>|t|) +
(Intercept) 1.268748 0.176850 7.174 6.86e-12 *** + x1 > 133TRUE
2.027377 0.461446 4.394 1.60e-05 *** + x1 0.008489 0.002290 3.707
0.000254 *** + x1 > 133TRUE:x1 -0.018159 0.003062 -5.930
9.14e-09 *** + --- + Signif. codes: + 0 '***' 0.001 '**' 0.01 '*'
0.05 '.' 0.1 ' ' 1 + + Residual standard error: 1.014 on 273
degrees of freedom + Multiple R-squared: 0.1681,Adjusted R-squared:
0.1589 + F-statistic: 18.39 on 3 and 273 DF, p-value:
6.806e-11
confint(fit.c1)
+ 2.5 % 97.5 % + (Intercept) 0.920584549 1.61691226 + x1 >
133TRUE 1.118931885 2.93582160 + x1 0.003980031 0.01299743 + x1
> 133TRUE:x1 -0.024187618 -0.01213028
The second and fourth lines of the coefficients table correspond to
the estimates of δ0 and δ1 respectively; we have δ0 = 2.027377 and
δ1 = −0.01815895 – both values are significantly different from
zero, so there is significant evidence of a changepoint.
MATH 423/533 ASSIGNMENT 2 Solutions Page 11 of 15
2. We can repeat the above, but make the expectation continuous at
the changepoint. The easiest way to do this is to form the
model
EYi|X[Yi|xi] = β0 + β1xi1 + I(133,277)(xi1)(δ1(xi1 − 133)) =
{ β0 + β1xi1 0 ≤ xi1 ≤ 133
(β0 − 133δ1) + (β1 + δ1)xi1 133 < xi1 ≤ 277
Note that the continuity assumption restricts us to using a three
parameter model.
fit.c2<-lm(y˜x1+I((x1>133)*(x1-133)))
print(summary(fit.c2,digits=4))
+ + Call: + lm(formula = y ˜ x1 + I((x1 > 133) * (x1 - 133))) +
+ Residuals: + Min 1Q Median 3Q Max + -3.3257 -0.4900 -0.0313
0.5300 4.4862 + + Coefficients: + Estimate Std. Error t value +
(Intercept) 1.369618 0.165547 8.273 + x1 0.006230 0.001802 3.458 +
I((x1 > 133) * (x1 - 133)) -0.017855 0.003065 -5.826 +
Pr(>|t|) + (Intercept) 5.74e-15 *** + x1 0.000632 *** + I((x1
> 133) * (x1 - 133)) 1.59e-08 *** + --- + Signif. codes: + 0
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard
error: 1.017 on 274 degrees of freedom + Multiple R-squared:
0.1604,Adjusted R-squared: 0.1542 + F-statistic: 26.17 on 2 and 274
DF, p-value: 3.982e-11
confint(fit.c2)
+ 2.5 % 97.5 % + (Intercept) 1.043711323 1.69552456 + x1
0.002683043 0.00977787 + I((x1 > 133) * (x1 - 133)) -0.023888774
-0.01182157
Now we have δ1 = −0.01785517, and again a significant change in
slope.
MATH 423/533 ASSIGNMENT 2 Solutions Page 12 of 15
3. We could perform two separate straight-line fits on the first
half and second half of the data: we achieve this using the subset
function in the lm call.
fit.c3<-lm(y˜x1,subset=(x1<=133)) #First half
print(summary(fit.c3,digits=4))
+ + Call: + lm(formula = y ˜ x1, subset = (x1 <= 133)) + +
Residuals: + Min 1Q Median 3Q Max + -3.2428 -0.6401 -0.0697 0.6753
4.5554 + + Coefficients: + Estimate Std. Error t value Pr(>|t|)
+ (Intercept) 1.268748 0.219464 5.781 5.16e-08 *** + x1 0.008489
0.002842 2.987 0.00337 ** + --- + Signif. codes: + 0 '***' 0.001
'**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 1.258
on 131 degrees of freedom + Multiple R-squared: 0.06376,Adjusted
R-squared: 0.05661 + F-statistic: 8.921 on 1 and 131 DF, p-value:
0.003366
confint(fit.c3)
fit.c4<-lm(y˜x1,subset=(x1>133)) #Second half
print(summary(fit.c4,digits=4))
+ + Call: + lm(formula = y ˜ x1, subset = (x1 > 133)) + +
Residuals: + Min 1Q Median 3Q Max + -2.90201 -0.33119 0.01855
0.38653 2.57152 + + Coefficients: + Estimate Std. Error t value
Pr(>|t|) + (Intercept) 3.29612 0.30194 10.917 < 2e-16 *** +
x1 -0.00967 0.00144 -6.715 4.19e-10 *** + --- + Signif. codes: + 0
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard
error: 0.7183 on 142 degrees of freedom + Multiple R-squared:
0.241,Adjusted R-squared: 0.2357 + F-statistic: 45.09 on 1 and 142
DF, p-value: 4.186e-10
confint(fit.c4)
MATH 423/533 ASSIGNMENT 2 Solutions Page 13 of 15
From the reported confidence intervals for the parameters, we can
conclude that the two lines are significantly different; for
example, the confidence intervals for the slope parameters do not
overlap.
par(mar=c(4,4,3,0)) plot(x1,y,ylab='Percent GDP growth',type='l')
points(x1,y,pch=19,cex=0.75) title('US GDP Growth: log scale
differences, 1947-2016') x1v<-seq(0,277,by=0.01)
y1v<-predict(fit.c1,newdata=data.frame(x1=x1v))
lines(x1v,y1v,col='red')
y2v<-predict(fit.c2,newdata=data.frame(x1=x1v))
lines(x1v,y2v,col='blue')
legend(150,6,c('Discontinuous','Continuous'),col=c('red','blue'),lty=1)
0 50 100 150 200 250
− 2
Discontinuous Continuous
The residual plots from the first two fits do not offer conclusive
evidence that the assumptions are incorrect, however it appears
that the residuals in the first half have a larger variance than
the residuals in the second half. This is backed up by the fact
that the two estimates of σ are different: 1.258361 and 0.7183496
for the first and second halves respectively.
par(mar=c(4,4,3,0),mfrow=c(2,1))
plot(x1,residuals(fit.c1),pch=19,ylim=range(-4,4),main='Discontinuous
model: residuals') abline(v=133,lty=3);abline(h=0,lty=2)
plot(x1,residuals(fit.c2),pch=19,ylim=range(-4,4),main='Continuous
model: residuals') abline(v=133,lty=3);abline(h=0,lty=2)
MATH 423/533 ASSIGNMENT 2 Solutions Page 14 of 15