Part II Linear Regressionlipas.uwasa.fi/~sjp/Teaching/ams/lectures/amsc2.pdf · 2019. 11. 7. · Simple Regression Multiple Regression Part II Linear Regression As of Nov 7, 2019

Simple Regression Multiple Regression

Part II

Linear Regression

As of Nov 2, 2020Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications

in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

Seppo Pynnonen Applied Multivariate Statistical Analysis


1 Simple Regression

2 Multiple Regression

Statistical significance of coefficients

Selecting important variables

Model fit

Prediction

Other considerations

Qualitative predictors

Potential problems

Non-parametric regressions



Simple (linear) regression is defined as

y = β0 + β1x + ε, (1)

where y and x are observed values, β0 and β1, called parameters,are the intercept (constant) term and slope coefficient,respectively, and ε is an unobserved random error term.

The parameters β0 and β1 are unknown, and are estimated fromtraining data.

Given estimates β0 and β1, one can compute

y = β0 + β1x (2)

to predict y on the basis of given x-value, in which y indicates theprediction.



Regression: Estimation

The unknown coefficients β0 and β1 are most often estimated fromthe training data (sample) with observations (x1, y1), . . . , (xn, yn)by the ordinary least squares (OLS), i.e.,

(β0, β1) = arg minβ0,β1

n∑i=1

(yi − (β0 + β1xi ))2. (3)

Here the solutions are

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2(4)

β0 = y − β1x , (5)

where x and y are the sample means of x and y .




Remark 1

A crucial assumption for in estimation for successful estimation is

E (ε|x) = 0, (6)

which implies that the explanatory variable x and the error term ε are

uncorrelated.




Example 1

In the Advertising data, regressing Sales on TV budget gives:

lm(formula = sales ~ TV, data = adv)

Residuals:

Min 1Q Median 3Q Max

-8.3860 -1.9545 -0.1913 2.0671 7.2124

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.032594 0.457843 15.36 <2e-16 ***

TV 0.047537 0.002691 17.67 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.259 on 198 degrees of freedom

Multiple R-squared: 0.6119,Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 3005

10

15

20

25

TV budget (1,000USD)

Sa

les (

1,0

00

un

its)



The deviations from the regression line, ei = yi − yi are calledresiduals, which are estimates of the unobserved errors (εi ).

Note, yi = yi + ei = β0 + β1x + ei .

The estimates2 are random variables in the sense that is anothersample is selected they assume different values.

The distribution of an estimate (estimator) is called the samplingdistribution, which is the distribution of the estimate values if onesampled n observations over and over again from the populationand computed the estimates (see example below).

2Statistical literature usually makes distinction with estimator and estimate,so that the estimator refers to the function and estimate refers to the value ofthe function.




Example 2

Below in the left panel are the scatter plot, the population regression line, and the OLSestimated line from a sample of n = 100 obsevations, in the middle panel are OLS estimatedlines from 10 samples of size n = 100, and in the right panel is a histogram of the slopecoefficient estimates β1 from 1,000 different samples of size n = 100.The population model is

y = 2 + 3x + ε, (7)

i.e. β0 = 2 and β1 = 3. For this simulated example x and ε are generated from N(0, 1) andN(0, 4) distributions, respectively.

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

● ●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

● ●●

●

●

●

●●

●

−2 −1 0 1 2

−5

05

10

15

Initial Sample

x

y

β0 = 1.99

β1 = 3.36

−2 −1 0 1 2

−5

05

10

15

Lines from 10 Samples

x

y

PopulationInitial sample10 simulations

Histogram of 1,000 β1 Estimates

β1

De

nsi

ty

1.5 2.5 3.5 4.50

.00

.40

.8



Regression: Accuracy of coefficients

As stated generally in equation (1.1), the true relationship betweenx and y is y = f (x) + ε.

Here, f (x) = β0 + β1x , resulting to the population regressiony = β0 + β1x + ε as given in equation (1).

As demonstrated by Example 2 estimates β0 and β1 deviate moreor less from the underlying population parameters β0 and β1.

However, it can be shown that on average the estimates equal the

underlying parameter value, mathematically E[βj

]= βj , j = 0, 1.

In such a case we say that the estimates are unbiased

So, in summary, it can be shown that OLS estimators are unbiased,i.e., they do not systematically over or under estimate theunderlying parameters.



The accuracy of the estimates can be evaluated in terms ofstandard errors of the estimates

se(β1) =σ√∑n

i=1(xi − x)2(8)

se(β0) = σ

(1

n+

x∑ni=1(xi − x)2

) 12

. (9)




These are routinely produced by every regression package.

In the above example, the initial sample produces:

lm(formula = y ~ x)

Coefficients:


(Intercept) 1.9905 0.3663 5.434 4.03e-07 ***

x 3.3588 0.3852 8.720 7.22e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



F-statistic: 76.03 on 1 and 98 DF, p-value: 7.219e-14




Thus se(β1) = 0.3852, which estimates the standard error if werepeated the sampling over and over again, computed β1 fromeach sample and calculated the standard deviation of them.

We did this 1,000 times for the right panel histogram in Example 2.

The standard deviation of these estimates is 0.4141 which is closeto that of se(β1) = 0.3852 (the difference is about .029, or 7%).




The standard errors can be used to compute e.g. confidenceintervals (CIs)for the coefficients.

CIs are of the formβ ± cα/2 · se(β), (10)

or[β − cα/2 · se(β), β + cα/2 · se(β)], (11)

where cα/2 is the 1− α/2 percentile of the t-distribution (ornormal distribution).

α is the confidence level with typical values .05 or .01 in whichcases the confidence intervals are 95% and 99%, respectively. Forexample, for the 95% confidence interval c.025 ≈ 2.




In Example 2 the 95% confidence interval for β1 is

β1 ± 2× se(β1) = 3.56± 2× 0.385 = 3.56± 0.770, or [2.79, 4.33].

We observe that in this case the population β1 = 3 belongs to theinterval.

For a 95% confidence interval there is 5% change to have such asample that the estimate is so much off that the confidenceinterval does not cover the population parameter.




Standard errors can also used in hypothesis testing.

The most common hypothesis testing involves testing the nullhypothesis of

H0 : There is no relation between x and y (12)

versus the alternative hypothesis

H1 : There is some relationship between x and y . (13)

More formally, in terms of the regression model in equation (1) thiscorresponds to testing

H0 : β1 = 0 (14)

versusH1 : β1 6= 0. (15)

If the null hypothesis holds then y = β0 + ε, so that x is notassociated to y .




Tesing for a more general null hypothesis of H0 : β1 = β∗1 , whereβ∗1 is some given value, the test statistic is

t =β1 − β∗1se(β1)

, (16)

which for tesing hypothesis (14) with β∗1 = 0 reduces to

t =β1

se(β1). (17)

The null distribution (i.e., when the null hypothesis H0 is true) of tis the Student t distribution with n − 2 degrees of freedom.

With n > 30 the t-distribution is close to the normal distribution.




For large absolute value of t the null hypothesis is rejected.

By a ’large’ value we mean that if the probability of obtaining sucha large value is smaller then some specified threshold value α, wereject the null hypothesis.

Typical values of α are 0.05 or 0.01, i.e., 5% or 1%.

The computer produces p-values that indicate the probabilityP(|t| > |tobs| |H0), i.e., the probability for getting as large (orlarger) values than the one observed, tobs, if the null hypothesis isholds.

If the probability is too low, we infer that rather than having soextreme sample, the underlying parameter value is something elsethan that of the null hypothesis, and therefore reject the H0.




Typical threshold values are 0.05 (statistically significant at the 5%level) and 0.01 (statistically significant at the 1% level), i.e., if thep-value goes below these values, we reject the null hypothesis atthe associated level of significance.

Example 3

In the advertising example in p-value < .0001 (in fact the first 15

decimals are zeros), so the data suggest strongly to reject the null

hypothesis that TV advertising does not affect Sales (the sign of the

coefficient show that the association is positive, as could be expected).



Regression: Accuracy of the model

The quality of a linear regression fit is typically assessed by the residualstandard error (RSE) and coefficient of determination R2 (R-square) ofwhich the R-square more popular.

RSE =

√√√√ 1

n − 2

n∑i=1

(yi − yi )2. (18)

and

R2 =TSS− RSS

TSS= 1− RSS

TSS, (19)

where TSS =∑n

i=1(yi − y)2 is the total sum of squares andRSS =

∑ni=1(yi − yi )

2 is the residual sum of squares.

We observe that RSE =√

RSS/(n − 2).

R-square is a goodness-of-fit measure with 0 ≤ R2 ≤ 1 (R2 = 0, noassociation, R2 = 1 perfect fit), while RSE measures lack of fit.



Regression: Accuracy of the model

Both of these are routinely produced by regression packages.

In the advertising example (rounded to two decimals) RSE = 3.26and R2 = 0.61.

RSE is in the same units as the dependent variable y .

James et al. interpret RSE as the amount the prediction is onaverage off from the true value of the dependent variable.

Accordingly, RES = 3.26 would indicate that any sales on the basisof TV would be off on average 3.25 thousand units.

Thus, as the average sales over all markets is approximately 14thousand units, the error is 3.26 / 14 = 23 %.



1 Simple Regression




Model fit

Prediction



Potential problems




Adding explanatory variables (x-variables) to the model givesmultiple regression

y = β0 + β1x1 + · · ·+ βpxp + ε, (20)

where xj is the jth predictor (explanatory variable) and βjquantifies the marginal effect or association between y and xj .

That is, βj indicates the unit change in xj holding all otherpredictors fixed.

The coefficients are again estimated by finding β0, β1, . . . , βp thatminimize the sum of squares

∑ni=1(yi − yi )

2, whereyi = β0 + β1xi1 + · · ·+ βpxip.



Example 4

In the advertising example let us enhance the model as

sales = β0 + β1 TV + β2 radio + β3 newspaper + ε (21)

lm(formula = sales ~ TV + radio + newspaper, data = adv)

Coefficients:


(Intercept) 2.938889 0.311908 9.422 <2e-16 ***

TV 0.045765 0.001395 32.809 <2e-16 ***

radio 0.188530 0.008611 21.893 <2e-16 ***

newspaper -0.001037 0.005871 -0.177 0.86

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1






The results indicate that newspapers do not contribute sales, while an additionalthousand spent in TV advertising predicts an average increase in sales by about46 units (holding radio budget unchanged).

Similarly an additional thousand in radio advertising predicts increase in sales byabout 189 units (holding TV budget intact).

However, checking out the residuals reveals that the specification is notsatisfactory.

5 10 15 20 25

−10

−8−6

−4−2

02

4

Fitted values

Res

idua

ls

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

lm(sales ~ TV + radio + newspaper)

Residuals vs Fitted

131

6

179

The graph indicates non-linearity.



After dropping the non-significant newspaper, we add squared terms ofthe explanatory variables to account for the obvious non-linearity.

sales = β0 + β1 TV + β2 radio + β11 (TV)2 + β22 (radio)2 + ε (22)

(Note: βs and ε are generic notations).lm(formula = sales ~ TV + radio + I(TV^2) + I(radio^2), data = adv)

Residuals:


-7.3987 -0.8509 0.0376 0.9781 3.3727

Coefficients:


(Intercept) 1.535e+00 4.093e-01 3.750 0.000233 ***

TV 7.852e-02 4.978e-03 15.774 < 2e-16 ***

radio 1.588e-01 2.830e-02 5.613 6.78e-08 ***

I(TV^2) -1.138e-04 1.674e-05 -6.799 1.26e-10 ***

I(radio^2) 7.135e-04 5.709e-04 1.250 0.212862

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




5 10 15 20

−8

−6

−4

−2

02

4

Fitted values

Re

sid

ua

ls

●

●●

●

●

●

●●

●

●

●

●●

● ●●

●

●

●

● ●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●●

●●

●

●

●

●

●

●●

●

● ●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

● ●

●

●●● ●

●

●

●

●

●

●

●

● ●●● ●●

●

● ●● ●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

Residuals vs Fitted

131

6

92

●

●●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●●●●●●

●

●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−4

−2

02

Theoretical Quantiles

Sta

nd

ard

ize

d r

esid

ua

ls

Normal Q−Q

131

6

92

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

● ●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●●●

●● ●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

0 50 100 150 200 250 300

−4

−2

02

4

Residuals vs TV adrvertising

TV

Re

sid

ua

ls ●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

● ●

●

●●

● ●

●

●

●

●

●

●

●

● ●●

●●●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

0 10 20 30 40 50

−4

−2

02

4

Residuals vs Radio adrvertising

Radio

Re

sid

ua

lsSeppo Pynnonen Applied Multivariate Statistical Analysis


The residual plot (top left) indicate still non-linearity.

Let us yet enhance the model by adding the interaction term TV× radio of the explanatoryvariables and estimate regression

sales = β0 + β1 TV + β2 radio + β11 (TV)2 + β22 (radio)2 + β12(TV× radio) + ε (23)

lm(formula = sales ~ TV + radio + I(TV^2) + I(radio^2) + TV:radio,

data = adv)

Residuals:


-5.0027 -0.2859 -0.0062 0.3829 1.2100

Coefficients:


(Intercept) 5.194e+00 2.061e-01 25.202 <2e-16 ***

TV 5.099e-02 2.236e-03 22.801 <2e-16 ***

radio 2.654e-02 1.242e-02 2.136 0.0339 *

I(TV^2) -1.098e-04 6.901e-06 -15.914 <2e-16 ***

I(radio^2) 1.861e-04 2.359e-04 0.789 0.4311

TV:radio 1.075e-03 3.479e-05 30.892 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



F-statistic: 2740 on 5 and 194 DF, p-value: < 2.2e-16

5 10 15 20 25

−4

−2

0

Fitted values

Res

idua

ls

●

●

●

●●

●

●

●

● ●

● ●

●

● ● ●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●●

●

●

●● ●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

● ●

●

●

● ●

●

●●

●

●

●●●● ●

●●

●

●●

● ●

●●

●●

●

●

●

●●

●●

●

●

●●

●

● ●● ●

●

●

● ●

●

●

●

●

●

●

●

●●●

●

●

●●

● ●

●●

●

●● ●●

●●●

●

●

●● ● ●

●● ●

●●● ● ●

●

●

●

●

●●●

●

●

●●

●●

● ●

● ●●

●

●

●

●●

●●●

●● ●

●

●

Residuals vs Fitted

131

156

79

●

●

●

●●

●

●

●

●●

● ●

●

● ●●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●●

●

●

●●●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●●

●

●

●●

●

●

●●

●

●●● ●●

●

●●

●●

●●

●●

●●

●

●

●

●

●

●●

●

●

●●

●

●●●●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●●

●●

●●

●

●●●●

●●

●

●

●

●●●●

●●●

●●●●●

●

●

●

●

●●●

●

●

●●

●●

●●

●●●

●

●

●

●●

●●●

●●●

●

●

−3 −2 −1 0 1 2 3

−8

−6

−4

−2

02

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

131

156

79

●●

●

●●

●

●

●

● ●

● ●

●

● ●●

●●

●

● ●●

●

●

●

●

●

●●

●●

●

● ●●●

●●

●

●●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

● ●●

●

●●

● ●●

●

●

●●

●

●● ●

●

● ●●

●

●●●

●●

●

●● ●

●

●●

●

●

●●● ●● ●

● ●

●●

● ●

●●

●●

●●

●●

●

●●

●

●

●●

●

● ●● ●

●

●

● ●●

●

●

●●

●

●

●●●

●

●

●●●●

●●●

●● ●●●

● ●

●

●

●● ● ●

●●●

● ●● ●●

●

●●

●

● ●●

●

●

●●

● ●

● ●

●●● ●

●●

●●

●●●

●● ●

●●

0 50 100 150 200 250 300

−4

−2

02

4

Residuals vs TV adrvertising

TV

Res

idua

ls

●●

●

●●

●

●

●

●●

● ●

●

● ● ●

●●

●

● ●●

●

●

●

●

●

●●

●●

●

● ●●●

●●

●

●●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

● ●●

●

●●

● ●●

●

●

●●

●

●●●

●

●●●

●

● ●●

● ●

●

●● ●

●

●●

●

●

●●●● ● ●

●●

●●

● ●

●●

●●

●●

●●

●

●●

●

●

●●

●

●● ●●

●

●

● ●●

●

●

●●

●

●

●●●

●

●

●●● ●

●●

●

●●●●●

●●

●

●

●●●●

●● ●

●● ●● ●

●

●●

●

●●●

●

●

●●

● ●

● ●

● ●● ●

●●

●●

●●●

●● ●

●●

0 10 20 30 40 50

−4

−2

02

4

Residuals vs Radio adrvertising

Radio

Res

idua

ls

Except of two potential outliers (obs 131 and 156), the residual plots are more satisfactory

(recall error term should be purely random, thereby showing any systematic patterns in any

context). Some indication of third order effect of TV may be present.



The interpretation of coefficients is now a bit more tricky.

For example the TV coefficient 0.051 indicates TV effect at zero radio budget (anincrease of $1,000 TV advertising can be expected to increases sales by 51 units ifradio advertising is zero), while generally the marginal effect depends on thecurrent levels of TV and radio advertising, being of the formβ1 + 2β11TV + β12radio.

Finally, it may be surprising that newspaper advertising does not contribute sales

in the model because alone it is significant in a simple regression.Dependent variable: sales

Coefficients:


(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***

newspaper 0.05469 0.01658 3.30 0.00115 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlations

sales TV radio newspaper

sales 1.000 0.782 0.576 0.228

TV 0.782 1.000 0.055 0.057

radio 0.576 0.055 1.000 0.354

newspaper 0.228 0.057 0.354 1.000

The reason is that radio and newspaper are correlated.

So, newspaper alone in a regression reflects the radio advertising (due to the

correlation) even though newspaper advertising actually does not contribute sales!



A 3D plot to illustrate graphically the relationships.




1 Simple Regression




Model fit

Prediction



Potential problems





The t-ratios

t =βj

se(βj)

and associated p-values indicate significances of individual coefficientsseparately.

Testing joint significance of all (or a subset of) coefficients, i.e., nullhypothesis

H0 : β1 = · · · = βp = 0 (24)

versusH1 : at least one βj 6= 0. (25)

can be performed by the F -statistic

F =(TSS− RSS)/p

RSS/(n − p − 1), (26)

which has the F -distribution with p and n − p − 1 degrees of freedom ifthe null hypothesis H0 is true.




Example 5

In Example 4 F = 260.9 with degrees of freedom 3 and 169 the p-value

zero in 15 decimal places, implying strong rejection of the null hypothesis

that advertising in the three media does not affect sales (i.e., the null

hypothesis H0 : β1 = β2 = β3 = 0).

The F -test for testing hypothesis (24) can be consider as the firststep.

If the null hypothesis is not rejected, we conclude that noexplanatory variable is associated to y and the model is of the formy = β0 + ε, i.e., y purely varies around its mean.

If the null hypothesis is rejected then the interest is which variablesare associated with y , i.e., which explanatory variables areimportant.




1 Simple Regression




Model fit

Prediction



Potential problems





From a set of p explanatory variables all are not necessarily associated to y or theirimportance is marginal.

Variable selection or model selection in regression analysis refers to the problem tochoosing the best subset of variables from available (large number of) candidates.

Criterion functions: e.g., Akaike information criterion (AIC), Bayesianinformation criterion (BIC)

Select that subset for which the criterion function assumes its minimum

Step wise selection

Forward selection: start from the null model with no explanatory variables andenhance the model one variable at a time with smallest p-value until the smallestp-value among the non-selected variables is not significant at the chosen level(e.g. 5%-level).Backward selection: Start with all explanatory variables in the model and removeone by one a variable largest non-significant p-value. Stop when all remainingvariables are significant.Forward-backward selection: This is a combination of forward and backwardselection by starting with the forward selection and applying backward selectionat each step to remove non-significant variables from the current model. Theprocedure is stopped when no more variables are selected and no variables areremoved.




Remark 2

The step wise selections can be also performed using criterion functionslike AIC.

For example R package car has AIC step wise option.




Example 6

AIC step wise variable selection.Full model:

lm(formula = medv ~ ., data = boston)

## note, the dot in medv ~ . includes all variables

Residuals:


-15.594 -2.730 -0.518 1.777 26.199

Coefficients:


(Intercept) 36.45949 5.10346 7.1 3e-12 ***

crim -0.10801 0.03286 -3.3 0.001 **

zn 0.04642 0.01373 3.4 8e-04 ***

indus 0.02056 0.06150 0.3 0.738

chas 2.68673 0.86158 3.1 0.002 **

nox -17.76661 3.81974 -4.7 4e-06 ***

rm 3.80987 0.41793 9.1 <2e-16 ***

age 0.00069 0.01321 0.1 0.958

dis -1.47557 0.19945 -7.4 6e-13 ***

rad 0.30605 0.06635 4.6 5e-06 ***

tax -0.01233 0.00376 -3.3 0.001 **

ptratio -0.95275 0.13083 -7.3 1e-12 ***

black 0.00931 0.00269 3.5 6e-04 ***

lstat -0.52476 0.05072 -10.3 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’



F-statistic: 108 on 13 and 492 DF, p-value: <2e-16

lm(formula = medv ~ crim + zn + chas + nox + rm + dis +

rad + tax + ptratio + black + lstat, data = boston)

Residuals:


-15.5984 -2.7386 -0.5046 1.7273 26.2373

Coefficients:


(Intercept) 36.3411 5.0675 7.2 3e-12 ***

crim -0.1084 0.0328 -3.3 0.001 **

zn 0.0458 0.0135 3.4 8e-04 ***

chas 2.7187 0.8542 3.2 0.002 **

nox -17.3760 3.5352 -4.9 1e-06 ***

rm 3.8016 0.4063 9.4 <2e-16 ***

dis -1.4927 0.1857 -8.0 7e-15 ***

rad 0.2996 0.0634 4.7 3e-06 ***

tax -0.0118 0.0034 -3.5 5e-04 ***

ptratio -0.9465 0.1291 -7.3 9e-13 ***

black 0.0093 0.0027 3.5 6e-04 ***

lstat -0.5226 0.0474 -11.0 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’




indus and age become removed.



Model fit

1 Simple Regression




Model fit

Prediction



Potential problems




Model fit

Similar to simple regression R2 and residual standard error (RSE) are two ofthe most common model fit measures.

In Example 4: R2 = 0.986, i.e., the model explains 98.6% of the total variationin sales.

Another R-squared is

R2 = 1− RSS/(n − p − 1)

TSS/(n − 1)= 1− (1− R2)

n − 1

n − p − 1, (27)

called the adjusted R-squared which penalizes the R2 by inclusion of additionalexplanatory variables.

In Example 6: R2 = 0.741 for the full model and 0.7406 for the stepwiseselected model (indus and age removed), while R2 = 0.734 for the full modeland R2 = 0.735 for the reduced model.

Thus, R-squared (slightly) decreases when removing explanatory variables,while in this case the adjusted R-squared slightly increases.



Prediction

1 Simple Regression




Model fit

Prediction



Potential problems




Prediction

The estimated modely = β0 + β1x1 + · · ·+ βpxp (28)

estimates the population regression

f (x) = β0 + β1x1 + · · ·+ βpxp. (29)

The inaccuracy in the estimated coefficients in (28) is related to the reducible error.

Confidence interval for regression: Confidence interval of the population regressionin (29) is

y ± cα/2 se(y), (30)

where

se(y) = σ√

x′(X′X)−1x (31)

is the standerd error of the regression line (more precisely hyper plane).

Confidence interval for prediction: Confidence interval for a realized value y relatedto a given x observed values is given by

y ± cα/2 se(pred y), (32)

where

se(pred y) = σ√

1 + x′(X′X)−1x. (33)

The one in the standard error of prediction is due to the irreducible error.Seppo Pynnonen Applied Multivariate Statistical Analysis


Prediction

Example 7

In the Advertising data set consider regression sales = β0 + β1 TV + β11 (TV)2 + u.

The figure depicts 95% confidence for the regression line (grey) and predictions (light

blue).

0 50 100 150 200 250 300

510

1520

25

TV

sale

s

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●




1 Simple Regression




Model fit

Prediction



Potential problems





1 Simple Regression




Model fit

Prediction



Potential problems






Qualitative information that indicate only classificationinformation (e.g. gender, ethnic group, etc) can be introducedinto the regression using indicator or dummy variables.

In regression for a qualitative explanatory variables with qclasses, one is selected as the reference class and the otherq − 1 classes are indicated by q − 1 dummy variables.

The coefficients of the dummy variable indicate the deviationfrom the base group.

In R category variables can be defined as factor variables, forwhich R generates the needed dummy variables in theregression.




Example 8

Using the wage data available on www.econometrics.com

log(wage) = β0 + δ1singlefem + δ2marrmale + δ3marrfem (34)

+β2educ + β3tenure + β4exper + β5(tenure)2 + β6(exper)2 + ε.

Thus, single male is the reference group.

lm(formula = log(wage) ~ mstatus + educ + tenure + exper + I(tenure^2) +

I(exper^2), data = wdf)

Residuals:


-1.89697 -0.24060 -0.02689 0.23144 1.09197

Coefficients:


(Intercept) 0.3213781 0.1000090 3.213 0.001393 **

mstatussingle female -0.1103502 0.0557421 -1.980 0.048272 *

mstatusmarried male 0.2126757 0.0553572 3.842 0.000137 ***

mstatusmarried female -0.1982676 0.0578355 -3.428 0.000656 ***

educ 0.0789103 0.0066945 11.787 < 2e-16 ***

tenure 0.0290875 0.0067620 4.302 2.03e-05 ***

exper 0.0268006 0.0052428 5.112 4.50e-07 ***

I(tenure^2) -0.0005331 0.0002312 -2.306 0.021531 *

I(exper^2) -0.0005352 0.0001104 -4.847 1.66e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1







1 Simple Regression




Model fit

Prediction



Potential problems





Potential problems

In fitting a linear regression, many problems may occur.

Non-linearity.

Correlation of error terms.

Heteroskedasticity.

Outliers.

High-leverage points.

Collinearity.

Graphical tools hare often useful in checking the presence of theseproblems (scatter plots of residuals against predicted values andexplanatory variables as we have done in some of Advertisingexamples).




1 Simple Regression




Model fit

Prediction



Potential problems





Parametric regression assume well defined functional form for f (x).

Non-parametric approaches do not set assumptions on f (x).

These methods rely on data and apply different algorithms to findrelationships between the dependent and response variable.

One is the K -nearest neighbor regression (KNN regression), whichis closely related to KNN classifier.

Given a value of K and prediction point x0, KNN regression firstidentifies the K training observations that are closest to x0,represented by N0.

f (x0) is estimated by the average of the training responses in N0,i.e.,

f (x0) =1

K

∑xi∈N0

yi . (35)

We will return to this later.


Documents

Part II Linear Regressionlipas.uwasa.fi/~sjp/Teaching/ams/lectures/amsc2.pdf · 2019. 11. 7. · Simple Regression Multiple Regression Part II Linear Regression As of Nov 7, 2019