27
15-1 Objectives of Multiple Regression • Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory variable from a large set of potential predictors {x 1 , x 2 , ... x k }. • Find that subset of all possible predictor variables that explains a significant and appreciable proportion of the variance of Y, trading off adequacy of prediction against the cost of measuring more predictor variables.

15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

Embed Size (px)

Citation preview

Page 1: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-1

Objectives of Multiple Regression

• Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory variable from a large set of potential predictors {x1, x2, ... xk}.

• Find that subset of all possible predictor variables that explains a significant and appreciable proportion of the variance of Y, trading off adequacy of prediction against the cost of measuring more predictor variables.

Page 2: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-2

Expanding Simple Linear Regression

• Quadratic model.

Y = 0 + 1x1 + 2x12 +

• General polynomial model.

Y = 0 + 1x1 + 2x12 +

3x13 + ... + kx1

k +

Any independent variable, xi, which appears in the polynomial regression model as xi

k is called a kth-degree term.

Adding one or more polynomial terms to the model.

Page 3: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-3

Polynomial model shapes.

x

y

10 20 30 40

1.5

2.0

2.5

3.0

3.5

Linear

Quadratic xn

yn

10 20 30 40

1.5

2.0

2.5

3.0

3.5

Adding one more terms to the model significantly improves the model fit.

Page 4: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-4

Incorporating Additional Predictors

Simple additive multiple regression model

y = 0 + 1x1 + 2x2 + 3x3 + ... + kxk +

Additive (Effect) Assumption - The expected change in y per unit increment in xj is constant and does not depend on the value of any other predictor. This change in y is equal to j.

Page 5: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-5

Additive regression models: For two independent variables, the response is modeled as a surface.

Page 6: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-6

Interpreting Parameter Values (Model Coefficients)

• “Intercept” - value of y when all predictors are 0.0

• “Partial slopes”

k

j - describes the expected change in y per unit

increment in xj when all other predictors in

the model are held at a constant value.

Page 7: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-7

Graphical depiction of j.

- slope in direction of x1.

2 - slope in direction of x2.

Page 8: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-8

Multiple Regression with Interaction Terms

Y = 0 +1x1 + 2x2 +

3x3 + ... + kxk +

12x1x2 + 13x1x3 +

... + 1kx1xk + ...

+ k-1,kxk-1xk +

cross-productterms quantify the interaction among predictors.

Interactive (Effect) Assumption: The effect of one predictor, xi, on the response, y, will depend on the value of one or more of the other predictors.

Page 9: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-9

Interpreting Interaction

1 – No longer the expected change in Y per unit increment in X1!

12 – No easy interpretation! The effect on y of a unit increment in X1,

now depends on X2.

i2i1i122i21i10i xxxxy

or Define: 2i1i3i xxx

i3i122i21i10i xxxy

No difference

Interaction Model

Page 10: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-10

x1

y

x2=2

x2=1

x2=0

0

interaction

x1

y

x2=2

x2=1

x2=0

0

12}

2}

no-interaction

1

1+212

0+22

0+2

Page 11: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-11

Multiple Regression models with interaction:

Lines move apart

Lines come together

Page 12: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-12

Effect of the Interaction Term in Multiple Regression

Surface is twisted.

Page 13: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-13

A Protocol for Multiple Regression

Identify all possible predictors.

Establish a method for estimating model parameters and their standard errors.

Develop tests to determine if a parameter is equal to zero (i.e. no evidence of association).

Reduce number of predictors appropriately.

Develop predictions and associated standard error.

Page 14: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-14

Estimating Model ParametersLeast Squares Estimation

Assuming a random sample of n observations (yi, xi1,xi2,...,xik), i=1,2,...,n. The estimates of

the parameters for the best predicting equation:

Is found by choosing the values:

which minimize the expression:

kˆ,,ˆ,ˆ 10

n

1i

n

1i

2ikk2i21i10i

2ii )xxxy()yy(SSE

ikkiii xˆxˆxˆˆy 22110

Page 15: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-15

Normal Equations

Take the partial derivatives of the SSE function with respect to 0, 1,…, k, and equate each equation to 0. Solve this system of k+1 equations in k+1 unknowns to obtain the equations for the parameter estimates.

n

1ik

2ik

n

1i11iik0

n

1iik

n

1iiikik

n

1ikik1i

n

1i1

21i0

n

1i1i

n

1ii1i1i

n

1ikik

n

1i11i0

n

1ii

ˆxˆxxˆxyxx

ˆxxˆxˆxyxx

ˆxˆxˆny1

Page 16: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-16

An Overall Measure of How Well the Full Model Performs

• Denoted as R2.• Defined as the proportion of the variability in the

dependent variable y that is accounted for by the independent variables, x1, x2, ..., xk, through the regression model.

• With only one independent variable (k=1), R2 = r2, the square of the simple correlation coefficient.

Coefficient of Multiple Determination

Page 17: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-17

Computing the Coefficient of Determination

10, 22

21

R

S

SSES

S

SSRR

yy

yy

yyxxxy k

TSS )(1

2

n

iiyy yyS

n

i

n

iiikk

n

iii

n

iii

n

iii

yxyxyy

yySSE

1 1111

10

2

1

2

ˆˆˆ

)ˆ(

Page 18: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-18

0 2 4 6x1

0

2

4

6

x2

-1 0 1 2 3 4 5 6x1

-1

1

3

5

x2

1 2 3 4 5x1

1

2

3

4

5

x2

Multicollinearity

A further assumption in multiple regression (absent in SLR), is that the predictors (x1, x2, ... xk) are statistically uncorrelated. That is, the

predictors do not co-vary. When the predictors are significantly correlated (correlation greater than about 0.6) then the multiple regression model is said to suffer from problems of multicollinearity.

r = 0 r = 0.6 r = 0.8

Page 19: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-19

Effect of Multicollinearity on the Fitted Surface

xx

xx

xx

x

Extreme collinearity

x1

xx

xx

xx

xx

xx

xx

x

xx

xx

xx

x2

y

Page 20: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-20

Multicollinearity leads to

• Numerical instability in the estimates of the regression parameters – wild fluctuations in these estimates if a few observations are added or removed.

• No longer have simple interpretations for the regression coefficients in the additive model.

Ways to detect multicollinearity

• Scatterplots of the predictor variables.

• Correlation matrix for the predictor variables – the higher these correlations the worse the problem.

• Variance Inflation Factors (VIFs) reported by software packages. Values larger than 10 usually signal a substantial amount of collinearity.

What can be done about multicollinearity

• Regression estimates are still OK, but the resulting confidence/prediction intervals are very wide.

• Choose explanatory variables wisely! (E.g. consider omitting one of two highly correlated variables.)

• More advanced solutions: principal components analysis; ridge regression.

Page 21: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-21

Testing in Multiple Regression

• Testing individual parameters in the model.• Computing predicted values and associated standard

errors.

),0(N~ , 2110 kkXXY

Overall AOV F-test

H0: None of the explanatory variables is a significant predictor of Y

MSE

MSR

knSSE

kSSRF

)1/(

/

,1, knkFFReject if:

Page 22: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-22

Standard Error for Partial Slope Estimate

The estimated standard error for: j

)1(

1121 kjjjjj

jxxxxxxxx RS

s

where)1(

ˆ

kn

SSE

n

ijijxx xxS

jj1

2)(

2

1121 kjjj xxxxxxR and is the coefficient of determination for the model with xj as the dependent variable and all other x variables as predictors.

What happens if all the predictors are truly independent of each other?

jj

jkjjj

xx

xxxxxxS

sR

0 ˆ

2

1121

jkjjj ˆxxxxxx sR 12

1121

If there is high dependency?

Page 23: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-23

Confidence Interval

100(1-)% Confidence Interval for j

jst

knj ˆ2),1(

ˆ

Reflects the number of data points minus the number of parameters that have to be estimated.

df for SSE

Page 24: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-24

Testing whether a partial slope coefficient is equal to zero.

0

0

0

00

j

j

ja

j

H

H

Test Statistic:

js

t j

ˆ

ˆ

Rejection Region:

2),1(

),1(

),1(

||

kn

kn

kn

tt

tt

tt

Alternatives:

Page 25: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-25

Predicting Y

• We use the least squares fitted value, , as our predictor of a single value of y at a particular value of the explanatory variables (x1, x2, ..., xk).

• The corresponding interval about the predicted value of y is called a prediction interval.

• The least squares fitted value also provides the best predictor of E(y), the mean value of y, at a particular value of (x1, x2, ..., xk). The corresponding interval for the mean prediction is called a confidence interval.

• Formulas for these intervals are much more complicated than in the case of SLR; they cannot be calculated by hand (see the book).

y

Page 26: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-26

Minimum R2 for a “Significant” Regression

Since we have formulas for R2 and F, in terms of n, k, SSE and TSS, we can relate these two quantities.

We can then ask the question: what is the min R2 which will ensure the regression model will be declared significant, as measured by the appropriate quantile from the F distribution?

The answer (below), shows that this depends on n, k, and SSE/TSS.

,1,min2

1 knkFTSS

SSE

kn

kR

Page 27: 15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory

15-27

Minimum R2 for Simple Linear Regression (k=1)