REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y = 0 + 1 x + | | +...

Preview:

Citation preview

REGRESSION MODELREGRESSION MODEL

ASSUMPTIONSASSUMPTIONS

The Regression Model

• We have hypothesized that:

y = 0 + 1x +

|<Regression>| + |<Error>|

• So far we focused on the regression part – getting the best estimates for the ’s

• Here we focus on the error term,

THE RANDOM VARIABLE,

• The error term, , is a random variable that describes how the observed values, yi, vary around the regression line.

• For any value of x, has a distribution with a mean and a standard deviation

• At any x value xi, the observed value of the error term is called its residual, given by:

iii y - y e ˆ

STEP 3: 4 ASSUMPTIONS ABOUT

The remainder of our discussion about linear regression assumes the following about

• (1) DISTRIBUTION: is distributed normally

• (2) MEAN:– The errors average out to 0, i.e. E(), or = 0

• (3) STANDARD DEVIATION: , is the samesame at all values of x

• (4) INDEPENDENCE:– The errors are independentindependent of each other

What Do These Assumptions Imply About y?

• y = 0 + 1x + .0 + 1x is a constant for a given value of x is normally distributed with mean 0 and standard

deviation .

• Thus y is normally distributed with standard deviation and mean E(y),

E(y) = E(0 + 1x + ) = E(0 + 1x) + E() = 0 + 1x + 0 = 0 + 1x

BEST ESTIMATE FOR

• The true value of is unkown.

• It can estimated by s as follows:

s s and 2-n

y -(y

2-n

SSE s

and, 2-n freedom of degrees Thus

β and β :quantities two estimating are we Here

.estimated) being quantities(# - n freedom of Degrees

Freedom of Degrees

y -(y

Freedom of Degrees

SSE s

2ii

10

ii

;)ˆ

.

22

22

Hand Calculation of SSE

1 1200 101000 109567.57 73403214.02

2 800 92000 88540.54 11967859.75

3 1000 110000 99054.05 119813732.7

4 1300 120000 114824.32 26787618.7

5 700 90000 83283.78 45107560.26

6 800 82000 88540.54 42778670.56

7 1000 93000 99054.05 36651570.49

8 600 75000 78027.03 9162892.622

9 900 91000 93797.30 7824872.169

10 1100 105000 104310.81 474981.7385

SUM 373972972.97

ii 52.5657x 46486.49 ythat Recall ˆ

SSESSE

22iiiii )( )y(y )y y y x i

6837.15246746621.6s

246746621.68

97377972972.

2n

SSEs2

s

Residual Error

SSE/(n-2) = s2

SSE

Checking the Assumptions

• Many times it is just assumed that the assumptions hold.

• We now show how to check the assumptions.

Residuals

• The assumptions for can be checked using RESIDUAL ANALYSISRESIDUAL ANALYSIS.

• A residual, ei, is the observation of at an observed value of x, xi.

• For example in the Dollar Only example:y1 = 101,000 when x1 = 1200

8567.67109,567.57101,000e

109,567.57200)52.56757(146486.49y

1

1

ˆ

Standardized Residuals• Is a residual of -8,567.67 large?

– It depends on the size of a standard error, s.• Standardized residual = ei/(standard error of ei for xi).• Standardized residuals are easier to use to test the

assumptions.• Two typical ways for calculating the standard error of

ei for a particular xi value are:

• Both approaches yield substantially the same results.

2i

2i

i

i

i

i

)x(x

)x(x

n

1h where

h1s

e

s

e

Standardized Residuals in Excel

• Excel uses the following formula:

1-n

2-ns

ei

This still gives approximately the same values as the other methods. We will use the ones generated by Excel to check the assumptions.

Checking to See if Errors (Residuals) Appear to Come From a Normal Distribution

TWO WAYS TO CHECK• Construct a plot of standardized residuals and

see if they look normal– Could use Histogram from Data Analysis– A “quick check” – Standardized residuals are like

z-values. Check to see if about 68% are between ± 1, 95% between ± 2, and virtually all between ± 3.

• Look at a normal probability plot. These are statistical plots to check for “normality”. A “perfect” normal distribution would be a straight line on such a plot.

Checking to see if Is Constant

• Look at the residual plot to see if the points seem more spread out at some x’s than at others – in the Dollar Only example, it did not appear so on the Excel residual plot.

• Constant is called homoscedasticityhomoscedasticity!• If the points had looked like the next page, then

we see for lower values of x there is less variation than at higher values and the constant variation assumption would have been violated. This is called heteroscedasticityheteroscedasticity!

x

e

Heteroscedasticity– Nonconstant Variance

Checking Independence

• This is mainly for time series data (i.e. the x-axis is time) used in forecasting

• But basically if the data looks like the next slide – errors are not independent – In this case whether you have a positive or

negative error (residual) depends on the x-value.

– This is called autocorrelation.

X=timeX=time

YY

Example of Autocorrelation(Errors are Dependent on x)

Residual Analysis in Excel

CHECK:

Residuals

Standardized Residuals

Residual Plots

Normal Probability Plots

Standardized ResidualsStandardized Residuals70% are between ± 1

100% are between ±2

“Close” to expected

normalnormal values

Residual values appear to

average out to 0 everywhere.

There is no discernable

pattern for the errors.

Normal Probability Plot

• The following is the normal probability plot generated by Excel. Again Excel does it “slightly wrong”, but it should give us a good idea.

• Looks close to a straight line – normality assumption appears valid.

Normal Probability Plot

050000100000150000

0 20 40 60 80 100

Sample Percentile

Sal

es

Review• 4 assumptions about

1. is normal.

2. = E() = 0.3. is the same for all values of x.4. Errors are independent.

• Checking The Assumptions– Check residual plot to see if variation changes for

different values of x.– Check normality assumption by a normal probability

plot or by creating a histogram of standardized residuals.

• Does it appear normal and centered around 0?• Are about 68% between ±1, 95% between ±2, almost all

between ±3?

Recommended