REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y = 0 + 1 x + | | +...

REGRESSION MODELREGRESSION MODEL

ASSUMPTIONSASSUMPTIONS

The Regression Model

• We have hypothesized that:

y = 0 + 1x +

|<Regression>| + |<Error>|

• So far we focused on the regression part – getting the best estimates for the ’s

• Here we focus on the error term,

THE RANDOM VARIABLE,

• The error term, , is a random variable that describes how the observed values, yi, vary around the regression line.

• For any value of x, has a distribution with a mean and a standard deviation

• At any x value xi, the observed value of the error term is called its residual, given by:

iii y - y e ˆ

STEP 3: 4 ASSUMPTIONS ABOUT

The remainder of our discussion about linear regression assumes the following about

• (1) DISTRIBUTION: is distributed normally

• (2) MEAN:– The errors average out to 0, i.e. E(), or = 0

• (3) STANDARD DEVIATION: , is the samesame at all values of x

• (4) INDEPENDENCE:– The errors are independentindependent of each other

What Do These Assumptions Imply About y?

• y = 0 + 1x + .0 + 1x is a constant for a given value of x is normally distributed with mean 0 and standard

deviation .

• Thus y is normally distributed with standard deviation and mean E(y),

E(y) = E(0 + 1x + ) = E(0 + 1x) + E() = 0 + 1x + 0 = 0 + 1x

BEST ESTIMATE FOR

• The true value of is unkown.

• It can estimated by s as follows:

s s and 2-n

and, 2-n freedom of degrees Thus

β and β :quantities two estimating are we Here

.estimated) being quantities(# - n freedom of Degrees

Freedom of Degrees

Hand Calculation of SSE

1 1200 101000 109567.57 73403214.02

2 800 92000 88540.54 11967859.75

3 1000 110000 99054.05 119813732.7

4 1300 120000 114824.32 26787618.7

5 700 90000 83283.78 45107560.26

6 800 82000 88540.54 42778670.56

7 1000 93000 99054.05 36651570.49

8 600 75000 78027.03 9162892.622

9 900 91000 93797.30 7824872.169

10 1100 105000 104310.81 474981.7385

SUM 373972972.97

ii 52.5657x 46486.49 ythat Recall ˆ

SSESSE

22iiiii )( )y(y )y y y x i

6837.15246746621.6s

246746621.68

97377972972.

Residual Error

SSE/(n-2) = s2

Checking the Assumptions

• Many times it is just assumed that the assumptions hold.

• We now show how to check the assumptions.

Residuals

• The assumptions for can be checked using RESIDUAL ANALYSISRESIDUAL ANALYSIS.

• A residual, ei, is the observation of at an observed value of x, xi.

• For example in the Dollar Only example:y1 = 101,000 when x1 = 1200

8567.67109,567.57101,000e

109,567.57200)52.56757(146486.49y

Standardized Residuals• Is a residual of -8,567.67 large?

– It depends on the size of a standard error, s.• Standardized residual = ei/(standard error of ei for xi).• Standardized residuals are easier to use to test the

assumptions.• Two typical ways for calculating the standard error of

ei for a particular xi value are:

• Both approaches yield substantially the same results.

1h where

Standardized Residuals in Excel

• Excel uses the following formula:

This still gives approximately the same values as the other methods. We will use the ones generated by Excel to check the assumptions.

Checking to See if Errors (Residuals) Appear to Come From a Normal Distribution

TWO WAYS TO CHECK• Construct a plot of standardized residuals and

see if they look normal– Could use Histogram from Data Analysis– A “quick check” – Standardized residuals are like

z-values. Check to see if about 68% are between ± 1, 95% between ± 2, and virtually all between ± 3.

• Look at a normal probability plot. These are statistical plots to check for “normality”. A “perfect” normal distribution would be a straight line on such a plot.

Checking to see if Is Constant

• Look at the residual plot to see if the points seem more spread out at some x’s than at others – in the Dollar Only example, it did not appear so on the Excel residual plot.

• Constant is called homoscedasticityhomoscedasticity!• If the points had looked like the next page, then

we see for lower values of x there is less variation than at higher values and the constant variation assumption would have been violated. This is called heteroscedasticityheteroscedasticity!

Heteroscedasticity– Nonconstant Variance

Checking Independence

• This is mainly for time series data (i.e. the x-axis is time) used in forecasting

• But basically if the data looks like the next slide – errors are not independent – In this case whether you have a positive or

negative error (residual) depends on the x-value.

– This is called autocorrelation.

X=timeX=time

Example of Autocorrelation(Errors are Dependent on x)

Residual Analysis in Excel

CHECK:

Residuals

Standardized Residuals

Residual Plots

Normal Probability Plots

Standardized ResidualsStandardized Residuals70% are between ± 1

100% are between ±2

“Close” to expected

normalnormal values

Residual values appear to

average out to 0 everywhere.

There is no discernable

pattern for the errors.

Normal Probability Plot

• The following is the normal probability plot generated by Excel. Again Excel does it “slightly wrong”, but it should give us a good idea.

• Looks close to a straight line – normality assumption appears valid.

Normal Probability Plot

050000100000150000

0 20 40 60 80 100

Sample Percentile

Review• 4 assumptions about

1. is normal.

2. = E() = 0.3. is the same for all values of x.4. Errors are independent.

• Checking The Assumptions– Check residual plot to see if variation changes for

different values of x.– Check normality assumption by a normal probability

plot or by creating a histogram of standardized residuals.

• Does it appear normal and centered around 0?• Are about 68% between ±1, 95% between ±2, almost all

between ±3?

REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y = 0 + 1 x + | | +...

Documents

1 Simple Linear Regression Linear regression model Prediction Limitation Correlation

Regression Analysis - University of Floridaathienit/STA4210/reg_notes.pdf · 6 Multiple Regression I 98 6.1 Model ... 9.4 Regression Model Building ... Chapter 0 Review In regression

Truncated Regression Model

Multiple Regression Analysis Multiple Regression Model Sections 16.1 - 16.6

Regression Model

Fuzzy Regression Model

Simple Regression Model

2 the Linear Regression Model

Part II Multiple Linear Regression - Statistics · PDF filePart II Multiple Linear Regression 86. Chapter 7 Multiple Regression A multiple linear regression model is a linear model

Generalized Regression Model

Tutorial Single Equation Regression Model

regression model case study

Paper SAS2033:2018 Regression Model Building for Large ...support.sas.com/rnd/app/stat/papers/Regression-Model-Building.pdf · Paper SAS2033-2018 Regression Model Building for Large,

What is Regression Model

Multiple Regression1 Multiple Linear Regression Multiple Regression Model A regression model that contains more than one regressor variable. Multiple Linear

Exceptional Model Mining: a logistic regression model on

Logistic Regression Model

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model

Simple Linear Regression Model and Parameter Estimation• Estimate the parameters of a regression model Simple Regression Analysis • Regression analysis deals with investigation

The Simple Regression Model