Chapter 13

Chapter 13

Simple Linear Regression

13.1: Types of Regression Models

Lots of Terms:• Dependent or response variable

• Independent or predictor or explanatory variable

• Simple linear regression (SLR)

• Linear relationship

Formula 13.1, page 515

• Simple Linear Regression Model

• Beta0 Intercept

• Beta1 Slope

• Epsilon: noise, random error, stochastic error

• Figure 13.2, page 515 (linearity)

Relationship to Previous Work

• Where’s the mean?

• Where are the hypotheses?

13.2: Determining the Simple Linear Regression Equation

• We want the best fitting line.• We use the method of “Least-Squares.”

It guarantees the “best fitting” line.• Estimate Beta0 with b0.• Estimate Beta1 with b1.• The “b” values are called

“coefficients.”

Least Squares Results

• The slope is b1.

• Interpretation: for each 1 unit increase in X, the average or predicted value of Y changes by b1 units.

• Underlined terms should be defined in the context of the problem.

Least Squares Results

• The intercept is b0.• Interpretation: when X is 0 units, the

average or predicted value of Y is b0.• Very often, this value of the intercept is

outside the range of the data (ie the “relevant range.”)

• Interpretations of b0 should be made cautiously. Beware of extrapolation.

13.3: Measures of Variation

• “Usefulness” is an important concept.

• Two types of usefulness: statistical and practical.

• Statistical usefulness almost always requires managing the “sums of squares.”

• Practical usefulness can be assessed by managing the sums of squares or by obtaining an informed opinion.

Calculating Sums of Squares

The three most important sums of squares are shown on page 526:

• SST = SSR + SSE

• SST results from summing up squared deviations between actual values of Y and Y-bar.

• SSR = variation in Y accounted for by the regression.

• SSE = variation in Y NOT accounted for by the regression.

Using Sums of Squares

• Raw sums of squares are not that helpful.

• The Coefficient of Determination (r2) can be calculated with formula 13.7.

• Interpretation: r2 percent of the variation in variable y is explained by the variation in variable x in this data set.

• 0 r2 +1

Using Sums of Squares

• Standard Error of the Estimate• Measures the variability of the

predicted values of Y relative to the actual values of Y.

• Formula 13.13, page 530.• Interpretation: The general variability

of Y around the fitted line is “standard error of the estimate” units.

13.4: Assumptions13.5: Residual Analysis

There are 4 assumptions of Regression

• Normality of Error

• Homoscedasticity

• Independence of Errors

• Linearity.

Errors• What are the “errors?”• In Equation 13.1, errors are the epsilons.• We do not know the epsilons—they live in the

population.• We approximate the epsilons with sample data:

residual = y – yhat. See formula 13.14.• If the residuals meet the assumptions, then we

feel better about the usefulness of our analysis. • If the residuals do not meet the assumptions,

then we need a new analysis technique.

Linearity

• Examine the plot of X versus the residuals.

• Example: see Figure 13.12.

• There should not be pattern.

• A pattern means that the linear regression was not effective at explaining the variation in Y, ie the SST.

Notes on Residual Analysis

• There are a LOT of techniques that can be used to examine residuals.

• You are trying to assess the validity of assumptions.

• Each observation produces a residual.• The process of calculating Studentized

Residuals allows you to look each observation to see if it produced a “strange” residual.

Assumption of Normality

• For any value of X, the errors (and residuals) conform to a normal distribution.

• At this point, we subjectively assess with graphical means.

• Histogram of residuals.

• Normal probability plot.

Homoscedasticity

• For all X, the errors and residuals should have constant, or same, variance.

• Assess subjectively by looking at the graph of residuals versus predicted values (or studentized residuals versus X).

• Assess numerically by performing a test of equal variances (divide the set of residuals in half and test).

• Figure 13.16 shows a problem.

Independence

• Previous residuals are not correlated with current and future residuals.

• Assess by plotting the residuals in order of observation.

• A formal procedure called “Dubin-Watson test” exists.

• Usually only a problem with time-series data—data observed over time.

13.7: Inferences about the slope and correlation coefficient.

• There is a “t test” for the slope.

• There is an “F test” for the overall explanatory value of the regression line.

• There is a confidence interval estimate for the slope (skip).

• There is a “t test” for the significance of the correlation coefficient (skip).

“t-test for Slope”

• Test follows the standard hypothesis test pattern.

• Like all “t-tests,” the test statistic follows the usual format, shown in Equation 13.16, page 542.

• Like all analyses on large data sets, we look to the computer output for answers.

“F Test”

• The formula for the F Test is shown in Equation 13.17, page 544.

• This test should look familiar to you: it was developed in the ANOVA section.

• Even though the text discusses this test in terms of the slope, the more general form is the more useful.

13.8 Estimation of Mean Values and Prediction of Individual

Values

• CI for a mean value of Y (13.20)

• PI for a value of Y (13.21)

• Won’t have to calculate by hand but might need to interpret.

13.9: Pitfalls in Regression and Ethical Issues

Page 554 lists some difficulties.

• Lack of awareness of assumptions.

• Unable to evaluate assumptions.

• Unable to proceed if assumptions are violated (what are the alternatives?).

• No knowledge of subject area.

Moral of the Story

• Examine assumptions.

• ALWAYS Plot the data.– The 4 sets of data in Table 13.7 all

have the same regression results, BUT they look vastly different.

Some Objectives• Find the regression coefficients using software or

software output.• Interpret the slope, R2, and the standard error of the

estimate.• Perform a test of hypothesis on the slope.• Perform a test of hypothesis on “all of the slopes.”• Write confidence intervals and prediction intervals

for Y given X.• Evaluate assumptions when given computer output.• Suggest the type of output necessary to evaluate

assumptions.• Name some difficulties of using simple regression.

Documents

Chapter 13