Statr session 23 and 24

Simple Regression Analysis

• Bivariate (two variables) linear regression -- the most elementary regression model– dependent variable, the variable to be predicted, usually

called Y– independent variable, the predictor or explanatory

variable, usually called X– Usually the first step in this analysis is to construct a

scatter plot of the data• Nonlinear relationships and regression models with

more than one independent variable can be explored by using multiple regression models

Linear Regression Models

• Deterministic Regression Model - - produces an exact output:

• Probabilistic Regression Model

• 0 and 1 are population parameters

• 0 and 1 are estimated by sample statistics b0

and b1

0 1y x

0 1y x

Equation of the Simple Regression Line

A typical regression line

X

Y

𝑏0

ϴ Slope = =

y-intercept =

Hypothesis Tests for the Slope of the Regression Model

• A hypothesis test can be conducted on the sample slope of the regression model to determine whether the population slope is significantly different from zero.

• Using the non-regression model (the model) as a worst case, the researcher can analyze the regression line to determine whether it adds a more significant amount of predictability of y than does the model.


• As the slope of the regression line diverges from zero, the regression model is adding predictability that the line is not generating.

• Testing the slope of the regression line to determine whether the slope is different from zero is important.

• If the slope is not different from zero, the regression line is doing nothing more than the average line of y predicting y


Solving for and of the Regression Line: Airline Cost DataAirlines Cost Data include the costs and associated number of

passengers for twelve 500-mile commercial airline flights using Boeing 737s during the same season of the year.

Number of CostPassengers ($1,000) 61 4,280 63 4,080 67 4,420 69 4,170

70 4,480 74 4,300 76 4,820 81 4,700 86 5,110 91 5,130 95 5,640 97 5,560

Hypothesis Test: Airline Cost Example

0

0

10,025.

Hreject not do ,228.2228.2

Hreject ,228.2||

228.2

05.

102102

tIf

tIf

ndf

t


|t| = 9.44 > 2.228so reject H0

Note: P-value = 0.000


• The t value calculated from the sample slope falls in the rejection region and the p-value is .00000014.

• The null hypothesis that the population slope is zero is rejected.

• This linear regression model is adding significantly more predictive information to the model (no regression).

Comparison of F and t values

• ANOVA can be used to test hypotheses about the difference in two means

• Analysis of data from two samples by both a t test and ANOVA show that

Observed F = Square of Observed t for dfc = 1• The t test for two independent samples is a special

case one-way ANOVA when there are two treatment levels (dfc = 1)

Testing the Overall Model• It is common in regression analysis to compute an F

test to determine the overall significance of the model.

• In multiple regression, this test determines whether at least one of the regression coefficients (from multiple predictors) is different from zero.

• Simple regression provides only one predictor and only one regression coefficient to test.

• Because the regression coefficient is the slope of the regression line, the F test for overall significance is testing the same thing as the t test in simple regression

Testing the Overall Model


F = 89.09 > 4.96so reject H0

Note: P-value = 0.000


• The difference between the F value (89.09) and the value obtained by squaring the t statistic (88.92) is due to rounding error.

• The probability of obtaining an F value this large or larger by chance if there is no regression prediction in this model is .000 according to the ANOVA output (the p-value).

Estimation

• One of the main uses of regression analysis is as a prediction tool.

• If the regression function is a good model, the researcher can use the regression equation to determine values of the dependent variable from various values of the independent variable.

• In simple regression analysis, a point estimate prediction of y can be made by substituting the associated value of x into the regression equation and solving for y.

Point Estimation for the AirlineCost Example

Confidence Interval of Estimate ofthe Conditional Mean of y

• The regression line is determined by a sample set of points. For different samples, the regression equations will be different, yielding different Point Estimates.

• Hence a Confidence Interval (CI) of estimation is often useful because for any value of independent variable (x), there can be many values of dependent variable (y).

• One type of C.I. is an estimate of the average value of y for a given value of x and is designated as E(yx)

Confidence Interval of Estimate ofthe Conditional Mean of y

• The regression line is determined by a sample set of points. For different samples, the regression equations will be different, yielding different Point Estimates.

• Hence a Confidence Interval (CI) of estimation is often useful because for any value of independent variable (x), there can be many values of dependent variable (y).

• One type of C.I. is an estimate of the average value of y for a given value of x and is designated as E(yx)

Prediction Interval of Estimate ofa Single Value y

• The second type of interval in regression estimation to estimate a single value of y for a given value of x

• The P.I. is wider than C.I. • The P.I. takes into account all the y values for a

given x

Intervals for Estimation: Airline Cost Example

Multiple Regression Models

Regression analysis with two or more independent variables or with at least one nonlinear predictor is called multiple regression analysis.

Regression Models

Probabilistic Multiple Regression Model Y = 0 + 1X1 + 2X2 + 3X3 + . . . + kXk+

Y = the value of the dependent (response) variable0 = the regression constant

1 = the partial regression coefficient of independent variable 1


k = the partial regression coefficient of independent variable kk = the number of independent variables = the error of prediction

Regression Models

• In multiple regression analysis, the dependent variable y is sometimes referred to as the response variable.

• The partial regression coefficient of an independent variable βi represents the increase that will occur in the value of y from a one-unit increase in that independent variable if all other variables are held constant.

• The partial regression coefficients occur because more than one predictor is included in a model.

Estimated Regression Models

Multiple Regression Model with 2 Independent Variables (First-Order)

• The simplest multiple regression model is one constructed with two independent variables,where the highest power of either variable is 1(first-order regression model).

• In multiple regression analysis, the resulting model produces a response surface.

Multiple Regression Model with 2 Independent Variables (First-Order)

1 20 1 2

0

1

2

: = the regression constant

the partial regression coefficient for independent variable 1

the partial regression coefficient for independent variable 2

= the error of pred

where

Y X X

1 20 1 2

0

1

2

iction

ˆ: predicted value of Y

estimate of regression constant

estimate of regression coefficient 1

estimate of regression coefficient 2

ˆ

where Y

Y b b bX X

bbb

PopulationModel

EstimatedModel

Response Plane for First-Order Two-Predictor Multiple Regression Model

• In multiple regression analysis, the resulting model produces a response surface.

• In the multiple regression model shown on the next slide with two independent first-order variables, the response surface is a response plane.• The response plane for such a model is fit in a

three-dimensional space (x1, x2, y).

Response Plane for First-Order Two-Predictor Multiple Regression Model

Determining the MultipleRegression Equation

• The simple regression equations for determining the sample slope and intercept given in earlier material are the result of using methods of calculus to minimize the sum of squares of error for the regression model.

• The formulas are established to meet an objective of minimizing the sum of squares of error for the model.

• The regression analysis shown here is referred to as least squares analysis. Methods of calculus are applied, resulting in k + 1 equations with k + 1 unknowns for multiple regression analyses with k independent variables.

Least Squares Equations for k = 2

Multiple Regression Model

• A real estate study was conducted in a small Louisiana city to determine what variables, if any, are related to the market price of a home.

• Suppose the researcher wants to develop a regression model to predict the market price of a home by two variables, “total number of square feet in the house” and “the age of the house.”

Real Estate Data

Observation Y X1 X2 Observation Y X1 X2

1 63.065.1

1,605 35 13 79.7 2,121 142 2,489 45 14 84.5 2,485 93 69.9

71,553 20 15 96.0 2,300 19

4 76.8 2,404 32 16 109.5 2,714 45 73.9 1,884 25 17 102.5 2,463 56 77.9 1,558 14 18 121.0 3,076 77 74.9 1,748 8 19 104.9 3,048 38 78.0 3,105 10 20 128.0 3,267 69 79.0 1,682 28 21 129.0 3,069 10

10 63.4 2,470 30 22 117.9 4,765 1111 79.5 1,820 2 23 140.0 4,540 812 83.9 2,143 6

MarketPrice($1,000)

SquareFeet

Age(Years)

MarketPrice($1,000)

SquareFeet

Age(Years)

Package Output for the Real Estate Example

The regression equation isPrice = 57.4 + 0.0177 Sq.Feet - 0.666 Age

Predictor Coef StDev T PConstant 57.35 10.01 5.73 0.000Sq.Feet 0.017718 0.003146 5.63 0.000Age -0.6663 0.2280 -2.92 0.008

S = 11.96 R-Sq = 74.1% R-Sq(adj) = 71.5%Analysis of Variance

Source DF SS MS F PRegression 2 8189.7 4094.9 28.63 0.000Residual Error 20 2861.0 143.1Total 22 11050.7

Predicting the Price of Home

Evaluating the Multiple Regression Model

H

Hk

a

01 2 3

0:

:

At least one of the regression coefficients is 0

H

H

H

H

H

H

H

H

a a

a

k

ak

01

1

03

3

02

2

0

0

0

0

0

0

0

0

0

:

:

:

:

:

:

:

:

Significance

Tests forIndividualRegressionCoefficients

Testingthe

OverallModel

Testing the Overall Model for theReal Estate Example

• It is important to test the model to determine whether it fits the data well and the assumptions underlying regression analysis are met.

• With simple regression, a t test of the slope of the regression line is used to determine whether the population slope of the regression line is different from zero.

• Fail to reject the null hypothesis - the regression model has no significant predictability for the dependent variable.


• A rejection of the null hypothesis indicates that at least one of the independent variables is adding significant predictability for y.

• The F value is 28.63; because p = 0.000, the F value is significant at = 0.001.

• The null hypothesis is rejected, and there is at least one significant predictor of house price in this analysis.


ANOVAdf SS MS F p

Regression 2 8189.723 4094.86 28.63 .000Residual (Error) 20 2861.017 143.1Total 22 11050.74

Significance Test: Regression Coefficients for the Real Estate Example

t.025,20 = 2.086

tCal = 5.63 > 2.086, reject H0.

Coefficients Std Dev t Stat p

x1 (Sq.Feet) 0.0177 0.003146 5.63 .000

x2 (Age) -0.666 0.2280 -2.92 .008

Residuals

• The residual, or error, of the regression model is the difference between the actual value and its predicted value which is -

• The residuals for a multiple regression model are solved for in the same manner as they are with simple regression.

• First, a predicted value of is determined by entering the value for each independent variable for a given set of observations into the multiple regression equation.

Residuals

• Residuals are also helpful in locating outliers.• Outliers are data points that are apart, or far, from

the mainstream of the other data.• They are sometimes data points that were

mistakenly recorded or measured.• Because every data point influences the regression

model, outliers can exert an overly important influence on the model based on their distance from other points.

Sum of Squares Error

• In an effort to compute a single statistic that can represent the error in a regression analysis, thezero-sum property can be overcome by squaring the residuals and then summing the squares.

• Such an operation produces the sum of squaresof error (SSE).

Residuals and Sum of SquaresError for the Real Estate Example

SSE

Observation Y Observation Y

1 43.0 42.466 0.534 0.285 13 59.7 65.602 -5.902 34.8322 45.1 51.465 -6.365 40.517 14 64.5 75.383 -10.883 118.4383 49.9 51.540 -1.640 2.689 15 76.0 65.442 10.558 111.4794 56.8 58.622 -1.822 3.319 16 89.5 82.772 6.728 45.2655 53.9 54.073 -0.173 0.030 17 82.5 77.659 4.841 23.4406 57.9 55.627 2.273 5.168 18 101.0 87.187 13.813 190.7997 54.9 62.991 -8.091 65.466 19 84.9 89.356 -4.456 19.8588 58.0 85.702 -27.702 767.388 20 108.0 91.237 16.763 280.9829 59.0 48.495 10.505 110.360 21 109.0 85.064 23.936 572.936

10 63.4 61.124 2.276 5.181 22 97.9 114.447 -16.547 273.81511 59.5 68.265 -8.765 76.823 23 120.0 112.460 7.540 56.85412 63.9 71.322 -7.422 55.092 2861.017

Y Y Y 2

Y Y Y Y Y 2

Y Y

General Linear Regression Model

Regression models presented thus far are based on thegeneral linear regression model, which has the form

Y = 0 + 1X1 + 2X2 + 3X3 + . . . + kXk+ Y = the value of the dependent (response) variable0 = the regression constant



k = the partial regression coefficient of independent variable kk = the number of independent variables = the error of prediction

General Linear Regression Model

• In the general linear model, the parameters, βi,are linear.

• However, dependent variable, y, is not necessarilylinearly related to the predictor variables.

• Multiple regression response surfaces are not restricted to linear surfaces and may be curvilinear.

• Regression models can be developed for more than two predictors.

Polynomial Regression

• Regression models in which the highest power of any predictor variable is 1 and in which there are no interaction terms are referred to as first-order models

• If a second independent variable is added, the model is referred to as a first-order model with two independent variables

• Polynomial regression models are regression models that are second- or higher-order models -contain squared, cubed, or higher powers of thepredictor variable(s)

Non Linear Models:Mathematical Transformation

Y X X 0 1 1 2 2 First-order with Two Independent Variables

Second-order with One Independent Variable

Second-order with anInteraction Term

Second-order withTwo Independent

Variables

Y X X 0 1 1 2 1

2

Y X X X X 0 1 1 2 2 3 1 2

Y X X X X X X 0 1 1 2 2 3 1

2

4 2

2

5 1 2

Sales Data and Scatter Plot for 13 Manufacturing Companies

• Consider the table in the next slide.• The table contains sales for 13 manufacturing

companies along with the number of manufacturer representatives associated with each firm.

• A simple regression analysis to predict sales by the number of manufacturer’s representatives resultsin the Excel output.


050

100150200250300350400450500

0 2 4 6 8 10 12

Number of Representatives

Sales

ManufacturerSales

($1,000,000)

Number of Manufacturing Representatives

1 2.1 22 3.6 13 6.2 24 10.4 35 22.8 46 35.6 47 57.1 58 83.5 59 109.4 6

10 128.6 711 196.8 812 280.0 1013 462.3 11

Excel Simple Linear Regression Outputfor the Manufacturing Example

Regression StatisticsMultiple R 0.933R Square 0.870Adjusted R Square 0.858Standard Error 51.10Observations 13

Coefficients Standard Error t Stat P-valueIntercept -107.03 28.737 -3.72 0.003numbers 41.026 4.779 8.58 0.000

ANOVAdf SS MS F Significance F

Regression 1 192395 192395 73.69 0.000Residual 11 28721 2611Total 12 221117


• Researcher creates a second predictor variable, (number of manufacturer’s representatives2) to use in the regression analysis to predict sales along with number of manufacturer’s representatives

• This variable can be created to explore second-order parabolic relationships by squaring the data from the independent variable of the linear model and entering it into the analysis

• With the new data, a multiple regression modelcan be developed

Manufacturing Datawith Newly Created Variable

ManufacturerSales

($1,000,000)

Number of Mgfr Reps

X1

(No. Mgfr Reps)2

X2 = (X1)2

1 2.1 2 42 3.6 1 13 6.2 2 44 10.4 3 95 22.8 4 166 35.6 4 167 57.1 5 258 83.5 5 259 109.4 6 36

10 128.6 7 4911 196.8 8 6412 280.0 10 10013 462.3 11 121

Package output for Quadratic Model to Predict Sales


Coefficients Standard Error t Stat P-valueIntercept 18.067 24.673 0.73 0.481MfgrRp -15.723 9.5450 - 1.65 0.131MfgrRpSq 4.750 0.776 6.12 0.000


Regression 2 215069 107534 177.79 0.000Residual 10 6048 605Total 12 221117

Tukey’s Ladder of Transformations• Tukey’s ladder of expressions can be used to straighten out a

plot of x and y.• Tukey used a four-quadrant approach to show which

expressions on the ladder are more appropriate for agiven situation.

• If the scatter plot of x and y indicates a shape like that shown in the upper left quadrant, recoding should move “down the ladder” for the x variable toward or “up the ladder” for the y variable toward.

• If the scatter plot of x and y indicates a shape like that of the lower right quadrant, the recoding should move “up the ladder” for the x variable toward or “down the ladder” for the y variable toward.

Tukey’s Four Quadrant Approach

Regression Models with Interaction

• When two different independent variables are used in a regression analysis, an interaction occurs between the two variables

• Interaction can be examined as a separate independent variable

• An interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable, thereby creating a new variable

Example – Three Stocks

Suppose the data in the following table represent the closing stock prices for three corporations over a period of 15 months. An investment firm wants to use the prices for stocks 2 and 3 to develop a regression model to predict the price of stock 1.

Prices of Three Stocks overa 15-Month Period

Stock 1 Stock 2 Stock 341 36 35

39 36 35

38 38 3245 51 41

41 52 3943 55 5547 57 52

49 58 54

41 62 6535 70 77

36 72 75

39 74 7433 83 8128 101 92

31 107 91

Regression Models for the Three Stocks

First-order withTwo Independent Variables

Second-order with anInteraction Term

Regression for Three Stocks:First-order, Two Independent Variables

The regression equation isStock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3

Predictor Coef StDev T PConstant 50.855 3.791 13.41 0.000Stock 2 -0.1190 0.1931 -0.62 0.549Stock 3 -0.0708 0.1990 -0.36 0.728

S = 4.570 R-Sq = 47.2% R-Sq(adj) = 38.4%

Analysis of Variance

Source DF SS MS F PRegression 2 224.29 112.15 5.37 0.022Error 12 250.64 20.89Total 14 474.93

Regression for Three Stocks:Second-order With an Interaction Term

The regression equation isStock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998 Inter

Predictor Coef StDev T PConstant 12.046 9.312 1.29 0.222Stock 2 0.8788 0.2619 3.36 0.006Stock 3 0.2205 0.1435 1.54 0.153Inter -0.009985 0.002314 -4.31 0.001

S = 2.909 R-Sq = 80.4% R-Sq(adj) = 25.1%



Regression for Three Stocks:Comparison of two models

• The introduction of the interaction term caused the R-squared to increase from 47.2% to 80.4%

• The standard error of the estimate decreased from 4.570 in the first model to 2.909 in the second model

• The t ratios of the x term and the interaction term are statistically significant in the second model

• Inclusion of the interaction term helped the model account for a substantially greater amount of the dependent variable.

Nonlinear Regression Models:Model Transformation

Data Set for Model Transformation Example

Company Y X1 2580 1.22 11942 2.63 9845 2.24 27800 3.25 18926 2.96 4800 1.57 14550 2.7

Company LOG Y X1 3.41162 1.22 4.077077 2.63 3.993216 2.24 4.444045 3.25 4.277059 2.96 3.681241 1.57 4.162863 2.7

ORIGINAL DATA TRANSFORMED DATA

Y = Sales ($ million/year) X = Advertising ($ million/year)

Regression Output for Model Transformation Example


Coefficients Standard Error t Stat P-valueIntercept 2.9003 0.0729 39.80 0.000X 0.4751 0.0300 15.82 0.000


Regression 1 0.7392 0.7392 250.36 0.000Residual 5 0.0148 0.0030Total 6 0.7540

Prediction with the Transformed Model

Indicator (Dummy) Variables

• Some variables are referred to as Qualitative variables Qualitative variables do not yield quantifiable

outcomes Qualitative variables yield nominal- or ordinal-

level information; used more to categorize items.

• Qualitative variables are referred to as indicatoror dummy variables

• If a dummy variable has c categories, then c – 1 dummy variables must be created

Monthly Salary Example

As an example, consider the issue of sex discrimination in the salary earnings of workers in some industries. In examining this issue, suppose a random sample of 15 workers is drawn from a pool of employed laborers in a particular industry and the workers’ average monthly salaries are determined, along with their age and gender. The data are shown in the following table. As sex can be only male or female, this variable is coded as a dummy variable with 0 = female, 1 = male.

Data for the Monthly Salary Example

Regression Output for the Monthly Salary Example

The regression equation isSalary = 1.732 + 0.111 Age + 0.459 Gender

Predictor Coef StDev T PConstant 1.7321 0.2356 7.35 0.000Age 0.11122 0.07208 1.54 0.149Gender 0.45868 0.05346 8.58 0.000

S = 0.09679 R-Sq = 89.0% R-Sq(adj) = 87.2%



Regression Output for the Monthly Salary Example

MODEL-BUILDING

Suppose a researcher wants to develop a multiple regression model to predict the world production of crude oil. The researcher decides to use as predictors the following five independent variables.• U.S. energy consumption• Gross U.S. nuclear electricity generation• U.S. coal production• Total U.S. dry gas (natural gas) production• Fuel rate of U.S.-owned automobiles

Data for Multiple Regressionto Predict Crude Oil Production

Y World Crude Oil Production

X1 U.S. Energy Consumption

X2 U.S. Nuclear Generation

X3 U.S. Coal Production

X4 U.S. Dry Gas Production

X5 U.S. Fuel Rate for Autos

Y X1 X2 X3 X4 X5

55.7 74.3 83.5 598.6 21.7 13.3055.7 72.5 114.0 610.0 20.7 13.4252.8 70.5 172.5 654.6 19.2 13.5257.3 74.4 191.1 684.9 19.1 13.5359.7 76.3 250.9 697.2 19.2 13.8060.2 78.1 276.4 670.2 19.1 14.0462.7 78.9 255.2 781.1 19.7 14.4159.6 76.0 251.1 829.7 19.4 15.4656.1 74.0 272.7 823.8 19.2 15.9453.5 70.8 282.8 838.1 17.8 16.6553.3 70.5 293.7 782.1 16.1 17.1454.5 74.1 327.6 895.9 17.5 17.8354.0 74.0 383.7 883.6 16.5 18.2056.2 74.3 414.0 890.3 16.1 18.2756.7 76.9 455.3 918.8 16.6 19.2058.7 80.2 527.0 950.3 17.1 19.8759.9 81.3 529.4 980.7 17.3 20.3160.6 81.3 576.9 1029.1 17.8 21.0260.2 81.1 612.6 996.0 17.7 21.6960.2 82.1 618.8 997.5 17.8 21.6860.6 83.9 610.3 945.4 18.2 21.0460.9 85.6 640.4 1033.5 18.9 21.48

Regression Analysis forCrude Oil Production

MODEL-BUILDING : Objectives

• To develop a regression model that accounts for the most variation of the dependent variable

• To make the model simple and economical at the same time

All Possible Regressions with Five Independent Variables

FourPredictors

X1,X2,X3,X4

X1,X2,X3,X5

X1,X2,X4,X5

X1,X3,X4,X5

X2,X3,X4,X5

SinglePredictor

X1

X2

X3

X4

X5

TwoPredictors

X1,X2

X1,X3

X1,X4

X1,X5

X2,X3

X2,X4

X2,X5

X3,X4

X3,X5

X4,X5

ThreePredictorsX1,X2,X3

X1,X2,X4

X1,X2,X5

X1,X3,X4

X1,X3,X5

X1,X4,X5

X2,X3,X4

X2,X3,X5

X2,X4,X5

X3,X4,X5

Five PredictorsX1,X2,X3,X4,X5

MODEL-BUILDING : Search Procedures

Search procedures are processes whereby more than one multiple regression model is developed for a given database, and the models are compared and sorted by different criteria, depending on the given procedure:• All Possible Regressions• Stepwise Regression• Forward Selection• Backward Elimination

MODEL-BUILDING : Stepwise Regression

• Stepwise regression is a step-by-step process that begins by developing a regression model with a single predictor variable and adds and deletes predictors one step at a time.

• Perform k simple regressions; and select the best as the initial model.

• Evaluate each variable not in the model If none meets the criterion, stop Add the best variable to the model; reevaluate previous

variables, and drop any which are not significant• Return to previous step.

Stepwise: Step 1 - Simple RegressionResults for Each Independent Variable

Dependent

Variable

Independent

Variable t-Ratio R2

Y X1 11.77 85.2%

Y X2 4.43 45.0%

Y X3 3.91 38.9%

Y X4 1.08 4.6%

Y X5 3.54 34.2%

Stepwise Regression Step 2: Two Predictors

Step 3: Three Predictors

MODEL-BUILDING : Forward Selection

• Forward selection is like stepwise regression, but once a variable is entered into the process, it isnever dropped out.

• Forward selection begins by finding the independent variable that will produce the largest absolute value of t (and largest R2) in predicting y.

MODEL-BUILDING : Backward Elimination

• Start with the “full model” (all k predictors).• If all predictors are significant, stop.• Otherwise, eliminate the most non-significant

predictor; return to previous step.


Step 1:

Step 2:


Step 3:

Step 4:

Education

Statr session 23 and 24