Week 6 in Class Lecture

8/10/2019 Week 6 in Class Lecture

1/70

Week 6

Simple Linear Regression


2/70

Models

Representation of some phenomenon

Mathematical model is a mathematical

expression of some phenomenon

Often describe relationships between

variables

Types Deterministic models

Probabilistic models


3/70

Deterministic Models

Hypothesize exactrelationships

Suitable when prediction error is negligible

Example: force is exactly mass timesacceleration

F= ma

1984-1994 T/Maker Co.


4/70

Probabilistic Models

Hypothesize two components

Deterministic

Random error

Example:sales volume (y) is 10 times

advertising spending (x) + random error

y= 10x+

Random error may be due to factors

other than advertising


5/70

General Form of Probabilistic

Modelsy= Deterministic component + Random error

whereyis the variable of interest.

We always assume that the mean value of therandom error equals 0:

E(y) = Deterministic component


6/70

A First-Order (Straight Line)

Probabilistic Model

y= 0+ 1x +

where

y= Dependentorresponse variable(variable to be modeled)

x= Independentorpredictor variable

(variable used as a predictor ofy)E(y) = 0+ 1x = Deterministic component

(epsilon) = Random error component


7/70


Probabilistic Model

y= 0+ 1x +

0(beta zero) =y-intercept of the line, that is, thepoint at which the line interceptsor cuts through the y-axis

1(beta one) = slope of the line, that is, the

change (amount of increase ordecrease) in the deterministiccomponent ofyfor every 1-unitincrease inx


8/70


Probabilistic Model

Apositiveslope implies thatE(y) increasesby theamount 1for each unit increase inx.

A negativeslope implies thatE(y) decreasesbythe amount 1.


9/70

Five-Step Procedure

Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.

Step 2: Use the sample data to estimate unknown

parameters in the model.Step 3: Specify the probability distribution of the

random error and estimate the standarddeviation of this distribution.

Step 4: Statistically evaluate the usefulness of themodel.

Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.


10/70

Scattergram

1. Plot of all (xi,yi) pairs

2. Suggests how well model will fit

0

20

40

60

0 20 40 60

x

y


11/70

0

20

40

60

0 20 40 60

x

y

Thinking Challenge

How would you draw a line through the points?

How do you determine which line fits best?


12/70

Least Squares Line

The least squares line is one that has

the following two properties:

1. The sum of the errors equals 0,i.e., mean error = 0.

2. The sum of squared errors (SSE) is smaller than

for any other straight-line model, i.e., the errorvariance is minimum.


13/70

Interpreting the Estimates of 0and

1in Simple Liner Regression

y-intercept: represents the predicted value ofywhenx= 0 (Caution: This value will notbe meaningful if the valuex= 0 is

nonsensical or outside the range of thesample data.)

slope: represents the increase (or decrease) iny

for every 1-unit increase inx(Caution:This interpretation is valid only forx-values within the range of the sampledata.)


14/70

Dependent

Independent

MiniTab


15/70

Least Squares Example

Youre an economist for the county cooperative.

You gather the following data:

Fertilizer (lb.) Yield (lb.)4 3.0

6 5.5

10 6.5

12 9.0Find the least squares linerelating

crop yield and fertilizer.

1984-1994 T/Maker Co.


16/70

Scattergram

Crop Yield vs. Fertilizer*Stat -> Regression -> Fitted Line Plot


17/70

Coefficient Interpretation

Solution

2. y-Intercept ( 0) Since 0 is outside of the range of the sampled

values ofx, they-intercept has no meaningful

interpretation.

^

^1. Slope (

1

)

Crop Yield (y) is expected to increase by .65 lb. for

each 1 lb. increase in Fertilizer (x)

.8 .65y x


18/70

Five-Step Procedure



parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.




19/70

Basic Assumptions of the

Probability Distribution


20/70

Five-Step Procedure



parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.




21/70

A Test of Model Utility: Simple

Linear RegressionOne-Tailed Test

H0: 1= 0

Ha:

1< 0 (orH

a:

1> 0)

Test Statistic: t

Rejection region: t t whenHa: 1> 0)

where t is based on (n2) degrees of freedom


22/70

A Test of Model Utility: Simple

Linear RegressionTwo-Tailed Test

H0: 1= 0

Ha:

1 0

Test Statistic: t

Rejection region: |t| > t

where t is based on (n2) degrees of freedom


23/70

Interpreting p-Values for

Coefficients in Regression

Almost all statistical computer software packages

report a two-tailedp-value for each of the

parameters in the regression model. For example,in simple linear regression, thep-value for the two-

tailed testH0: 1= 0 versusHa: 1 0 is given on

the printout. If you want to conduct a one-tailedtest of hypothesis, you will need to adjust the

p-value accordingly.


24/70

Test of Slope Coefficient

Example

Youre a marketing analyst for Hasbro Toys.

You find 0=.1,1= .7and s= .6055.

Ad Expenditure (100$) Sales (Units)

1 12 13 24 2

5 4Is the relationship significantat the .05level of significance?

^^


25/70


Solution

H0:

if slop ( 1) is zero then there is no relationship

Ha: This is the claim, there is a relationship because slop is not

zero.

1= 0

1 0

f S C ff


26/70


Solution

H0:

Ha:

df

Critical Value(s):

t0 3.182-3.182

.025

RejectH0 RejectH0

.025

1= 0

1 0

.0552 = 3

Inverse Cumulative Distribution Function

Student's t distribution with 3 DF

P( X Probability Distributions -> t

T f Sl C ffi i


27/70


Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)

Regression Equation

Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)

Coefficients

Term Coef SE Coef T P

Constant -0.1 0.635085 -0.15746 0.885

Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

Summary of Model

S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

PRESS = 4.43367 R-Sq(pred) = 26.11%

Analysis of Variance

Source DF Seq SS Adj SS Adj MS F P

Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

Ad Ex enditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

tP-Value

1^

Stat -> Regression -> General Regression

0

^

T t f Sl C ffi i t


28/70


Solution

H0:

Ha:

df

Critical Value(s):

t0 3.182-3.182

.025

RejectH0 RejectH0

.025

1= 0

1 0

.0552 = 3

Test Statistic:

Decision:

Conclusion:

t 3.657

RejectH0 at = .05

because t >

because P-value is smaller than .

There is evidence of a

relationship

P-Value = 0.035


29/70

Correlation Models

Answers How strongis the linearrelationship between two variables?

Coefficient of correlation Sample correlation coefficient denoted r Population correlation coefficient

Values range from1 to +1


30/70

Coefficient of Correlation


31/70



32/70



33/70


34/70


Solution

r =

r =

r = 0.9038805

r 0.904

Stat -> Regression -> Fitted Line Plot

r 0.904 -- Strong Positive Relation between x and y


35/70


Example

Youre an economist for the county cooperative.

You gather the following data:

Fertilizer (lb.) Yield (lb.)4 3.0

6 5.5

10 6.5

12 9.0Find the coefficient of correlation.

1984-1994 T/Maker Co.


36/70


Solution

r =

r =

r 0.956

Stat -> Regression -> Fitted Line Plot


37/70

It represents the proportion of the total sample

variability around y that is explained by the linear

relationship betweenyandx.

Coefficient of Determination

0 r2 1

r2= (coefficient of correlation)2

Coefficient of


38/70

Coefficient of

Determination Example

Youre a marketing analyst for Hasbro Toys.

You know r= .904.

Ad Expenditure (100$) Sales (Units)1 12 13 24 2

5 4

Calculate and interpret thecoefficient of determination.

Coefficient of


39/70

Coefficient of

Determination Solution

r2= (coefficient of correlation)2

r2= (.904)2

r2= .817

Interpretation:About 81.7% of the sample variation

in Sales (y) can be explained by using Ad $ (x) to

predict Sales (y)in the linear model. The remaining

18.3% are due to other factors.

Regression Modeling


40/70

Regression Modeling

Steps

1. Hypothesize deterministic component

2. Estimate unknown model parameters

3. Specify probability distribution of random errorterm

Estimate standard deviation of error

4. Evaluate model5. Use model for prediction and estimation

P di ti With R i


41/70

Prediction With Regression

Models Types of predictions

Point estimates

Interval estimates

What is predicted

Population mean value of y,E(y), for givenx

(confidence interval)

Individual response (yi) for givenx(prediction interval)

Confidence Interval


42/70

Confidence Interval

Estimate Example

Youre a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.

Ad Expenditure (100$) Sales (Units)1 12 13 24 25 4

Find a 95%confidence interval forthe meansales when advertising is $4.

^^


43/70

Prediction Interval Solution

I t l E ti t


44/70

Interval Estimate


Regression Equation


Coefficients


Constant -0.1 0.635085 -0.15746 0.885

Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

Summary of Model

S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

PRESS = 4.43367 R-Sq(pred) = 26.11%



Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

Error 3 1.1 1.1 0.36667

Total 4 6.0

Fits and Diagnostics for Unusual Observations

No unusual observations

Predicted Values for New Observations

New Obs Fit SE Fit 95% CI 95% PI

1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)

Values of Predictors for New Observations

Ad Expenditure

New Obs (100$)

1 4

I t l E ti t


45/70





1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)


Ad Expenditure

New Obs (100$)

1 4

Interval Estimate

Computer Output

Predicted y

when x= 4

Confidence

IntervalSY

Prediction Interval


46/70

Prediction Interval

ExampleYoure a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.

Ad Expenditure (1000$) Sales (Units)1 1

2 13 24 25 4

Predict the sales when advertisingis $400. Use a 95%predictioninterval.

^^


47/70

Prediction Interval Solution

Interval Estimate


48/70

Interval Estimate


Regression Equation


Coefficients


Constant -0.1 0.635085 -0.15746 0.885

Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

Summary of Model

S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

PRESS = 4.43367 R-Sq(pred) = 26.11%



Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

Error 3 1.1 1.1 0.36667

Total 4 6.0





1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)


Ad Expenditure

New Obs (100$)

1 4

Interval Estimate


49/70





1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)


Ad Expenditure

New Obs (100$)

1 4

Interval Estimate

Computer Output

Predicted y

when x= 4SY^

Prediction

Interval

C fid I t l


50/70

Confidence Intervals v.

Prediction Intervals The prediction interval is always wider than

the corresponding confidence interval

Added uncertainty involved in predicting a single

response versus the mean response


51/70


52/70

Example

Suppose a fire insurance company wants to relatethe amount of fire damage in major residential

fires to the distance between the burning house

and the nearest fire station. The study is to beconducted in a large suburb of a major city; a

sample of 15 recent fires in this suburb is

selected. The amount of damage,y, and the

distance between the fire and the nearest fire

station,x, are recorded for each fire.


53/70

Example


54/70

Example

Step 1: First, we hypothesize a model to relatefire damage,y, to the distance from the nearest

fire station,x. We hypothesize a straight-line

probabilistic model:y= 0+ 1x+


55/70

Example

Step 2: Use a statistical software package toestimate the unknown parameters in the

deterministic component of the hypothesized

model. The least squares estimates of the slope 1and intercept 0, highlighted on the printout, are

1

0


56/70

ExampleGeneral Regression Analysis: DAMAGE versus DISTANCE

Regression Equation

DAMAGE = 10.2779+ 4.91933DISTANCE

Coefficients


Constant 10.2779 1.42028 7.2366 0.000

DISTANCE 4.9193 0.39275 12.5254 0.000

Summary of Model

S = 2.31635 R-Sq = 92.35% R-Sq(adj) = 91.76%

PRESS = 93.2117 R-Sq(pred) = 89.77%



Regression 1 841.766 841.766 841.766 156.886 0.0000000

DISTANCE 1 841.766 841.766 841.766 156.886 0.0000000Error 13 69.751 69.751 5.365

Total 14 911.517



Least Square Equation


57/70

Example

This prediction equation is graphed in theMinitab Fitted Line Plot.


58/70

Example

The least squares estimate of the slope, 1implies that the estimated mean damage increases

by $4,919 for each additional mile from the fire

station. This interpretation is valid over the rangeofx, or from .7 to 6.1 miles from the station. The

estimatedy-intercept, 0 has no

practical interpretation becausex= 0 is outside

the sampled range.


59/70


60/70

Example

Step 4: First, test the null hypothesis that theslope 1is 0that is, that there is no linearrelationship between fire damage and thedistance from the nearest fire station, against the

alternative hypothesis that fire damage increasesas the distance increases. We test

H0: 1= 0

Ha: 1> 0The two-tailed observed significance level fortesting is approximately 0. Dividing by 2, p-value

is also approximately 0. (P

Documents

Week 6 in Class Lecture