Upload
abigail-gutierrez
View
219
Download
0
Embed Size (px)
Citation preview
8/10/2019 Week 6 in Class Lecture
1/70
Week 6
Simple Linear Regression
8/10/2019 Week 6 in Class Lecture
2/70
Models
Representation of some phenomenon
Mathematical model is a mathematical
expression of some phenomenon
Often describe relationships between
variables
Types Deterministic models
Probabilistic models
8/10/2019 Week 6 in Class Lecture
3/70
Deterministic Models
Hypothesize exactrelationships
Suitable when prediction error is negligible
Example: force is exactly mass timesacceleration
F= ma
1984-1994 T/Maker Co.
8/10/2019 Week 6 in Class Lecture
4/70
Probabilistic Models
Hypothesize two components
Deterministic
Random error
Example:sales volume (y) is 10 times
advertising spending (x) + random error
y= 10x+
Random error may be due to factors
other than advertising
8/10/2019 Week 6 in Class Lecture
5/70
General Form of Probabilistic
Modelsy= Deterministic component + Random error
whereyis the variable of interest.
We always assume that the mean value of therandom error equals 0:
E(y) = Deterministic component
8/10/2019 Week 6 in Class Lecture
6/70
A First-Order (Straight Line)
Probabilistic Model
y= 0+ 1x +
where
y= Dependentorresponse variable(variable to be modeled)
x= Independentorpredictor variable
(variable used as a predictor ofy)E(y) = 0+ 1x = Deterministic component
(epsilon) = Random error component
8/10/2019 Week 6 in Class Lecture
7/70
A First-Order (Straight Line)
Probabilistic Model
y= 0+ 1x +
0(beta zero) =y-intercept of the line, that is, thepoint at which the line interceptsor cuts through the y-axis
1(beta one) = slope of the line, that is, the
change (amount of increase ordecrease) in the deterministiccomponent ofyfor every 1-unitincrease inx
8/10/2019 Week 6 in Class Lecture
8/70
A First-Order (Straight Line)
Probabilistic Model
Apositiveslope implies thatE(y) increasesby theamount 1for each unit increase inx.
A negativeslope implies thatE(y) decreasesbythe amount 1.
8/10/2019 Week 6 in Class Lecture
9/70
Five-Step Procedure
Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.
Step 2: Use the sample data to estimate unknown
parameters in the model.Step 3: Specify the probability distribution of the
random error and estimate the standarddeviation of this distribution.
Step 4: Statistically evaluate the usefulness of themodel.
Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.
8/10/2019 Week 6 in Class Lecture
10/70
Scattergram
1. Plot of all (xi,yi) pairs
2. Suggests how well model will fit
0
20
40
60
0 20 40 60
x
y
8/10/2019 Week 6 in Class Lecture
11/70
0
20
40
60
0 20 40 60
x
y
Thinking Challenge
How would you draw a line through the points?
How do you determine which line fits best?
8/10/2019 Week 6 in Class Lecture
12/70
Least Squares Line
The least squares line is one that has
the following two properties:
1. The sum of the errors equals 0,i.e., mean error = 0.
2. The sum of squared errors (SSE) is smaller than
for any other straight-line model, i.e., the errorvariance is minimum.
8/10/2019 Week 6 in Class Lecture
13/70
Interpreting the Estimates of 0and
1in Simple Liner Regression
y-intercept: represents the predicted value ofywhenx= 0 (Caution: This value will notbe meaningful if the valuex= 0 is
nonsensical or outside the range of thesample data.)
slope: represents the increase (or decrease) iny
for every 1-unit increase inx(Caution:This interpretation is valid only forx-values within the range of the sampledata.)
8/10/2019 Week 6 in Class Lecture
14/70
Dependent
Independent
MiniTab
8/10/2019 Week 6 in Class Lecture
15/70
Least Squares Example
Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)4 3.0
6 5.5
10 6.5
12 9.0Find the least squares linerelating
crop yield and fertilizer.
1984-1994 T/Maker Co.
8/10/2019 Week 6 in Class Lecture
16/70
Scattergram
Crop Yield vs. Fertilizer*Stat -> Regression -> Fitted Line Plot
8/10/2019 Week 6 in Class Lecture
17/70
Coefficient Interpretation
Solution
2. y-Intercept ( 0) Since 0 is outside of the range of the sampled
values ofx, they-intercept has no meaningful
interpretation.
^
^1. Slope (
1
)
Crop Yield (y) is expected to increase by .65 lb. for
each 1 lb. increase in Fertilizer (x)
.8 .65y x
8/10/2019 Week 6 in Class Lecture
18/70
Five-Step Procedure
Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.
Step 2: Use the sample data to estimate unknown
parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.
Step 4: Statistically evaluate the usefulness of themodel.
Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.
8/10/2019 Week 6 in Class Lecture
19/70
Basic Assumptions of the
Probability Distribution
8/10/2019 Week 6 in Class Lecture
20/70
Five-Step Procedure
Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.
Step 2: Use the sample data to estimate unknown
parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.
Step 4: Statistically evaluate the usefulness of themodel.
Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.
8/10/2019 Week 6 in Class Lecture
21/70
A Test of Model Utility: Simple
Linear RegressionOne-Tailed Test
H0: 1= 0
Ha:
1< 0 (orH
a:
1> 0)
Test Statistic: t
Rejection region: t t whenHa: 1> 0)
where t is based on (n2) degrees of freedom
8/10/2019 Week 6 in Class Lecture
22/70
A Test of Model Utility: Simple
Linear RegressionTwo-Tailed Test
H0: 1= 0
Ha:
1 0
Test Statistic: t
Rejection region: |t| > t
where t is based on (n2) degrees of freedom
8/10/2019 Week 6 in Class Lecture
23/70
Interpreting p-Values for
Coefficients in Regression
Almost all statistical computer software packages
report a two-tailedp-value for each of the
parameters in the regression model. For example,in simple linear regression, thep-value for the two-
tailed testH0: 1= 0 versusHa: 1 0 is given on
the printout. If you want to conduct a one-tailedtest of hypothesis, you will need to adjust the
p-value accordingly.
8/10/2019 Week 6 in Class Lecture
24/70
Test of Slope Coefficient
Example
Youre a marketing analyst for Hasbro Toys.
You find 0=.1,1= .7and s= .6055.
Ad Expenditure (100$) Sales (Units)
1 12 13 24 2
5 4Is the relationship significantat the .05level of significance?
^^
8/10/2019 Week 6 in Class Lecture
25/70
Test of Slope Coefficient
Solution
H0:
if slop ( 1) is zero then there is no relationship
Ha: This is the claim, there is a relationship because slop is not
zero.
1= 0
1 0
f S C ff
8/10/2019 Week 6 in Class Lecture
26/70
Test of Slope Coefficient
Solution
H0:
Ha:
df
Critical Value(s):
t0 3.182-3.182
.025
RejectH0 RejectH0
.025
1= 0
1 0
.0552 = 3
Inverse Cumulative Distribution Function
Student's t distribution with 3 DF
P( X Probability Distributions -> t
T f Sl C ffi i
8/10/2019 Week 6 in Class Lecture
27/70
Test of Slope Coefficient
Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)
Coefficients
Term Coef SE Coef T P
Constant -0.1 0.635085 -0.15746 0.885
Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035
Summary of Model
S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%
PRESS = 4.43367 R-Sq(pred) = 26.11%
Analysis of Variance
Source DF Seq SS Adj SS Adj MS F P
Regression 1 4.9 4.9 4.90000 13.3636 0.0353528
Ad Ex enditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528
tP-Value
1^
Stat -> Regression -> General Regression
0
^
T t f Sl C ffi i t
8/10/2019 Week 6 in Class Lecture
28/70
Test of Slope Coefficient
Solution
H0:
Ha:
df
Critical Value(s):
t0 3.182-3.182
.025
RejectH0 RejectH0
.025
1= 0
1 0
.0552 = 3
Test Statistic:
Decision:
Conclusion:
t 3.657
RejectH0 at = .05
because t >
because P-value is smaller than .
There is evidence of a
relationship
P-Value = 0.035
8/10/2019 Week 6 in Class Lecture
29/70
Correlation Models
Answers How strongis the linearrelationship between two variables?
Coefficient of correlation Sample correlation coefficient denoted r Population correlation coefficient
Values range from1 to +1
8/10/2019 Week 6 in Class Lecture
30/70
Coefficient of Correlation
8/10/2019 Week 6 in Class Lecture
31/70
Coefficient of Correlation
8/10/2019 Week 6 in Class Lecture
32/70
Coefficient of Correlation
8/10/2019 Week 6 in Class Lecture
33/70
8/10/2019 Week 6 in Class Lecture
34/70
Coefficient of Correlation
Solution
r =
r =
r = 0.9038805
r 0.904
Stat -> Regression -> Fitted Line Plot
r 0.904 -- Strong Positive Relation between x and y
8/10/2019 Week 6 in Class Lecture
35/70
Coefficient of Correlation
Example
Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)4 3.0
6 5.5
10 6.5
12 9.0Find the coefficient of correlation.
1984-1994 T/Maker Co.
8/10/2019 Week 6 in Class Lecture
36/70
Coefficient of Correlation
Solution
r =
r =
r 0.956
Stat -> Regression -> Fitted Line Plot
8/10/2019 Week 6 in Class Lecture
37/70
It represents the proportion of the total sample
variability around y that is explained by the linear
relationship betweenyandx.
Coefficient of Determination
0 r2 1
r2= (coefficient of correlation)2
Coefficient of
8/10/2019 Week 6 in Class Lecture
38/70
Coefficient of
Determination Example
Youre a marketing analyst for Hasbro Toys.
You know r= .904.
Ad Expenditure (100$) Sales (Units)1 12 13 24 2
5 4
Calculate and interpret thecoefficient of determination.
Coefficient of
8/10/2019 Week 6 in Class Lecture
39/70
Coefficient of
Determination Solution
r2= (coefficient of correlation)2
r2= (.904)2
r2= .817
Interpretation:About 81.7% of the sample variation
in Sales (y) can be explained by using Ad $ (x) to
predict Sales (y)in the linear model. The remaining
18.3% are due to other factors.
Regression Modeling
8/10/2019 Week 6 in Class Lecture
40/70
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random errorterm
Estimate standard deviation of error
4. Evaluate model5. Use model for prediction and estimation
P di ti With R i
8/10/2019 Week 6 in Class Lecture
41/70
Prediction With Regression
Models Types of predictions
Point estimates
Interval estimates
What is predicted
Population mean value of y,E(y), for givenx
(confidence interval)
Individual response (yi) for givenx(prediction interval)
Confidence Interval
8/10/2019 Week 6 in Class Lecture
42/70
Confidence Interval
Estimate Example
Youre a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.
Ad Expenditure (100$) Sales (Units)1 12 13 24 25 4
Find a 95%confidence interval forthe meansales when advertising is $4.
^^
8/10/2019 Week 6 in Class Lecture
43/70
Prediction Interval Solution
I t l E ti t
8/10/2019 Week 6 in Class Lecture
44/70
Interval Estimate
Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)
Coefficients
Term Coef SE Coef T P
Constant -0.1 0.635085 -0.15746 0.885
Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035
Summary of Model
S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%
PRESS = 4.43367 R-Sq(pred) = 26.11%
Analysis of Variance
Source DF Seq SS Adj SS Adj MS F P
Regression 1 4.9 4.9 4.90000 13.3636 0.0353528
Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528
Error 3 1.1 1.1 0.36667
Total 4 6.0
Fits and Diagnostics for Unusual Observations
No unusual observations
Predicted Values for New Observations
New Obs Fit SE Fit 95% CI 95% PI
1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)
Values of Predictors for New Observations
Ad Expenditure
New Obs (100$)
1 4
I t l E ti t
8/10/2019 Week 6 in Class Lecture
45/70
Fits and Diagnostics for Unusual Observations
No unusual observations
Predicted Values for New Observations
New Obs Fit SE Fit 95% CI 95% PI
1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)
Values of Predictors for New Observations
Ad Expenditure
New Obs (100$)
1 4
Interval Estimate
Computer Output
Predicted y
when x= 4
Confidence
IntervalSY
Prediction Interval
8/10/2019 Week 6 in Class Lecture
46/70
Prediction Interval
ExampleYoure a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.
Ad Expenditure (1000$) Sales (Units)1 1
2 13 24 25 4
Predict the sales when advertisingis $400. Use a 95%predictioninterval.
^^
8/10/2019 Week 6 in Class Lecture
47/70
Prediction Interval Solution
Interval Estimate
8/10/2019 Week 6 in Class Lecture
48/70
Interval Estimate
Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)
Coefficients
Term Coef SE Coef T P
Constant -0.1 0.635085 -0.15746 0.885
Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035
Summary of Model
S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%
PRESS = 4.43367 R-Sq(pred) = 26.11%
Analysis of Variance
Source DF Seq SS Adj SS Adj MS F P
Regression 1 4.9 4.9 4.90000 13.3636 0.0353528
Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528
Error 3 1.1 1.1 0.36667
Total 4 6.0
Fits and Diagnostics for Unusual Observations
No unusual observations
Predicted Values for New Observations
New Obs Fit SE Fit 95% CI 95% PI
1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)
Values of Predictors for New Observations
Ad Expenditure
New Obs (100$)
1 4
Interval Estimate
8/10/2019 Week 6 in Class Lecture
49/70
Fits and Diagnostics for Unusual Observations
No unusual observations
Predicted Values for New Observations
New Obs Fit SE Fit 95% CI 95% PI
1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)
Values of Predictors for New Observations
Ad Expenditure
New Obs (100$)
1 4
Interval Estimate
Computer Output
Predicted y
when x= 4SY^
Prediction
Interval
C fid I t l
8/10/2019 Week 6 in Class Lecture
50/70
Confidence Intervals v.
Prediction Intervals The prediction interval is always wider than
the corresponding confidence interval
Added uncertainty involved in predicting a single
response versus the mean response
8/10/2019 Week 6 in Class Lecture
51/70
8/10/2019 Week 6 in Class Lecture
52/70
Example
Suppose a fire insurance company wants to relatethe amount of fire damage in major residential
fires to the distance between the burning house
and the nearest fire station. The study is to beconducted in a large suburb of a major city; a
sample of 15 recent fires in this suburb is
selected. The amount of damage,y, and the
distance between the fire and the nearest fire
station,x, are recorded for each fire.
8/10/2019 Week 6 in Class Lecture
53/70
Example
8/10/2019 Week 6 in Class Lecture
54/70
Example
Step 1: First, we hypothesize a model to relatefire damage,y, to the distance from the nearest
fire station,x. We hypothesize a straight-line
probabilistic model:y= 0+ 1x+
8/10/2019 Week 6 in Class Lecture
55/70
Example
Step 2: Use a statistical software package toestimate the unknown parameters in the
deterministic component of the hypothesized
model. The least squares estimates of the slope 1and intercept 0, highlighted on the printout, are
1
0
8/10/2019 Week 6 in Class Lecture
56/70
ExampleGeneral Regression Analysis: DAMAGE versus DISTANCE
Regression Equation
DAMAGE = 10.2779+ 4.91933DISTANCE
Coefficients
Term Coef SE Coef T P
Constant 10.2779 1.42028 7.2366 0.000
DISTANCE 4.9193 0.39275 12.5254 0.000
Summary of Model
S = 2.31635 R-Sq = 92.35% R-Sq(adj) = 91.76%
PRESS = 93.2117 R-Sq(pred) = 89.77%
Analysis of Variance
Source DF Seq SS Adj SS Adj MS F P
Regression 1 841.766 841.766 841.766 156.886 0.0000000
DISTANCE 1 841.766 841.766 841.766 156.886 0.0000000Error 13 69.751 69.751 5.365
Total 14 911.517
Fits and Diagnostics for Unusual Observations
No unusual observations
Least Square Equation
8/10/2019 Week 6 in Class Lecture
57/70
Example
This prediction equation is graphed in theMinitab Fitted Line Plot.
8/10/2019 Week 6 in Class Lecture
58/70
Example
The least squares estimate of the slope, 1implies that the estimated mean damage increases
by $4,919 for each additional mile from the fire
station. This interpretation is valid over the rangeofx, or from .7 to 6.1 miles from the station. The
estimatedy-intercept, 0 has no
practical interpretation becausex= 0 is outside
the sampled range.
8/10/2019 Week 6 in Class Lecture
59/70
8/10/2019 Week 6 in Class Lecture
60/70
Example
Step 4: First, test the null hypothesis that theslope 1is 0that is, that there is no linearrelationship between fire damage and thedistance from the nearest fire station, against the
alternative hypothesis that fire damage increasesas the distance increases. We test
H0: 1= 0
Ha: 1> 0The two-tailed observed significance level fortesting is approximately 0. Dividing by 2, p-value
is also approximately 0. (P