View
221
Download
0
Embed Size (px)
Citation preview
Note 13 of 5E
Statistics with Economics and Statistics with Economics and Business ApplicationsBusiness Applications
Chapter 11 Linear Regression and Correlation
Correlation Coefficient, Least Squares, ANOVA Table, Test and Confidence Intervals,
Estimation and Prediction
Note 13 of 5E
ReviewReview
I. I. What’s in last two lectures?What’s in last two lectures?Hypotheses tests for means and proportions Chapter 8
II. II. What's in the next three lectures?What's in the next three lectures?Correlation coefficient and linear regression Read Chapter 11
Note 13 of 5E
IntroductionIntroduction
So far we have done statistics on one variable at a time. We now interested in relationships between two variables and how to use one variable to predict another variable.
• Does weight depend on height?
• Does blood pressure level predict life expectancy?
• Do SAT scores predict college performance?
• Does taking PSTAT 5E make you a better person?
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
The following data was collected in a study of age and fatness in humans.
* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839
One of the questions was, “What is the relationship between age and fatness?”
Age 23 23 27 27 39 41 45 49 50% Fat 9 .5 27 .9 7 .8 17 .8 31 .4 25 .9 27 .4 25 .2 31 .1
Age 53 53 54 56 57 58 58 60 61% Fat 34.7 42 29.1 32 .5 30 .3 33 33.8 41 .1 34 .5
Note 13 of 5E
Example: Age and FatnessExample: Age and FatnessThe following scatterplot shows that % fat in general tend to increase with age. The relationship is close, but not exactly, linear.
Note 13 of 5E
Example: Advertising and SaleExample: Advertising and Sale The following table contains sales (y) and advertising
expenditures (x) for 10 branches of a retail store.
x ($100) 18 7 14 31 21 5 11 16 26 29
y ($1000) 55 17 36 85 62 18 33 41 63 87
Note 13 of 5E
• What patternpattern or formform do you see?
• Straight line upward or downward
• Curve or no pattern at all
• How strongstrong is the pattern?
• Strong or weak
• Are there any unusual observationsunusual observations?
• Clusters or outliers
Describing the ScatterplotDescribing the Scatterplot
Note 13 of 5E
Positive linear - strong Negative linear -weak
Curvilinear No relationship
ExamplesExamples
Note 13 of 5E
Investigation of RelationshipInvestigation of Relationship
There are two approaches to investigate linear relationship
– Correlation coefficient: a numerical measure of the strengthstrength and directiondirection of the linear relationship between x and y.
– Linear regression: a linear equation expresses the relationship between x and y. It provides a formform of the relationship.
Note 13 of 5E
The Correlation CoefficientThe Correlation Coefficient The strength and direction of the relationship
between x and y are measured using the correlation coefficient (Pearson product correlation coefficient (Pearson product moment coefficient of correlation), moment coefficient of correlation), rr..
where
sx = standard deviation of the x’s
sy = standard deviation of the y’s
sx = standard deviation of the x’s
sy = standard deviation of the y’s
yx
xy
ss
sr
yx
xy
ss
sr
1
))((
nn
yxyx
s
iiii
xy 1
))((
nn
yxyx
s
iiii
xy
Note 13 of 5E
ExampleExample
The table shows the heights and weights ofn = 10 randomly selected college football players.Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
Use your calculator to find the sums and sums of squares.
Use your calculator to find the sums and sums of squares.
8261.)2610)(4.60(
328
26104.60328
r
SSS yyxxxy
8261.)2610)(4.60(
328
26104.60328
r
SSS yyxxxy
Note 13 of 5E
Football PlayersFootball Players
Height
Weig
ht
75747372717069686766
210
200
190
180
170
160
150
Scatterplot of Weight vs Height
r = .8261
Strong positive correlation
As the player’s height increases, so
does his weight.
r = .8261
Strong positive correlation
As the player’s height increases, so
does his weight.
Note 13 of 5E
Interpreting Interpreting rr
All points fall exactly on a straight line.
Strong relationship; either positive or negative
No relationship; random scatter of points
• -1 r 1
• r 0
• r 1 or –1
• r = 1 or –1
Sign of r indicates direction of the linear relationship.
Note 13 of 5E
Example: Advertising Example: Advertising and Saleand Sale
Often we want to investigate the form of the relationship for the purpose of description, control and prediction. For the advertising and sale example, sale increases as advertising expenditure increase. The relationship is almost, but not exact linear.
• Description: how sales depends on advertising expenditure
• Control: how much to spend on advertising to reach certain goals on sales
• Prediction: how much sales do we expect if we spend certain amount of money on advertising
Note 13 of 5E
Linear Deterministic ModelLinear Deterministic Model• Denote x as independent variables and y as
dependent variable. For example, x=age and y=%fat in the first example; x=advertising expenditure, y=sales in the second example
• We want to find how y depends on x, or how to predict y using x
• One of the simplest deterministic mathematical relationship between two variables x and y is a linear relationship y = α + βx
00
= intercept
α: intercept
β: slope
Note 13 of 5E
RealityReality • In the real world, things are never so clean!
– Age influences fatness. But it is not the sole influence. There are other factors such as sex, body type and random variation (e.g. measurement error)
– Other factors such as time of year, state of economy and size of inventory, besides the advertising expenditure, can influence the sale
• Observations of (x, y) do not fall on a straight line
As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality. Albert Einstein
Note 13 of 5E
Probabilistic ModelProbabilistic Model • Probabilistic model: y = deterministic model + random error
• Random error represents random fluctuation from the deterministic model
• The probabilistic model is assumed for the population
• Simple linear regression model: y = α + βx + ε
• Without the random deviation ε, all observed points (x, y) points would fall exactly on the deterministic line. The inclusion of ε in the model equation allows points to deviate from the line by random amounts.
Note 13 of 5E
Simple Linear Regression ModelSimple Linear Regression Model
0
0x = x1 x = x2
ε1
Observation when x = x1
(positive deviation)
ε2
Observation when x = x2
(negativetive deviation)
= vertical intercept
Population regression line(Slope )
Note 13 of 5E
Basic Assumptions of the Simple Linear Basic Assumptions of the Simple Linear Regression ModelRegression Model
1. The distribution of ε at any particular x value has mean value 0.
2. The standard deviation of ε is the same for any particular value of x. This standard deviation is denoted by .
3. The distribution of ε at any particular x value is normal.
4. The random errors are independent of one another.
Note 13 of 5E
Interpretation of TermsInterpretation of Terms1. The line α + βx describes average value of y for any
fixed value of x.
2. The slope of the population regression line is the average change in y associated with a 1-unit increase in x.
3. The intercept is the height of the population line when x = 0.
4. The size of determines the extent to which (x, y) observations deviate from the population line.
Small Large
Note 13 of 5E
The Distribution of yThe Distribution of y
For any fixed x, y has normal distribution with mean xx and standard deviation and standard deviation .
Note 13 of 5E
DataData1. So far we have described the population
probabilistic model.2. Usually three population parameters, α, β and σ,
are unknown. We need to estimate them from data.
3. Data: n pairs of observations of independent and dependent variables
(x1,y1), (x2,y2), …, (xn,yn)
4. Probabilistic model yi = + xi + i, i=1,…,n i are independent normal with mean 0 and
standard deviation .
Note 13 of 5E
Steps in Regression AnalysisSteps in Regression Analysis When you perform simple regression analysis, use a step-by step approach:
1. Fit the model to data – estimate parameters.
2. Use the analysis of variance F test (or t test) and r2 to determine how well the model fits the data.
3. Use diagnostic plots to check for violation of the regression assumptions.
4. Proceed to estimate or predict the quantity of interest
We now discuss statistical methods for each step and use the fatness example to illustrate each step
Note 13 of 5E
The Method of Least SquaresThe Method of Least Squares• The equation of the best-fitting line is
calculated using n pairs of data (xi, yi). • We choose our estimates and to estimate and so that the vertical distances of the points from the line,are minimized.
n
1i
2ii
n
1i
2ii
)xβα(y
)y(ySSE
minimize toβ and α Choose
xβαy :line fittingBest
Note 13 of 5E
Least Squares EstimatorsLeast Squares Estimators
xy
Sn
yxyxS
n
yyS
n
xxS
n
yy
n
xx
xx
xy
iiiixy
iiyy
iixx
ii
ˆ - ofestimator point ˆ
S ofestimator point ˆ
Then .) )( (
,) (
,) (
, , Compute
22
22
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
The following data was collected in a study of age and fatness in humans.
* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839
One of the questions was, “What is the relationship between age and fatness?”
Age 23 23 27 27 39 41 45 49 50% Fat 9 .5 27 .9 7 .8 17 .8 31 .4 25 .9 27 .4 25 .2 31 .1
Age 53 53 54 56 57 58 58 60 61% Fat 34.7 42 29.1 32 .5 30 .3 33 33.8 41 .1 34 .5
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
2
n 18
X 834
y 515
X 41612
XY 25489.2
Age (x) % Fat y x2 xy23 9.5 529 218.523 27.9 529 641.727 7.8 729 210.627 17.8 729 480.639 31.4 1521 1224.641 25.9 1681 1061.945 27.4 2025 123349 25.2 2401 1234.850 31.1 2500 155553 34.7 2809 1839.153 42 2809 222654 29.1 2916 1571.456 32.5 3136 182057 30.3 3249 1727.158 33 3364 191458 33.8 3364 1960.460 41.1 3600 246661 34.5 3721 2104.5
834 515 41612 25489.2
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
2
n 18, x 834, y 515
x 41612, xy 25489.2
2
2xx
2
xS x
n
83441612 2970
18
xy
x yS xy
n834 515
25489.2 1627.5318
xy
xy
S
xx
xy
55.22.3ˆ
22.318
83455.
18
515
ˆ - ˆ
55.2970
53.1627
S ˆ
Note 13 of 5E
The Analysis of VarianceThe Analysis of Variance• The total variation in the experiment is measured by
the total sum of squarestotal sum of squares:2)yyS yy ( SS Total
2)yyS yy ( SS Total
• The Total SSTotal SS is divided into two parts:
SSRSSR (sum of squares for regression): measures the variation explained by including the independent variable x in the model.
SSESSE (sum of squares for error): measures the leftover variation not explained by x.
xx
xy
S
S 2)(SSR
xx
xy
S
S 2)(SSR SSR - SS TotalSSE SSR - SS TotalSSE
Note 13 of 5E
The ANOVA TableThe ANOVA Table
Total df = Mean SquaresRegression df = Error df =
n -1
1
n –1 – 1 = n - 2
MSR = SSR/1
MSE = SSE/(n-2)
Source df SS MS F
Regression 1 SSR MSR MSR/MSE
Error n - 2 SSE MSE
Total n - 1 Total SS
Note 13 of 5E
The F TestThe F Test
We can test the overall usefulness of the linear model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.
0 :H vs0 :H a0
df. 2-n and 1 with FF if HReject
MSE
MSRF :StatisticTest
α0
Note 13 of 5E
Coefficient of DeterminationCoefficient of Determination
• r2 is the square of correlation coefficient• r2 is a number between zero and one and a value close to zero suggests a poor model.• It gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.• A very high value of r² can arise even though the relationship between the two variables is non-linear. The fit of a model should never simply be judged from the r² value alone.
SS Total
SSE-1
SS Total
SSRr
as defined ision determinat oft coefficien The
2
Note 13 of 5E
Estimate of Estimate of σσ
An estimator of the variance 2 is
MSE 2-n
SSEˆ 2
Thus, an estimator of the standard deviation is
MSE2-n
SSEˆ
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
5.75 33.11 ˆ
33.11 2-18
529.71
2-n
SSE ˆ
.627 1421.58
529.71 -1
SS Total
SSE-1 r
529.71 891.27-1421.58 SSR - SS Total SSE
27.8912970
53.1627)( SSR
58.142118
5153.16156
n
)( SS Total
2
2
22
222
xx
xy
S
S
yy
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
Source df SS MS F
Regression 1 891.27 891.27 26.94
Error 16 529.71 33.11
Total 17 1421.58
ANOVA Table
• With r2=0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age.• The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. • This would suggest that the model is only useful in the sense of provide gross ballpark estimates for %Fat for humans based on age.
Note 13 of 5E
Inference Concerning the Slope Inference Concerning the Slope ββ
• Do the data present sufficient evidence to indicate that y increases (or decreases) linearly as x increases?
• Is the independent variable x useful in predicting y?
• A no answer to above questions means that y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.
0:0:0 aH versusH 0:0:0 aH versusH
Note 13 of 5E
Sampling Distribution Sampling Distribution When the four basic assumptions of the simple linear
regression model are satisfied, the following are true:
1. The mean value of is . That is, is unbiased
2. The standard deviation of the statistic is
3. has a normal distribution (a consequence of the error being normally distributed)
4. The probability distribution of the standardized variable
has the t distribution with df=n-2
xxS
xxSt
/ˆ
ˆ
Note 13 of 5E
Confidence Interval for Confidence Interval for When then four basic assumptions of the simple linear regression model are satisfied, a (1-α)100% confidence interval for is
where the t critical value is based on df = n - 2.
xxSt /ˆ ˆ2/
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
.77) (.33,or
.22 .55 29705.75/ 2.12 .55 /ˆ ˆ
is for interval confidence 95%A
2/ xxSt
Based on sample data, the %Fat increases .55% on average with one year of age, and we are 95% confident that the true increase per year is between 0.33% and 0.77%.
Note 13 of 5E
Hypothesis Tests Concerning Hypothesis Tests Concerning Step 1: Specify the null and alternative hypothesis
– H0: β = βversus Ha: β βtwo-sided test)– H0: β = βversus Ha: β > β(one-sided test)– H0: β = βversus Ha: β < βone-sided test)
Step 2: Test statistic
Step 3: When four basic assumptions of the simple linear regression model are satisfied, under H0 , the sampling distribution of t has a Student’s t distribution with n-2 degrees of freedom
xxSt
/ˆ
ˆ0
Note 13 of 5E
Hypothesis Tests Concerning Hypothesis Tests Concerning Step 3: Find p-value. Compute
sample statistic
– Ha: two-sided test)
– Ha: > (one-sided test)
– Ha: < (one-sided test)
|)*|(2value-p ttP |)*|(2value-p ttP
*)(value-p ttP *)(value-p ttP
*)(value-p ttP *)(value-p ttP
P(t>|t*|), P(t>t*) and P(t<t*) can be found from the t tableP(t>|t*|), P(t>t*) and P(t<t*) can be found from the t table
xxSt
/ˆ
ˆ* 0
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
fatness. and agebetween
iprelationshlinear t significan a is There .5
Hreject 4.
.005value-p 3.
162-ndf
21.52970/75.5
55.
/ˆ
0ˆ t*.2
0:H 0, :H 1.
0
a0
xxS
fatness. and agebetween
iprelationshlinear t significan a is There .5
Hreject 4.
.005value-p 3.
162-ndf
21.52970/75.5
55.
/ˆ
0ˆ t*.2
0:H 0, :H 1.
0
a0
xxS
Note 13 of 5E
Interpreting a Significant RegressionInterpreting a Significant Regression
• Even if you do not reject the null hypothesis
that the slope of the line equals 0, it does not
necessarily mean that y and x are unrelated.
• It may happen that y and x are perfectly
related in a nonlinear way.
• Causality—Do not conclude that x causes y.
There may be an unknown variable at work!
Note 13 of 5E
Checking the Checking the Regression AssumptionsRegression Assumptions
1. The relationship between x and y is linear, given by y = + x +
2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and variance 2.
1. The relationship between x and y is linear, given by y = + x +
2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and variance 2.
Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.
Note 13 of 5E
ResidualsResiduals• The residual corresponding to (xi,yi) is • The residual residual is the “leftover” variation in each data point after the variation explained by the regression model has been removed. • Residuals reflect random errors, thus we can use them to check violations in the assumptions about random errors.
iii xye ˆˆ iii xye ˆˆ
00 x = x1 x = x2
e1 e2
fitted line
Note 13 of 5E
Diagnostic Tools – Residual PlotsDiagnostic Tools – Residual Plots
1. Normal probability plot of residuals
2. Plot of residuals versus fit or residuals versus variables
1. Normal probability plot of residuals
2. Plot of residuals versus fit or residuals versus variables
We can check the normality and equal variance assumptions using
Note 13 of 5E
If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.
If not, you will often see the pattern fail in the tails of the graph.
If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.
If not, you will often see the pattern fail in the tails of the graph.
Normal Probability PlotNormal Probability Plot
Note 13 of 5E
If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
Residuals versus FitsResiduals versus Fits
Note 13 of 5E
Estimation and Prediction Estimation and Prediction • Once we have
determined that the regression line is useful used the diagnostic plots to check for
violation of the regression assumptions.
• We are ready to use the regression line to Estimate the average value of y for a
given value of x Predict a particular value of y for a
given value of x.
Note 13 of 5E
Estimation and Prediction Estimation and Prediction
Estimating the average value of y when x = x0
Predicting a particular value of y when x = x0
Note 13 of 5E
Estimation and Prediction Estimation and Prediction • The best estimate of the average value of y and
prediction of y for a given value x = x0 is
• The prediction of y is more difficult, requiring a wider range of values in the prediction interval.
0 ˆˆˆ xy 0 ˆˆˆ xy
Note 13 of 5E
Estimation and Prediction Estimation and Prediction
xx
xx
S
xx
nty
xxy
S
xx
nty
xxy
20
2/
0
20
2/
0
)(11ˆˆ
: when of valueparticular apredict To
)(1ˆˆ
: when of valueaverage theestimate To
xx
xx
S
xx
nty
xxy
S
xx
nty
xxy
20
2/
0
20
2/
0
)(11ˆˆ
: when of valueparticular apredict To
)(1ˆˆ
: when of valueaverage theestimate To
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
The fitted line is
If x0=45 is put into the equation for x, we have both an estimated average %Fat for 45 year old humans and a predicted %Fat for a 45 year old human
The two interpretations are quite different.
y 3.22 0.548x
3.22+0.548(45)=27.9%
Note 13 of 5E
Example: Age and FatnessExample: Age and Fatness
.much wider is interval prediction The 40.43). (15.37,or
53.129.272970
)18/83445(
18
1175.512.2ˆ
is 45at xy of prediction for the interval confidence 95%
30.78). (25.2,or
88.29.272970
)18/83445(
18
175.512.2ˆ
is 45at xy of average for the interval confidence 95%
2
0
2
0
y
y
.much wider is interval prediction The 40.43). (15.37,or
53.129.272970
)18/83445(
18
1175.512.2ˆ
is 45at xy of prediction for the interval confidence 95%
30.78). (25.2,or
88.29.272970
)18/83445(
18
175.512.2ˆ
is 45at xy of average for the interval confidence 95%
2
0
2
0
y
y
Note 13 of 5E
Steps in Regression AnalysisSteps in Regression Analysis When you perform simple regression analysis, use a step-by step approach:
1. Fit the model to data – estimate parameters.
2. Use the analysis of variance F test (or t test) and r2 to determine how well the model fits the data.
3. Use diagnostic plots to check for violation of the regression assumptions.
4. Proceed to estimate or predict the quantity of interest
Note 13 of 5E
Regression Analysis: y versus xThe regression equation is y = 40.8 + 0.766 xPredictor Coef SE Coef T PConstant 40.784 8.507 4.79 0.001x 0.7656 0.1750 4.38 0.002
S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8%
Analysis of VarianceSource DF SS MS F PRegression 1 1450.0 1450.0 19.14 0.002Residual Error 8 606.0 75.8Total 9 2056.0
Regression coefficients, a and b
Minitab OutputMinitab Output
MSE
0: 0H test ToLeast squares regression line
Ft 2MSE
Note 13 of 5E
Key ConceptsKey ConceptsI. Scatter plotI. Scatter plot
Pattern and unusual observationsPattern and unusual observations
II. Correlation coefficient II. Correlation coefficient a numerical measure of the strengthstrength and direction.direction.
Interpretation of rInterpretation of r
Note 13 of 5E
Key ConceptsKey ConceptsIII.III. A Linear Probabilistic ModelA Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate model is y x .
2. The random error has a normal distribution with mean 0 and variance 2.
IV.IV. Method of Least SquaresMethod of Least Squares1. Estimates and , for and , are chosen to
minimize SSE, the sum of the squared deviations about the regression line,
2. The least squares estimates are and
. ˆˆˆ xy
xy ˆˆ xxxy SS /ˆ
Note 13 of 5E
Key ConceptsKey ConceptsV.V. Analysis of VarianceAnalysis of Variance
1. Total SS SSR SSE, where Total SS Syy and SSR (Sxy)2 Sxx.
2. The best estimate of 2 is MSE SSE (n 2).
VI.VI. Testing, Estimation, and PredictionTesting, Estimation, and Prediction
1. A test for the significance of the linear regression
—H0 : 0 can be implemented using statistic
/ˆ
ˆ
xxSt
Note 13 of 5E
Key ConceptsKey Concepts2. The strength of the relationship between x and y can be
measured using
which gets closer to 1 as the relationship gets stronger.3. Use residual plots to check for nonnormality, inequality of
variances, and an incorrectly fit model.4. Confidence intervals can be constructed to estimate the
slope of the regression line and to estimate the average value of y, for a given value of x.
5. Prediction intervals can be constructed to predict a particular observation, y, for a given value of x. For a given x, prediction intervals are always wider than confidence intervals.
SS Total
SSE12 r