47
Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Embed Size (px)

Citation preview

Page 1: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Statistics for Business and Economics

Dr. TANG Yu

Department of MathematicsSoochow University

May 28, 2007

Page 2: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Types of Correlation

Positive correlation

Slope is positive

Negative correlation

Slope is negtive

No correlation

Slope is zero1 1 1

Page 3: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Hypothesis Test

For the simple linear regression model xy 10

If x and y are linearly related, we must have 01

We will use the sample data to test the following hypotheses about the parameter 1

0:0: 110 aHH

Page 4: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Sampling Distribution

• Just as the sampling distribution of the sample mean, X-bar, depends on the the mean, standard deviation and shape of the X population, the sampling distributions of the β0-hat and β1-hat least squares estimators depend on the properties of the {Yj } sub-populations (j=1,…, n).

Given xj, the properties of the {Yj } sub-population are determined by the εj error/random variable.

y xj j j 0 1

Page 5: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Model Assumption

As regards the As regards the probability distributions of εj ( j =1,…, n), it is assumed that:

i.i. Each Each εεjj is normally distributed, is normally distributed, YYjj is also normal; is also normal;

ii.ii. Each Each εεjj has zero mean, has zero mean, E(E(YYjj) =) = β β00 + β + β1 1 xxjj

iii.iii. Each Each εεjj has the same variance has the same variance, , σσεε

22,,Var(Var(YYjj) =) = σ σεε

22 is also consta is also constant;nt;

iv.iv. The errors are independent of The errors are independent of each other,each other,

{{YYii} and {} and {YYjj},}, i i jj, are also i, are also independent;ndependent;

v.v. The error does not depend on The error does not depend on the independent variable(s).the independent variable(s).

The effects of The effects of XX and and εε on on YY can be separated from eaccan be separated from each other.h other.

Page 6: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Graph ShowE(Y)

X

XββY 10)(E

xi xj

Yi : N (β0+β1xi ; σ )

Yj : N (β0+β1xj ; σ )

The y distributions have the same shape at each x value

Page 7: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Sum of SquaresSum of squares due to error (SSE)

224

23

22

21

ˆ

ˆˆˆˆ

ii YY

SSE

Sum of squares due to regression (SSR)

2ˆ YYSSSR iYY

Total sum of squares (SST)

SSRSSEYYSST i 2

Page 8: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

ANOVA Table

Source of Variation

Sum of Squares

Degree of Freedom

Mean Square F

Regression SSR 1 MSR=SSR/1 MSR/MSE

Error SSE n-2MSE=

SSE/(n-2)

Total SST n-1

Page 9: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Example

xy

xy

xy

01.910.89

10.89)33.4)(01.9(09.5001.94749.22

4872.202

333.47

33.30087.50

7

61.350

^

1

^

0

^

1

^

Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total

Page 10: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

SSE

78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101

253.886

iYiY ii YY ˆ 2ii YY

xy 01.910.89ˆ

Page 11: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

SST and SSR

3.1824

89.253183.2078

01.910.89ˆ

183.2078487.202475.22

SSESSTSSR

SSESSST

xy

SSS

YY

yyxyxx

Page 12: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

ANOVA Table

Source of Variation

Sum of Squares

Degree of Freedom

Mean Square F

Regression 1824.3 1 MSR=1824.3 35.93

Error 253.9 5 MSE=50.78

Total 2078.2 6

As F=35.93 > 6.61, where 6.61 is the critical value for F-distribution with degrees of freedom 1 and 5 (significant level takes .05), we reject H0, and conclude that the relationship between x and y is significant

Page 13: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Hypothesis Test

For the simple linear regression model xy 10

If x and y are linearly related, we must have 01

We will use the sample data to test the following hypotheses about the parameter 1

0:0: 110 aHH

Page 14: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Standard Errors

Standard error of estimate: the sample standard deviatio: the sample standard deviation of n of ε.ε.

2

n

SSEMSEs

Replacing Replacing σσεε with its estimate, with its estimate, ssεε, the , the estimated standard error ofofββ11-hat is -hat is

21 xx

s

S

ss

ixx

Page 15: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

t-test Hypothesis

Test statistic

1

1

ˆ

ˆ

s

t

where t follows a t-distribution with n-2 degrees of freedom

0:0: 110 aHH

Page 16: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Reject Rule

This is a two-tailed test

value if Reject :approach value 0 pHp

220 or if Reject :approach valueCritical ttttH

Hypothesis

0:0: 110 aHH

Page 17: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Example

xy

xy

xy

01.910.89

10.89)33.4)(01.9(09.5001.94749.22

4872.202

333.47

33.30087.50

7

61.350

^

1

^

0

^

1

^

Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total

Page 18: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

SSE

78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101

253.886

iYiY ii YY ˆ 2ii YY

xy 01.910.89ˆ

Page 19: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Calculation

5031.1

475.22

1258.721

xx

s

S

ss

ixx

1258.727

89.253

2

n

SSEMSEs

571.29943.55031.1

01.9ˆ

ˆ

1

1

s

t

where 2.571 is the critical value for t-distribution with degree of freedom 5 (significant level takes .025), so we reject H0, and conclude that the relationship between x and y is significant

Page 20: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Confidence Interval

So the So the CC% % confidence interval estimators of of ββ11 is is stβ

β,nα/1

ˆ221ˆ

The The estimated standard error ofofββ11-hat is -hat is

21 xx

s

S

ss

ixx

ββ11-hat is an estimator of -hat is an estimator of ββ11

1

1

ˆ

ˆ

s

t follows a t-distribution with n-2 degrees of freedom

Page 21: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Example

The The 9595% % confidence interval estimators of of ββ11 in the previous example is in the previous example is

86.301.95031.1571.201.9

i.e., from –12.87 to -5.15, which does not contai.e., from –12.87 to -5.15, which does not contain 0in 0

Page 22: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Regression Equation

It is believed that the longer one studied, the better It is believed that the longer one studied, the better one’s grade is. The final mark (one’s grade is. The final mark (YY) on study time () on study time (XX) ) is supposed to follow the regression equation:is supposed to follow the regression equation:

xxy 877.1590.21ˆˆˆ 10

If the fit of the sample regression equation is satisfIf the fit of the sample regression equation is satisfactory, it can be used to actory, it can be used to estimate its mean value or its mean value or to to predict the dependent variable. the dependent variable.

Page 23: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Estimate and Predict

xxy 877.1590.21ˆˆˆ 10

E.g.: What is the final mark of Tom who spent 30 hours on studying?I.e., given x = 30, how large is y?

E.g.: What is the mean final mark of all those students who spent 30 hours on studying?

I.e., given x = 30, how large is E(y)?

For a particular element of a Y sub-population.

For the expected value of a Y sub-population.

PredictEstimate

Page 24: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

What Is the Same?For a given X value, the point forecast (predict) of Y and the point estimator of the mean of the {Y} sub-population are the same:

xy 10ˆˆˆ

Ex.1 Estimate the mean final mark of students who spent 30 hours on study.

9.7730877.1590.21ˆˆˆ 10 xy

Ex.2 Predict the final mark of Tom, when his study time is 30 hours.

Page 25: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

What Is the Difference?The interval prediction of Y and the interval estimation of the mean of the {Y} sub-population are different:

The prediction The prediction

The prediction interval is wider than the confidence interval

2

2

2 )(

)(11ˆ

xx

xx

nsty

i

g

The estimation The estimation

2

2

2 )(

)(1ˆ

xx

xx

nsty

i

g

Page 26: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Example

xy

xy

xy

01.910.89

10.89)33.4)(01.9(09.5001.94749.22

4872.202

333.47

33.30087.50

7

61.350

^

1

^

0

^

1

^

Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total

Page 27: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

SSE

78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101

253.886

iYiY ii YY ˆ 2ii YY

xy 01.910.89ˆ

Page 28: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Estimation and Prediction

xy 01.910.89ˆ For 0.5gx

The point forecast (predict) of Y and the point estimator of the mean of the {Y} are the same:

05.440.501.910.89ˆ y

Page 29: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Estimation and Prediction

xy 01.910.89ˆ For 0.5gx

But for the interval estimation and prediction, it is different:

Page 30: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Data Needed

1258.727

89.253

2

n

SSEMSEs

The prediction The prediction

2

2

2 )(

)(11ˆ

xx

xx

nsty

i

g

The estimation The estimation

2

2

2 )(

)(1ˆ

xx

xx

nsty

i

g

475.222 xxi Sxx

571.2025. t

For 0.5gx

Page 31: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Calculation

3887.705.44475.22

333.40.5

7

11258.7571.205.44

)(

)(1ˆ

2

2

2

2

xx

xx

nsty

i

g

7543.1905.44475.22

333.40.5

7

111258.7571.205.44

)(

)(11ˆ

2

2

2

2

xx

xx

nsty

i

g

Estimation

Prediction

Page 32: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Moving Rule

2

i

2g

2)xx(

)xx(

n1

sty

x 2

i

2

2)xx(

2n1

sty

2

i

2

2)xx(

1n1

sty

As xg moves away from x the interval becomes longer. That is, the shortest interval is found at x.

2x 2x 1x 1x

The confidence intervalwhen xg = x

The confidence intervalwhen xg = 1x

The confidence intervalwhen xg = 2x

Page 33: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Moving Rule

2

2

2 )(

)(11ˆ

xx

xx

nsty

i

g

x

2

2

2 )(

211ˆ

xxnsty

i

2

2

2 )(

111ˆ

xxnsty

i

As xg moves away from x the interval becomes longer. That is, the shortest interval is found at x.

2x 2x 1x 1x

The confidence intervalwhen xg = x

The confidence intervalwhen xg = 1x

The confidence intervalwhen xg = 2x

Page 34: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Interval Estimation

x2x 2x 1x 1x

EstimationPrediction

Page 35: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Residual AnalysisRegression Residual – the difference between an observed y value and its corresponding predicted value

Properties of Regression ResidualThe mean of the residuals equals zero

The standard deviation of the residuals is equal to the standard deviation of the fitted regression model

yyr ˆ

Page 36: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Example

Score (y) LSD Conc (x) y-hat residual(r)78.93 1.17 78.558 0.371758.20 2.97 62.34 -4.140367.47 3.26 59.727 7.742637.47 4.69 46.843 -9.373145.65 5.83 36.572 9.078332.92 6.00 35.04 -2.1229.97 6.41 31.346 -1.3759

xy 01.910.89ˆ

Page 37: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Residual Plot Against x

x

r

Page 38: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Residual Plot Against y-hat

y

r

Page 39: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Three Situations

Good Pattern

Non-constant Variance

Model form not adequate

Page 40: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Standardized Residual Standard deviation of the ith residual

iyy hssii

1ˆwhere

2

2

ˆ

1

estimate theoferror standard the

residual ofdeviation standard the

xx

xx

nh

s

is

j

ii

yy ii

Standardized residual for observation i

ii yy

iii s

yyz

ˆ

ˆ

Page 41: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Standardized Residual Plot

x

z

Page 42: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Standardized Residual The standardized residual plot can provide

insight about the assumption that the error term has a normal distribution

It is expected to see approximately 95% of the standardized residuals between –2 and +2

If the assumption is satisfied, the distribution of the standardized residuals should appear to come from a standard normal probability distribution

Page 43: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Detecting Outlier

Outlier

Page 44: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Influential Observation

Outlier

Page 45: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Influential Observation

Influential observation

Page 46: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

High Leverage Points Leverage of observation

2

21

xx

xx

nh

j

ii

70252020151010

For example

24.2857x

94.2857.24

2857.2470

7

112

2

2

2

ij

ii

xxx

xx

nh 86.

7

66

n

Page 47: Statistics for Business and Economics Dr. TANG Yu Department of Mathematics Soochow University May 28, 2007

Contact Information

Tang Yu ( 唐煜 ) [email protected] http://math.suda.edu.cn/homepage/tangy