Inference for Simple Regression

1

Inference for Simple Regression

Social Research Methods 2109 & 6507

Spring 2006

March 15, 16, 2006

2

Regression Equation

Equation of a regression line:(y_hat) = α +βx y = α +βx + ε

y = dependent variablex = independent variableβ = slope = predicted change in y with a one unit c

hange in xα= intercept = predicted value of y when x is 0y_hat = predicted value of dependent variable

3

補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 )

• PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule

• R2 is a PRE measure• Naïve rule = predict y_bar• Sophisticated rule = predict y_hat• R2 measures reduction in predictive error f

rom using regression predictions as contrasted to predicting the mean of y

4

Example: SPSS Regression Procedures and Output

• To get a scatterplot ():

統計圖 (G) → 散佈圖 (S) → 簡單 →定義（選x 及 y ）

• To get a correlation coefficient:

分析 (A) → 相關 (C) → 雙變量• To perform simple regression

分析 (A) → 迴歸方法 (R) → 線性 (L) （選 x及 y ）（還可選擇儲存預測值及殘差）

5

SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data

Infant Mortality vs. Female Literacy

109 countries, 1995 UN Data

Females who read (%)

120100806040200

Infa

nt

mort

ality

(d

eath

s p

er

10

00

liv

e b

irth

s)

200

100

0

6

Example: correlation between infant mortality and female literacy

相關

1 -.843**. .000

109 85-.843** 1.000 .

85 85

Pearson 相關 ( )顯著性雙尾

個數Pearson 相關

( )顯著性雙尾個數

BABYMORT Infantmortality (deaths per1000 live births)

LIT_FEMA Femaleswho read (%)

BABYMORT Infant mortality(deaths per 1000

live births)

LIT_FEMA Females who

read (%)

0.01 ( )在顯著水準為時雙尾，相關顯著。**.

7

Regression: infant mortality vs. female literacy, 1995 UN Data

模式摘要b

.843a .711 .708 20.6971模式1

R R 平方調過後的R 平方估計的標準誤

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000依變數：

live births)b.

係數a

127.203 5.764 22.067 .000 115.738 138.668

-1.129 .079 -.843 -14.302 .000 -1.286 -.972

( )常數LIT_FEMA Femaleswho read (%)

模式1

B 之估計值標準誤未標準化係數

Beta 分配

標準化係數

t 顯著性下限上限

B 95% 迴歸係數的信賴區間

\ BABYMORT Infant mortality (deaths per 1000 live births)依變數：a.

8

Diagnosis: a residual plot

Regression Residuals vs. Female Literacy

109 countries, 1995 UN Data

Females who read (%)

120100806040200

Uns

tand

ardi

zed

Res

idua

l

60

40

20

0

-20

-40

-60

-80

9

Global test--F 檢定 : 檢定迴歸方程式有無解釋能力 (β= 0)

10

11

The regression model ( 迴歸模型 )

• Note: the slope and intercept of the regression line are statistics (i.e., from the sample data).

• To do inference, we have to think of α and β as estimates of unknown parameters.

12

Regression as conditional means

• Ways to think about regression:1. Straight-line description of association2. Prediction3. Conditional means ( 條件平均數 ) Conditional mean: a mean computed conditi

onal on the value of another variableRegression line predicts the conditional mea

n of y given x

13

Assumptions for regression inference

Think about there as being a population or “true” regression line

Assumptions:• For any fixed value of x, the response (y) varies

according to a normal distribution. Repeated responses y are independent of each other.

• μy = α +βx (means of y conditional on x fall in a straight line)

• The standard deviation of y (call it σ) for each value of x is the same. The value of σ is unknown.

14

“True” regression line

15

Inference for regression

• Population regression line:

μy = α +βx

estimated from sample:

(y_hat) = a + bx

b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α

16

Sampling distribution of a (intercept) and b (slope)

• Mean of the sampling distribution of a is α

• Mean of the sampling distribution of b is β

17

Sampling distribution of a (intercept) and b (slope)

• Mean of the sampling distribution of a is α

• Mean of the sampling distribution of b is β

• The standard error of a and b are related to the amount of spread about the regression line (σ)

• Normal sampling distributions; with σ estimated use t-distribution for inference

18

The standard error of the least-squares line

• Estimate σ (spread about the regression line using residuals from the regression)

• recall that residual = (y –y_hat)

• Estimate the population standard deviation about the regression line (σ) using the sample estimates

19

Estimate σ from sample data

20

Standard Error of Slope (b)

• The standard error of the slope has a sampling distribution given by:

• Small standard errors of b means our estimate of b is a precise estimate of

• SEb is directly related to s; inversely related to sample size (n) and Sx

21

Confidence Interval for regression slope

A level C confidence interval for the slope of “true” regression line β is

b ± t * SEb

Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom

To test the hypothesis H0: β= 0, compute the t statistic:

t = b/ SEb

In terms of a random variable having the t,n-2 distribution

22

Significance Tests for the slope

Test hypotheses about the slope of β. Usually:

H0: β= 0 (no linear relationship between the independent and dependent variable)

Alternatives:

HA: β ＞ 0 or HA: β ＜ 0

or HA: β ≠ 0

23

24

Statistical inference for intercept

We could also do statistical inference for the regression intercept, α

Possible hypotheses:

H0: α = 0

HA: α≠ 0t-test based on a, very similar to prior t-tests

we have doneFor most substantive applications, interested

in slope (β), not usually interested in α

25

Regression: infant mortality vs. female literacy, 1995 UN Data

模式摘要b

.843a .711 .708 20.6971模式1

R R 平方調過後的R 平方估計的標準誤

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000依變數：

live births)b.

係數a

127.203 5.764 22.067 .000 115.738 138.668

-1.129 .079 -.843 -14.302 .000 -1.286 -.972

( )常數LIT_FEMA Femaleswho read (%)

模式1

B 之估計值標準誤未標準化係數

Beta 分配

標準化係數

t 顯著性下限上限

B 95% 迴歸係數的信賴區間

\ BABYMORT Infant mortality (deaths per 1000 live births)依變數：a.

變異數分析b

87617.840 1 87617.840 204.538 .000a

35554.673 83 428.370123172.513 84

迴歸殘差總和

模式1

平方和自由度平均平方和 F 檢定顯著性

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000 live births)依變數：b.

26

Hypothesis test example

大華正在分析教育成就的世代差異，他蒐集到 117 組父子教育程度的資料。父親的教育程度是自變項，兒子的教育程度是依變項。他的迴歸公式是： y_hat = 0.2915*x +10.25

迴歸斜率的標準誤差 (standard error) 是 : 0.10

1. 在 α=0.05 ，大華可得出父親與兒子的教育程度是有關連的嗎？

2. 對所有父親的教育程度是大學畢業的男孩而言，這些男孩的平均教育程度預測值是多少？

3. 有一男孩的父親教育程度是大學畢業，預測這男孩將來的教育程度會是多少？

Documents

Inference for Simple Regression