Proposition 1.1 De Moargan’s Laws - Tongji University解释变量的选择 1、由于我们对被研究问题认识的局限，会很难确定哪些变量是重要的，哪些变量是次要

实际问题的回归模型

具体问题

设置指标变量

收集整理数据

构造理论模型

估计模型参数

模型检验

模型运用

因素分析变量控制决策预测

修改

Y

N

一、根据研究的目的、设置指标变量

所研究问题的设置因变量 y ，然后选取与 y 有统计

关系的一些变量作为自变量。并且因变量与自变量之间

应具有因果关系。

因果关系：

果—— y （因变量或称被解释变量）

因—— pxxx ,,, 21 （自变量或称解释变量）

解释变量的选择

1、由于我们对被研究问题认识的局限，会很

难确定哪些变量是重要的，哪些变量是次要的，因此应与一些专门领域的专家合作，以帮助我们确定模型变量。

2、解释变量并非越多越好，在很多经济问题

中变量之间反映的信息可能含有较严重的重叠，因而产生共线性的问题。因此解释变量的选取应采取“少而精”的原则。

二收集、整理统计数据

• 描述性统计分析

• 探索性统计分析

• 相关性分析

对数据的各种特征进行分析，以便于描述测量样本的各

种特征及其所代表的总体的特征。描述性统计分析的项目很

多，常用的如平均数、标准差、中位数、频数分布、正态或

偏态程度等等。这些分析是复杂统计分析的基础。

平均数、标准误中位数、众数、全距

标准差、方差四分位、十分位、百分位数

频数分布、峰度、偏度标准分数及其线性转换

探索分析交叉列联表分析

描述性统计分析

返回本章首页

探索性统计分析

• 对一组或多组数据的总体分布特征进行分析，

• 考察其中有无奇异值、极大或极小值等；

• 考察各组数据或全部数据是不是正态或接近于正态分布；

• 探索多组数据之间的方差是否齐性，以确定是否可以采用某种统计分

析技术对数据进行检验等等。

1. 用直方图反映数据的分布直观形式；

2. 用箱图（或叫框图）反映数据的集中趋势和奇异值；

3. 用Levene检验考察多组间方差是否齐性;

4. 用Q-Q概率图检验数据是否正态分布或接近正态分布。

相关性分析

• Pearson 相关系数：计算连续变量或是等间距测度的变量间的相关系数

• Kendall等级相关系数：计算分类变量间的秩相关

• Spearman等级相关系数：

• 当资料不服从正态或总体分布未知时，或原始数据是用等级表示时，宜用后两种相关；

Distribution of Xdata toluca;

infile 'H:\LEC01TA01.DAT';

input lotsize workhrs;

seq=_n_; （*录入一数据文档，并增加一列序号）

proc print data=toluca;

run;

Obs lotsize workhrs seq

1 80 399 1

2 30 121 2

3 50 221 3

4 90 376 4

5 70 361 5

⁞ ⁞ ⁞ ⁞

Distribution of X: Descriptive (1)proc univariate data=toluca plot;

var lotsize workhrs;

run;

Moments

N 25 Sum Weights 25

Mean 70 Sum Observations 1750

Std Deviation 28.7228132 Variance（lxx/n-1） 825

Skewness(三阶标准化矩)

-0.1032081 Kurtosis

(四阶标准化矩)

-1.0794107

Uncorrected SS（sum of x square）

142300 Corrected SS（lxx）

19800

Coeff Variation(变异系数=100*Std

Deviation/Mean）

41.0325903 Std Error Mean 5.74456265

Distribution of X: Descriptive (2)Basic Statistical Measures

Location Variability

Mean 70.00000 Std Deviation 28.72281

Median 70.00000 Variance 825.00000

Mode 90.00000 Range（Max-Min） 100.00000

Interquartile Range（Q3-Q1）

40.00000

Quantiles (Definition 5)

Quantile Estimate Quantile Estimate

100% Max 120 5% 30

99% 120 1% 20

95% 110 0% Min 20

90% 110

75% Q3 90

50% Median 70

25% Q1 50

10% 30

Distribution of X: Descriptive (3)

Extreme Observations

Lowest Highest

Value Obs Value Obs

20 14 100 9

30 21 100 16

30 17 110 15

30 2 110 20

40 23 120 7

Distribution of X: Descriptive (4)Stem Leaf # Boxplot

12 0 1 | Max

11 00 2 |

10 00 2 |

9 0000 4 +-----+ Q3

8 000 3 | |

7 000 3 *--+--* Med

6 0 1 | |

5 000 3 +-----+ Q1

4 00 2 |

3 000 3 |

2 0 1 | Min

Distribution of X: Sequence plottitle1 h=3 'Sequence plot for X with smooth curve';

symbol1 v=circle i=sm70;

axis1 label=(h=2);

axis2 label=(h=2 angle=90);

proc gplot data=toluca;

plot lotsize*seq/haxis=axis1 vaxis=axis2;

run;

Distribution of X: QQPlottitle1 'QQPlot (normal probability plot)';

proc univariate data=toluca noprint;

qqplot lotsize workhrs / normal (L=1 mu=est sigma=est);

run;

CS Example:Yi: GPA after 3 semesters

X1: High school math grades (HSM)

X2: High school science grades (HSS)

X3: High school English grades (HSE)

X4: SAT Math (SATM)

X5: SAT Verbal (SATV)

Gender: (1 = male, 2 = female)

n = 224

CS Example: Input

data cs;

infile 'I:\My Documents\2018spring\csdata.dat';

input id gpa hsm hss hse satm satv genderm1;

proc print data=cs; run;

proc reg data=cs;

model gpa=hsm hss hse satm satv;

run;

CS Example: Descriptive Statistics proc means

proc means data=cs maxdec=2;

var gpa hsm hss hse satm satv;

run;

Variable N Mean Std Dev Minimum Maximum

gpa 224 2.64 0.78 0.12 4.00

hsm 224 8.32 1.64 2.00 10.00

hss 224 8.09 1.70 3.00 10.00

hse 224 8.09 1.51 3.00 10.00

satm 224 595.29 86.40 300.00 800.00

satv 224 504.55 92.61 285.00 760.00

CS Example: Correlation

proc corr data=cs;

var hsm hss hse;

Pearson Correlation Coefficients, N = 224

Prob > |r| under H0: Rho=0

hsm hss hse

hsm 1.00000 0.57569 0.44689

<.0001 <.0001

hss 0.57569 1.00000 0.57937

<.0001 <.0001

hse 0.44689 0.57937 1.00000

<.0001 <.0001

CS Example: Correlation (cont)proc corr data=cs noprob;

var satm satv;

proc corr data=cs noprob;

var hsm hss hse;

with satm satv;


satm satv

satm 1.00000 0.46394

satv 0.46394 1.00000


hsm hss hse

satm 0.45351 0.24048 0.10828

satv 0.22112 0.26170 0.24371

时间序列数据

(按时间序列排列的统计数据）

横截面数据

(在同一时间截面上的统计数据）

样本数据的分类

样本数据的分类

• 时间序列数据要注意数据的可比性与统计问题，时间序列数据容易产生模型中随机误差项的序列相关，需要对数据的某些计算整理来消除序列相关性。

• 横截面数据作样本时，容易产生异方差性，这是因为一个回归模型往往涉及到众多解释变量，如果其中某一因素或一些因素随着解释变量观测值的变化而对被解释变量产生不同的影响，就产生了异方差。对于具有异方差性的建模问题，数据整理就要注意消除异方差性。

• 数据整理不仅要把一些数据进行变换，差分，甚至将数据标准化，有时也要剔除一些“异常值”或利用插值的方法补齐空缺的数据。

三确定理论回归模型的数学形式

要确定回归模型的数学形式，我们首

先应将收集的样本数据绘制关于 iy 与

),,2,1( nixi 的样本散点图。根据散点

图的大致形状，来确定初步的模型形式。

三确定理论回归模型的数学形式

如果n个样本点大致分布在一条直线周围，我

们可考虑用线性回归模型去拟合这条直线。如果n个样本点大致分布在一条指数曲线周围，

我们可选择指数形式的理论回归模型去描述它。有时我们无法根据所获信息确定模型形式，这时可采用不同的形式进行计算机模拟，对于不同的模拟结果选择较好的一个作为理论模型。

n i1 i2

ip i

如果在实际问题中获得组观测值(x ,x , ,

x ;y ),i=1,2, ,n,则回归模型为:

1 0 1 11 2 12 1 1

2 0 1 21 2 22 2 2

0 1 1 2 2

p

p

n n n np n

y x x x

y x x x

y x x x

表示成矩阵形式

Y X

多元线性回归

1, ( ) 0, 1,2, ,iE i n 零均值假定:

[ ] 0E 即

2, :同方差和无自相关假定2 ,

( , ) ; , 1,2, ,0,

i j

i jCov i j n

i j

即的协方差矩阵 2

2

2

2

( ) ( , ) n

n n

D Cov I

多元线性回归

3,正态分布的假定2

1 2

~ (0, ), 1, ,

, , ,

i

n

N i n

相互独立

2~ (0, )N I 即

4, X回归设计矩阵满秩的假定

( ) ( ' ) 1rank X rank X X p n

由以上假定可得

2

2

( ) , ( ) ( )

~ ( , )

n

n

E Y X D Y Var Y I

Y N X I

多元线性回归

pp xxxyE 22110)(

令 pii xxxxx ,,,,,, 1121 保持不变， )(yE 对 ix 求偏导

i

ix

yE

)(， ni ,,2,1

i 表示当其余变量不变的情况下， ix 每变动一个单位，对 y

的平均影响程度。

多元线性回归

四模型参数的估计

对模型未知参数的估计是回归分析的重

要内容。未知参数估计最常用的方法是普通最小二乘法，贝叶斯方法。对于一些特殊的回归模型或不满足基本假设的回归问题，也可用极大似然估计，加权最小二乘法，岭回归等估计方法。

Regression Coefficients

Y X 设随机误差向量

定义离差平方和

( ) ' ( ) '( )Q Y X Y X

ˆ, ( ) min ( )Q Q

我们的目的是求出使得

XXYXYY 2

当 XX 可逆时，解正规方程组，得的最小二乘估计：

YXXX 1)(̂

其中

p

ˆ

ˆ

ˆ

ˆ 1

0

称 pp xxxy ˆˆˆˆˆ22110 为经验回归方程。

Regression Coefficients:(continuous)

五模型的检验和修改

• 模型建立以后，此模型是否真正揭示了被解释变量与解释变量之间的关系。必须通过对模型的检验才能确定。

• 模型的检验是分为统计检验和实际意义检验二种。


统计检验

检验解释变量的多重共线性

检验随机误差项的异方差性

检验随机误差项的序列相关

满足基本假设检验

拟合优度检验

回归系数的显著性检验

回归方程的显著性检验

显著性检验

Questions addressed by residuals

• Is the relationship linear?

• Does the variance depend on X?

• Are there outliers?

• Are error terms not independent?

• Are the errors normal?

• Can other predictors be helpful?


• 检验回归模型是否符合实际意义，主要看解释变量系数的正负号是否与实际意义相符。若某x与y应为正相关，而相应系数𝛽为

负，这在实际问题中就无法解释，因而模型也失去意义。

如果一个模型没有通过某种统计检验，

或者通过了检验而没有合理的实际意义时，就需要对模型进行修改，模型的修改可以从以下几个方面考虑：

1. 变量设置是否合理，是否遗漏某些重要变量；

2. 解释变量之间是否具有很强的相关性；

3. 样本量是否太少；

4. 理论模型的设置是否合理等。


Body Fat Example

For 20 healthy female subjects between 25 – 30

Y = amount of body fat (fat)

X1 = tricepts skinfold thickness (skinfold)

X2 = thigh circumference (thigh)

X3 = midarm circumference (midarm)

Body Fat Example: Regression (input)

data bodyfat;

infile 'I:\My Documents\2018SPRING\CH07TA01.DAT';

input skinfold thigh midarm fat;

proc print data=bodyfat;

run;

proc reg data=bodyfat;

model fat=skinfold thigh midarm;

run;

Body Fat Example: Correlation

proc corr data=bodyfat noprob;run;


skinfold thigh midarm fat

skinfold 1.00000 0.92384 0.45778 0.84327

thigh 0.92384 1.00000 0.08467 0.87809

midarm 0.45778 0.08467 1.00000 0.14244

fat 0.84327 0.87809 0.14244 1.00000

Body Fat Example: Scatter plot

Body Fat Example: Regression (output)Analysis of Variance

Source DF Sum of

Squares

Mean

Square

F Value Pr > F

Model 3 396.98461 132.32820 21.52 <.0001

Error 16 98.40489 6.15031

Corrected Total 19 495.38950

Root MSE 2.47998 R-Square 0.8014

Dependent Mean 20.19500 Adj R-Sq 0.7641

Coeff Var 12.28017

Parameter Estimates

Variable DF Parameter

Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 117.08469 99.78240 1.17 0.2578

skinfold 1 4.33409 3.01551 1.44 0.1699

thigh 1 -2.85685 2.58202 -1.11 0.2849

midarm 1 -2.18606 1.59550 -1.37 0.1896

Set of variables are helpful.

But, each individual parameter is not helpful.

Body Fat Example: Diagnostics (output)

Body Fat Example: Diagnostics (output)

Body Fat Example: Single Xi’s (input)


model fat = skinfold;

model fat = thigh;

model fat = midarm;

run;

Body Fat Example: Single Xi’s (output)

Root MSE 2.81977

R-Square 0.7111

Adj R-Sq 0.6950

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -1.49610 3.31923 -0.45 0.6576

skinfold 1 0.85719 0.12878 6.66 <.0001

Root MSE 2.51024

R-Square 0.7710

Adj R-Sq 0.7583

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -23.63449 5.65741 -4.18 0.0006

thigh 1 0.85655 0.11002 7.79 <.0001

Root MSE 5.19261

R-Square 0.0203

Adj R-Sq -0.0341

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 14.68678 9.09593 1.61 0.1238

midarm 1 0.19943 0.32663 0.61 0.5491

Body Fat Example: Single Xi’s


model fat = thigh;

run;

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -23.63449 5.65741 -4.18 0.0006

thigh 1 0.85655 0.11002 7.79 <.0001

Root MSE 2.51024

R-Square 0.7710

Adj R-Sq 0.7583

Body Fat Example: General Linear Test



skinmid: test skinfold, midarm;

* test H0: beta1 = beta3 = 0;

run;

Test skinmid Results for Dependent Variable fat

Source DF Mean

Square

F Value Pr > F

Numerator 2 7.50940 1.22 0.3210

Denominator 16 6.15031

Body Fat Example: Single Xi’s (input)


model fat = skinfold;

run;

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t|

Intercept 1 -1.49610 3.31923 -0.45 0.6576

skinfold 1 0.85719 0.12878 6.66 <.0001

Root MSE 2.81977

R-Square 0.7111

Adj R-Sq 0.6950

Body Fat Example: General Linear Test

Does this variable alone do the job?

Perform general linear test



thighmid: test thigh, midarm;

* test H0: beta2 = beta3 = 0;

run;

Test thighmid Results for Dependent Variable fat

Source DF Mean

Square

F Value Pr > F

Numerator 2 22.35741 3.64 0.0500


Appears there is additional information in the variables. Perhaps the addition of one more variable would be helpful.

Body Fat Example: General Linear Test (input)



thigh: test thigh;

* test H0: beta2 = 0;

run;

Test thigh Results for Dependent Variable fat

Source DF Mean

Square

F Value Pr > F

Numerator 1 7.52928 1.22 0.2849


Body Fat Example: Partial Correlation


model fat=skinfold thigh midarm / pcorr1 pcorr2;

run;

Parameter Estimates


Estimate

Standard

Error

t Value Pr > |t| Squared

Partial

Corr Type I

Squared

Partial

Corr Type II

Intercept 1 117.08469 99.78240 1.17 0.2578 . .

skinfold 1 4.33409 3.01551 1.44 0.1699 0.71110 0.11435

thigh 1 -2.85685 2.58202 -1.11 0.2849 0.23176 0.07108

midarm 1 -2.18606 1.59550 -1.37 0.1896 0.10501 0.10501


Variable Sum of Squares Squared Partial Corr

Type I

Skinfold(X1) SSR(X1)/SST(X1) 0.71110

Thigh(X2) SSE(X2|X1) /SSE(X1) 0.23176

Midarm(X3) SSE(X3|X1,X2)

/SSE(X1,X2)0.10501


Variable Sum of Squares Squared Partial Corr

Type II

Skinfold(X1) SSE(X1|X2,X3)

/SSE(X2,X3)0.11435

Thigh(X2) SSE(X2|X1,X3)

/SSE(X1,X3)0.07108

Midarm(X3) SSE(X3|X1,X2)

/SSE(X1,X2)0.10501

Body Fat Example: Effects of Correlation

Variablesin model

መ𝛽1መ𝛽2 s{ መ𝛽1} s{ መ𝛽2}

X1 0.8572 0.1288

X2 0.8565 0.1100

X1, X2 0.2224 0.6594 0.3034 0.2912

X1, X2, X3 4.334 -2.857 3.013 2.582

A regression coefficient does not reflect an inherent effect of the particular predictor variable on the response variable but only a marginal or partial effect.

Example of Unequal Error Variances Remedial Measures-Weighted Least Squares

Blood Pressure Example

Researching the relationship between blood pressure in healthy women ages 20 – 60.

Y = diastolic blood pressure (diast)

X = age

n = 54

Blood Pressure: inputdata pressure;

infile ‘H:\My Documents\2018SPRING\CH11TA01.DAT';

input age diast;

proc print data=pressure; run;

title1 h=3 'Blood Pressure';

title2 h=2 'Scatter plot';

symbol1 v=circle i=sm70 c=purple;

axis1 label=(h=2);

axis2 label=(h=2 angle=90);

proc sort data=pressure;

by age;

proc gplot data=pressure;

plot diast*age;

run;

Blood Pressure: Scatterplot

Blood Pressure: regression (unweighted)proc reg data=pressure;

model diast=age / clb;

output out=diag r=resid;

run;Analysis of Variance

Source DFSum of

Squares

Mean

SquareF Value Pr > F

Model 1 2374.96833 2374.96833 35.79 <.0001

Error 52 3450.36501 66.35317




Parameter Estimates

Variable DFParameter

Estimate

Standard

Errort Value Pr > |t| 95% Confidence Limits

Intercept 1 56.15693 3.99367 14.06 <.0001 48.14304 64.17082

age 1 0.58003 0.09695 5.98 <.0001 0.38548 0.77458

Blood Pressure: Residual Plots

data diag; (*try to find pattern with age)

set diag;

absr=abs(resid);

sqrr=resid*resid;

title2 h=2 'residual abs(resid) squared

residual plots vs. age';

proc gplot data=diag;

plot (resid absr sqrr)*age/haxis=axis1

vaxis=axis2;

run;

Blood Pressure: Residual Plots

(cont)

Blood Pressure: computing weights if using abs-resid

*Model absolute residual as a linear function of

age.

*Fit model and predict the standard deviation for

each observation.

proc reg data=diag;

model absr=age;

output out=findweights p=shat;

* Shat 是绝对残差作为Y与X拟合后的Y的预测值.

Blood Pressure: computing weights if using abs-resid

*Model absolute residual as a linear function of

age.

*Fit model and predict the standard deviation for

each observation.

proc reg data=diag;

model absr=age;

output out=findweights p=shat;

* Shat 是绝对残差作为Y与X拟合后的Y的预测值.

*Compute weights

data findweights;

set findweights;

wt=1/(shat*shat);

Blood Pressure: weighted regression(abs-

resid)proc reg data=findweights;

model diast=age / clb p;

weight wt;

output out = weighted r = resid p = predict;

run;Analysis of Variance

Source DFSum of

Squares

Mean


Model 1 83.34082 83.34082 56.64 <.0001

Error 52 76.51351 1.47141




Parameter Estimates


Estimate

Standard


Intercept 1 55.56577 2.52092 22.04 <.0001 50.50718 60.62436

age 1 0.59634 0.07924 7.53 <.0001 0.43734 0.75534

Blood pressure: Comparison

• Normal Regression

• Weighted RegressionParameter Estimates


Estimate

Standard


Intercept 1 55.56577 2.52092 22.04 <.0001 50.50718 60.62436

age 1 0.59634 0.07924 7.53 <.0001 0.43734 0.75534

Parameter Estimates


Estimate

Standard


Intercept 1 56.15693 3.99367 14.06 <.0001 48.14304 64.17082

age 1 0.58003 0.09695 5.98 <.0001 0.38548 0.77458

Blood Pressure: new residuals

data graphtest;

set weighted;

resid1 = sqrt(wt)*resid;

title2 h=2 'Weighted data - residual plot';

symbol1 v=circle i=none color=red;

proc gplot data=graphtest;

plot resid1*predict/vref=0 haxis=axis1 vaxis=axis2;

run;

Blood Pressure: new residuals

Example ofOutlying Observations

Life Insurance Example

Y = the amount of life insurance for the 18 managers (in $1000)

X1 = average annual income (in $1000)

X2 = risk aversion score (0 – 10)

Life Insurance: Input, diagnosticstitle1 h=3 'Insurance';

data insurance;

infile 'I:\My Documents\2018SPRING\CH10TA01.DAT';

input income risk amount;

run;

proc print data=insurance; run;

*diagnostics, r means residuals, p means predited value;

title2 h=2 'residual plots';

symbol1 v=circle c=black;

proc reg data=insurance;

model amount = income risk/r p;

plot r.*(p. income risk);

run;

Life Insurance: Initial RegressionAnalysis of Variance

Source DFSum of

Squares

Mean


Model 2 173919 86960 542.33 <.0001

Error 15 2405.14763 160.34318

Corrected Total 17 176324



Coeff Var 9.41851

Parameter Estimates


Estimate

Standard

Errort Value Pr > |t|

Intercept 1 -205.71866 11.39268 -18.06 <.0001

income 1 6.28803 0.20415 30.80 <.0001

risk 1 4.73760 1.37808 3.44 0.0037

Life Insurance: output, diagnostics

Life Insurance: Studentized Deleted Residuals

proc reg data=quad;

model amount=income risk incomesq/r influence;

output out = diag1 r=resid rstudent=rstudent;

run;

proc print data=diag1; run;

*The studentized residual RSTUDENT differs slightly from

STUDENT since the error variance is estimated by ො𝜎 𝑒 𝑖

without the ith observation, not by ො𝜎 𝑒 .rstudent=𝑒 𝑖

ෝ𝜎 𝑒 𝑖=

𝑒 𝑖

𝑀𝑆𝐸 𝑖 (1−ℎ𝑖𝑖)~𝑡(𝑛 − 1 − 𝑝 − 1)

Studentized Deleted Residuals (cont)Obs

Dependent

Variable

Predicted

ValueRStudent

Hat Diag

H

1 91.0000 97.8164 -5.3155 0.0962

2 162.0000 160.1201 0.8848 0.1711

3 11.0000 11.5901 -0.3333 0.4524

4 240.0000 240.6278 -0.2822 0.1373

5 73.0000 71.5019 0.6618 0.0826

6 311.0000 309.6777 0.7153 0.3848

7 316.0000 315.6359 0.3063 0.7535

8 154.0000 153.3645 0.2931 0.1802

9 164.0000 162.4847 0.6866 0.1258

10 54.0000 52.4068 0.7127 0.1006

11 53.0000 52.8060 0.0866 0.1297

12 326.0000 327.6975 -0.9308 0.3856

13 55.0000 54.4957 0.2210 0.0951

14 130.0000 131.0179 -0.5120 0.3018

15 112.0000 109.6080 1.1138 0.1249

16 91.0000 93.0992 -0.9653 0.1222

17 14.0000 13.8135 0.0909 0.2705

18 63.0000 62.2363 0.3338 0.0856

Identifying Influential Cases

Various Criteria

• DFFITS

• Cook’s Distance

• DFBETAS Measures

Influential observations are those that, according to various criteria, appear to have a large influence on the parameter estimates.

Influence on Single Fitted Value- DFFITS

• 𝐷𝐹𝐹𝐼𝑇𝑆 >1,case is influential case for small to medium data set;

• 𝐷𝐹𝐹𝐼𝑇𝑆 >2 𝑝/𝑛,case is influential case for large data set.

1/2

( ) 1/2

22

( )

ˆ ˆ 2( ) ( )

(1 ) 1ˆ

i i iii i

ii i iiii i

Y Y hn pDFFITS e

SSE h e hh

22

( )1

2 2 2

ˆ ˆ( )

ˆ ˆ( 1) ( 1) (1 )

n

j j ij i iii

ii

Y Y e hD

p p h

Cook’s Distance 反映了杠杆值 iih 与残差 ie 大

小的一个综合效应。判断的一般标准是：

当 5.0iD 时，认为不是异常值点；

当 1iD 时，认为是异常值点。

Influence on All Fitted Value- Cook’s Distance

Influence on Regression Coefficients-DFBETAS

( )

( )2

( )

ˆ ˆ( )

ˆ

k k i

k i

ii i

DFBETASc

• 𝐷𝐹𝐵𝐸𝑇𝐴𝑆 >1,the ith case has a large impact on the kth regression coefficient for small to medium data set;

• 𝐷𝐹𝐵𝐸𝑇𝐴𝑆 >2 𝑛, the ith case has a large impact on the kth regression coefficient for large data set.

Hat Matrix Diagnosis, DFFITSObs Residual RStudent Hat Diag H Cov Ratio DFFITS

1 -6.8164 -5.3155 0.0962 0.0147 -1.7339

2 1.8799 0.8848 0.1711 1.2842 0.4020

3 -0.5901 -0.3333 0.4524 2.3742 -0.3029

4 -0.6278 -0.2822 0.1373 1.5215 -0.1126

5 1.4981 0.6618 0.0826 1.2842 0.1986

6 1.3223 0.7153 0.3848 1.8735 0.5656

7 0.3641 0.3063 0.7535 5.3027 0.5356

8 0.6355 0.2931 0.1802 1.5981 0.1374

9 1.5153 0.6866 0.1258 1.3342 0.2604

10 1.5932 0.7127 0.1006 1.2830 0.2384

11 0.1940 0.0866 0.1297 1.5420 0.0334

12 -1.6975 -0.9308 0.3856 1.6912 -0.7373

13 0.5043 0.2210 0.0951 1.4643 0.0717

14 -1.0179 -0.5120 0.3018 1.7786 -0.3366

15 2.3920 1.1138 0.1249 1.0675 0.4209

16 -2.0992 -0.9653 0.1222 1.1616 -0.3601

17 0.1865 0.0909 0.2705 1.8390 0.0553

18 0.7637 0.3338 0.0856 1.4216 0.1022

Cook’s Distance, DFBetas, Cov RatioObs

Cook's

D

Cov

RatioDFFITS

DFBETAS

Intercept income risk incomesq

1 0.255 0.0147 -1.7339 -0.4126 0.0662 -0.3686 0.9168

2 0.041 1.2842 0.4020 0.0110 0.2513 -0.2064 -0.2579

3 0.025 2.3742 -0.3029 -0.1839 0.2513 -0.0525 -0.2312

4 0.003 1.5215 -0.1126 0.0642 -0.0692 -0.0299 0.0230

5 0.010 1.2842 0.1986 0.1216 -0.0566 -0.0108 -0.0580

6 0.083 1.8735 0.5656 -0.3627 0.1183 0.3901 0.1704

7 0.077 5.3027 0.5356 -0.0249 0.2235 -0.3381 0.2233

8 0.005 1.5981 0.1374 -0.0372 0.0245 0.0788 -0.0712

9 0.018 1.3342 0.2604 -0.0462 0.1333 0.0084 -0.1799

10 0.015 1.2830 0.2384 0.1978 -0.0988 -0.0773 -0.0084

11 0.000 1.5420 0.0334 0.0195 -0.0244 0.0126 0.0091

12 0.137 1.6912 -0.7373 0.4425 -0.1728 -0.3821 -0.3486

13 0.001 1.4643 0.0717 0.0535 -0.0427 0.0030 0.0063

14 0.030 1.7786 -0.3366 -0.0807 -0.1746 0.2583 0.1861

15 0.044 1.0675 0.4209 0.0160 -0.0195 0.2003 -0.2036

16 0.033 1.1616 -0.3601 -0.1515 -0.0774 0.1654 0.2177

17 0.001 1.8390 0.0553 0.0462 -0.0383 -0.0150 0.0317

18 0.003 1.4216 0.1022 0.0714 -0.0471 -0.0003 -0.0097

Example of Autocorrelation

Applied Linear Statistical Models (5th

edition) ，12.1-12.4

研究我国人均消费水平问题，数据从 1980

—1998 年为时间序列数据，因变量 y 为人均消

费额，自变量 x 为人均国民收入。

年份人均国民收入

x(元)

人均消费金额

y(元)年份

人均国民收入

x(元)

人均消费金额

y(元)

1980 460 234.75 1990 1634 797.081981 489 259.26 1991 1879 890.661982 525 280.58 1992 2287 1063.391983 580 305.97 1993 2939 1323.221984 692 347.15 1994 3923 1736.321985 853 433.53 1995 4854 2224.591986 956 481.36 1996 5576 2627.061987 1104 545.4 1997 6053 2819.361988 1355 687.51 1998 6392 2958.181989 1512 756.27

人均国民收入表

Model Summaryb

.999a .999 .999 31.74833 .873

Model

1

R R Square

Adjusted

R Square

Std. Error of

the Estimate

Durbin-

Watson

Predictors: (Constant), 人均国民收入（元）a.

Dependent Variable: 人均消费金额（元）b.

ANOVAb

15326406 1 15326406.15 15306.535 .000a

17022.070 17 1001.298

15343428 18

Regress ion

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), 人均国民收入（元）a.

Dependent Variable: 人均消费金额（元）b.

74833.31ˆ,999.02 R ，F 检验与 t 检验均

显示 x 、 y 有显著的线性关系，

但计算得DW 值为 0.873，若取 05.0 ，DW

值显示残差序列存在正自相关。

)(DWf

DW

18.1Ld 40.1Ud

正自相关0.873

不能确定

无自相关不能确定

负自相关

Ud4Ld40 2

First Differences Procedure

首先计算差分

1 ttt yyy ，

1 ttt xxx

计算结果见下表

年份序号 xt yt et △xt △yt

1980 1 460 234.75 -12.111981 2 489 259.26 -0.8 29 24.511982 3 525 280.58 4.13 36 21.321983 4 580 305.97 4.48 55 25.391984 5 692 347.15 -5.33 112 41.181985 6 853 433.53 7.75 161 86.381986 7 956 481.36 8.69 103 47.831987 8 1104 545.4 5.35 148 64.041988 9 1355 687.51 33.19 251 142.111989 10 1512 756.27 30.47 157 68.761990 11 1634 797.08 15.74 122 40.811991 12 1879 890.66 -2.22 245 93.581992 13 2287 1063.39 -15.24 408 172.731993 14 2939 1323.22 -52.24 652 259.831994 15 3923 1736.32 -87.12 984 413.11995 16 4854 2224.59 -22.7 931 488.271996 17 5576 2627.06 51.07 722 402.471997 18 6053 2819.36 26.21 477 192.31998 19 6392 2958.18 10.7 339 138.82

Model Summaryc,d

.991b .981 .980 29.34036 1.569

Model

1

R R Squarea

Adjus ted

R Square

Std. Error of

the Estimate

Durbin-

Watson

For regress ion through the origin (the no-intercept m odel), R Square

measures the proportion of the variabili ty in the dependent variable

about the origin explained by regress ion. This CANNOT be compared to

R Square for models which include an intercept.

a.

Predictors : detaxtb.

Dependent Variable: detaytc.

Linear Regress ion through the Origind.

用 ty 对 tx 作过原点的回归拟合，

计算结果如下表。

计算得DW =1.569，查表（ 05.0,2,18 kn ）， Ld =1.16， Vd =1.39

落在无自相关区域，且F 检验、 t 检验均显示回归显著。

)(DWf

DW16.1Ld 39.1Ud

正自相关

不能确定

无自相关1.569

不能确定

负自相关

Ud4Ld40 2

Example ofModel Seletion

Surgical ExampleSurgical unit wants to predict survival in patients

undergoing a specific liver operation.n = 54Y = post-operation survival timeExplanatory Variables

X1: blood clotting score (blood)X2: prognostic index (prog)X3: enzyme function test score (enz)X4: liver function test score (liver)X5: age, in years(age)X6: indicator variable for gender(M:O F:1,gender)X7 and X8 :indicator variable for history of alcohol use,3 levels(Alcohol Moderate, Alcohol Severe)

Surgical Example: inputdata surgical;

infile 'I:\My Documents\2018SPRING\CH09TA01.txt'

delimiter='09'x;

input blood prog enz liver age gender alcmod alcheavy surv

logsurv;

run;

proc print data=surgical;

run;

title1 h=3 'Original model';

title2 h=2 'Matrix Scatterplot';

proc sgscatter data=surgical;

matrix surv blood prog enz liver;

run;

Surgical Example: Scatterplot

Surgical Example: Correlation

proc corr data=surgical noprob;

var lsurv blood prog enz liver;

run;


lsurv blood prog enz liver

lsurv 1.00000 0.24633 0.47015 0.65365 0.64920

blood 0.24633 1.00000 0.09012 -0.14963 0.50242

prog 0.47015 0.09012 1.00000 -0.02361 0.36903

enz 0.65365 -0.14963 -0.02361 1.00000 0.41642

liver 0.64920 0.50242 0.36903 0.41642 1.00000

Variable Selection

Variable Selection Criteria–𝑅2,adjusted 𝑅2,MSE

–𝐶𝑃–PRESS

–AIC,SBC(aka BIC)

Automatic Search Procedures–Forward Selection

–Backward Elimination

–Stepwise Search

–All Subset Selection

Surgical Example: Model Selection –data for the current model

proc reg data=surgical outtest=mparam;

model lsurv=blood prog enz liver/

rsquare adjrsq cp press aic sbc;

run;

proc print data=mparam; run;

Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ _PRESS_

1 MODEL1 PARMS lsurv 0.25088 4.06875

Obs _IN_ _P_ _EDF_ _RSQ_ _ADJRSQ_ _CP_ _AIC_ _SBC_

1 4 5 49 0.75914 0.73948 5 -144.587 -134.642

Obs Intercept blood prog enz liver lsurv

1 3.85193 0.083739 0.012671 0.015627 0.032056 -1

Surgical Example: Model Selection -automatic

proc reg data=surgical;

model lsurv=blood prog enz liver / selection=stepwise;

run;

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Selection

Step Variable Entered

Variable Removed

Number Vars In

Partial R-Square

Model R-Square

C(p) F Value Pr > F

1 enz 1 0.4273 0.4273 66.5181 38.79 <.0001

2 prog 2 0.2359 0.6632 20.5228 35.72 <.0001

3 blood 3 0.0941 0.7572 3.3879 19.37 <.0001

Surgical Example: Model Selection –backward elimination

Bounds on condition number: 1.0308, 9.1864

All variables left in the model are significant at the 0.1000 level.

Variable Parameter Estimate

Standard Error

Type II SS F Value Pr > F

Intercept 3.76644 0.22676 17.15229 275.89 <.0001

blood 0.09547 0.02169 1.20436 19.37 <.0001

prog 0.01334 0.00203 2.67403 43.01 <.0001

enz 0.01644 0.00163 6.32862 101.80 <.0001

Summary of Backward Elimination

Step Variable

Removed Number Vars In

Partial R-Square

Model R-Square

C(p) F Value Pr > F

1 liver 3 0.0019 0.7572 3.3879 0.39 0.5363

六回归模型的应用

1 进行因素分析

2 控制

3 预测

Documents

Proposition 1.1 De Moargan’s Laws - Tongji University解释变量的选择 1、由于我们对被研究问题认识的局限，会很 难确定哪些变量是重要的，哪些变量是次要

Proposition 1.1 De Moargan’s Laws - Tongji University解释变量的选择 1、由于我们对被研究问题认识的局限，会很难确定哪些变量是重要的，哪些变量是次要