Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
实际问题的回归模型
具体问题
设置指标变量
收集整理数据
构造理论模型
估计模型参数
模型检验
模型运用
因素分析 变量控制 决策预测
修改
Y
N
一、根据研究的目的、设置指标变量
所研究问题的设置因变量 y ,然后选取与 y 有统计
关系的一些变量作为自变量。并且因变量与自变量之间
应具有因果关系。
因果关系:
果—— y (因变量或称被解释变量)
因—— pxxx ,,, 21 (自变量或称解释变量)
解释变量的选择
1、由于我们对被研究问题认识的局限,会很
难确定哪些变量是重要的,哪些变量是次要的,因此应与一些专门领域的专家合作,以帮助我们确定模型变量。
2、解释变量并非越多越好,在很多经济问题
中变量之间反映的信息可能含有较严重的重叠,因而产生共线性的问题。因此解释变量的选取应采取“少而精”的原则。
二收集、整理统计数据
• 描述性统计分析
• 探索性统计分析
• 相关性分析
对数据的各种特征进行分析,以便于描述测量样本的各
种特征及其所代表的总体的特征。描述性统计分析的项目很
多,常用的如平均数、标准差、中位数、频数分布、正态或
偏态程度等等。这些分析是复杂统计分析的基础。
平均数、标准误 中位数、众数、全距
标准差、方差 四分位、十分位、百分位数
频数分布、峰度、偏度 标准分数及其线性转换
探索分析 交叉列联表分析
描述性统计分析
返回本章首页
探索性统计分析
• 对一组或多组数据的总体分布特征进行分析,
• 考察其中有无奇异值、极大或极小值等;
• 考察各组数据或全部数据是不是正态或接近于正态分布;
• 探索多组数据之间的方差是否齐性,以确定是否可以采用某种统计分
析技术对数据进行检验等等。
1. 用直方图反映数据的分布直观形式;
2. 用箱图 (或叫框图)反映数据的集中趋势和奇异值;
3. 用Levene检验考察多组间方差是否齐性;
4. 用Q-Q概率图检验数据是否正态分布或接近正态分布。
相关性分析
• Pearson 相关系数:计算连续变量或是等间距测度的变量间的相关系数
• Kendall等级相关系数:计算分类变量间的秩相关
• Spearman等级相关系数:
• 当资料不服从正态或总体分布未知时,或原始数据是用等级表示时,宜用后两种相关;
Distribution of Xdata toluca;
infile 'H:\LEC01TA01.DAT';
input lotsize workhrs;
seq=_n_; (*录入一数据文档,并增加一列序号)
proc print data=toluca;
run;
Obs lotsize workhrs seq
1 80 399 1
2 30 121 2
3 50 221 3
4 90 376 4
5 70 361 5
⁞ ⁞ ⁞ ⁞
Distribution of X: Descriptive (1)proc univariate data=toluca plot;
var lotsize workhrs;
run;
Moments
N 25 Sum Weights 25
Mean 70 Sum Observations 1750
Std Deviation 28.7228132 Variance(lxx/n-1) 825
Skewness(三阶标准化矩)
-0.1032081 Kurtosis
(四阶标准化矩)
-1.0794107
Uncorrected SS(sum of x square)
142300 Corrected SS(lxx)
19800
Coeff Variation(变异系数=100*Std
Deviation/Mean)
41.0325903 Std Error Mean 5.74456265
Distribution of X: Descriptive (2)Basic Statistical Measures
Location Variability
Mean 70.00000 Std Deviation 28.72281
Median 70.00000 Variance 825.00000
Mode 90.00000 Range(Max-Min) 100.00000
Interquartile Range(Q3-Q1)
40.00000
Quantiles (Definition 5)
Quantile Estimate Quantile Estimate
100% Max 120 5% 30
99% 120 1% 20
95% 110 0% Min 20
90% 110
75% Q3 90
50% Median 70
25% Q1 50
10% 30
Distribution of X: Descriptive (3)
Extreme Observations
Lowest Highest
Value Obs Value Obs
20 14 100 9
30 21 100 16
30 17 110 15
30 2 110 20
40 23 120 7
Distribution of X: Descriptive (4)Stem Leaf # Boxplot
12 0 1 | Max
11 00 2 |
10 00 2 |
9 0000 4 +-----+ Q3
8 000 3 | |
7 000 3 *--+--* Med
6 0 1 | |
5 000 3 +-----+ Q1
4 00 2 |
3 000 3 |
2 0 1 | Min
Distribution of X: Sequence plottitle1 h=3 'Sequence plot for X with smooth curve';
symbol1 v=circle i=sm70;
axis1 label=(h=2);
axis2 label=(h=2 angle=90);
proc gplot data=toluca;
plot lotsize*seq/haxis=axis1 vaxis=axis2;
run;
Distribution of X: QQPlottitle1 'QQPlot (normal probability plot)';
proc univariate data=toluca noprint;
qqplot lotsize workhrs / normal (L=1 mu=est sigma=est);
run;
CS Example:Yi: GPA after 3 semesters
X1: High school math grades (HSM)
X2: High school science grades (HSS)
X3: High school English grades (HSE)
X4: SAT Math (SATM)
X5: SAT Verbal (SATV)
Gender: (1 = male, 2 = female)
n = 224
CS Example: Input
data cs;
infile 'I:\My Documents\2018spring\csdata.dat';
input id gpa hsm hss hse satm satv genderm1;
proc print data=cs; run;
proc reg data=cs;
model gpa=hsm hss hse satm satv;
run;
CS Example: Descriptive Statistics proc means
proc means data=cs maxdec=2;
var gpa hsm hss hse satm satv;
run;
Variable N Mean Std Dev Minimum Maximum
gpa 224 2.64 0.78 0.12 4.00
hsm 224 8.32 1.64 2.00 10.00
hss 224 8.09 1.70 3.00 10.00
hse 224 8.09 1.51 3.00 10.00
satm 224 595.29 86.40 300.00 800.00
satv 224 504.55 92.61 285.00 760.00
CS Example: Correlation
proc corr data=cs;
var hsm hss hse;
Pearson Correlation Coefficients, N = 224
Prob > |r| under H0: Rho=0
hsm hss hse
hsm 1.00000 0.57569 0.44689
<.0001 <.0001
hss 0.57569 1.00000 0.57937
<.0001 <.0001
hse 0.44689 0.57937 1.00000
<.0001 <.0001
CS Example: Correlation (cont)proc corr data=cs noprob;
var satm satv;
proc corr data=cs noprob;
var hsm hss hse;
with satm satv;
Pearson Correlation Coefficients, N = 224
satm satv
satm 1.00000 0.46394
satv 0.46394 1.00000
Pearson Correlation Coefficients, N = 224
hsm hss hse
satm 0.45351 0.24048 0.10828
satv 0.22112 0.26170 0.24371
时间序列数据
(按时间序列排列的统计数据)
横截面数据
(在同一时间截面上的统计数据)
样本数据的分类
样本数据的分类
• 时间序列数据要注意数据的可比性与统计问题,时间序列数据容易产生模型中随机误差项的序列相关,需要对数据的某些计算整理来消除序列相关性。
• 横截面数据作样本时,容易产生异方差性,这是因为一个回归模型往往涉及到众多解释变量,如果其中某一因素或一些因素随着解释变量观测值的变化而对被解释变量产生不同的影响,就产生了异方差。对于具有异方差性的建模问题,数据整理就要注意消除异方差性。
• 数据整理不仅要把一些数据进行变换,差分,甚至将数据标准化,有时也要剔除一些“异常值”或利用插值的方法补齐空缺的数据。
三 确定理论回归模型的数学形式
要确定回归模型的数学形式,我们首
先应将收集的样本数据绘制关于 iy 与
),,2,1( nixi 的样本散点图。根据散点
图的大致形状,来确定初步的模型形式。
三 确定理论回归模型的数学形式
如果n个样本点大致分布在一条直线周围,我
们可考虑用线性回归模型去拟合这条直线。如果n个样本点大致分布在一条指数曲线周围,
我们可选择指数形式的理论回归模型去描述它。有时我们无法根据所获信息确定模型形式,这时可采用不同的形式进行计算机模拟,对于不同的模拟结果选择较好的一个作为理论模型。
n i1 i2
ip i
如果在实际问题中获得 组观测值(x ,x , ,
x ;y ),i=1,2, ,n,则回归模型为:
1 0 1 11 2 12 1 1
2 0 1 21 2 22 2 2
0 1 1 2 2
p
p
n n n np n
y x x x
y x x x
y x x x
表示成矩阵形式
Y X
多元线性回归
1, ( ) 0, 1,2, ,iE i n 零均值假定:
[ ] 0E 即
2, :同方差和无自相关假定2 ,
( , ) ; , 1,2, ,0,
i j
i jCov i j n
i j
即 的协方差矩阵 2
2
2
2
( ) ( , ) n
n n
D Cov I
多元线性回归
3,正态分布的假定2
1 2
~ (0, ), 1, ,
, , ,
i
n
N i n
相互独立
2~ (0, )N I 即
4, X回归设计矩阵 满秩的假定
( ) ( ' ) 1rank X rank X X p n
由以上假定可得
2
2
( ) , ( ) ( )
~ ( , )
n
n
E Y X D Y Var Y I
Y N X I
多元线性回归
pp xxxyE 22110)(
令 pii xxxxx ,,,,,, 1121 保持不变, )(yE 对 ix 求偏导
i
ix
yE
)(, ni ,,2,1
i 表示当其余变量不变的情况下, ix 每变动一个单位,对 y
的平均影响程度。
多元线性回归
四 模型参数的估计
对模型未知参数的估计是回归分析的重
要内容。未知参数估计最常用的方法是普通最小二乘法,贝叶斯方法。对于一些特殊的回归模型或不满足基本假设的回归问题,也可用极大似然估计,加权最小二乘法,岭回归等估计方法。
Regression Coefficients
Y X 设随机误差向量
定义离差平方和
( ) ' ( ) '( )Q Y X Y X
ˆ, ( ) min ( )Q Q
我们的目的是求出 使得
XXYXYY 2
当 XX 可逆时,解正规方程组,得 的最小二乘估计:
YXXX 1)(̂
其中
p
ˆ
ˆ
ˆ
ˆ 1
0
称 pp xxxy ˆˆˆˆˆ22110 为经验回归方程。
Regression Coefficients:(continuous)
五 模型的检验和修改
• 模型建立以后,此模型是否真正揭示了被解释变量与解释变量之间的关系。必须通过对模型的检验才能确定。
• 模型的检验是分为统计检验和实际意义检验二种。
五 模型的检验和修改
统计检验
检验解释变量的多重共线性
检验随机误差项的异方差性
检验随机误差项的序列相关
满足基本假设检验
拟合优度检验
回归系数的显著性检验
回归方程的显著性检验
显著性检验
Questions addressed by residuals
• Is the relationship linear?
• Does the variance depend on X?
• Are there outliers?
• Are error terms not independent?
• Are the errors normal?
• Can other predictors be helpful?
五 模型的检验和修改
• 检验回归模型是否符合实际意义,主要看解释变量系数的正负号是否与实际意义相符。若某x与y应为正相关,而相应系数𝛽为
负,这在实际问题中就无法解释,因而模型也失去意义。
如果一个模型没有通过某种统计检验,
或者通过了检验而没有合理的实际意义时,就需要对模型进行修改,模型的修改可以从以下几个方面考虑:
1. 变量设置是否合理,是否遗漏某些重要变量;
2. 解释变量之间是否具有很强的相关性;
3. 样本量是否太少;
4. 理论模型的设置是否合理等。
五 模型的检验和修改
Body Fat Example
For 20 healthy female subjects between 25 – 30
Y = amount of body fat (fat)
X1 = tricepts skinfold thickness (skinfold)
X2 = thigh circumference (thigh)
X3 = midarm circumference (midarm)
Body Fat Example: Regression (input)
data bodyfat;
infile 'I:\My Documents\2018SPRING\CH07TA01.DAT';
input skinfold thigh midarm fat;
proc print data=bodyfat;
run;
proc reg data=bodyfat;
model fat=skinfold thigh midarm;
run;
Body Fat Example: Correlation
proc corr data=bodyfat noprob;run;
Pearson Correlation Coefficients, N = 20
skinfold thigh midarm fat
skinfold 1.00000 0.92384 0.45778 0.84327
thigh 0.92384 1.00000 0.08467 0.87809
midarm 0.45778 0.08467 1.00000 0.14244
fat 0.84327 0.87809 0.14244 1.00000
Body Fat Example: Scatter plot
Body Fat Example: Regression (output)Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 396.98461 132.32820 21.52 <.0001
Error 16 98.40489 6.15031
Corrected Total 19 495.38950
Root MSE 2.47998 R-Square 0.8014
Dependent Mean 20.19500 Adj R-Sq 0.7641
Coeff Var 12.28017
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 117.08469 99.78240 1.17 0.2578
skinfold 1 4.33409 3.01551 1.44 0.1699
thigh 1 -2.85685 2.58202 -1.11 0.2849
midarm 1 -2.18606 1.59550 -1.37 0.1896
Set of variables are helpful.
But, each individual parameter is not helpful.
Body Fat Example: Diagnostics (output)
Body Fat Example: Diagnostics (output)
Body Fat Example: Single Xi’s (input)
proc reg data=bodyfat;
model fat = skinfold;
model fat = thigh;
model fat = midarm;
run;
Body Fat Example: Single Xi’s (output)
Root MSE 2.81977
R-Square 0.7111
Adj R-Sq 0.6950
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -1.49610 3.31923 -0.45 0.6576
skinfold 1 0.85719 0.12878 6.66 <.0001
Root MSE 2.51024
R-Square 0.7710
Adj R-Sq 0.7583
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -23.63449 5.65741 -4.18 0.0006
thigh 1 0.85655 0.11002 7.79 <.0001
Root MSE 5.19261
R-Square 0.0203
Adj R-Sq -0.0341
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 14.68678 9.09593 1.61 0.1238
midarm 1 0.19943 0.32663 0.61 0.5491
Body Fat Example: Single Xi’s
proc reg data=bodyfat;
model fat = thigh;
run;
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -23.63449 5.65741 -4.18 0.0006
thigh 1 0.85655 0.11002 7.79 <.0001
Root MSE 2.51024
R-Square 0.7710
Adj R-Sq 0.7583
Body Fat Example: General Linear Test
proc reg data=bodyfat;
model fat=skinfold thigh midarm;
skinmid: test skinfold, midarm;
* test H0: beta1 = beta3 = 0;
run;
Test skinmid Results for Dependent Variable fat
Source DF Mean
Square
F Value Pr > F
Numerator 2 7.50940 1.22 0.3210
Denominator 16 6.15031
Body Fat Example: Single Xi’s (input)
proc reg data=bodyfat;
model fat = skinfold;
run;
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -1.49610 3.31923 -0.45 0.6576
skinfold 1 0.85719 0.12878 6.66 <.0001
Root MSE 2.81977
R-Square 0.7111
Adj R-Sq 0.6950
Body Fat Example: General Linear Test
Does this variable alone do the job?
Perform general linear test
proc reg data=bodyfat;
model fat=skinfold thigh midarm;
thighmid: test thigh, midarm;
* test H0: beta2 = beta3 = 0;
run;
Test thighmid Results for Dependent Variable fat
Source DF Mean
Square
F Value Pr > F
Numerator 2 22.35741 3.64 0.0500
Denominator 16 6.15031
Appears there is additional information in the variables. Perhaps the addition of one more variable would be helpful.
Body Fat Example: General Linear Test (input)
proc reg data=bodyfat;
model fat=skinfold thigh midarm;
thigh: test thigh;
* test H0: beta2 = 0;
run;
Test thigh Results for Dependent Variable fat
Source DF Mean
Square
F Value Pr > F
Numerator 1 7.52928 1.22 0.2849
Denominator 16 6.15031
Body Fat Example: Partial Correlation
proc reg data=bodyfat;
model fat=skinfold thigh midarm / pcorr1 pcorr2;
run;
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Squared
Partial
Corr Type I
Squared
Partial
Corr Type II
Intercept 1 117.08469 99.78240 1.17 0.2578 . .
skinfold 1 4.33409 3.01551 1.44 0.1699 0.71110 0.11435
thigh 1 -2.85685 2.58202 -1.11 0.2849 0.23176 0.07108
midarm 1 -2.18606 1.59550 -1.37 0.1896 0.10501 0.10501
Body Fat Example: Partial Correlation
Variable Sum of Squares Squared Partial Corr
Type I
Skinfold(X1) SSR(X1)/SST(X1) 0.71110
Thigh(X2) SSE(X2|X1) /SSE(X1) 0.23176
Midarm(X3) SSE(X3|X1,X2)
/SSE(X1,X2)0.10501
Body Fat Example: Partial Correlation
Variable Sum of Squares Squared Partial Corr
Type II
Skinfold(X1) SSE(X1|X2,X3)
/SSE(X2,X3)0.11435
Thigh(X2) SSE(X2|X1,X3)
/SSE(X1,X3)0.07108
Midarm(X3) SSE(X3|X1,X2)
/SSE(X1,X2)0.10501
Body Fat Example: Effects of Correlation
Variablesin model
መ𝛽1መ𝛽2 s{ መ𝛽1} s{ መ𝛽2}
X1 0.8572 0.1288
X2 0.8565 0.1100
X1, X2 0.2224 0.6594 0.3034 0.2912
X1, X2, X3 4.334 -2.857 3.013 2.582
A regression coefficient does not reflect an inherent effect of the particular predictor variable on the response variable but only a marginal or partial effect.
Example of Unequal Error Variances Remedial Measures-Weighted Least Squares
Blood Pressure Example
Researching the relationship between blood pressure in healthy women ages 20 – 60.
Y = diastolic blood pressure (diast)
X = age
n = 54
Blood Pressure: inputdata pressure;
infile ‘H:\My Documents\2018SPRING\CH11TA01.DAT';
input age diast;
proc print data=pressure; run;
title1 h=3 'Blood Pressure';
title2 h=2 'Scatter plot';
symbol1 v=circle i=sm70 c=purple;
axis1 label=(h=2);
axis2 label=(h=2 angle=90);
proc sort data=pressure;
by age;
proc gplot data=pressure;
plot diast*age;
run;
Blood Pressure: Scatterplot
Blood Pressure: regression (unweighted)proc reg data=pressure;
model diast=age / clb;
output out=diag r=resid;
run;Analysis of Variance
Source DFSum of
Squares
Mean
SquareF Value Pr > F
Model 1 2374.96833 2374.96833 35.79 <.0001
Error 52 3450.36501 66.35317
Corrected Total 53 5825.33333
Root MSE 8.14575 R-Square 0.4077
Dependent Mean 79.11111 Adj R-Sq 0.3963
Parameter Estimates
Variable DFParameter
Estimate
Standard
Errort Value Pr > |t| 95% Confidence Limits
Intercept 1 56.15693 3.99367 14.06 <.0001 48.14304 64.17082
age 1 0.58003 0.09695 5.98 <.0001 0.38548 0.77458
Blood Pressure: Residual Plots
data diag; (*try to find pattern with age)
set diag;
absr=abs(resid);
sqrr=resid*resid;
title2 h=2 'residual abs(resid) squared
residual plots vs. age';
proc gplot data=diag;
plot (resid absr sqrr)*age/haxis=axis1
vaxis=axis2;
run;
Blood Pressure: Residual Plots
(cont)
Blood Pressure: computing weights if using abs-resid
*Model absolute residual as a linear function of
age.
*Fit model and predict the standard deviation for
each observation.
proc reg data=diag;
model absr=age;
output out=findweights p=shat;
* Shat 是绝对残差作为Y与X拟合后的Y的预测值.
Blood Pressure: computing weights if using abs-resid
*Model absolute residual as a linear function of
age.
*Fit model and predict the standard deviation for
each observation.
proc reg data=diag;
model absr=age;
output out=findweights p=shat;
* Shat 是绝对残差作为Y与X拟合后的Y的预测值.
*Compute weights
data findweights;
set findweights;
wt=1/(shat*shat);
Blood Pressure: weighted regression(abs-
resid)proc reg data=findweights;
model diast=age / clb p;
weight wt;
output out = weighted r = resid p = predict;
run;Analysis of Variance
Source DFSum of
Squares
Mean
SquareF Value Pr > F
Model 1 83.34082 83.34082 56.64 <.0001
Error 52 76.51351 1.47141
Corrected Total 53 159.85432
Root MSE 1.21302 R-Square 0.5214
Dependent Mean 73.55134 Adj R-Sq 0.5122
Parameter Estimates
Variable DFParameter
Estimate
Standard
Errort Value Pr > |t| 95% Confidence Limits
Intercept 1 55.56577 2.52092 22.04 <.0001 50.50718 60.62436
age 1 0.59634 0.07924 7.53 <.0001 0.43734 0.75534
Blood pressure: Comparison
• Normal Regression
• Weighted RegressionParameter Estimates
Variable DFParameter
Estimate
Standard
Errort Value Pr > |t| 95% Confidence Limits
Intercept 1 55.56577 2.52092 22.04 <.0001 50.50718 60.62436
age 1 0.59634 0.07924 7.53 <.0001 0.43734 0.75534
Parameter Estimates
Variable DFParameter
Estimate
Standard
Errort Value Pr > |t| 95% Confidence Limits
Intercept 1 56.15693 3.99367 14.06 <.0001 48.14304 64.17082
age 1 0.58003 0.09695 5.98 <.0001 0.38548 0.77458
Blood Pressure: new residuals
data graphtest;
set weighted;
resid1 = sqrt(wt)*resid;
title2 h=2 'Weighted data - residual plot';
symbol1 v=circle i=none color=red;
proc gplot data=graphtest;
plot resid1*predict/vref=0 haxis=axis1 vaxis=axis2;
run;
Blood Pressure: new residuals
Example ofOutlying Observations
Life Insurance Example
Y = the amount of life insurance for the 18 managers (in $1000)
X1 = average annual income (in $1000)
X2 = risk aversion score (0 – 10)
Life Insurance: Input, diagnosticstitle1 h=3 'Insurance';
data insurance;
infile 'I:\My Documents\2018SPRING\CH10TA01.DAT';
input income risk amount;
run;
proc print data=insurance; run;
*diagnostics, r means residuals, p means predited value;
title2 h=2 'residual plots';
symbol1 v=circle c=black;
proc reg data=insurance;
model amount = income risk/r p;
plot r.*(p. income risk);
run;
Life Insurance: Initial RegressionAnalysis of Variance
Source DFSum of
Squares
Mean
SquareF Value Pr > F
Model 2 173919 86960 542.33 <.0001
Error 15 2405.14763 160.34318
Corrected Total 17 176324
Root MSE 12.66267 R-Square 0.9864
Dependent Mean 134.44444 Adj R-Sq 0.9845
Coeff Var 9.41851
Parameter Estimates
Variable DFParameter
Estimate
Standard
Errort Value Pr > |t|
Intercept 1 -205.71866 11.39268 -18.06 <.0001
income 1 6.28803 0.20415 30.80 <.0001
risk 1 4.73760 1.37808 3.44 0.0037
Life Insurance: output, diagnostics
Life Insurance: Studentized Deleted Residuals
proc reg data=quad;
model amount=income risk incomesq/r influence;
output out = diag1 r=resid rstudent=rstudent;
run;
proc print data=diag1; run;
*The studentized residual RSTUDENT differs slightly from
STUDENT since the error variance is estimated by ො𝜎 𝑒 𝑖
without the ith observation, not by ො𝜎 𝑒 .rstudent=𝑒 𝑖
ෝ𝜎 𝑒 𝑖=
𝑒 𝑖
𝑀𝑆𝐸 𝑖 (1−ℎ𝑖𝑖)~𝑡(𝑛 − 1 − 𝑝 − 1)
Studentized Deleted Residuals (cont)Obs
Dependent
Variable
Predicted
ValueRStudent
Hat Diag
H
1 91.0000 97.8164 -5.3155 0.0962
2 162.0000 160.1201 0.8848 0.1711
3 11.0000 11.5901 -0.3333 0.4524
4 240.0000 240.6278 -0.2822 0.1373
5 73.0000 71.5019 0.6618 0.0826
6 311.0000 309.6777 0.7153 0.3848
7 316.0000 315.6359 0.3063 0.7535
8 154.0000 153.3645 0.2931 0.1802
9 164.0000 162.4847 0.6866 0.1258
10 54.0000 52.4068 0.7127 0.1006
11 53.0000 52.8060 0.0866 0.1297
12 326.0000 327.6975 -0.9308 0.3856
13 55.0000 54.4957 0.2210 0.0951
14 130.0000 131.0179 -0.5120 0.3018
15 112.0000 109.6080 1.1138 0.1249
16 91.0000 93.0992 -0.9653 0.1222
17 14.0000 13.8135 0.0909 0.2705
18 63.0000 62.2363 0.3338 0.0856
Identifying Influential Cases
Various Criteria
• DFFITS
• Cook’s Distance
• DFBETAS Measures
Influential observations are those that, according to various criteria, appear to have a large influence on the parameter estimates.
Influence on Single Fitted Value- DFFITS
• 𝐷𝐹𝐹𝐼𝑇𝑆 >1,case is influential case for small to medium data set;
• 𝐷𝐹𝐹𝐼𝑇𝑆 >2 𝑝/𝑛,case is influential case for large data set.
1/2
( ) 1/2
22
( )
ˆ ˆ 2( ) ( )
(1 ) 1ˆ
i i iii i
ii i iiii i
Y Y hn pDFFITS e
SSE h e hh
22
( )1
2 2 2
ˆ ˆ( )
ˆ ˆ( 1) ( 1) (1 )
n
j j ij i iii
ii
Y Y e hD
p p h
Cook’s Distance 反映了杠杆值 iih 与残差 ie 大
小的一个综合效应。判断的一般标准是:
当 5.0iD 时,认为不是异常值点;
当 1iD 时,认为是异常值点。
Influence on All Fitted Value- Cook’s Distance
Influence on Regression Coefficients-DFBETAS
( )
( )2
( )
ˆ ˆ( )
ˆ
k k i
k i
ii i
DFBETASc
• 𝐷𝐹𝐵𝐸𝑇𝐴𝑆 >1,the ith case has a large impact on the kth regression coefficient for small to medium data set;
• 𝐷𝐹𝐵𝐸𝑇𝐴𝑆 >2 𝑛, the ith case has a large impact on the kth regression coefficient for large data set.
Hat Matrix Diagnosis, DFFITSObs Residual RStudent Hat Diag H Cov Ratio DFFITS
1 -6.8164 -5.3155 0.0962 0.0147 -1.7339
2 1.8799 0.8848 0.1711 1.2842 0.4020
3 -0.5901 -0.3333 0.4524 2.3742 -0.3029
4 -0.6278 -0.2822 0.1373 1.5215 -0.1126
5 1.4981 0.6618 0.0826 1.2842 0.1986
6 1.3223 0.7153 0.3848 1.8735 0.5656
7 0.3641 0.3063 0.7535 5.3027 0.5356
8 0.6355 0.2931 0.1802 1.5981 0.1374
9 1.5153 0.6866 0.1258 1.3342 0.2604
10 1.5932 0.7127 0.1006 1.2830 0.2384
11 0.1940 0.0866 0.1297 1.5420 0.0334
12 -1.6975 -0.9308 0.3856 1.6912 -0.7373
13 0.5043 0.2210 0.0951 1.4643 0.0717
14 -1.0179 -0.5120 0.3018 1.7786 -0.3366
15 2.3920 1.1138 0.1249 1.0675 0.4209
16 -2.0992 -0.9653 0.1222 1.1616 -0.3601
17 0.1865 0.0909 0.2705 1.8390 0.0553
18 0.7637 0.3338 0.0856 1.4216 0.1022
Cook’s Distance, DFBetas, Cov RatioObs
Cook's
D
Cov
RatioDFFITS
DFBETAS
Intercept income risk incomesq
1 0.255 0.0147 -1.7339 -0.4126 0.0662 -0.3686 0.9168
2 0.041 1.2842 0.4020 0.0110 0.2513 -0.2064 -0.2579
3 0.025 2.3742 -0.3029 -0.1839 0.2513 -0.0525 -0.2312
4 0.003 1.5215 -0.1126 0.0642 -0.0692 -0.0299 0.0230
5 0.010 1.2842 0.1986 0.1216 -0.0566 -0.0108 -0.0580
6 0.083 1.8735 0.5656 -0.3627 0.1183 0.3901 0.1704
7 0.077 5.3027 0.5356 -0.0249 0.2235 -0.3381 0.2233
8 0.005 1.5981 0.1374 -0.0372 0.0245 0.0788 -0.0712
9 0.018 1.3342 0.2604 -0.0462 0.1333 0.0084 -0.1799
10 0.015 1.2830 0.2384 0.1978 -0.0988 -0.0773 -0.0084
11 0.000 1.5420 0.0334 0.0195 -0.0244 0.0126 0.0091
12 0.137 1.6912 -0.7373 0.4425 -0.1728 -0.3821 -0.3486
13 0.001 1.4643 0.0717 0.0535 -0.0427 0.0030 0.0063
14 0.030 1.7786 -0.3366 -0.0807 -0.1746 0.2583 0.1861
15 0.044 1.0675 0.4209 0.0160 -0.0195 0.2003 -0.2036
16 0.033 1.1616 -0.3601 -0.1515 -0.0774 0.1654 0.2177
17 0.001 1.8390 0.0553 0.0462 -0.0383 -0.0150 0.0317
18 0.003 1.4216 0.1022 0.0714 -0.0471 -0.0003 -0.0097
Example of Autocorrelation
Applied Linear Statistical Models (5th
edition) ,12.1-12.4
研究我国人均消费水平问题,数据从 1980
—1998 年为时间序列数据,因变量 y 为人均消
费额,自变量 x 为人均国民收入。
年份人均国民收入
x(元)
人均消费金额
y(元)年份
人均国民收入
x(元)
人均消费金额
y(元)
1980 460 234.75 1990 1634 797.081981 489 259.26 1991 1879 890.661982 525 280.58 1992 2287 1063.391983 580 305.97 1993 2939 1323.221984 692 347.15 1994 3923 1736.321985 853 433.53 1995 4854 2224.591986 956 481.36 1996 5576 2627.061987 1104 545.4 1997 6053 2819.361988 1355 687.51 1998 6392 2958.181989 1512 756.27
人均国民收入表
Model Summaryb
.999a .999 .999 31.74833 .873
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Durbin-
Watson
Predictors: (Constant), 人均国民收入(元)a.
Dependent Variable: 人均消费金额(元)b.
ANOVAb
15326406 1 15326406.15 15306.535 .000a
17022.070 17 1001.298
15343428 18
Regress ion
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), 人均国民收入(元)a.
Dependent Variable: 人均消费金额(元)b.
74833.31ˆ,999.02 R ,F 检验与 t 检验均
显示 x 、 y 有显著的线性关系,
但计算得DW 值为 0.873,若取 05.0 ,DW
值显示残差序列存在正自相关。
)(DWf
DW
18.1Ld 40.1Ud
正自相关0.873
不能确定
无自相关不能确定
负自相关
Ud4Ld40 2
First Differences Procedure
首先计算差分
1 ttt yyy ,
1 ttt xxx
计算结果见下表
年份 序号 xt yt et △xt △yt
1980 1 460 234.75 -12.111981 2 489 259.26 -0.8 29 24.511982 3 525 280.58 4.13 36 21.321983 4 580 305.97 4.48 55 25.391984 5 692 347.15 -5.33 112 41.181985 6 853 433.53 7.75 161 86.381986 7 956 481.36 8.69 103 47.831987 8 1104 545.4 5.35 148 64.041988 9 1355 687.51 33.19 251 142.111989 10 1512 756.27 30.47 157 68.761990 11 1634 797.08 15.74 122 40.811991 12 1879 890.66 -2.22 245 93.581992 13 2287 1063.39 -15.24 408 172.731993 14 2939 1323.22 -52.24 652 259.831994 15 3923 1736.32 -87.12 984 413.11995 16 4854 2224.59 -22.7 931 488.271996 17 5576 2627.06 51.07 722 402.471997 18 6053 2819.36 26.21 477 192.31998 19 6392 2958.18 10.7 339 138.82
Model Summaryc,d
.991b .981 .980 29.34036 1.569
Model
1
R R Squarea
Adjus ted
R Square
Std. Error of
the Estimate
Durbin-
Watson
For regress ion through the origin (the no-intercept m odel), R Square
measures the proportion of the variabili ty in the dependent variable
about the origin explained by regress ion. This CANNOT be compared to
R Square for models which include an intercept.
a.
Predictors : detaxtb.
Dependent Variable: detaytc.
Linear Regress ion through the Origind.
用 ty 对 tx 作过原点的回归拟合,
计算结果如下表。
计算得DW =1.569,查表( 05.0,2,18 kn ), Ld =1.16, Vd =1.39
落在无自相关区域,且F 检验、 t 检验均显示回归显著。
)(DWf
DW16.1Ld 39.1Ud
正自相关
不能确定
无自相关1.569
不能确定
负自相关
Ud4Ld40 2
Example ofModel Seletion
Surgical ExampleSurgical unit wants to predict survival in patients
undergoing a specific liver operation.n = 54Y = post-operation survival timeExplanatory Variables
X1: blood clotting score (blood)X2: prognostic index (prog)X3: enzyme function test score (enz)X4: liver function test score (liver)X5: age, in years(age)X6: indicator variable for gender(M:O F:1,gender)X7 and X8 :indicator variable for history of alcohol use,3 levels(Alcohol Moderate, Alcohol Severe)
Surgical Example: inputdata surgical;
infile 'I:\My Documents\2018SPRING\CH09TA01.txt'
delimiter='09'x;
input blood prog enz liver age gender alcmod alcheavy surv
logsurv;
run;
proc print data=surgical;
run;
title1 h=3 'Original model';
title2 h=2 'Matrix Scatterplot';
proc sgscatter data=surgical;
matrix surv blood prog enz liver;
run;
Surgical Example: Scatterplot
Surgical Example: Correlation
proc corr data=surgical noprob;
var lsurv blood prog enz liver;
run;
Pearson Correlation Coefficients, N = 54
lsurv blood prog enz liver
lsurv 1.00000 0.24633 0.47015 0.65365 0.64920
blood 0.24633 1.00000 0.09012 -0.14963 0.50242
prog 0.47015 0.09012 1.00000 -0.02361 0.36903
enz 0.65365 -0.14963 -0.02361 1.00000 0.41642
liver 0.64920 0.50242 0.36903 0.41642 1.00000
Variable Selection
Variable Selection Criteria–𝑅2,adjusted 𝑅2,MSE
–𝐶𝑃–PRESS
–AIC,SBC(aka BIC)
Automatic Search Procedures–Forward Selection
–Backward Elimination
–Stepwise Search
–All Subset Selection
Surgical Example: Model Selection –data for the current model
proc reg data=surgical outtest=mparam;
model lsurv=blood prog enz liver/
rsquare adjrsq cp press aic sbc;
run;
proc print data=mparam; run;
Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ _PRESS_
1 MODEL1 PARMS lsurv 0.25088 4.06875
Obs _IN_ _P_ _EDF_ _RSQ_ _ADJRSQ_ _CP_ _AIC_ _SBC_
1 4 5 49 0.75914 0.73948 5 -144.587 -134.642
Obs Intercept blood prog enz liver lsurv
1 3.85193 0.083739 0.012671 0.015627 0.032056 -1
Surgical Example: Model Selection -automatic
proc reg data=surgical;
model lsurv=blood prog enz liver / selection=stepwise;
run;
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
Summary of Stepwise Selection
Step Variable Entered
Variable Removed
Number Vars In
Partial R-Square
Model R-Square
C(p) F Value Pr > F
1 enz 1 0.4273 0.4273 66.5181 38.79 <.0001
2 prog 2 0.2359 0.6632 20.5228 35.72 <.0001
3 blood 3 0.0941 0.7572 3.3879 19.37 <.0001
Surgical Example: Model Selection –backward elimination
Bounds on condition number: 1.0308, 9.1864
All variables left in the model are significant at the 0.1000 level.
Variable Parameter Estimate
Standard Error
Type II SS F Value Pr > F
Intercept 3.76644 0.22676 17.15229 275.89 <.0001
blood 0.09547 0.02169 1.20436 19.37 <.0001
prog 0.01334 0.00203 2.67403 43.01 <.0001
enz 0.01644 0.00163 6.32862 101.80 <.0001
Summary of Backward Elimination
Step Variable
Removed Number Vars In
Partial R-Square
Model R-Square
C(p) F Value Pr > F
1 liver 3 0.0019 0.7572 3.3879 0.39 0.5363
六 回归模型的应用
1 进行因素分析
2 控制
3 预测