View
511
Download
9
Tags:
Embed Size (px)
Citation preview
© Thierry Warin
1
STATA STEP-BY-STEP
Thierry Warin
© Thierry Warin
2
Table of Contents
SETTING UP STATA 5
SETTING UP A PANEL 5
HOW TO GENERATE VARIABLES 6
GENERATING VARIABLES 6 GENERATING DATES 7
HOW TO GENERATE DUMMIES 8
GENERATING GENERAL DUMMIES 8 GENERATING TIME DUMMIES 9
TIMES-SERIES ANALYSES 10
1. ASSUMPTIONS OF THE OLS ESTIMATOR 10 2. CHECK THE INTERNAL AND EXTERNAL VALIDITY 11 A. THREATS TO INTERNAL VALIDITY 12 B. THREATS TO EXTERNAL VALIDITY 12 3. THE LINEAR REGRESSION MODEL 13 4. LINEAR REGRESSION WITH MULTIPLE REGRESSORS 13 ASSUMPTIONS OF THE OLS ESTIMATOR 13 5. NONLINEAR REGRESSION FUNCTIONS 14 A. EXAMPLES OF NONLINEAR REGRESSIONS 15 1) POLYNOMIAL REGRESSION MODEL OF A SINGLE INDEPENDENT VARIABLE 15 2) LOGARITHMS 15 B. INTERACTIONS BETWEEN TWO BINARY VARIABLES 16 C. INTERACTIONS BETWEEN A CONTINUOUS AND A BINARY VARIABLE 16 D. INTERACTIONS BETWEEN TWO CONTINUOUS VARIABLES 16
RUNNING TIME-SERIES ANALYSES IN STATA 16
A TIME SERIES REGRESSION 16 REGRESSION DIAGNOSTICS: NON NORMALITY 21
© Thierry Warin
3
REGRESSION DIAGNOSTICS: NON-LINEARITY 31 REGRESSION DIAGNOSTICS: HETEROSCEDASTICITY 36 REGRESSION DIAGNOSTICS: OUTLIERS 40 REGRESSION DIAGNOTICS: MULTICOLLINEARITY 52 REGRESSION DIAGNOSTICS: NON-INDEPENDENCE 55
TIME-SERIES CROSS-SECTION ANALYSES (TSCS) OR PANEL DATA
MODELS 57
A. THE FIXED EFFECTS REGRESSION MODEL 57 B. REGRESSION WITH TIME FIXED EFFECTS 58
RUNNING POOLED OLS REGRESSIONS IN STATA 59
THE FIXED AND RANDOM EFFECTS MODELS 59 CHOICE OF ESTIMATOR 60 TESTING PANEL MODELS 61 ROBUST STANDARD ERRORS 61
RUNNING PANEL REGRESSIONS IN STATA 63
IT IS ABSOLUTELY FUNDAMENTAL THAT THE ERROR TERM IS NOT CORRELATED
WITH THE INDEPENDENT VARIABLES. 63 CHOOSING BETWEEN FIXED EFFECTS AND RANDOM EFFECTS? THE HAUSMAN TEST
64 IF YOU QUALIFY FOR A FIXED EFFECTS MODEL, SHOULD YOU INCLUDE TIME
EFFECTS? 66 FIXED EFFECTS OR RANDOM EFFECTS WHEN TIME DUMMIES ARE INVOLVED: A TEST
66
DYNAMIC PANELS AND GMM ESTIMATIONS 68
HOW DOES IT WORK? 69
TESTS 71
IN NEED FOR A CAUSALITY TEST? 71
MAXIMUM LIKELIHOOD ESTIMATION 75
© Thierry Warin
4
1. PROBIT AND LOGIT REGRESSIONS 75 PROBIT REGRESSION 75 LOGIT REGRESSION 75 LINEAR PROBABILITY MODEL 76
EXAMPLES 77
HEALTH CARE 77
APPENDIX 1 91
© Thierry Warin
5
Setting up Stata
We are going to allocate 10 megabites to the dataset. You do not want to allocate
to much memory to the dataset because the more memory you allocate to the dataset, the less memory will be available to perform the commands. You could
reduce the speed of Stata or even kill it.
set mem 10m
we can also decide to have the “more” separation line on the screen or not when
the software displays results:
set more on
set more off
Setting up a panel
Now, we have to instruct Stata that we have a panel dataset. We do it with the
command tsset, or iis and tis
iis idcode
tis year
or
tsset idcode year
In the previous command, idcode is the variable that identifies individuals in our
dataset. Year is the variable that identifies time periods. This is always the rule.
The commands refering to panel data in Stata almost always start with the prefix
xt. You can check for these commands by calling the help file for xt.
help xt
© Thierry Warin
6
You should describe and summarize the dataset as usually before you perform estimations. Stata has specific commands for describing and summarizing panel
datasets.
xtdes
xtsum
xtdes permits you to observe the pattern of the data, like the number of individuals
with different patterns of observations across time periods. In our case, we have
an unbalanced panel because not all individuals have observations to all years.
The xtsum command gives you general descriptive statistics of the variables in the
dataset, considering the overall, the between and the within variations. Overall refers to the whole dataset.
Between refers to the variation of the means to each individual (across time periods). Within refers to the variation of the deviation from the respective mean
to each individual.
You may be interested in applying the panel data tabulate command to a variable. For instance, to the variable south, in order to obtain a one-way table.
xttab south
As in the previous commands, Stata will report the tabulation for the overall
variation, the within and the between variation.
How to generate variables
Generating variables
gen age2=age^2
gen ttl_exp2=ttl_exp^2
gen tenure2=tenure^2
© Thierry Warin
7
Now, let's compute the average wage for each individual (across time periods).
bysort idcode: egen meanw=mean(ln_wage)
In this case, we did not apply the sort command previously and then the by prefix
command. We could have done it, but with this only command, you can always
abreviate the implementation of the by prefix command.
The command egen is an extension of the gen command to generate new variables. The general rule to apply egen is when you want to generate a new
variable that is created using a function inside Stata.
In our case, we used the function mean.
You can apply the command list to list the first 10 observations of the new variable mwage.
list meanw in 1/10
And then apply the xtsum command to summarize the new variable.
xtsum meanw
You may want to obtain the average of the logarithm of wages to each year in the
panel.
bysort year: egen meanw1=mean(ln_wage)
And then you can apply the xttab command.
xttab meanw1
Generating dates
Let’s generate dates:
Gen varname2 = date(varname1, “dmy”)
© Thierry Warin
8
And format:
Format varname2 %d
How to generate dummies
Generating general dummies
Let's generate the dummy variable black, which is not in our dataset.
gen black=1 if race==2
replace black=0 if black==.
Suppose you want to generate a new variable called tenure1 that is equal to the variable tenure lagged one period. Than you would use a time series operator (l).
First, you would need to sort the dataset according to idcode and year, and then generate the new variable with the "by" prefix on the variable idcode.
sort idcode year
by idcode: gen tenure1=l.tenure
If you were interested in generating a new variable tenure3 equal to one
difference of the variable tenure, you would use the time series d operator.
by idcode: gen tenure3=d.tenure
If you would like to generate a new variable tenure4 equal to two lags of the
variable tenure, you would type:
by idcode: gen tenure4=l2.tenure
The same principle would apply to the operator d.
Let's just save our data file with the changes that we made to it.
© Thierry Warin
9
save, replace
Another way would be to use the xi command. It takes the items (string of letters,
for instance) of a designated variable (category, for instance) and create a dummy
variable for each item. You need to change the base anyway:
char _dta[omit] “prevalent”
xi: i.category
tabulate category
Generating time dummies
In order to do this, let's first generate our time dummies. We use the "tabulate" command with the option "gen" in order to generate time dummies for each year
of our dataset.
We will name the time dummies as "y",
• and we will get a first time dummy called "y1" which takes the value 1 if year=1980, 0 otherwise,
• a second time dummy "y2" which assumes the value 1 if year=1982, 0
otherwise, and similarly for the remaining years. You could give any other name to your time dummies.
tab year, g(y)
© Thierry Warin
10
Times-series analyses
The OLS estimator chooses the regression coefficient so that the estimated
regression line is as close as possible to the observed data, where closeness is
measured by the sum of the squared mistakes made in predicting Y given X:
( )2
0 1
1
n
i i
i
Y b b X=
− −∑ (1)
With 0b and 1b being estimators of 0β and 1β .
1. Assumptions of the OLS estimator
1. the conditional distribution of iu given
iX has a mean of zero. This means
that the other factors captured in the error term are unrelated to iX . The
correlation between iX and iu should be nil: ( ), 0i icorr x u = . This is the
most important assumption in practice. If this assumption does not hold,
then it is likely because there is an omitted variable bias. One should test
for omitted variables using (Ramsey and Braithwaite, 1931)’s test.
2. Related to the first assumption: if the variance of this conditional
distribution of iu does not depend on iX , then the errors are said to be
homoskedastic. The error term iu is homoskedastic if the variance of the
conditional distribution of iu given
iX is constant for 1,...,i n= and in
© Thierry Warin
11
particular does not depend on iX . Otherwise, the error term is
heteroskedastic.
a. Whether the errors are homoskedastic or heteroskedastic, the OLS
estimator is unbiased, consistent, and asymptotically normal.
b. If the standard errors are heteroskedastic, one should use
hetereoskedastic-robust standard errors. To test for
heteroskedasticity, we use (Breusch and Pagan, 1979)’s test.
3. ( ), , 1,...,i iX Y i n= are Independently and Identically Distributed. This is
to be sure that there is no selection bias in the sample. This second
assumption holds in many cross-sectional data sets, but it is inappropriate
for time series data.
4. iX and
iu have four moments. The fourth assumption is that the fourth
moments of iX and
iu are nonzero and finite: ( )40 iE X< < ∞ and
( )40 iE u< < ∞
2. Check the internal and external validity
A statistical analysis is internally valid if the statistical inferences about causal
effects are valid for the population being studied. The analysis is externally
valid if its inferences and conclusions can be generalized from the population
and setting studied to other populations and settings.
Internal and external validity distinguish between population and setting
studied and the population and setting to which the results are generalized.
© Thierry Warin
12
A. Threats to internal validity
Internal validity has two components:
1. The estimator of the causal effect should be unbiased and consistent.
Causal effects are estimated using the estimated regression function.
2. hypothesis test should have the desired significance level, and confidence
intervals should have the desired confidence level. Hypothesis tests are
performed using the estimated regression coefficients and their standard
errors.
Studies based on regression analysis are internally valid if the estimated
regression coefficients are unbiased and consistent, and if their standard errors
yield confidence intervals with the desired confidence level. Reasons why the
OLS estimator of the multiple regression coefficients might be biased are
sevenfold: omitted variables, misspecification of the functional form of the
regression function, imprecise measurement of the independent variables,
sample selection, simultaneous causality, heteroskedasticity, and correlation
of the error term across observations (sample not i.i.d.). All seven sources of
bias arise because the regressor is correlated with the error term violating the
first least squares assumption.
B. Threats to external validity
External validity must be judged using specific knowledge of the populations
and settings studied and those of interest. Important differences between the
two will cast doubt on the external validity of the study. Sometimes, there are
two or more studies on different but related populations. If so, the external
validity of both studies can be checked by comparing their results.
© Thierry Warin
13
3. The linear regression model
0 1i i iY X uβ β= + + (2)
This can be a time series analysis or not, for instance test scores and class sizes in
1998 in 420 California school districts.
4. Linear regression with multiple regressors
0 1 1 2 2i i i iY X X uβ β β= + + + (3)
Assumptions of the OLS estimator
1. the conditional distribution of iu given
1 2, ,...,i i kiX X X has a mean of zero.
This means that the other factors captured in the error term are unrelated
to 1 2, ,...,i i kiX X X . The correlation between
1 2, ,...,i i kiX X X and iu should
be nil. This is the most important assumption in practice. If this
assumption does not hold, then it is likely because there is an omitted
variable bias. One should test for omitted variables using (Ramsey and
Braithwaite, 1931)’s test.
2. Related to the first assumption: if the variance of this conditional
distribution of iu does not depend on
1 2, ,...,i i kiX X X , then the errors are
said to be homoskedastic. The error term iu is homoskedastic if the
© Thierry Warin
14
variance of the conditional distribution of iu given 1 2, ,...,i i kiX X X is
constant for 1,...,i n= and in particular does not depend on
1 2, ,...,i i kiX X X . Otherwise, the error term is heteroskedastic.
a. Whether the errors are homoskedastic or heteroskedastic, the OLS
estimator is unbiased, consistent, and asymptotically normal.
b. If the standard errors are heteroskedastic, one should use
hetereoskedastic-robust standard errors. To test for
heteroskedasticity, we use (Breusch and Pagan, 1979)’s test.
3. ( )1 2, ,..., , , 1,...,i i ki iX X X Y i n= are Independently and Identically
Distributed. This is to be sure that there is no selection bias in the sample.
This second assumption holds in many cross-sectional data sets, but it is
inappropriate for time series data.
4. 1 2, ,...,i i kiX X X and
iu have four moments. The fourth assumption is that
the fourth moments of 1 2, ,...,i i kiX X X and
iu are nonzero and finite.
5. No perfect multicollinearity. In case of perfect multicollinearity, it is
impossible to compute the OLS estimator. The regressors are said to be
perfectly multicollinear if one of regressors is a perfect linear function of
one of the other regressors.
5. Nonlinear regression functions
( )1 2, ,..., , 1,...,i i i ki iY f X X X u i n= + = (4)
© Thierry Warin
15
A. Examples of nonlinear regressions
1) Polynomial regression model of a single independent
variable
2
0 1 2 ... r
i i i r i iY X X X uβ β β β= + + + + + (5)
2) Logarithms
1. Lin-log model
( )0 1 lni i iY X uβ β= + + (6)
A 1% change in X is associated with a change in Y of 0.011β .
2. Log-lin model
( ) 0 1ln i i iY X uβ β= + + (7)
A change in X by 1 unit is associated with a 1001β %.
3. Log-log model. Logarithms convert changes in variables into percentage
changes. In the economic analysis of consumer demand, it is often
assumed that a 1% increase in price leads to a certain percentage decrease
in the quantity demanded. This percentage change in demand is called the
price elasticity. The regressor coefficients will then measure the elasticity
in a log-log model.
© Thierry Warin
16
( ) ( )0 1ln ln i iiY X uβ β= + + (8)
A 1% change in X is associated with a 1β % change in Y, so
1β is the elasticity of
Y with respect to X.
B. Interactions between two binary variables
( )0 1 1 2 2 3 1 2i i i i i iY D D D D uβ β β β= + + + × + (9)
C. Interactions between a continuous and a binary variable
( )0 1 2 3i i i i i iY X D X D uβ β β β= + + + × + (10)
D. Interactions between two continuous variables
( )0 1 1 2 2 3 1 2i i i i i iY X X X X uβ β β β= + + + × + (11)
Running time-series analyses in Stata
A time series regression
use http://www.ats.ucla.edu/stat/stata/modules/reg/ok, clear
© Thierry Warin
17
First, let's look at a scatterplot of all variables. There are a few observations that could be outliers, but there is nothing seriously wrong in this scatterplot.
. graph y x1 x2 x3, matrix
Let's use the regress command to run a regression predicting y from x1 x2 and
x3.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 3, 96) = 21.69
Model | 5936.21931 3 1978.73977 Prob > F = 0.0000
Residual | 8758.78069 96 91.2372989 R-squared = 0.4040 ---------+------------------------------ Adj R-squared = 0.3853
Total | 14695.00 99 148.434343 Root MSE = 9.5518
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | .298443 .1187692 2.513 0.014 .0626879 .5341981 x2 | .4527284 .1230534 3.679 0.000 .2084695 .6969874
x3 | .3466306 .0838481 4.134 0.000 .1801934 .5130679
_cons | 31.50512 .9587921 32.859 0.000 29.60194 33.40831 ------------------------------------------------------------------------------
We can use the rvfplot command to display the residuals by the fitted (predicted) values. This can be useful in checking for outliers, checking for non-
normality, checking for non-linearity, and checking for heteroscedasticity. The
© Thierry Warin
18
distribution of points seems fairly even and random, so no serious problems seem evident from this plot.
. rvfplot
We can use the vif command to check for multicollinearity. As a rule of thumb,
VIF values in excess of 20, or 1/VIF values (tolerances) lower than 0.05 may
merit further investigation. These values all seem to be fine.
. vif
Variable | VIF 1/VIF
---------+----------------------
x1 | 1.23 0.812781 x2 | 1.16 0.864654
x3 | 1.12 0.892234
---------+---------------------- Mean VIF | 1.17
We can use the predict command (with the rstudent option) to make studentized residuals, and then use the summarize command to check the distribution of the
residuals. The residuals do not seem to be seriously skewed (although they do have a higher than expected kurtosis). The largest studentized residual (in
absolute value) is -2.88, which is somewhat large but not extremely large.
. predict rstud, rstudent
. summarize rstud, detail
© Thierry Warin
19
Studentized residuals -------------------------------------------------------------
Percentiles Smallest
1% -2.5294 -2.885066 5% -1.52447 -2.173733
10% -1.217658 -2.043349 Obs 100
25% -.694324 -1.763776 Sum of Wgt. 100
50% -.0789393 Mean .0016893
Largest Std. Dev. 1.015969
75% .6562946 2.013153 90% 1.409712 2.019451 Variance 1.032194
95% 1.96464 2.096213 Skewness .0950425
99% 2.253903 2.411592 Kurtosis 3.020196
We can use the kdensity command (with the normal option) to show the
distribution of the residuals (in yellow), and a normal overlay (in red). The results look pretty close to normal.
. kdensity rstud, normal
Below we show a boxplot of the residuals. The largest residual (-2.88) is
plotted as a residual, calling that to our attention.
. graph rstud, box
© Thierry Warin
20
We can use the pnorm command to make a normal probability plot. A perfect
normal distribution would be an exact diagonal line (as shown in red). The actual
data is plotted in yellow and is fairly close to the diagonal. While not perfectly
normal, this is not seriously non-normal.
. pnorm rstud
© Thierry Warin
21
Regression Diagnostics: Non normality
use http://www.ats.ucla.edu/stat/stata/modules/reg/nonnorm, clear Let's start by using the summarize command to look at the distribution of x1 x2
x3 and y looking for evidence of non-normality. The skewness for y suggests that
y might be skewed. The skewness for x1 x2 and x3 all look fine.
. summarize y x1 x2 x3, detail
y
------------------------------------------------------------- Percentiles Smallest
1% 44.5 25
5% 173 64 10% 306.5 121 Obs 100
25% 576 121 Sum of Wgt. 100
50% 961 Mean 1151.84
Largest Std. Dev. 845.8479
75% 1600 3136 90% 2162.5 3364 Variance 715458.7
95% 2864.5 4096 Skewness 1.45566
99% 4226 4356 Kurtosis 5.374975
x1
-------------------------------------------------------------
Percentiles Smallest 1% -22.5 -28
5% -15.5 -17
10% -11 -16 Obs 100 25% -5 -16 Sum of Wgt. 100
50% 1.5 Mean .68 Largest Std. Dev. 8.965568
75% 7 18
90% 10.5 19 Variance 80.38141
95% 16.5 19 Skewness -.2514963 99% 19.5 20 Kurtosis 3.17527
© Thierry Warin
22
x2 -------------------------------------------------------------
Percentiles Smallest
1% -18 -18 5% -13 -18
10% -11.5 -16 Obs 100
25% -4.5 -15 Sum of Wgt. 100
50% 0 Mean .12
Largest Std. Dev. 8.389845
75% 5 17 90% 10 17 Variance 70.38949
95% 15.5 19 Skewness .0378385
99% 19.5 20 Kurtosis 2.711806
x3
------------------------------------------------------------- Percentiles Smallest
1% -30.5 -38
5% -18.5 -23
10% -15.5 -22 Obs 100 25% -7 -20 Sum of Wgt. 100
50% -1 Mean -.18 Largest Std. Dev. 12.12092
75% 8 22
90% 18 22 Variance 146.9168 95% 19.5 22 Skewness .0358513
99% 26.5 31 Kurtosis 3.085636
We use the kdensity command below to show the distribution of y (in yellow) and a normal overlay (in red). We can see that y is positively skewed, i.e., it has
a long tail to the right.
. kdensity y, normal
© Thierry Warin
23
We can make a normal probability plot of y using the pnorm command. If y
were normal, the yellow points would be a diagonal line (right atop the red line).
We can see that the observed values of y depart from this line.
. pnorm y
Even though y is skewed, let's run a regression predicting y from x1 x2 and x3
and look at some of the regression diagnostics.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 21.29 Model | 28298992.7 3 9432997.57 Prob > F = 0.0000
Residual | 42531420.7 96 443035.633 R-squared = 0.3995
---------+------------------------------ Adj R-squared = 0.3808 Total | 70830413.4 99 715458.722 Root MSE = 665.61
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
© Thierry Warin
24
---------+-------------------------------------------------------------------- x1 | 19.81458 8.276317 2.394 0.019 3.38622 36.24294
x2 | 29.56946 8.574851 3.448 0.001 12.54852 46.59041
x3 | 25.54624 5.842875 4.372 0.000 13.94823 37.14426 _cons | 1139.416 66.81248 17.054 0.000 1006.794 1272.038
------------------------------------------------------------------------------
The rvfplot command gives us a graph of the residual value by the fitted
(predicted) value. We are looking for a nice even distribution of the residuals
across the levels of the fitted value. We see that the points are more densely
packed together at the bottom part of the graph (for the negative residuals),
indicating that there could be a problem of non-normally distributed residuals.
. rvfplot
Below we use the avplots command to produce added variable plots. These plots
show the relationship between each predictor, and the dependent variable after adjusting for all other predictors. For example, the plot in the top left shows the
relationship between x1 and y, after y has been adjusted for x2 and x3. The plot
in the top right shows x2 on the bottom axis, and the plot in the bottom left shows x3 on the bottom axis. We would expect the points to be normally distributed
around these regression lines. Looking at these plots show the data points seem
to be more densely packed below the regression line, another possible indicator
that the residuals are not normally distributed.
. avplots
© Thierry Warin
25
Below we create studentized residuals using the predict command, creating a variable called rstud containing the studentized residuals. Stata knew we wanted
studentized residuals because we used the rstudent option after the comma. We
then use the summarize command to examine the residuals for normality. We see that the residuals are positively skewed, and that the 5 smallest values go as
low as -1.78, while the five highest values go from 2.44 to 3.42, another indicator
of the positive skew.
. predict rstud, rstudent
. summarize rstud, detail
Studentized residuals -------------------------------------------------------------
Percentiles Smallest
1% -1.764866 -1.789388 5% -1.44973 -1.740343
10% -1.115274 -1.587549 Obs 100
25% -.6660804 -1.515767 Sum of Wgt. 100
50% -.1976601 Mean .0068091
Largest Std. Dev. 1.022305 75% .5167466 2.446876
90% 1.363443 2.59326 Variance 1.045107
95% 2.106763 2.753908 Skewness .9225476
99% 3.089911 3.425914 Kurtosis 3.933529 The kdensity command is used below to display the distribution of the residuals
(in yellow) as compared to a normal distribution (in red). We can see the skew in
the residuals below.
. kdensity rstud, normal
© Thierry Warin
26
A boxplot, see below, also shows the skew in the residuals.
. graph box rstud
Finally, a normal probability plot can be used to examine the normality of the
residuals.
. pnorm rstud
© Thierry Warin
27
Let us try using a square root transformation on y creating sqy, and examine the distribution of sqy. The transformation has considerably reduced (but not totally
eliminated) the skewness.
. generate sqy = sqrt(y)
. summarize sqy, detail sqy
-------------------------------------------------------------
Percentiles Smallest 1% 6.5 5
5% 13 8
10% 17.5 11 Obs 100 25% 24 11 Sum of Wgt. 100
50% 31 Mean 31.8 Largest Std. Dev. 11.91722
75% 40 56
90% 46.5 58 Variance 142.0202 95% 53.5 64 Skewness .4513345
99% 65 66 Kurtosis 3.219272
Looking at the distribution using kdensity we can see that although the
distribution of sqy is not completely normal, it is much improved.
. kdensity sqy, normal
© Thierry Warin
28
We run the regression again using the transformed value, sqy below.
. regress sqy x1 x2 x3 Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 21.98
Model | 5724.4318 3 1908.14393 Prob > F = 0.0000 Residual | 8335.5682 96 86.8288354 R-squared = 0.4071
---------+------------------------------ Adj R-squared = 0.3886
Total | 14060.00 99 142.020202 Root MSE = 9.3182
------------------------------------------------------------------------------
sqy | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-------------------------------------------------------------------- x1 | .2786301 .1158643 2.405 0.018 .0486412 .508619
x2 | .4465366 .1200437 3.720 0.000 .2082518 .6848214
x3 | .3487791 .0817973 4.264 0.000 .1864127 .5111456 _cons | 31.61973 .9353415 33.806 0.000 29.76309 33.47637
------------------------------------------------------------------------------
The distribution of the residuals in the rvfplot below look much better. The residuals are more evenly distributed.
. rvfplot
© Thierry Warin
29
The avplots below also look improved (not perfect, but much improved).
. avplots
We create studentized residuals (called rstud2) and look at their distribution using summarize. The skewness is much better (.27) and the 5 smallest and 5
largest values are nearly symmetric.
. predict rstud2, rstudent
. summarize rstud2, detail Studentized residuals
------------------------------------------------------------- Percentiles Smallest
1% -2.180267 -2.243119
5% -1.550404 -2.117415
10% -1.262745 -1.825144 Obs 100 25% -.7082379 -1.789484 Sum of Wgt. 100
© Thierry Warin
30
50% -.0782854 Mean .0026017 Largest Std. Dev. 1.014223
75% .6425427 2.033796
90% 1.450073 2.085326 Variance 1.028647 95% 2.010917 2.180998 Skewness .2682509
99% 2.316003 2.451008 Kurtosis 2.718882
The distribution of the residuals below look nearly normal.
. kdensity rstud2, normal
The boxplot of the residuals looks symmetrical, and there are no outliers in the plot.
. graph rstud2, box
In this case, a square root transformation of the dependent variable addressed both problems in skewness in the residuals, and outliers in the residuals. Had we tried
© Thierry Warin
31
to address these problems via dealing with the outliers, the problem of the skewness of the residuals would have remained. When there are outliers in the
residuals, it can be useful to assess whether the residuals are skewed. If so,
addressing the skewness may also solve the outliers at the same time.
Regression diagnostics: Non-linearity
use http://www.ats.ucla.edu/stat/stata/modules/reg/nonlin, clear
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 3, 96) = 2.21
Model | 5649.25003 3 1883.08334 Prob > F = 0.0915
Residual | 81668.75 96 850.716146 R-squared = 0.0647 ---------+------------------------------ Adj R-squared = 0.0355
Total | 87318.00 99 882.00 Root MSE = 29.167
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | .1134093 .3626687 0.313 0.755 -.6064824 .833301 x2 | -.0089643 .3757505 -0.024 0.981 -.7548232 .7368946
x3 | .5932696 .2560351 2.317 0.023 .0850439 1.101495
_cons | 20.09967 11.61974 1.730 0.087 -2.965335 43.16468 ------------------------------------------------------------------------------
The ovtest command with the rhs option tests whether higher order trend effects (e.g. squared, cubed) are present but omitted from the regression model. The null
hypothesis, as shown below, is that there are no omitted variables (no significant
higher order trends). Because the test is significant, this suggest there are higher order trends in the data that we have overlooked.
. ovtest, rhs Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables
F(9, 87) = 67.86
Prob > F = 0.0000
A scatterplot matrix is used below to look for higher order trends. We can see that there is a very clear curvilinear relationship between x2 and y.
. graph matrix y x1 x2 x3
© Thierry Warin
32
We can likewise use avplots to look for non-linear trend patterns. Consistent with the scatterplot, the avplot for x2 (top right) exhibits a distinct curved pattern.
. avplots
Below we create x2sq and add it to the regression equation to account for the
curvilinear relationship between x2 and y.
. generate x2sq = x2*x2
. regress y x1 x2 x2sq x3 Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 4, 95) = 171.00
Model | 76669.3763 4 19167.3441 Prob > F = 0.0000 Residual | 10648.6237 95 112.090776 R-squared = 0.8780
---------+------------------------------ Adj R-squared = 0.8729
Total | 87318.00 99 882.00 Root MSE = 10.587
------------------------------------------------------------------------------
© Thierry Warin
33
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .1706277 .1316641 1.296 0.198 -.0907586 .4320141
x2 | -17.82489 .7208091 -24.729 0.000 -19.25588 -16.39391 x2sq | .2954615 .011738 25.171 0.000 .2721586 .3187645
x3 | .2584843 .0938846 2.753 0.007 .0720997 .4448688
_cons | 267.9799 10.71298 25.015 0.000 246.712 289.2478
------------------------------------------------------------------------------
We use the ovtest again below, however the results are misleading. Stata gave us
a note saying that x2sq was dropped due to collinearity. In testing for higher
order trends, Stata created x2sq^2 which duplicated x2sq, and then x2sq was discarded since it was the same as the term Stata created. The resulting ovtest
misleads us into thinking there may be higher order trends, but it has discarded
the higher trend we just included.
. ovtest, rhs (note: x2sq dropped due to collinearity)
(note: x2sq^2 dropped due to collinearity)
Ramsey RESET test using powers of the independent variables
Ho: model has no omitted variables
F(11, 85) = 54.57 Prob > F = 0.0000
There is another minor problem. We use the vif command below to look for
problems of multicollinearity. A general rule of thumb is that a VIF in excess of 20 (or a 1/VIF or tolerance of less than 0.05) may merit further investigation. We
see that the VIF for x2 and x2sq are over 32. The reason for this is that x2 and
x2sq are very highly correlated.
. vif Variable | VIF 1/VIF
---------+---------------------- x2sq | 32.43 0.030834
x2 | 32.30 0.030959
x1 | 1.23 0.812539
x3 | 1.14 0.874328 ---------+----------------------
Mean VIF | 16.78
We can solve both of these problems with one solution. If we "center" x2 (i.e. subtract its mean) before squaring it, the results of the ovtest will no longer be
misleading, and the VIF values for x2 and x2sq will get much better. Below, we
center x2 (called x2cent) and then square that value (creating x2centsq).
. egen x2mean = mean(x2)
© Thierry Warin
34
. generate x2cent = x2 - x2mean
. generate x2centsq = x2cent^2 We now run the regression using x2cent and x2centsq in the equation.
. regress y x1 x2cent x2centsq x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 4, 95) = 171.00
Model | 76669.3763 4 19167.3441 Prob > F = 0.0000 Residual | 10648.6237 95 112.090775 R-squared = 0.8780
---------+------------------------------ Adj R-squared = 0.8729
Total | 87318.00 99 882.00 Root MSE = 10.587
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .1706278 .1316641 1.296 0.198 -.0907586 .4320141
x2cent | -.0262898 .1363948 -0.193 0.848 -.2970677 .244488 x2centsq | .2954615 .011738 25.171 0.000 .2721586 .3187645
x3 | .2584843 .0938846 2.753 0.007 .0720997 .4448688
_cons | -.8589132 1.3437 -0.639 0.524 -3.526496 1.808669
------------------------------------------------------------------------------ As we expected, the VIF values are much better.
. vif Variable | VIF 1/VIF ---------+----------------------
x1 | 1.23 0.812539
x2cent | 1.16 0.864632 x3 | 1.14 0.874328
x2centsq | 1.02 0.978676
---------+---------------------- Mean VIF | 1.14
We try the ovtest again, and this time it does not drop any of the terms that we
placed into the regression model. This time, the results indicate that there are no
more significant higher order terms that have been omitted from the model.
. ovtest, rhs (note: x2cent^2 dropped due to collinearity)
(note: x2cent^3 dropped due to collinearity)
Ramsey RESET test using powers of the independent variables
Ho: model has no omitted variables F(10, 85) = 0.45
© Thierry Warin
35
Prob > F = 0.9178 We create avplots below and no longer see any substantial non-linear trends in
the data.
. avplots
Note that if we examine the avplot for x2cent, it shows no curvilinear trend. This
is because the avplot adjusts for all other terms in the model, so after adjusting for the other terms (including x2centsq) there is no longer any curved trend between
x2cent and the adjusted value of y.
. avplot x2cent
Had we simply run the regression and reported the initial results, we would have
ignored the significant curvilinear component between x2 and y.
© Thierry Warin
36
Regression Diagnostics: Heteroscedasticity
use http://www.ats.ucla.edu/stat/stata/modules/reg/hetsc, clear We try running a regression predicting y from x1 x2 and x3.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 65.68
Model | 8933.72373 3 2977.90791 Prob > F = 0.0000
Residual | 4352.46627 96 45.3381903 R-squared = 0.6724
---------+------------------------------ Adj R-squared = 0.6622 Total | 13286.19 99 134.203939 Root MSE = 6.7334
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | .2158539 .083724 2.578 0.011 .0496631 .3820447
x2 | .7559357 .086744 8.715 0.000 .5837503 .9281211 x3 | .3732164 .0591071 6.314 0.000 .2558898 .490543
_cons | 33.23969 .6758811 49.180 0.000 31.89807 34.5813
------------------------------------------------------------------------------ We can use the hettest command to test for heteroscedasticity. The test indicates
that the regression results are indeed heteroscedastic, so we need to further
understand this problem and try to address it.
. hettest Cook-Weisberg test for heteroscedasticity using fitted values of y
Ho: Constant variance chi2(1) = 21.30
Prob > chi2 = 0.0000
Looking at the rvfplot below that shows the residual by fitted (predicted) value,
we can clearly see evidence for heteroscedasticity. The variability of the residuals at the left side of the graph is much smaller than the variability of the residuals at
the right side of the graph.
. rvfplot
© Thierry Warin
37
We will try to stabilize the variance by using a square root transformation, and then run the regression again.
. generate sqy = y^.5
. regress sqy x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 69.37
Model | 66.0040132 3 22.0013377 Prob > F = 0.0000 Residual | 30.4489829 96 .317176905 R-squared = 0.6843
---------+------------------------------ Adj R-squared = 0.6744
Total | 96.4529961 99 .974272688 Root MSE = .56318
------------------------------------------------------------------------------
sqy | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .0170293 .0070027 2.432 0.017 .003129 .0309297
x2 | .0652379 .0072553 8.992 0.000 .0508362 .0796397 x3 | .0328274 .0049438 6.640 0.000 .0230141 .0426407
_cons | 5.682593 .0565313 100.521 0.000 5.570379 5.794807
------------------------------------------------------------------------------ Using the hettest again, the chi-square value is somewhat reduced, but the test for
heteroscedasticity is still quite significant. The square root transformation was
not successful.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of sqy
Ho: Constant variance chi2(1) = 13.06
© Thierry Warin
38
Prob > chi2 = 0.0003 Looking at the rvfplot below indeed shows that the results are still
heteroscedastic.
. rvfplot
We next try a natural log transformation, and run the regression.
. generate lny = ln(y)
. regress lny x1 x2 x3 Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 69.85
Model | 8.17710164 3 2.72570055 Prob > F = 0.0000 Residual | 3.74606877 96 .03902155 R-squared = 0.6858
---------+------------------------------ Adj R-squared = 0.6760
Total | 11.9231704 99 .120436065 Root MSE = .19754
------------------------------------------------------------------------------
lny | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .0054677 .0024562 2.226 0.028 .0005921 .0103432
x2 | .0230303 .0025448 9.050 0.000 .0179788 .0280817
x3 | .0118223 .001734 6.818 0.000 .0083803 .0152643 _cons | 3.445503 .0198285 173.765 0.000 3.406144 3.484862
------------------------------------------------------------------------------
We again try the hettest and the results are much improved, but the test is still significant.
. hettest Cook-Weisberg test for heteroscedasticity using fitted values of lny Ho: Constant variance
© Thierry Warin
39
chi2(1) = 5.60 Prob > chi2 = 0.0179
Below we see that the rvfplot does not look perfect, but it is much improved.
. rvfplot
Perhaps you might want to try a log (to the base 10) transformation. We show
that below.
. generate log10y = log10(y)
. regress log10y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 69.85 Model | 1.54229722 3 .514099074 Prob > F = 0.0000
Residual | .706552237 96 .007359919 R-squared = 0.6858
---------+------------------------------ Adj R-squared = 0.6760 Total | 2.24884946 99 .022715651 Root MSE = .08579
------------------------------------------------------------------------------ log10y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | .0023746 .0010667 2.226 0.028 .0002571 .004492
x2 | .0100019 .0011052 9.050 0.000 .0078081 .0121957 x3 | .0051344 .0007531 6.818 0.000 .0036395 .0066292
_cons | 1.496363 .0086114 173.765 0.000 1.479269 1.513456
------------------------------------------------------------------------------ The results for the hettest are the same as before. Whether we chose a log to the
base e or a log to the base 10, the effect in reducing heteroscedasticity (as
measured by hettest) was the same.
. hettest
© Thierry Warin
40
Cook-Weisberg test for heteroscedasticity using fitted values of log10y
Ho: Constant variance
chi2(1) = 5.60 Prob > chi2 = 0.0179
While these results are not perfect, we will be content for now that this has
substantially reduced the heteroscedasticity as compared to the original data.
Regression Diagnostics: Outliers
use http://www.ats.ucla.edu/stat/stata/modules/reg/outlier.dta , clear Below we run an ordinary least squares (OLS) regression predicting y from x1, x2, and x3. The results suggest that x2 and x3 are significant, but x1 is not
significant.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 14.12 Model | 6358.64512 3 2119.54837 Prob > F = 0.0000
Residual | 14406.3149 96 150.06578 R-squared = 0.3062
---------+------------------------------ Adj R-squared = 0.2845 Total | 20764.96 99 209.747071 Root MSE = 12.25
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .1986327 .1523206 1.304 0.195 -.1037212 .5009867
x2 | .576853 .1578149 3.655 0.000 .2635928 .8901132 x3 | .3533915 .1075346 3.286 0.001 .1399371 .5668459
_cons | 32.33932 1.229643 26.300 0.000 29.8985 34.78014
------------------------------------------------------------------------------ Let's start an examination for outliers by looking at a scatterplot matrix showing
scatterplots among y, x1, x2, and x3. Although we cannot see a great deal of
detail in these plots (especially since we have reduced their size for faster web access) we can see that there is a single point that stands out from the rest. This
looks like a potential outlier.
. graph matrix y x1 x2 x3
© Thierry Warin
41
We repeat the scatterplot matrix below, using the symbol([case]) option that indicates to make the symbols the value of the variable case. The variable case is
the case id of the observation, ranging from 1 to 100. It is difficult to see below,
but the case for the outlier is 100. If you run this in Stata yourself, the case numbers will be much easier to see.
. graph matrix y x1 x2 x3, symbol([case])
We can use the lvr2plot command to obtain a plot of the leverage by normalized
residual squared plot. The most problematic outliers would be in the top right of
the plot, indicating both high leverage and a large residual. This plot shows us that case 100 has a very large residual (compared to the others) but does not have
exceptionally high leverage.
. lvr2plot, symbol([case])
© Thierry Warin
42
The rvfplot command shows us residuals by fitted (predicted) values, and also indicates that case 100 has the largest residual.
. rvfplot, symbol([case])
The avplots command gives added variable plots (sometimes called partial regression plots). The plot in the top left shows x1 on the horizontal axis, and
the residual value of y after using x2 and x3 as predictors. Likewise, the top right
plot shows x2 on the horizontal axis, and the residual value of y after using x1 and x3 as predictors, and the bottom plot shows x3 on the horizontal axis, and the
residual value of y after using x1 and x2 as predictors.
Returning to the top left plot, this shows us the relationship between x1 and y, after adjusting y for x2 and x3. The line plotted has the slope of the coefficient
for x1 and is the least squares regression line for the data in the scatterplot. In
short, these plots allow you to view each of the scatterplots much like you would look at a scatterplot from a simple regression analysis with one predictor. In
looking at these plots, we see that case 100 appears to be an outlier in each plot.
© Thierry Warin
43
Beyond noting that x1 is an outlier, we can see the type of influence that it has on each of the regression lines. For x1 we see that x1 seems to be tugging the line up
at the left giving the line a smaller slope. By contrast, the outlier for x2 seems to
be tugging the line up at the right giving the line a greater slope. Finally, for x3 the outlier is right in the center and seems to have no influence on the slope (but
would pull the entire line up influencing the intercept).
. avplots, symbol([case])
Below we repeat the avplot just for the variable x2, showing that you can obtain
an avplot for a single variable at a time. Also, we can better see the influence of observation 100 tugging the regression line up at the right, possibly increasing the
overall slope for x2.
. avplot x2, symbol([case])
We use the predict command below to create a variable containing the
studentized residuals called rstu. Stata knew we wanted studentized residuals because we used the rstudent option after the comma. We can then use the
© Thierry Warin
44
graph command to make a boxplot looking at the studentized residuals, looking for outliers. As we would expect, observation 100 stands out as an outlier.
. predict rstu, rstudent
. graph box rstu, symbol([case])
Below we use the predict command to create a variable called l that will contain
the leverage for each observation. Stata knew we wanted leverages because we
used the leverage option after the comma. The boxplot shows some observations that might be outliers based on their leverage. Note that observation 100 is not
among them. This is consistent with the lvr2plot (see above) that showed us that
observation 100 had a high residual, but not exceptionally high leverage.
. predict l, leverage
. graph box l, symbol([case])
Below use use the predict command to compute Cooks D for each observation.
We make a boxplot of that below, and 100 shows to have the highest value for
Cooks D.
. predict d, cooksd
. graph box d, symbol([case])
© Thierry Warin
45
We can make a plot that shows us the studentized residual, leverage, and cooks D
all in one plot. The graph command below puts the studentized residual (rstud) on the vertical axis, leverage (l) on the horizontal axis, and the size of the bubble
reflects the size of Cook's D (d). The [w=d] tells Stata to weight the size of the
symbol by the variable d so the higher the value of Cook's D, the larger the
symbol will be. As we would expect, the plot below shows an observation that has a very large residual, a very large value of Cook's D, but does not have a very
large leverage.
. graph rstu l [w=d]
We repeat the graph above, except using the symbol([case]) option to show us the variable case as the symbol, which shows us that the observation we identified
above to be case = 100.
. graph rstu l, symbol([case])
© Thierry Warin
46
The leverage gives us an overall idea of how influential an observation is. From
our examination of the avplots above, it appeared that the outlier for case 100 influences x1 and x2 much more than it influences x3. We can use the dfbeta
command to generate dfbeta values for observation, and for each predictor. The
dfbeta value shows the degree the coefficient will change when that single
observation is omitted. This allows you to see, for a given predictor, how influential a single observation can be. The output below shows that three
variables were created, DFx1, DFx2, and DFx3.
. dfbeta DFx1: DFbeta(x1)
DFx2: DFbeta(x2)
DFx3: DFbeta(x3) Below we make a graph of the studentized residual by DFx1. We see that
observation 100 has a very high residual value, and that it has a large negative
DFBeta. This indicates that the presence of observation x1 decreases the value of the coefficient for x1 and if it was removed, the coefficient for x1 would get
larger.
. graph rstu DFx1, symbol([case])
© Thierry Warin
47
Below we make a graph of the studentized residual by the value of DFx2. Like
above, we see that x2 has a very large residual, but instead DFx2 is a large positive value. This suggests that the presence of observation 100 enhances the
coefficient for x2 and its removal would lower the coefficient for x2.
. graph rstu DFx2, symbol([case])
Finally, we make a plot showing the studentized residual and DFx3. This shows
that observation 100 has a large residual, but it has a small DFBeta (small DFx2). This suggests that the exclusion of observation 100 would have little impact on
the coefficient for x3.
. graph rstu DFx3, symbol([case])
© Thierry Warin
48
The results of looking at the DFbeta values is consistent with our observations looking at the avplots. It looks like observation 100 diminishes the coefficient for
x1, enhances the coefficient for x2, and has little impact on the coefficient for x3.
We can see that the information provided by the avplots and the values provided by dfbeta are related. Instead of looking at this information separately, we could
look at the DFbeta values right in the added variable plots.
Below we take the DFbeta value for x2 (DFx2) and round it to 2 decimal places, creating rDFx2. We then include rDFx2 as a symbol in the added variable plot
below. We can see that the outlier at the top right has the largest DFbeta value
and that observation enhances the coefficient (.576) and if this value were
omitted, that coefficient would get smaller. In fact, the value of the DFbeta tells us exactly how much smaller, it indicates that the coefficient will be .98 standard
errors smaller, or .98 * .1578 =.154. Removing this observation will make the
coefficient for x2 go from .576 to .422. As a rule of thumb, a DFbeta value of 1 or larger is considered worthy of attention.
. generate rDFx2 = round(DFx2,0.01)
. avplot x2, symbol([rDFx2])
© Thierry Warin
49
The plot below is the same as above, but shows us the case numbers (the variable
case) as the symbol, allowing us to see that observation 100 is the outlying case.
. avplot x2, symbol([case])
Below, we look at the data for observation 100 and see that it has a value of 110. We checked the original data, and found that this was a data entry error. The
value really should have been 11.
. list in 100 Observation 100 case 100 x1 -5 x2 8
x3 0 y 110 rstu 7.829672
l .0298236 d .2893599 DFx1 -.8180526 DFx2 .9819155 DFx3 .0784904 rDFx2 .98
We change the value of y to be 11, the correct value.
. replace y = 11 if (case == 100) (1 real change made)
© Thierry Warin
50
Having fixed the value of y we run the regression again. The coefficients change just as the regression diagnostics indicated. The coefficient for x1 increased
(from .19 to .325), the coefficient for x2 decreased by the exact amount the
DFbeta indicated (from .57 to .42). Note that for x2 the coefficient went from being non-significant to being significant. As we expected, the coefficient for x3
was changed very little (from .35 to .34).
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 20.27
Model | 5863.70256 3 1954.56752 Prob > F = 0.0000 Residual | 9255.28744 96 96.4092442 R-squared = 0.3878
---------+------------------------------ Adj R-squared = 0.3687
Total | 15118.99 99 152.717071 Root MSE = 9.8188
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .325315 .1220892 2.665 0.009 .08297 .5676601
x2 | .4193103 .126493 3.315 0.001 .1682236 .670397
x3 | .3448104 .0861919 4.000 0.000 .1737208 .5159 _cons | 31.28053 .9855929 31.738 0.000 29.32415 33.23692
------------------------------------------------------------------------------
We repeat the regression diagnostic plots below. We need to look carefully at the scale of these plots since the axes will be rescaled with the omission of the large
outlier. The lvr2plot shows a point with a larger residual than most but low
leverage, and 3 points with larger leverage than most, but a small residual.
. lvr2plot
© Thierry Warin
51
The rvfplot below shows a fairly even distribution of residuals without any points that dramatically stand out from the crowd.
. rvfplot
The avplots show a couple of points here and there that stand out from the
crowd. Looking at an avplot with the DFbeta value in each plot would be a
useful followup to assess the influence of these points on the regression coefficients. For now, we will leave this up to the reader.
. avplots
We create values of leverage, studentized residuals, and Cook's D. (We first drop
the variables l rstu and d because the predict command will not replace existing values).
. drop l rstu d
. predict l, leverage
. predict rstu, rstudent
© Thierry Warin
52
. predict d, cooksd We then make the plot that shows the residual, leverage and Cook's D all in one
graph. None of the points really jump out as having an exceptionally high
residual, leverage and Cook's D value.
. graph rstu l [w=d]
Although we could scrutinize the data a bit more closely, we can tentatively state
that these revised results are good. Had we skipped checking the residuals, we
would have used the original results which would have underestimated the impact of x1 and overstated the impact of x2.
Regression Diagnotics: Multicollinearity
use http://www.ats.ucla.edu/stat/stata/modules/reg/multico, clear Below we run a regression predicting y from x1 x2 x3 and x4. If we were to
report these results without any further checking, we would conclude that none of these predictors are significant predictors of the dependent variable, y. If we look
more carefully, we note that the test of all four predictors is significant (F = 16.37,
p = 0.0000) and these predictors account for 40% of the variance in y (R-squared = 0.40). It seems like a contradiction that the combination of these 4 predictors
should be so strongly related to y, yet none of them are significant. Let us
investigate further.
. regress y x1 x2 x3 x4
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 4, 95) = 16.37 Model | 5995.66253 4 1498.91563 Prob > F = 0.0000
Residual | 8699.33747 95 91.5719733 R-squared = 0.4080
© Thierry Warin
53
---------+------------------------------ Adj R-squared = 0.3831 Total | 14695.00 99 148.434343 Root MSE = 9.5693
------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | 1.118277 1.024484 1.092 0.278 -.9155806 3.152135
x2 | 1.286694 1.042406 1.234 0.220 -.7827429 3.356131
x3 | 1.191635 1.05215 1.133 0.260 -.8971469 3.280417
x4 | -.8370988 1.038979 -0.806 0.422 -2.899733 1.225535
_cons | 31.61912 .9709127 32.566 0.000 29.69161 33.54662 ------------------------------------------------------------------------------
We use the vif command to examine the VIF values (and 1/VIF values, also
called tolerances). A general rule of thumb is that a VIF in excess of 20, or a tolerance of 0.05 or less may be worthy of further investigation. A tolerance
(1/VIF) can be described in this way, using x1 as an example. Use x1 as a
dependent variable, and use x2 x3 and x4 as predictors and compute the R-squared (the proportion of variance that x2 x3 and x4 explain in x1) and then take
1-Rsquared. In this example, 1-Rsquared equals 0.010964 (the value of 1/VIF for
x1). This means that only about 1% of the variance in x1 is not explained by the
other predictors. If we look at x4, we see that less than .2% of the variance in x4 is not explained by x1 x2 and x3. You can see that these results indicate that there
is a problem of multicollinearity.
. vif
Variable | VIF 1/VIF
---------+---------------------- x4 | 534.97 0.001869
x3 | 175.83 0.005687
x1 | 91.21 0.010964 x2 | 82.69 0.012093
---------+----------------------
Mean VIF | 221.17
If we examine the correlations among the variables, it seems that x4 is most strongly related to the other predictors.
. corr x1 x2 x3 x4 (obs=100)
| x1 x2 x3 x4
---------+------------------------------------ x1 | 1.0000
© Thierry Warin
54
x2 | 0.3553 1.0000 x3 | 0.3136 0.2021 1.0000
x4 | 0.7281 0.6516 0.7790 1.0000
We might conclude that x4 is redundant, and is really not needed in the model, so we try removing x4 from the regression equation. Note that the variance
explained is about the same (still about 40%) but now the predictors x1 x2 and x3
are now significant. If you compare the standard errors in the table above with
the standard errors below, we see that the standard errors in the table above were
much larger. This makes sense, because when a variable has a low tolerance, its
standard error will be increased.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 21.69 Model | 5936.21931 3 1978.73977 Prob > F = 0.0000
Residual | 8758.78069 96 91.2372989 R-squared = 0.4040
---------+------------------------------ Adj R-squared = 0.3853 Total | 14695.00 99 148.434343 Root MSE = 9.5518
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .298443 .1187692 2.513 0.014 .0626879 .5341981
x2 | .4527284 .1230534 3.679 0.000 .2084695 .6969874 x3 | .3466306 .0838481 4.134 0.000 .1801934 .5130679
_cons | 31.50512 .9587921 32.859 0.000 29.60194 33.40831
------------------------------------------------------------------------------ Below we look at the VIF and tolerances and see that they are very good, and
much better than the prior results. With these improved tolerances, the standard
errors in the table above are reduced.
. vif
Variable | VIF 1/VIF
---------+---------------------- x1 | 1.23 0.812781
x2 | 1.16 0.864654
x3 | 1.12 0.892234 ---------+----------------------
Mean VIF | 1.17
We should emphasize that dropping variables is not the only solution to problems of multicollinearity. The solutions are often driven by the nature of your study
© Thierry Warin
55
and the nature of your variables. You may decide to combine variables that are very highly correlated because you realize that the measures are really tapping the
exact same thing. You might decide to use principal component analysis or factor
analysis to study the structure of your variables, and decide how you might combine the variables. Or, you might choose to generate factor scores from a
principal component analysis or factor analysis, and use the factor scores as
predictors.
Had we not investigated further, we might have concluded that none of these
predictors were related to the dependent variable. After dropping x4, the results
were dramatically different showing x1 x2 and x3 all significantly related to the
dependent variable.
Regression Diagnostics: Non-Independence
use http://www.ats.ucla.edu/stat/stata/modules/reg/nonind, clear Below we run a regression predicting y from x1 x2 and x3. These results suggest
that none of the predictors are related to y.
. regress y x1 x2 x3
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 3, 96) = 0.41 Model | 11.2753664 3 3.75845547 Prob > F = 0.7431
Residual | 870.834634 96 9.0711941 R-squared = 0.0128
---------+------------------------------ Adj R-squared = -0.0181
Total | 882.11 99 8.91020202 Root MSE = 3.0118
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+--------------------------------------------------------------------
x1 | .0356469 .0374498 0.952 0.344 -.0386904 .1099843
x2 | .0030059 .0388007 0.077 0.938 -.0740128 .0800247 x3 | -.0192823 .0264387 -0.729 0.468 -.0717626 .0331981
_cons | 1.301929 .3023225 4.306 0.000 .7018232 1.902034
------------------------------------------------------------------------------ Let's create and examine the residuals for this analysis, showing the residuals over
time. Below we see the residuals are clearly not distributed evenly across time,
suggesting the results are not independent over time.
. predict rstud, rstud
. graph rstud time
© Thierry Warin
56
We can use the dwstat command to test to see if the results are independent over
time. We first need to tell Stata the name of the time variable using the tsset command. Stata replies back that the time values range from 1 to 100.
. tsset time time variable: time, 1 to 100 The dwstat command gives a value of "d", but not a test of the significance of
"d". If there were no autocorrelation, the value of "d" would be 2, and the closer
the value is to 0, the greater the autocorrelation. In this instance, the value of .14
is sufficiently close to 0 to indicate a strong autocorrelation (see Chatterjee, Hadi and Price, section 8.3 for more information).
. dwstat Durbin-Watson d-statistic( 4, 100) = .1389823 Let's run the analysis using the arima command, with a first order
autocorrelation. This model would permit correlations between observations
across time.
. arima y x1 x2 x3, ar(1)
(setting optimization to BHHH) Iteration 0: log likelihood = -151.67536
Iteration 1: log likelihood = -148.94194
Iteration 2: log likelihood = -148.19301
Iteration 3: log likelihood = -148.16849 Iteration 4: log likelihood = -148.14569
(switching optimization to BFGS)
Iteration 5: log likelihood = -148.04916 Iteration 6: log likelihood = -148.04445
Iteration 7: log likelihood = -148.04442
Iteration 8: log likelihood = -148.04442
© Thierry Warin
57
Time-Series Cross-Section Analyses (TSCS) or Panel data models
A balanced panel has all its observations, that is, the variables are observed for
each entity and each time period. A panel that has some missing data for at least one time period for at least one entity is called an unbalanced panel data.
A. The fixed effects regression model
0 1 2it it i itY X Z uβ β β= + + + (12)
Where iZ is an unobserved variable that varies from one state to the next but does
not change over time. We can rewrite equation (12):
1it it i itY X uβ α= + + (13)
Where 0 2i iZα β β= + .
Assumptions of the OLS estimator
1. the conditional distribution of iu given
1 2, ,...,i i kiX X X has a mean of zero.
This means that the other factors captured in the error term are unrelated
to 1 2, ,...,i i kiX X X . The correlation between
1 2, ,...,i i kiX X X and iu should
be nil. This is the most important assumption in practice. If this
assumption does not hold, then it is likely because there is an omitted
variable bias. One should test for omitted variables using (Ramsey and
Braithwaite, 1931)’s test.
2. Related to the first assumption: if the variance of this conditional
distribution of iu does not depend on
1 2, ,...,i i kiX X X , then the errors are
said to be homoskedastic. The error term iu is homoskedastic if the
© Thierry Warin
58
variance of the conditional distribution of iu given 1 2, ,...,i i kiX X X is
constant for 1,...,i n= and in particular does not depend on
1 2, ,...,i i kiX X X . Otherwise, the error term is heteroskedastic.
c. Whether the errors are homoskedastic or heteroskedastic, the OLS
estimator is unbiased, consistent, and asymptotically normal.
d. If the standard errors are heteroskedastic, one should use
hetereoskedastic-robust standard errors. To test for
heteroskedasticity, we use (Breusch and Pagan, 1979)’s test.
3. ( )1 2, ,..., , , 1,...,i i ki iX X X Y i n= are Independently and Identically
Distributed. This is to be sure that there is no selection bias in the sample.
This second assumption holds in many cross-sectional data sets, but it is
inappropriate for time series data.
4. 1 2, ,...,i i kiX X X and
iu have four moments. The fourth assumption is that
the fourth moments of 1 2, ,...,i i kiX X X and
iu are nonzero and finite.
5. No perfect multicollinearity. In case of perfect multicollinearity, it is
impossible to compute the OLS estimator. The regressors are said to be
perfectly multicollinear if one of regressors is a perfect linear function of
one of the other regressors.
6. In cross-sectional data, the errors are uncorrelated across entities,
conditional on regressors, and here errors are uncorrelated across time as
well as entities, conditional on regressors.
B. Regression with time fixed effects
Just as fixed effects for each entity can control for variables that are constant over
time but differ across entities, so can time fixed effects control for variables that
are constant across entities but evolve over time.
© Thierry Warin
59
0 1 2 3it it i t itY X Z S uβ β β β= + + + + (14)
Where tS is unobserved, where the subscript “t” emphasizes that the variable S
changes over time but is constant across states.
Running Pooled OLS regressions in Stata
The simplest estimator for panel data is pooled OLS. In most cases this is unlikely
to be adequate.
The fixed and random effects models
The fixed and random effects models have in common that they decompose the
unitary pooled error term, uit . That is, we decompose uit into a unit-specific and time-invariant
component, _i, and an observation specific error, "it .1 The _is are then treated as
fixed parameters (in effect, unit-specific y-intercepts), which are to be estimated. This can be done by including a dummy variable for each cross-sectional unit
(and suppressing the global constant). This is sometimes called the Least Squares
Dummy Variables (LSDV) method or “de-meaned” variables method.
For the random effects model. In contrast to the fixed effects model, the vis are
not treated as fixed parameters, but as random drawings from a given probability distribution.
The celebrated Gauss–Markov theorem, according to which OLS is the best linear
unbiased estimator (BLUE), depends on the assumption that the error term is
independently and identically distributed (IID). If these assumptions are not met — and they are unlikely to be met in the context of panel data — OLS is not the
most efficient estimator. Greater efficiency may be gained using generalized least
squares (GLS), taking into account the covariance structure of the error term.
However, GLS estimation is equivalent to OLS using “quasi-demeaned”
variables; that is, variables from which we subtract a fraction of their average. This means that if all the variance is attributable to the individual effects, then the
fixed effects estimator is optimal; if, on the other hand, individual effects are
negligible, then pooled OLS turns out, unsurprisingly, to be the optimal estimator.
© Thierry Warin
60
Choice of estimator
Which panel method should one use, fixed effects or random effects? One way of answering this question is in relation to the nature of the data set. If
the panel comprises observations on a fixed and relatively small set of units of
interest (say, the member states of the European Union), there is a presumption in favor of fixed effects. If it comprises observations on a large number of randomly
selected individuals (as in many epidemiological and other longitudinal studies),
there is a presumption in favor of random effects.
Besides this general heuristic, however, various statistical issues must be taken
into account.
1. Some panel data sets contain variables whose values are specific to the cross-sectional unit but which do not vary over time. If you want to include such
variables in the model, the fixed effects option is simply not available. When the
fixed effects approach is implemented using dummy variables, the problem is that the time-invariant variables are perfectly collinear with the per-unit dummies.
When using the approach of subtracting the group means, the issue is that after
de-meaning these variables are nothing but zeros.
2. A somewhat analogous prohibition applies to the random effects estimator.
This estimator is in effect a matrix-weighted average of pooled OLS and the
“between” estimator. Suppose we have observations on n units or individuals and there are k independent variables of interest. If k > n, the “between” estimator is
undefined — since we have only n effective observations — and hence so is the
random effects estimator. If one does not fall foul of one or other of the prohibitions mentioned above, the choice between fixed effects and random
effects may be expressed in terms of the two econometric desiderata, efficiency
and consistency. From a purely statistical viewpoint, we could say that there is a tradeoff between robustness and efficiency. In the fixed effects approach, we do
not make any hypotheses on the “group effects” (that is, the time-invariant
differences in mean between the groups) beyond the fact that they exist — and
that can be tested; see below. As a consequence, once these effects are swept out by taking deviations from the group means, the remaining parameters can be
estimated.
On the other hand, the random effects approach attempts to model the group effects as drawings from a probability distribution instead of removing them. This
requires that individual effects are representable as a legitimate part of the
disturbance term, that is, zero-mean random variables, uncorrelated with the regressors.
© Thierry Warin
61
As a consequence, the fixed-effects estimator “always works”, but at the cost of not being able to estimate the effect of time-invariant regressors. The richer
hypothesis set of the random-effects estimator ensures that parameters for time-
invariant regressors can be estimated, and that estimation of the parameters for time-varying regressors is carried out more efficiently. These advantages, though,
are tied to the validity of the additional hypotheses. If, for example, there is
reason to think that individual effects may be correlated with some of the
explanatory variables, then the random-effects estimator would be inconsistent,
while fixed-effects estimates would still be valid.
It is precisely on this principle that the Hausman test is built: if the fixed- and
random effects estimates agree, to within the usual statistical margin of error, there is no reason to think the additional hypotheses invalid, and as a
consequence, no reason not to use the more efficient RE estimator.
Testing panel models
Panel models carry certain complications that make it difficult to implement all of
the tests one expects to see for models estimated on straight time-series or cross-sectional data.
When you estimate a model using fixed effects, you automatically get an F-test for
the null hypothesis that the cross-sectional units all have a common intercept. When you estimate using random effects, the Breusch–Pagan and Hausman tests
are presented automatically.
The Breusch–Pagan test is the counterpart to the F-test mentioned above. The null
hypothesis is that the variance of vi in equation equals zero; if this hypothesis is not rejected, then again we conclude that the simple pooled model is adequate.
The Hausman test probes the consistency of the GLS estimates. The null
hypothesis is that these estimates are consistent — that is, that the requirement of orthogonality of the vi and the Xi is satisfied. The test is based on a measure, H, of
the “distance” between the fixed-effects and random-effects estimates,
constructed such that under the null it follows the _2 distribution with degrees of freedom equal to the number of time-varying regressors in the matrix X. If the
value of H is “large” this suggests that the random effects estimator is not
consistent and the fixed-effects model is preferable.
Robust standard errors
For most estimators, Stata offers the option of computing an estimate of the
covariance matrix that is robust with respect to heteroskedasticity and/or
© Thierry Warin
62
autocorrelation (and hence also robust standard errors). In the case of panel data, robust covariance matrix estimators are available for the pooled and fixed effects
model but not currently for random effects.
Let's now turn to estimation commands for panel data.
The first type of regression that you may run is a pooled OLS regression, which is
simply an OLS regression applied to the whole dataset. This regression is not
considering that you have different individuals across time periods, and so, it is
not considering for the panel nature of the dataset.
reg ln_wage grade age ttl_exp tenure black not_smsa south
In the previous command, you do not need to type age1 or age2. You just need to
type age. When you do this, you are instructing Stata to include all the variables
starting with the expression age to be included in the regression.
Suppose you want to observe the internal results saved in Stata associated with
the last estimation. This is valid for any regression that you perform. In order to
observe them, you would type:
ereturn list
If you want to control for some categories:
xi: reg dependent ind1 ind2 i.category1 i.category2 i.time
Let's perform a regression where only the variation of the means across
individuals is considered.
This is the between regression.
xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be
© Thierry Warin
63
Running Panel regressions in Stata
In empirical work in panel data, you are always concerned in choosing between two alternative regressions. This choice is between fixed effects (or within, or least
squares dummy variables - LSDV) estimation and random effects (or feasible
generalized least squares - FGLS) estimation.
In panel data, in the two-way model, the error term can be the result of the sum of
three components:
1. The two-way model assumes the error term as having a specific individual term effect,
2. a specific time effect 3. and an additional idiosyncratic term.
In the one-way model, the error term can be the result of the sum of one
component: 1. assumes the error term as having a specific individual term effect
It is absolutely fundamental that the error term is not correlated with the independent variables.
• If you have no correlation, then the random effects model should be used
because it is a weighted average of between and within estimations.
• But, if there is correlation between the individual and/or time effects and the independent variables, then the individual and time effects (fixed
effects model) must be estimated as dummy variables in order to solve for
the endogeneity problem.
The fixed effects (or within regression) is an OLS regression of the form:
(yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)B + (vit - vi. - v.t + v..)
© Thierry Warin
64
where yi., xi. and vi. are the means of the respective variables (and the error) within the individual across time, y.t, x.t and v.t are the means of the respective
variables (and the error) within each time period across individuals and y.., x..
and v.. is the overall mean of the respective variables (and the error).
Choosing between Fixed effects and Random effects? The Hausman test
The generally accepted way of choosing between fixed and random effects is
running a Hausman test.
Statistically, fixed effects are always a reasonable thing to do with panel data
(they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient
estimator, so you should run random effects if it is statistically justifiable to do so.
© Thierry Warin
65
The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent
results.
To run a Hausman test comparing fixed with random effects in Stata, you need to
first estimate the fixed effects model, save the coefficients so that you can
compare them with the results of the next model, estimate the random effects
model, and then do the comparison.
1. xtreg dependentvar independentvar1 independentvar2... , fe 2. estimates store fixed 3. xtreg dependentvar independentvar1 independentvar2... , re 4. estimates store random 5. hausman fixed random
The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the
consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2
larger than .05) then it is safe to use random effects. If you get a significant P-value, however, you should use fixed effects.
If you want a fixed effects model with robust standard errors, you can use the
following command:
areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode)
robust
You may be interested in running a maximum likelihood estimation in panel data.
You would type:
xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle
© Thierry Warin
66
If you qualify for a fixed effects model, should you include time effects?
Other important question, when you are doing empirical work in panel data is to
choose for the inclusion or not of time effects (time dummies) in your fixed effects model.
In order to perform the test for the inclusion of time dummies in our fixed effects regression,
1. first we run fixed effects including the time dummies. In the next fixed effects regression, the time dummies were abbreviated to "y" (see “Generating time dummies”, but you could type them all if you prefer.
xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe
2. Second, we apply the "testparm" command. It is the test for time dummies, which assumes the null hypothesis that the time dummies are not jointly significant.
testparm y
3. We reject the null hypothesis that the time dummies are not jointly
significant if p-value smaller than 10%, and as a consequence our fixed
effects regression should include time effects.
Fixed effects or random effects when time dummies are involved: a test
What about if the inclusion of time dummies in our regression would permit us to
use a random effects model in the individual effects?
[This question is not usually considered in typical empirical work- the purpose
here is to show you an additional test for random effects in panel data.)
© Thierry Warin
67
1. First, we will run a random effects regression including our time dummies,
xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re
2. and then we will apply the "xttest0" command to test for random effects in this case, which assumes the null hypothesis of random effects.
xttest0
3. The null hypothesis of random effects is again rejected if p-value smaller than 10%, and thus we should use a fixed effects model with time effects.
© Thierry Warin
68
Dynamic panels and GMM estimations
Special problems arise when a lag of the dependent variable is included among
the regressors in a panel model.
First, if the error uit includes a group effect, vi, then yit�1 is bound to be correlated with the error, since the value of vi affects yi at all t. That means that OLS will be inconsistent as well as inefficient. The fixed-effects model sweeps
out the group effects and so overcomes this particular problem, but a subtler issue
remains, which applies to both fixed and random effects estimation.
Estimators which ignore this correlation will be consistent only as T ! 1 (in which
case the marginal effect of "it on the group mean of y tends to vanish). One strategy for handling this problem, and producing consistent estimates of _
and _, was proposed by Anderson and Hsiao (1981). Instead of de-meaning the
data, they suggest taking the first difference, an alternative tactic for sweeping out
the group effects: Although the Anderson–Hsiao estimator is consistent, it is not most efficient: it
does not make the fullest use of the available instruments, nor does it take into
account the differenced structure of the error _it . It is improved upon by the methods of Arellano and Bond (1991) and
Blundell and Bond (1998).
Stata implements natively the Arellano–Bond estimator. The rationale behind it is, strictly speaking, that of a GMM estimator. This procedure has the double effect
of handling
heteroskedasticity and/or serial correlation, plus producing estimators that are asymptotically efficient.
One-step estimators have sometimes been preferred on the grounds that they are more robust.
Moreover, computing the covariance matrix of the 2-step estimator via the
standard GMM formulae has been shown to produce grossly biased results in finite samples. However, implementing the finite-sample correction devised by
Windmeijer (2005), leads to standard errors for the 2-step estimator that can be
considered relatively accurate.
Two additional commands that are very usefull in empirical work are the Arellano
and Bond estimator (GMM estimator) and the Arellano and Bover estimator
(system GMM).
© Thierry Warin
69
Both commands permit you do deal with dynamic panels (where you want to use as independent variable lags of the dependent variable) as well with problems of
endogeneity.
You may want to have a look at them The commands are respectively "xtabond"
and "xtabond2". "xtabond" is a built in command in Stata, so in order to check
how it works, just type:
help xtabond
"xtabond2" is not a built in command in Stata. If you want to look at it, previously, you must get it from the net (this is another feature of Stata- you can
always get additional commands from the net). You type the following:
findit xtabond2
The next steps to install the command should be obvious.
How does it work?
The xtabond2 commands allows to estimate dynamic models either with the GMM estimator in difference or the GMM estimator in system.
xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1)
iv(list2, options2) two robust small
1. When noleveleq is specified, it is the GMM estimator in difference that’s used.
Otherwise, if noleveleq is not specified, it is the GMM estimator in system that’s used.
2. gmm(list1, options):
• list1 is the list of the non-exogenous independent variables
• options1 may take the following values: lag(a,b), eq(diff), eq(level), eq(both) and collapse
o lag(a,b) means that for the equation in difference, the lagged variables (in level) of each variable from list1, dated from t-a to t-
b, will be used as instruments; whereas for the equation in level,
the first differences dated t-a+1 will be used as instruments. If b=●, it means b is infinite. By default, a=1, and b=●. Example:
gmm(x y, lag(2 .)) ⇒all the lagged variables of x and y, lagged by
© Thierry Warin
70
at least two periods, will be used as instruments. Example 2:
gmm(x, lag(1 2)) gmm (y, lag (2 3)) ⇒ for variable x, the lagged
values of one period and two periods will be used as instruments, whereas for variable y, the lagged values of two and three periods
will be used as instruments.
o Options eq(diff), eq(level) or eq(both) mean that the instruments must be used respectively for the equation in first difference, the
equation in level, or for both. By default, the option is eq(both).
o Option collapse reduces the size of the instruments matrix and aloow to prevent the overestimation bias in small samples when
the number of instruments is close to the number of observations.
But it reduces the statistical efficiency of the estimator in large samples.
3. iv(list2, options2):
• List2 is the list of variables that are strictly exogenous, and options2 may take the following values: eq(diff), eq(level), eq(both), pass and mz.
o Eq(diff), eq(level), and eq(both): see above o By default, the exogenous variables are differentiated to serve as
instruments in the equations in first difference, and are used un-
differentiated to serve as instruments in the equations in level. The pass option allows to prevent that exogenous variables are
differentiated to serve as instruments in equations in first
difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass) allows to use variable x in level as an instrument in the equation in
level as well as in the equation in difference.
o Option mz replaces the missing values of the exogenous variables by zero, allowing thus to include in the regression the observations
whose data on exogenous variables are missing. This option
impacts the coefficients only if the variables are exogenous.
4. Option two:
• This option specifies the use of the GMM estimation in two steps. But although this two-step estimation is asymptotically more efficient, leads to
biased results. To fix this issue, the xtabond2 command proceeds to a correction of the covariance matrix for finite samples. So far, there is no
test to know whether the on-step GMM estimator or two-step GMM
estimator should be used.
5. Option robust:
• This option allows to correct the t-test for heteroscedasticity.
© Thierry Warin
71
6. Option small:
• This option replaces the z-statistics by the t-test results.
TESTS
In need for a causality test?
The first thing to do is to use the command summarize, detail or other functions
presented in the previous tutorials, to obtain a description of the data. Once again,
it is required that you show explicitly what are the NULL and ALTERNATIVE
hypotheses of this test, and the regression equations you are going to run. The results of Thurman and Fisher's (1988), Table 1, can be easily replicated using
OLS regressions and the time series commands introduced in the previous
tutorials.
A simple example in Stata:
*Causality direction A: Do chickens Granger-cause eggs? For example, using the
number of lags equals 1 you proceed as follows:
regress egg L.egg L.chic
Source | SS df MS Number of obs = 53
---------+------------------------------ F( 2, 50) = 645.24
Model | 38021977.8 2 19010988.9 Prob > F = 0.0000 Residual | 1473179.16 50 29463.5832 R-squared = 0.9627
---------+------------------------------ Adj R-squared = 0.9612
Total | 39495157.0 52 759522.25 Root MSE = 171.65
------------------------------------------------------------------------------
egg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-------------------------------------------------------------------- egg |
L1 | .9613121 .027241 35.289 0.000 .906597 1.016027
chic | L1 | -.0001136 .0005237 -0.217 0.829 -.0011655 .0009383
© Thierry Warin
72
_cons | 279.3413 279.6937 0.999 0.323 -282.44 841.1226 ------------------------------------------------------------------------------
And you can test if chickens Granger cause eggs using a F-test:
test L.chic
( 1) L.chic = 0.0
F( 1, 50) = 0.05
Prob > F = 0.8292
**Causality direction B: Do eggs Granger-cause chickens? This involves the same techniques, but here you need to regress chickens against the lags of
chickens and the lags of eggs. For example, using one lag you have:
regress chic L.egg L.chic
Source | SS df MS Number of obs = 53
---------+------------------------------ F( 2, 50) = 65.92 Model | 8.0984e+10 2 4.0492e+10 Prob > F = 0.0000
Residual | 3.0712e+10 50 614248751 R-squared = 0.7250
---------+------------------------------ Adj R-squared = 0.7140 Total | 1.1170e+11 52 2.1480e+09 Root MSE = 24784
------------------------------------------------------------------------------ chic | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
egg | L1 | -4.32139 3.933252 -1.099 0.277 -12.22156 3.57878
chic |
L1 | .8349305 .075617 11.042 0.000 .6830493 .9868117
_cons | 88951.72 40384.25 2.203 0.032 7837.569 170065.9 ------------------------------------------------------------------------------
test L.egg
( 1) L.egg = 0.0
© Thierry Warin
73
F( 1, 50) = 1.21
Prob > F = 0.2772
Do that for the for lags 1,2,3, and 4. Please provide a table in the same format of
Thurman and Fisher's (1988), containing your results, plus a graphical analysis.
Causality in further lags: To test Granger causality in further lags, the procedures
are the same. Just remember to test the joint hypothesis of non-significance of the "causality" terms.
Example: Do eggs Granger cause chickens (in four lags)?
regress chic L.egg L2.egg L3.egg L4.egg L.chic L2.chic L3.chic L4.chic
Source | SS df MS Number of obs = 50
---------+------------------------------ F( 8, 41) = 22.75
Model | 8.9451e+10 8 1.1181e+10 Prob > F = 0.0000
Residual | 2.0154e+10 41 491569158 R-squared = 0.8161 ---------+------------------------------ Adj R-squared = 0.7802
Total | 1.0961e+11 49 2.2369e+09 Root MSE = 22171
------------------------------------------------------------------------------
chic | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-------------------------------------------------------------------- egg |
L1 | 87.38472 26.87471 3.252 0.002 33.11014 141.6593
L2 | -62.49408 41.76817 -1.496 0.142 -146.8466 21.85845 L3 | -8.214513 44.09684 -0.186 0.853 -97.26989 80.84086
L4 | -22.63552 30.59828 -0.740 0.464 -84.43 39.15897
chic |
L1 | .2332566 .1934323 1.206 0.235 -.1573878 .623901 L2 | .45797 .2039095 2.246 0.030 .0461663 .8697736
L3 | -.0184877 .2059394 -0.090 0.929 -.4343907 .3974153
L4 | .0256691 .1779262 0.144 0.886 -.3336602 .3849984 _cons | 147330.3 46385.32 3.176 0.003 53653.2 241007.3
------------------------------------------------------------------------------
and then test the joint significance of all lags of eggs
© Thierry Warin
74
test L.egg L2.egg L3.egg L4.egg
( 1) L.egg = 0.0 ( 2) L2.egg = 0.0
( 3) L3.egg = 0.0
( 4) L4.egg = 0.0
F( 4, 41) = 4.26
Prob > F = 0.0057
© Thierry Warin
75
Maximum likelihood estimation
1. Probit and logit regressions
Probit and logit regressions are models designed for binary dependent variables. Because a regression with a binary dependent variable Y models the probability
that Y=1, it makes sense to adopt a nonlinear formulation that forces the predicted values to be between zero and one.
Probit regression uses the standard normal cumulative probability distribution
function. Logit regression uses the logistic cumulative probability distribution function.
Probit regression
( ) ( )1 0 1 1Pr 1 ,..., ...k k kY X X X Xφ β β β= = + + + (15)
Where φ is the cumulative standard normal distribution.
Logit regression
( ) ( )
( ) ( )0 1 1
1 0 1 1
1 ...
Pr 1 ,..., ...
1Pr 1 ,...,
1 k k
k k k
k X X
Y X X F X X
Y X Xe
β β β
β β β
− + + +
= = + + +
= =+
(16)
Logit regression is similar to probit regression except that the cumulative distribution function is different.
© Thierry Warin
76
Linear probability model
© Thierry Warin
77
EXAMPLES
Health Care
use http://www.ats.ucla.edu/stat/stata/modules/reg/health, clear Let's start by checking univariate the distribution of these variables. We see that
timedrs phyheal and stress show considerable skewness.
. summarize timedrs phyheal menheal stress, detail
No. visits physical/mental health prof
-------------------------------------------------------------
Percentiles Smallest 1% 0 0
5% 0 0
10% 1 0 Obs 465 25% 2 0 Sum of Wgt. 465
50% 4 Mean 7.901075 Largest Std. Dev. 10.94849
75% 10 60
90% 18 60 Variance 119.8695 95% 27 75 Skewness 3.23763
99% 58 81 Kurtosis 15.9472
No. of physical health problems -------------------------------------------------------------
Percentiles Smallest
1% 2 2 5% 2 2
10% 2 2 Obs 465
25% 3 2 Sum of Wgt. 465
50% 5 Mean 4.972043
Largest Std. Dev. 2.388296 75% 6 13
90% 8 13 Variance 5.703958
95% 9 14 Skewness 1.028006
© Thierry Warin
78
99% 12 15 Kurtosis 4.098588
No. of mental health problems
------------------------------------------------------------- Percentiles Smallest
1% 0 0
5% 0 0
10% 1 0 Obs 465
25% 3 0 Sum of Wgt. 465
50% 6 Mean 6.122581 Largest Std. Dev. 4.193594
75% 9 17
90% 12 18 Variance 17.58623 95% 14 18 Skewness .6005144
99% 17 18 Kurtosis 2.698121
Life Change Units
-------------------------------------------------------------
Percentiles Smallest
1% 0 0 5% 25 0
10% 59 0 Obs 465
25% 98 0 Sum of Wgt. 465
50% 178 Mean 204.2172
Largest Std. Dev. 135.7927 75% 278 597
90% 389 643 Variance 18439.66
95% 441 731 Skewness 1.039773 99% 594 920 Kurtosis 4.768424
Let's graph the distribution of the variables.
. kdensity timedrs, normal
© Thierry Warin
79
. kdensity phyheal, normal
. kdensity menheal, normal
. kdensity stress, normal
© Thierry Warin
80
From the graphs above, timedrs and phyheal seem the most skewed, while stress
is somewhat less skewed.
Below we create scatterplot matrices and they clearly show problems that need to be addressed.
. graph timedrs phyheal menheal stress, matrix symbol(.)
Even though we know there are problems with these variables, let's try running a
regression and examine the diagnostics.
. regress timedrs phyheal menheal stress
Source | SS df MS Number of obs = 465 ---------+------------------------------ F( 3, 461) = 43.03
Model | 12168.3154 3 4056.10512 Prob > F = 0.0000
Residual | 43451.1341 461 94.254087 R-squared = 0.2188 ---------+------------------------------ Adj R-squared = 0.2137
Total | 55619.4495 464 119.869503 Root MSE = 9.7085
© Thierry Warin
81
------------------------------------------------------------------------------ timedrs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
phyheal | 1.786948 .2210735 8.083 0.000 1.352511 2.221385 menheal | -.0096656 .1290286 -0.075 0.940 -.2632227 .2438915
stress | .0136145 .0036121 3.769 0.000 .0065162 .0207128
_cons | -3.704848 1.124195 -3.296 0.001 -5.914029 -1.495666
------------------------------------------------------------------------------
The rvfplot shows a real fan spread pattern where the variability of the residuals
grows across the fitted values.
. rvfplot
The hettest command confirms there is a problem of heteroscedasticity.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of timedrs Ho: Constant variance
chi2(1) = 148.83
Prob > chi2 = 0.0000
Let's address the problems of non-normality and heteroscedasticity. Tabachnick and Fidell recommend a log (to the base 10) transformation for ltimedrs and
phyheal and a square root transformation for stress. We make these
transformations below.
. generate ltimedrs = log10(timedrs+1)
. generate lphyheal = log10(phyheal+1)
. generate sstress = sqrt(stress) Let's examine the distributions of these new variables. These transformations
have nearly completely reduced the skewness.
. summarize ltimedrs lphyheal sstress, detail
© Thierry Warin
82
ltimedrs -------------------------------------------------------------
Percentiles Smallest
1% 0 0 5% 0 0
10% .30103 0 Obs 465
25% .4771213 0 Sum of Wgt. 465
50% .69897 Mean .741285
Largest Std. Dev. .4152538
75% 1.041393 1.78533 90% 1.278754 1.78533 Variance .1724357
95% 1.447158 1.880814 Skewness .2277155
99% 1.770852 1.913814 Kurtosis 2.811711
lphyheal
------------------------------------------------------------- Percentiles Smallest
1% .4771213 .4771213
5% .4771213 .4771213
10% .4771213 .4771213 Obs 465 25% .60206 .4771213 Sum of Wgt. 465
50% .7781513 Mean .7437625 Largest Std. Dev. .1668434
75% .845098 1.146128
90% .9542425 1.146128 Variance .0278367 95% 1 1.176091 Skewness .1555756
99% 1.113943 1.20412 Kurtosis 2.354632
sstress
-------------------------------------------------------------
Percentiles Smallest
1% 0 0 5% 5 0
10% 7.681146 0 Obs 465
25% 9.899495 0 Sum of Wgt. 465
50% 13.34166 Mean 13.39955
Largest Std. Dev. 4.972175 75% 16.67333 24.43358
© Thierry Warin
83
90% 19.72308 25.35744 Variance 24.72252 95% 21 27.03701 Skewness -.0908912
99% 24.37212 30.3315 Kurtosis 3.102605
Let's use kdensity to look at the distribution of these new variables. The distributions look pretty good.
. kdensity ltimedrs, normal
. kdensity lphyheal, normal
. kdensity sstress, normal
© Thierry Warin
84
The scatterplots for the transformed variables look better.
. graph ltimedrs lphyheal menheal sstress, matrix symbol(.)
Now let's try running a regression and diagnostics with these transformed variables.
. regress ltimedrs lphyheal menheal sstress Source | SS df MS Number of obs = 465
---------+------------------------------ F( 3, 461) = 93.70
Model | 30.3070861 3 10.102362 Prob > F = 0.0000
Residual | 49.7030783 461 .107815788 R-squared = 0.3788 ---------+------------------------------ Adj R-squared = 0.3747
Total | 80.0101644 464 .172435699 Root MSE = .32835
------------------------------------------------------------------------------
ltimedrs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-------------------------------------------------------------------- lphyheal | 1.293965 .1077396 12.010 0.000 1.082244 1.505687
© Thierry Warin
85
menheal | .0016188 .0043995 0.368 0.713 -.0070268 .0102645 sstress | .0156626 .0033582 4.664 0.000 .0090632 .0222619
_cons | -.4409002 .0755985 -5.832 0.000 -.5894606 -.2923398
------------------------------------------------------------------------------ The distribution of the residuals looks better. There still is a flat portion in the
bottom left of the plot, and there is a residual in the top left.
. rvfplot
The hettest command is no longer significant, suggesting that the residuals are
homoscedastic.
. hettest
Cook-Weisberg test for heteroscedasticity using fitted values of ltimedrs Ho: Constant variance
chi2(1) = 0.86
Prob > chi2 = 0.3529 We use the ovtest command to test for omitted variables from the equation. The
results suggest no omitted variables.
. ovtest Ramsey RESET test using powers of the fitted values of ltimedrs
Ho: model has no omitted variables
F(3, 458) = 0.60 Prob > F = 0.6134
We use the ovtest with the rhs option to test for omitted higher order trends (e.g.
quadratic, cubic trends). The results suggest there are no omitted higher order trends.
. ovtest, rhs Ramsey RESET test using powers of the independent variables
© Thierry Warin
86
Ho: model has no omitted variables F(9, 452) = 0.87
Prob > F = 0.5525
Examination of the added variable plots below show no dramatic problems.
. avplots
Let's create leverage, studentized residuals, and Cook's D, and plot these. These
result look mostly OK. There is one observation in the middle top-right section that has a large Cook's D (large bubble) a fairly large residual, but not a very large
leverage.
. predict l, leverage
. predict rstud, rstudent
. predict d, cooksd
. graph rstu l [w=d]
Below we show the same plot showing the subject number, and see that
observation 548 is the observation we identified in the plot above.
. graph rstu l, symbol([subjno])
© Thierry Warin
87
The residuals look like they are OK. Let's try running the regression using robust
standard errors and see if we get the same results. Indeed, the results below (using robust standard errors) are virtually the same as the prior results.
. regress ltimedrs lphyheal menheal sstress, robust Regression with robust standard errors Number of obs = 465
F( 3, 461) = 114.66
Prob > F = 0.0000
R-squared = 0.3788 Root MSE = .32835
------------------------------------------------------------------------------ | Robust
ltimedrs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+-------------------------------------------------------------------- lphyheal | 1.293965 .1084569 11.931 0.000 1.080834 1.507096
menheal | .0016188 .0044805 0.361 0.718 -.0071859 .0104235
sstress | .0156626 .0034264 4.571 0.000 .0089293 .0223958 _cons | -.4409002 .0699104 -6.307 0.000 -.5782828 -.3035176
------------------------------------------------------------------------------
Let's try robust regression and again check to see if the results change. Again, the
results are nearly identical to the original results.
. rreg ltimedrs lphyheal menheal sstress
Huber iteration 1: maximum difference in weights = .66878052 Huber iteration 2: maximum difference in weights = .05608508
Huber iteration 3: maximum difference in weights = .01226236
Biweight iteration 4: maximum difference in weights = .27285455 Biweight iteration 5: maximum difference in weights = .01514261
© Thierry Warin
88
Biweight iteration 6: maximum difference in weights = .00324634
Robust regression estimates Number of obs = 465
F( 3, 461) = 105.94 Prob > F = 0.0000
------------------------------------------------------------------------------
ltimedrs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
lphyheal | 1.363605 .1019097 13.381 0.000 1.16334 1.56387
menheal | .0013055 .0041615 0.314 0.754 -.0068723 .0094834 sstress | .0124211 .0031765 3.910 0.000 .0061789 .0186633
_cons | -.4590465 .0715078 -6.420 0.000 -.5995681 -.3185249
------------------------------------------------------------------------------ Since the dependent variable was a count variable, we could have tried analyzing
the data using poisson regression. We try analyzing the original variables using
poisson regression.
. poisson timedrs phyheal menheal stress
Iteration 0: log likelihood = -2399.3092
Iteration 1: log likelihood = -2398.772 Iteration 2: log likelihood = -2398.772
Poisson regression Number of obs = 465 LR chi2(3) = 1307.64
Prob > chi2 = 0.0000
Log likelihood = -2398.772 Pseudo R2 = 0.2142
------------------------------------------------------------------------------
timedrs | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+--------------------------------------------------------------------
phyheal | .1698972 .0065397 25.980 0.000 .1570797 .1827147
menheal | .0061833 .0044165 1.400 0.162 -.0024729 .0148395
stress | .001421 .000114 12.466 0.000 .0011976 .0016444 _cons | .7399455 .0428896 17.252 0.000 .6558835 .8240076
------------------------------------------------------------------------------
We can check to see if there is overdispersion in the poisson regression, by trying negative binomial regression.
. nbreg timedrs phyheal menheal stress Fitting comparison Poisson model:
© Thierry Warin
89
Iteration 0: log likelihood = -2399.3092
Iteration 1: log likelihood = -2398.772
Iteration 2: log likelihood = -2398.772
Fitting constant-only model:
Iteration 0: log likelihood = -1454.4125
Iteration 1: log likelihood = -1453.0168
Iteration 2: log likelihood = -1453.0165
Iteration 3: log likelihood = -1453.0165
Fitting full model:
Iteration 0: log likelihood = -1380.3758
Iteration 1: log likelihood = -1362.5593
Iteration 2: log likelihood = -1360.8911 Iteration 3: log likelihood = -1360.8849
Iteration 4: log likelihood = -1360.8849
Negative binomial regression Number of obs = 465 LR chi2(3) = 184.26
Prob > chi2 = 0.0000
Log likelihood = -1360.8849 Pseudo R2 = 0.0634
------------------------------------------------------------------------------
timedrs | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+--------------------------------------------------------------------
phyheal | .2253456 .0220181 10.235 0.000 .1821909 .2685003
menheal | .0113085 .0124366 0.909 0.363 -.0130667 .0356838 stress | .0017756 .0003299 5.382 0.000 .001129 .0024222
_cons | .3078912 .1249824 2.463 0.014 .0629302 .5528521
---------+--------------------------------------------------------------------
/lnalpha | -.3159535 .0788172 -4.009 0.000 -.4704323 -.1614747 ---------+--------------------------------------------------------------------
alpha | .7290933 .0574651 .6247321 .850888
------------------------------------------------------------------------------ Likelihood ratio test of alpha=0: chi2(1) = 2075.77 Prob > chi2 = 0.0000
The test of overdispersion (test of alpha=0) is significant, indicating that the
negative binomial model would be preferred over the poisson model.
© Thierry Warin
90
This module illustrated some of the diagnostic techniques and remedies that can be used in regression analysis. The main problems shown here were problems of
non-normality and heteroscedasticity that could be mended using log and square
root transformations.
© Thierry Warin
91
Appendix 1
The Crosstabs procedure forms two-way and multiway tables and provides a
variety of tests and measures of association for two-way tables. The structure of the table and whether categories are ordered determine what test or measure to
use.
Crosstabs’ statistics and measures of association are computed for two-way tables
only. If you specify a row, a column, and a layer factor (control variable), the
Crosstabs procedure forms one panel of associated statistics and measures for
each value of the layer factor (or a combination of values for two or more control
variables). For example, if GENDER is a layer factor for a table of MARRIED (yes, no) against LIFE (is life exciting, routine, or dull), the results for a two-way
table for the females are computed separately from those for the males and printed
as panels following one another.
Example. Are customers from small companies more likely to be profitable in
sales of services (for example, training and consulting) than those from larger companies? From a crosstabulation, you might learn that the majority of small
companies (fewer than 500 employees) yield high service profits, while the
majority of large companies (more than 2500 employees) yield low service
profits. Statistics and measures of association. Pearson chi-square, likelihood-ratio chi-
square, linear-by-linear association test, Fisher’s exact test, Yates’ corrected chi-
square, Pearson’s r, Spearman’s rho, contingency coefficient, phi, Cramér’s V, symmetric and asymmetric lambdas, Goodman and Kruskal’s tau, uncertainty
coefficient, gamma, Somers’ d, Kendall’s tau-b, Kendall’s tau-c, eta coefficient,
Cohen’s kappa, relative risk estimate, odds ratio, McNemar test, Cochran's and Mantel-Haenszel.
Chi-square. For tables with two rows and two columns, select Chi-square to calculate the Pearson chi-square, the likelihood-ratio chi-square, Fisher’s exact
test, and Yates’ corrected chi-square (continuity correction). For 2 ´ 2 tables,
Fisher’s exact test is computed when a table that does not result from missing
rows or columns in a larger table has a cell with an expected frequency of less than 5. Yates’ corrected chi-square is computed for all other 2 ´ 2 tables. For
tables with any number of rows and columns, select Chi-square to calculate the
Pearson chi-square and the likelihood-ratio chi-square. When both table variables are quantitative, Chi-square yields the linear-by-linear association test.
© Thierry Warin
92
Correlations. For tables in which both rows and columns contain ordered values, Correlations yields Spearman’s correlation coefficient, rho (numeric data only).
Spearman’s rho is a measure of association between rank orders. When both table
variables (factors) are quantitative, Correlations yields the Pearson correlation coefficient, r, a measure of linear association between the variables.
Nominal. For nominal data (no intrinsic order, such as Catholic, Protestant, and
Jewish), you can select Phi (coefficient) and Cramér’s V, Contingency
coefficient, Lambda (symmetric and asymmetric lambdas and Goodman and
Kruskal’s tau), and Uncertainty coefficient.
Ordinal. For tables in which both rows and columns contain ordered values, select Gamma (zero-order for 2-way tables and conditional for 3-way to 10-way tables),
Kendall’s tau-b, and Kendall’s tau-c. For predicting column categories from row
categories, select Somers’ d. Nominal by Interval. When one variable is categorical and the other is
quantitative, select Eta. The categorical variable must be coded numerically.
Kappa. For tables that have the same categories in the columns as in the rows (for
example, measuring agreement between two raters), select Cohen’
s Kappa.
Risk. For tables with two rows and two columns, select Risk for relative risk estimates and the odds ratio.
McNemar. The McNemar test is a nonparametric test for two related dichotomous
variables. It tests for changes in responses using the chi-square distribution. It is useful for detecting changes in responses due to experimental intervention in
"before and after" designs.
Cochran’s and Mantel-Haenszel. Cochran’s and Mantel-Haenszel statistics can
be used to test for independence between a dichotomous factor variable and a
dichotomous response variable, conditional upon covariate patterns defined by one or more layer (control) variables. The Mantel-Haenszel common odds ratio is
also computed, along with Breslow-Day and Tarone's statistics for testing the
homogeneity of the common odds ratio.
© Thierry Warin
93
References