Stata Step by Step

© Thierry Warin, 2006-2007

1

Setting up Stata

We are going to allocate 10 megabites to the dataset. You do not want to allocate

to much memory to the dataset because the more memory you allocate to the

dataset, the less memory will be available to perform the commands. You could

reduce the speed of Stata or even kill it.

set mem 10m

we can also decide to have the “more” separation line on the screen or not when

the software displays results:

set more on

set more off

Setting up a panel

Now, we have to instruct Stata that we have a panel dataset. We do it with the

command tsset, or iis and tis

iis idcode

tis year

or

tsset idcode year

In the previous command, idcode is the variable that identifies individuals in our

dataset. Year is the variable that identifies time periods. This is always the rule.

The commands refering to panel data in Stata almost always start with the prefix

xt. You can check for these commands by calling the help file for xt.

help xt


2

You should describe and summarize the dataset as usually before you perform

estimations. Stata has specific commands for describing and summarizing panel

datasets.

xtdes

xtsum

xtdes permits you to observe the pattern of the data, like the number of individuals

with different patterns of observations across time periods. In our case, we have

an unbalanced panel because not all individuals have observations to all years.

The xtsum command gives you general descriptive statistics of the variables in the

dataset, considering the overall, the between and the within variations. Overall

refers to the whole dataset.

Between refers to the variation of the means to each individual (across time

periods). Within refers to the variation of the deviation from the respective mean

to each individual.

You may be interested in applying the panel data tabulate command to a variable.

For instance, to the variable south, in order to obtain a one-way table.

xttab south

As in the previous commands, Stata will report the tabulation for the overall

variation, the within and the between variation.

How to generate variables

Generating variables

gen age2=age^2

gen ttl_exp2=ttl_exp^2

gen tenure2=tenure^2


3

Now, let's compute the average wage for each individual (across time periods).

bysort idcode: egen meanw=mean(ln_wage)

In this case, we did not apply the sort command previously and then the by prefix

command. We could have done it, but with this only command, you can always

abreviate the implementation of the by prefix command.

The command egen is an extension of the gen command to generate new

variables. The general rule to apply egen is when you want to generate a new

variable that is created using a function inside Stata.

In our case, we used the function mean.

You can apply the command list to list the first 10 observations of the new

variable mwage.

list meanw in 1/10

And then apply the xtsum command to summarize the new variable.

xtsum meanw

You may want to obtain the average of the logarithm of wages to each year in the

panel.

bysort year: egen meanw1=mean(ln_wage)

And then you can apply the xttab command.

xttab meanw1

Generating dates

Let’s generate dates:

Gen varname2 = date(varname1, “dmy”)


4

And format:

Format varname2 %d

How to generate dummies

Generating general dummies

Let's generate the dummy variable black, which is not in our dataset.

gen black=1 if race==2

replace black=0 if black==.

Suppose you want to generate a new variable called tenure1 that is equal to the

variable tenure lagged one period. Than you would use a time series operator (l).

First, you would need to sort the dataset according to idcode and year, and then

generate the new variable with the "by" prefix on the variable idcode.

sort idcode year

by idcode: gen tenure1=l.tenure

If you were interested in generating a new variable tenure3 equal to one

difference of the variable tenure, you would use the time series d operator.

by idcode: gen tenure3=d.tenure

If you would like to generate a new variable tenure4 equal to two lags of the

variable tenure, you would type:

by idcode: gen tenure4=l2.tenure

The same principle would apply to the operator d.

Let's just save our data file with the changes that we made to it.


5

save, replace

Another way would be to use the xi command. It takes the items (string of letters,

for instance) of a designated variable (category, for instance) and create a dummy

variable for each item. You need to change the base anyway:

char _dta[omit] “prevalent”

xi: i.category

tabulate category

Generating time dummies

In order to do this, let's first generate our time dummies. We use the "tabulate"

command with the option "gen" in order to generate time dummies for each year

of our dataset.

We will name the time dummies as "y",

• and we will get a first time dummy called "y1" which takes the value 1 if

year=1980, 0 otherwise,

• a second time dummy "y2" which assumes the value 1 if year=1982, 0

otherwise, and similarly for the remaining years. You could give any other

name to your time dummies.

tab year, g(y)


6

Running OLS regressions

Let's now turn to estimation commands for panel data.

The first type of regression that you may run is a pooled OLS regression, which is

simply an OLS regression applied to the whole dataset. This regression is not

considering that you have different individuals across time periods, and so, it is

not considering for the panel nature of the dataset.

reg ln_wage grade age ttl_exp tenure black not_smsa south

In the previous command, you do not need to type age1 or age2. You just need to

type age. When you do this, you are instructing Stata to include all the variables

starting with the expression age to be included in the regression.

Suppose you want to observe the internal results saved in Stata associated with

the last estimation. This is valid for any regression that you perform. In order to

observe them, you would type:

ereturn list

If you want to control for some categories:

xi: reg dependent ind1 ind2 i.category1 i.category2 i.time

Let's perform a regression where only the variation of the means across

individuals is considered.

This is the between regression.

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be


7

Running Panel regressions

In empirical work in panel data, you are always concerned in choosing between

two alternative regressions. This choice is between fixed effects (or within, or least

squares dummy variables - LSDV) estimation and random effects (or feasible

generalized least squares - FGLS) estimation.

In panel data, in the two-way model, the error term can be the result of the sum of

three components:

1. The two-way model assumes the error term as having a specific individual

term effect,

2. a specific time effect

3. and an additional idiosyncratic term.

In the one-way model, the error term can be the result of the sum of one

component:

1. assumes the error term as having a specific individual term effect

It is absolutely fundamental that the error term is not correlated with the independent variables.

• If you have no correlation, then the random effects model should be used

because it is a weighted average of between and within estimations.

• But, if there is correlation between the individual and/or time effects and

the independent variables, then the individual and time effects (fixed

effects model) must be estimated as dummy variables in order to solve for

the endogeneity problem.

The fixed effects (or within regression) is an OLS regression of the form:

(yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)B + (vit - vi. - v.t + v..)


8

where yi., xi. and vi. are the means of the respective variables (and the error)

within the individual across time, y.t, x.t and v.t are the means of the respective

variables (and the error) within each time period across individuals and y.., x..

and v.. is the overall mean of the respective variables (and the error).

Choosing between Fixed effects and Random effects? The Hausman test

The generally accepted way of choosing between fixed and random effects is

running a Hausman test.

Statistically, fixed effects are always a reasonable thing to do with panel data

(they always give consistent results) but they may not be the most efficient model

to run. Random effects will give you better P-values as they are a more efficient

estimator, so you should run random effects if it is statistically justifiable to do so.


9

The Hausman test checks a more efficient model against a less efficient but

consistent model to make sure that the more efficient model also gives consistent

results.

To run a Hausman test comparing fixed with random effects in Stata, you need to

first estimate the fixed effects model, save the coefficients so that you can

compare them with the results of the next model, estimate the random effects

model, and then do the comparison.

1. xtreg dependentvar independentvar1 independentvar2... , fe

2. estimates store fixed

3. xtreg dependentvar independentvar1 independentvar2... , re

4. estimates store random

5. hausman fixed random

The hausman test tests the null hypothesis that the coefficients estimated by the

efficient random effects estimator are the same as the ones estimated by the

consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2

larger than .05) then it is safe to use random effects. If you get a significant P-

value, however, you should use fixed effects.

If you want a fixed effects model with robust standard errors, you can use the

following command:

areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode)

robust

You may be interested in running a maximum likelihood estimation in panel data.

You would type:

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle

If you qualify for a fixed effects model, should you include time effects?


10

Other important question, when you are doing empirical work in panel data is to

choose for the inclusion or not of time effects (time dummies) in your fixed

effects model.

In order to perform the test for the inclusion of time dummies in our fixed effects

regression,

1. first we run fixed effects including the time dummies. In the next fixed

effects regression, the time dummies were abbreviated to "y" (see

“Generating time dummies”, but you could type them all if you prefer.

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe

2. Second, we apply the "testparm" command. It is the test for time

dummies, which assumes the null hypothesis that the time dummies are

not jointly significant.

testparm y

3. We reject the null hypothesis that the time dummies are not jointly

significant if p-value smaller than 10%, and as a consequence our fixed

effects regression should include time effects.

Fixed effects or random effects when time dummies are involved: a test

What about if the inclusion of time dummies in our regression would permit us to

use a random effects model in the individual effects?

[This question is not usually considered in typical empirical work- the purpose

here is to show you an additional test for random effects in panel data.)

1. First, we will run a random effects regression including our time

dummies,

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re


11

2. and then we will apply the "xttest0" command to test for random effects in

this case, which assumes the null hypothesis of random effects.

xttest0

3. The null hypothesis of random effects is again rejected if p-value smaller

than 10%, and thus we should use a fixed effects model with time effects.


12

GMM estimations

Two additional commands that are very usefull in empirical work are the Arellano

and Bond estimator (GMM estimator) and the Arellano and Bover estimator

(system GMM).

Both commands permit you do deal with dynamic panels (where you want to use

as independent variable lags of the dependent variable) as well with problems of

endogeneity.

You may want to have a look at them The commands are respectively "xtabond"

and "xtabond2". "xtabond" is a built in command in Stata, so in order to check

how it works, just type:

help xtabond

"xtabond2" is not a built in command in Stata. If you want to look at it,

previously, you must get it from the net (this is another feature of Stata- you can

always get additional commands from the net). You type the following:

findit xtabond2

The next steps to install the command should be obvious.

How does it work?

The xtabond2 commands allows to estimate dynamic models either with the

GMM estimator in difference or the GMM estimator in system.

xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1)

iv(list2, options2) two robust small

1. When noleveleq is specified, it is the GMM estimator in difference that’s used.

Otherwise, if noleveleq is not specified, it is the GMM estimator in system that’s

used.


13

2. gmm(list1, options):

• list1 is the list of the non-exogenous independent variables

• options1 may take the following values: lag(a,b), eq(diff), eq(level),

eq(both) and collapse

o lag(a,b) means that for the equation in difference, the lagged

variables (in level) of each variable from list1, dated from t-a to t-

b, will be used as instruments; whereas for the equation in level,

the first differences dated t-a+1 will be used as instruments. If

b=●, it means b is infinite. By default, a=1, and b=●. Example:

gmm(x y, lag(2 .)) ⇒ all the lagged variables of x and y, lagged by

at least two periods, will be used as instruments. Example 2:

gmm(x, lag(1 2)) gmm (y, lag (2 3)) ⇒ for variable x, the lagged

values of one period and two periods will be used as instruments,

whereas for variable y, the lagged values of two and three periods

will be used as instruments.

o Options eq(diff), eq(level) or eq(both) mean that the instruments

must be used respectively for the equation in first difference, the

equation in level, or for both. By default, the option is eq(both).

o Option collapse reduces the size of the instruments matrix and

aloow to prevent the overestimation bias in small samples when

the number of instruments is close to the number of observations.

But it reduces the statistical efficiency of the estimator in large

samples.

3. iv(list2, options2):

• List2 is the list of variables that are strictly exogenous, and options2 may

take the following values: eq(diff), eq(level), eq(both), pass and mz.

o Eq(diff), eq(level), and eq(both): see above

o By default, the exogenous variables are differentiated to serve as

instruments in the equations in first difference, and are used un-

differentiated to serve as instruments in the equations in level. The

pass option allows to prevent that exogenous variables are

differentiated to serve as instruments in equations in first

difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass)

allows to use variable x in level as an instrument in the equation in

level as well as in the equation in difference.

o Option mz replaces the missing values of the exogenous variables

by zero, allowing thus to include in the regression the observations

whose data on exogenous variables are missing. This option

impacts the coefficients only if the variables are exogenous.


14

4. Option two:

• This option specifies the use of the GMM estimation in two steps. But

although this two-step estimation is asymptotically more efficient, leads to

biased results. To fix this issue, the xtabond2 command proceeds to a

correction of the covariance matrix for finite samples. So far, there is no

test to know whether the on-step GMM estimator or two-step GMM

estimator should be used.

5. Option robust:

• This option allows to correct the t-test for heteroscedasticity.

6. Option small:

• This option replaces the z-statistics by the t-test results.

Documents

Stata Step by Step