Upload
brinda-ram
View
203
Download
0
Embed Size (px)
Citation preview
DUMMY VARIABLES – INDEPENDENT & DEPENDENT DUMMY VARIABLES
Dummy variables are independent variables which take the value of either 0 or 1. Just as a "dummy"
is a stand-in for a real person, in quantitative analysis, a dummy variable is a numeric stand-in for a
qualitative fact or a logical proposition. For example, a model to estimate demand for electricity in a
geographical area might include the average temperature, the average number of daylight hours, the
total number of structure square feet, numbers of businesses, numbers of residences, and so forth. It
might be more useful, however, if the model could produce appropriate results for each month or
each season. Using the number of the month, such as 12 for December, would be silly, because that
implies that the demand for electricity is going to be very different between December and January,
which is month 1. It also implies that Winter occurs during the same months everywhere, which
would preclude the use of the model for the opposite polar hemisphere. Thus, another way to
represent qualitative concepts such as season, male or female, smoker or non-smoker, etc., is
required for many models to make sense.
In a regression model, a dummy variable with a value of 0 will cause its coefficient to disappear from
the equation. Conversely, the value of 1 causes the coefficient to function as a supplemental
intercept, because of the identity property of multiplication by 1. This type of specification in a linear
regression model is useful to define subsets of observations that have different intercepts and/or
slopes without the creation of separate models. In logistic regression models, encoding all of the
independent variables as dummy variables allows easy interpretation and calculation of the odds
ratios, and increases the stability and significance of the coefficients. Examples of these results are in
Section 3. In addition to the direct benefits to statistical analysis, representing information in the
form of dummy variables is makes it easier to turn the model into a decision tool. Consider a risk
manager who needs to assign credit limits to businesses. The age of the business is almost always
significant in assessing risk. If the risk manager has to assign a different credit limit for each year in
business, it becomes extremely complicated and difficult to use because some businesses are several
hundred years old. Bivariate analysis of the relationship between age of business and default usually
yields a small number of groups that are far more statistically significant than each year evaluated
separately.
In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one
that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may
be expected to shift the outcome. For example, in econometric time series analysis, dummy variables
may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a
truth value represented as a numerical value 0 or 1 (as is sometimes done in computer
programming).
1
The addition of dummy variables always increases model fit (coefficient of determination), but at a
cost of fewer degrees of freedom and loss of generality of the model. Too many dummy variables
result in a model that does not provide any general conclusions.
Dummy variables may be extended to more complex cases. For example, seasonal effects may be
captured by creating dummy variables for each of the seasons: D1=1 if the observation is for
summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and
only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the
panel data fixed effects estimator dummies are created for each of the units in cross-sectional data
(e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the
constant term has to be removed or one of the dummies removed making this the base category
against which the others are assessed, for the following reason:
If dummy variables for all categories were included, their sum would equal 1 for all observations,
which is identical to and hence perfectly correlated with the vector-of-ones variable whose
coefficient is the constant term; if the vector-of-ones variable were also present, this would result in
perfect multicollinearity, so that the matrix inversion in the estimation algorithm would be
impossible. This is referred to as the dummy variable trap.
Describing qualitative data
Far from all of the data of interest to econometricians is quantitative. For instance, gender of
individuals, whether they are married, the industry of firms, countries or regions are all
considered to be qualitative. To include them in regression, we go for dummy variable. In
many cases, the information can be described as being true or false or the character present
or absent. In those cases, it is easy to set up a binary variable or dummy variable taking
values 0 and 1.
For instance, male is usually set to 1 when the individual is male and 0 when female, while if
rather we define female we would likely do the opposite. Those are clearer than a gender
variable. It does not matter to the result, but it does to their interpretation!
Describing categories or ranges
Dummy variables are also useful to describe categories. Indeed, even if the variable is not
binary, if it takes a finite number of values then it can be described by a complete set of
dummy variables For instance, if eyes colour can be brown, blue, green or red, we can have
four dummy variables for each of these colour, taking 1 whenever an individual has eyes of
this colour. More complex for Bowie. Notice that summing all variables in a complete set
2
should give you 1 for all observations. This technique can also be useful for quantitative data
which you do not believe should be considered as one continuous variable. Dummy variables
are 'discrete' and 'qualitative' (e.g., male or female, in the labour force or not, working under
a collective or individual employment contract, renting or owning your home). Units of
measurement are ‘meaningless’. Normally 1 is assigned to the presence of some
characteristic or attribute; 0 for the absence of that characteristic or attribute.
A dummy variable for several ranges allows you to distinguish the effects of what you might
see as “thresholds”.
Example1: in the Mincer equation, we often use dummy variables for high school dropouts,
high school graduates, etc.
EXAMPLE2: A regression model of labour market discrimination by gender.
where Yi = annual earnings
Si = years of education.
Gi = 1 if ith person is a male
0 if ith person is a female.
No special estimation issues as long as the regression meets the all the classical assumptions.
Only the nature of the independent variables has changed.
The expected salary of a female is:
The expected salary of a male is:
Since E( | Si, Gi)=0. Testing for discrimination (i.e., H0: β2=0) is a test for a difference in the intercept terms.
Intercept shift
3
Dummy Variable Trap: Suppose we estimate the following:
where Fi = 1 if ith person is female
0 if ith person is male
Mi = 1 if ith person is male
0 if ith person is female
This is known as the 'Dummy Variable Trap'. We're including redundant information in the
regression. Suppose the sample looks like this:
4
Men: wage = (β0 + β2) β1 Si
Women: wage = β0 + β1 Si
Slope = β2
β0+ β1
Constant Fi Mi
1 1 0
1 0 1
1 1 0
1 0 1
1 1 0
1 1 0
1 0 1
The problem is that the two dummies are a linear function of the constant (i.e., Fi+Mi = 1).
Perfect multicollinearity. Violates Assumption (6). Estimated coefficients and their standard
errors can’t be computed.
The solution is simple -- drop a dummy variable or the constant term.
Rule of Thumb: If you have 'm' categories, then use 'm-1' dummies.
Slope dummy variables: We could allow for differences in these returns by adding an
'interacted' variable:
This is a more 'flexible' specification.
The expected salary of female is:
The expected salary of male is:
We now have both a 'composite' intercept term and slope coefficient for male.
If β2>0, then male regression line has a higher intercept.
Using a set of dummy variables
What happens if we use a complete set of dummy variables?
5
The four dummies sum to one, hence we have perfect co linearity. The regression will not be
able to identify properly the coefficients. It is as if we had a single variable always equal to
one (like for the intercept). One possible way out is then to drop the intercept. Each dummy
coefficient will then be interpreted as the intercept for this specific group.
Another (more common) possibility is to drop one variable in the set. This will be the
baseline and the other dummy coefficients will read directly as the difference from this
baseline.
Example from Alesina, Algan, Cahuc and Giuliano (2009)
Dummy variables in R
By default, R will automatically remove the last dummy variable if you provide a complete
set. However, you are well-advised to do it yourself as this will help with the interpretation,
and also because other software may not be as kind. There are many methods to create
dummy variables from qualitative data.
Fixed effects
Dummy variables are also frequently used as fixed effects. Typically, we might add time-fixed
effect to our regression to capture structural changes underlying our regression. For
instance, this could be a dummy variable for each year or each period (minus one).
In many cases, it is also useful to define a set of individual-fixed effects to capture all
unobserved individual characteristics. This might lead to a potentially large number of
dummy variables, which is usually not a problem with modern computers. However, you
must have several observations for each individual or you will not have degrees of freedom!
6
Dummy Dependent Variables Models
In this chapter we introduce models that are designed to deal with situations in which our
dependent variable is a dummy variable. That is, it assumes either the value 0 or the value 1.
Such models are very useful in that they allow us to address questions for which there is a
“yes or no” answer.
1. Linear Probability Model
In the case of dummy dependent variable model we have:
where or 1 and .
What would happen if we simply estimated the slope coefficients of this model using
OLS? What would the coefficients mean? Would they be unbiased? Are they efficient?
A regression model in the situation where the dependent variable takes on the two
values 0 or 1 is called a linear probability model. To see its properties note the following.
a) Since the mean error is zero, we know that .
b) Now, if we define and , then
. Therefore, our model is and the estimated
slope coefficients would tell us the impact of a unit change in that explanatory
variable on the probability that
c) The predicted values from the regression model would provide
predictions, based on some chosen values for the explanatory variables, for the
probability that . There is, however, nothing in the estimation strategy that
would constrain the resulting predictions from being negative or larger than 1-
clearly an unfortunate characteristic of the approach.
d) Since and uncorrelated with the explanatory variables (by assumption), it is
easy to show that the OLS estimators are unbiased. The errors, however, are
heteroscedastic. A simple way to see this is to consider an example. Suppose that the
dependent variable takes the value 1 if the individual buys a Rolex watch and 0 other
wise. Also, suppose the explanatory variable is income. For low level of income it is
likely that all of the observations are zeros. In this case, there would be no scatter
around the line. For higher levels of income there would be some zeros and some
ones. That is, there would be some scatter around the line. Thus, the errors would be
heteroscedastic. This suggests two empirical strategies. First, we know that the OLS
7
estimators are unbiased but would yield the incorrect standard errors. We might
simply use OLS and then use the White correction to produce correct standard
errors.
2. Logit and Probit Models
One potential criticism of the linear probability model (beyond those mentioned
above) is that the model assumes that the probability that is linearly related to the
explanatory variable(s). We might, however, expect the relation to be nonlinear. For
example, increasing the income of the very poor or the very rich will probably have little
effect on whether they buy an automobile. It could, however, have a nonzero effect on other
income groups.
Two models that are nonlinear, yet provide predicted probabilities between 0 and 1,
are the logit and probit models. The difference between the linear probability model and the
nonlinear logit and probit models can be explained using an example. To motivate these
models, suppose that our underlying dummy dependent variable depends on an unobserved
(“latent”) utility index . For example, if the variable y is discrete, taking on the values 0
and 1 if someone buys a car, then we can imagine a continuous variable that reflects a
person’s desire to buy the car. It seems reasonable that would vary continuously with
some explanatory variable like income. More formally, suppose
and
(i.e., the utility index is “high enough”)
(i.e., the utility index is not “high enough”)
Then:
Given this, our basic problem is selecting F – the cumulative density function for the error
term. It is here where the logit and probit models differ. As a practical matter, we are likely
interested in estimating the ’s in the model. This is typically done using a Maximum
8
Likelihood Estimator (MLE). To outline the MLE in this context, recognize that each
outcome has the density function . That is, each takes on either
the value of 0 or 1 with probability and . Then the likelihood function
is:
and
which, given , becomes
Analytically, the next step would be to take the partial derivatives of the likelihood function
with respect to the ’s, set them equal to zero, and solve for the MLEs. This could be a very
messy calculation depending on the functional form of F. In practice, the computer will
solve this problem for us.
2.1. Logit Model
For the logit model we specify
It can be seen that .
Similarly, . Thus, unlike the linear probability model,
probabilities from the logit will be between 0 and 1. A complication arises in interpreting the
estimated ’s. In the case of a linear probability model, a b measures the ceteris paribus
effect of a change in the explanatory variable on the probability y equals 1. In the logit model
we can see that
9
Notice that the derivative is nonlinear and depends on the value of x. It is common to
evaluate the derivative at the mean of x so that a single derivative can be presented.
Odds Ratio
For ease of exposition, we write above equation as where . To
avoid the possibility that the predicted values might be outside the probability interval of 0
to 1, we model the ratio . This ratio is the likelihood, or odds, of obtaining a successful
outcome (the ration of the probability that a family will own a car to the probability that it
will not own a car)1.
If we take the natural log of above equation, we obtain
that is, L, the log of the odds ration, is not only linear in x, but also linear in the parameters. L
is called the logit, and hence the name logit model.
Logit model cannot be estimated using OLS. Instead, we use MLE that discussed
previous section, an iterative estimation technique that is especially useful for equations
that are nonlinear in the coefficients. MLE is inherently different from least squares in that it
chooses coefficient estimates that maximize the likelihood of the sample data set being
observed. Interestingly, OLS and MLE are not necessarily different; for a linear equation that
meets the classical assumptions (including the normality assumption), MLE are identical to
the OLS.
Once the logit has been estimated, hypothesis testing and econometric analysis can
be undertaken in much the same way as for linear equations. When interpreting coefficients,
1
10
however, be careful to recall that they represent the impact of a one unit increase in the
independent variable in question, holding the other explanatory variables constant, on the
log of the odds of a given choice, not on the probability itself. But we can always compute
the probability as certain level of variable in question.
2.2. Probit Model
In the case of the probit model, we assume that the . That is, we assume the
error in the utility index model is normally distributed. In this case,
where F is the standard normal cumulative density function. That is
In practice, the c.d.f. of the logit and the probit look quite similar to one another. Once
again, calculating the derivative is moderately complicated . In this case,
where f is the density function of the normal distribution. As in the logit case, the derivative
is nonlinear and is often evaluated at the mean of the explanatory variables. In the case of
dummy explanatory variables, it is common to estimate the derivative as the probability
when the dummy variable is 1 (other variables set to their mean) minus the probability
when the dummy variable is 0 (other variables set to their mean). That is, you simply
calculate how the predicted probability changes when the dummy variable of interest
switches from 0 to 1.
Which Is Better? Logit or Probit
Fortunately, from an empirical standpoint, logits and probits typically yield very
similar estimates of the relevant derivatives. This is because the cumulative distribution
functions for the logit and probit are similar, differing slightly only in the tails of their
respective distributions. Thus, the derivatives are different only if there are enough
observations in the tail of the distribution. While the derivatives are usually similar, it is
important to remember the parameter estimates associated with logit and probit models are
not. A simple approximation suggests that multiplying the logit estimates by 0.625 makes
the logit estimates comparable to the probit estimates.11
Example: We estimate the relationship between the openness of a country Y and a country’s
per capita income in dollars X in 1992. We hypothesize that higher per capita income should
be associated with free trade, and test this at the 5% significance level. The variable Y takes
the value of 1 for free trade, 0 otherwise.
Since the dependent variable is a binary variable, we set up the index function
If (open); if (not open)
Probit estimation gives the following results:
Dependent Variable: Y
Method: ML - Binary Probit (Quadratic hill climbing)
Date: 05/27/04 Time: 13:54
Sample(adjusted): 1 20
Included observations: 20 after adjusting endpoints
Convergence achieved after 7 iterations
Covariance matrix computed using second derivatives
Variable Coefficien
t
Std. Error z-Statistic Prob.
C -1.994184 0.824708 -2.418048 0.0156
X 0.001003 0.000471 2.129488 0.0332
Mean dependent
var
0.500000 S.D. dependent var 0.512989
S.E. of regression 0.337280 Akaike info criterion 0.886471
Sum squared resid 2.047636 Schwarz criterion 0.986045
Log likelihood -6.864713 Hannan-Quinn criter. 0.905909
Restr. log likelihood -13.86294 Avg. log likelihood -
0.343236
LR statistic (1 df) 13.99646 McFadden R-squared 0.504816
Probability(LR stat) 0.000183
Slope is significant at the 5% level.
The interpretation of the changes in a probit model. is the effect of X on . The
marginal effect of X on is easier to interpret and is given by .
12
To test the fit of the model (analogous to R-squared), the maximized log-likelihood value
(lnL) can be compared to the maximized log likelihood in a model with only a constant
in the likelihood ratio index
Logit estimation gives the following results:
Dependent Variable: Y
Method: ML - Binary Logit (Quadratic hill climbing)
Date: 05/27/04 Time: 14:12
Sample(adjusted): 1 20
Included observations: 20 after adjusting endpoints
Convergence achieved after 7 iterations
Covariance matrix computed using second derivatives
Variable Coefficien
t
Std. Error z-Statistic Prob.
C -3.604995 1.681068 -2.144467 0.0320
X 0.001796 0.000900 1.995415 0.0460
Mean dependent
var
0.500000 S.D. dependent var 0.512989
S.E. of regression 0.333745 Akaike info criterion 0.876647
Sum squared resid 2.004939 Schwarz criterion 0.976220
Log likelihood -6.766465 Hannan-Quinn criter. 0.896084
Restr. log likelihood -13.86294 Avg. log likelihood -
0.338323
LR statistic (1 df) 14.19296 McFadden R-squared 0.511903
Probability(LR stat) 0.000165
As you can see from the output, the slop coefficient is significant at the 5% level.
13
The coefficients are proportionally higher in absolute value than in the probit model, but the
marginal effects and significance should be similar.
This can be interpreted as the marginal effect of GDP per capita on the expected value of Y.
14