7
Econometric pills Causality and correlation The notion of ceteris paribus—that is, holding all other (relevant) factors fixed—is crucial to establish a causal relationship. Simply finding that two variables are correlated is rarely enough to conclude that a change in one variable causes a change in another. This result is due to the nature of economic data: rarely can we run a controlled experiment that allows a simple correlation analysis to uncover causality. Instead, we can use econometric methods to effectively hold other factors fixed. Because economic variables are properly interpreted as random variables, we should use ideas from probability to formalize the sense in which a change in w causes a change in y. If we focus on the average, or expected, response, a ceteris paribus analysis entails estimating | , E ywc , the expected value of y conditional on w and c. The vector c denotes a set of control variables that we would like to explicitly hold fixed when studying the effect of w on the expected value of y. The reason we control for these variables is that we think w is correlated with other factors that also influence y. If w is continuous, interest centers on | , / E ywc w , which is usually called the partial effect of w on | , E ywc . If w is discrete, we are interested in | , E ywc evaluated at different values of w, with the elements of c fixed at the same specified values. The linear regression model To develop an econometric model, let’s start with a determinist model: C Y α β = + , where C is consumption, Y is income and α and β are two parameters to be estimated. We know that the relationship we have just written is not deterministic in the real world, but some fluctuations can

RA Econometricpills

Embed Size (px)

DESCRIPTION

RA Econometricpills

Citation preview

  • Econometric pills

    Causality and correlation

    The notion of ceteris paribusthat is, holding all other (relevant) factors fixedis crucial to

    establish a causal relationship. Simply finding that two variables are correlated is rarely enough to

    conclude that a change in one variable causes a change in another. This result is due to the nature of

    economic data: rarely can we run a controlled experiment that allows a simple correlation analysis

    to uncover causality. Instead, we can use econometric methods to effectively hold other factors

    fixed.

    Because economic variables are properly interpreted as random variables, we should use ideas from

    probability to formalize the sense in which a change in w causes a change in y. If we focus on the

    average, or expected, response, a ceteris paribus analysis entails estimating | ,E y w c , the

    expected value of y conditional on w and c. The vector c denotes a set of control variables that we

    would like to explicitly hold fixed when studying the effect of w on the expected value of y. The

    reason we control for these variables is that we think w is correlated with other factors that also

    influence y. If w is continuous, interest centers

    on | , /E y w c w , which is usually called the partial effect of w on | ,E y w c . If w is discrete,

    we are interested in | ,E y w c evaluated at different values of w, with the elements of c fixed at

    the same specified values.

    The linear regression model

    To develop an econometric model, lets start with a determinist model: C Y = + , where C is consumption, Y is income and and are two parameters to be estimated. We know that the

    relationship we have just written is not deterministic in the real world, but some fluctuations can

  • occur that disturb it. We then add a stochastic component, to take into account of the uncertainty

    incorporated in many economic variables: C Y = + + , where is a random disturbance. When we want to investigate such a theoretical relation, we estimate a model of the form:

    i i iy x = + + Where y is the dependent variable, x is the independent or explanatory variable, and i=1, , n is

    the index of the n observations in our sample.

    To complete our model, on top of the linear hypothesis, we add some assumptions:

    Zero mean of the disturbances: [ ] 0iE i = Homoskedasticity: [ ] 2iVar = , constant for all i. Nonautocorrelation: , 0i jCov = if i j

    Uncorrelation between the regressor and the disturbance: , 0i jCov x = for all i and j.

    Normality of the residuals: ( )20,i N In graphic terms, when we run a regression we get something very close to the following picture:

    Figure 1

    So, in our regression model, the parameter captures the intercept of the function that represents

    the relationship between the dependent variable and the regressor, while is the linear coefficient.

  • So, when x increases by one, y increases by in the linear regression model. captures the marginal

    effect of x on y. The same concept holds true when we turn to a linear regression with multiple

    explanatory variables:

    i i i i iy x z w = + + + + Here the regressors are x, z and w. The parameters , , and capture the partial effects of each of

    these regressors on y, holding the others constant.

    Testing and significance

    Coefficients can be estimated, but, given the presence of disturbances, we cannot be sure that the

    underlying real parameters are of the same magnitude and sign as we have hypothesized in our

    model. Think of Figure 1: the estimated coefficient may be close to one, but can we be sure that the

    true is indeed 1?

    To test this, we employ statistical inference and the tools of hypothesis testing.

    Start with , the true coefficient of the relation we are studying. After estimation, we obtain b, the

    estimated coefficient, and 2s , the estimated variance of the error terms . We can also get the

    sample variance of x: it is defined as ( )2xx ii

    S x x= , where x is the sample mean of x. We can then obtain the estimate of the sample variance of b as [ ] 2 / xxVar b s S= . Taking the square root of the estimated variance of b, we get sb, the standard error of the estimate b. So /b xxs s S= .

    It can be shown that ( )2b

    b t ns : the ratio of the difference between the estimated parameter

    and its true value with the estimated standard errors is distributed as Student-t distribution with (n-

    2) degrees of freedom.

    In practice, suppose we have a sample of N=10 and estimate b=1.96 and bs =0.384. If we

    hypothesize that the true is , say, equal to 1, we will compute 1.96 1 2.50.384

    = to test 0 : 1H =

  • against 1 : 1H > . If 0H holds, the ratio just calculated is distributed as a t-distribution with 8 degrees of freedom. Suppose that, before discarding our hypothesized level , we want to be

    confident at the 95% level that our estimated b is different from 1. We then want to check whether

    the value 2.5 is within the 95-th percentile of the Students t distribution. There are tables that show

    the values taken by the t-distribution at difference confidence levels. For 8 degrees of freedom, the

    critical value at the 95% confidence level is 1.836. Therefore, we reject the null hypothesis

    0 : 1H = and accept 1 : 1H > . Our baseline hypothesis has probably to be revised. In standard regression analysis, this procedure is employed to test whether the estimated

    coefficients are different from zero. The relevant test statistic is therefore: b

    bts

    = . Since we are

    testing the null of equal to zero against the two-sided hypothesis that it is either greater or smaller,

    we take the ratio in absolute value and look at the 2

    th percentile in the tables. We reject the null

    0 : 0H = if / 2b

    b ts

    > and say that the coefficient can be said to be statistically significant. In

    general, in large samples, the value of 1.96, which would apply for the 95 percent significance

    level, is used as a benchmark value when tables are not available.

    Finally, note that you can infer the t-test value even if you are given only standard errors as post

    estimation outputs in papers: simply apply the formula and divide the estimated coefficient by the

    displayed standard errors to get the t and check the significance of each variable.

    Endogeneity and Instrumental Variables Estimation

    Suppose we have a model in the form: 1 2 3i i i i iy x x x = + + + + , but one of the assumption of the linear regression model are violated for one regression. Namely, one regressor is correlated with

    the disturbances: [ ]3, 0Cov x . We can then say that, while x1 and x2 are exogenous, x3 is potentially endogenous in our model.

  • The violation of the assumption of no correlation with the residual for one regressor disrupts the

    validity of the whole model: in other words, the cannot consistently estimate any parameter in our

    model if there is correlation between one regressor and the disturbances.

    The correlation itself may be due to the presence in the real world of an omitted variable that has

    not been included in our model. This variable may be correlated with x3, so that the coefficient of

    the latter captures both the effects of x3 on the dependent variable and the effects of the omitted

    variable on x3. Note also that the omitted variable in question may also be the dependent variable in

    our model: there are often cases of reverse causation, where one regressor is actually influenced by

    the values taken by the dependent variable.

    The method of the Instrumental Variables is widely used to solve the problems caused by the

    endogeneity of one regressor. In practice, we must find an observable variable z that is correlated

    with x3, but uncorrelated with the disturbances in our model ( [ ], 0Cov z = ). The correlation with x3 must, on the other hand, be conditional on the exogenous regressors: it

    must not be a simple correlation, but a significant partial effect of z on x3 once the effects of the

    other covariates have been netted out!

    More formally, the condition requires that the linear projection of x3 on all the exogenous variables

    has the coefficient of z significantly different from zero:

    3 0 1 1 2 2x x x z r = + + + + , 0 . Here, r is a random disturbance that fulfils the linear regression assumptions.

    If the two conditions outlined above are respected, we can say that z is a valid instrument for x3. It is

    important to stress, though, that the full list of instrumental variables is the list of all the exogenous

    regressors in the base equation plus the instrument z.

    Once we have identified z, consistent OLS estimates for our model can be retrieved, substituting the

    linear projection of x3 into the original equation.

    Closely related to the general instrumental variable technique (and the one mostly used in practice)

    is the two-stage-least-squared estimation (2SLS). The procedure consists in first running a

  • regression of x3 on the full set of instrumental variables. We obtain the estimates 0 1 2 , , and . The estimated coefficients must then be multiplied by the observed variables values to obtain the

    projected value of x3:

    3 0 1 1 2 2 x x x z = + + +

    This projected value (which is a function of the instrumental variables only) can be plugged in the

    original equation, substituting the original x3.

    1 2 0 1 1 2 2 ( )i i i i i i iy x x x x z u = + + + + + + +

    where i iu r = + . Rearranging the terms of the equation above, we get:

    ( ) ( )0 1 1 2 2 i i i i iy x x z u = + + + + + + 0 1 1 2 2 3i i i i iy x x z u = + + + +

    This last equation can be consistently estimated by OLS. The important thing is that it can be shown

    that not only can we get the s, but also the original coefficients of interest: , ,, and . In other

    words, the assumptions made in the instrumental variables approach solve the identification

    problem of the original coefficient.

    Duration analysis and the Cox hazard model

    Sometimes in we are interested in the duration of particular events, so we try to study the time

    elapsed until a certain event occurs (or a certain state is abandoned). Thus, we are interested in how

    some characteristics affect survival times. Many studies focus on the probability of exiting the

    initial state within a short interval, conditional on having survived up to the starting time of the

    interval. An approximation of this probability is given by the hazard function.

    Suppose T is the time at which a person leaves the initial state. For example, if the initial state is

    unemployment, T would be the time until a person becomes employed.

    The cumulative distribution function (cdf) of T is defined as

  • ( ) ( )F t P T t= 0t The survivor function is defined as ( ) ( ) ( )1 F tS t P T t = > and is the probability of surviving (not exiting the state) past time t.

    For 0h > , the probability of leaving the initial state in the interval [ ),t t h+ given survival up to time t is ( )|P t T t h T t < + .

    The hazard function for T is defined as: ( )0

    |( ) lim

    h

    P t T t h T tt

    h

    < + =

    So, for each t, ( )t is the instantaneous rate of leaving per unit of time. An important class of hazard functions is the one of proportional hazard models. A proportional

    hazard model takes the form:

    ( ) ( ) ( )0;t x k x t = where ( ) 0k > is a nonnegative function of x and ( )0 0t > is the baseline hazard. The baseline hazard is common to all individuals; so each individual proportionally differs from the other

    according to the different characteristics captured by the term ( )k x . If we impose ( ) exp( )k x x= where is a vector of parameters, we can transform the equation of the hazard model into:

    ( ) ( )0log ; logt x x t = + This model is named the Cox hazard model, since Cox (1972) first designed the procedure to

    correctly estimate its . Its results are based on the fact that, in most of cases, we are not interested

    in the baseline hazard, but want to focus on the effects of the covariates x on the hazard function.

    References:

    Greene, 1991, Econometric Analysis, Maxwell Mamillan, ch.5 Wooldridge, 2001, Econometric Analysis of Cross Section and Panel Data, MIT Press, ch.1.