CLSRN Summer School 2011 Professor Jeffrey Smith Department … · 2011-07-12 · CLSRN Summer School 2011 . Professor Jeffrey Smith . Department of Economics . University of Michigan

CLSRN Summer School 2011

Professor Jeffrey Smith

Department of Economics University of Michigan [email protected]

Lectures 2: Regression, Matching and Weighting Estimators

Montreal June 28, 2011

mailto:[email protected]

Outline Selection on observed variables Four ways to think about / implement selection on observed variables Parametric linear regression Exact and inexact matching Matching estimators Non-parametric regression estimators Weighting estimators

Outline (continued) Balance Bandwidth selection Variance estimation Common support Sensitivity analysis Dynamic treatment effects Variations Conclusions

Basic assumption In the parametric linear regression context, if

0 X DY X D Uβ β β= + + + then assume that ( | , ) 0E U X D = In the context of matching, weighting or non-parametric regression, assume that

0( ) |Y D X⊥ These assumptions go by many names: Economics: selection on observed variables, conditional independence Statistics: unconfoundedness, ignorability

What the basic assumption means The CIA states that treatment status is random conditional on some set of observed X variables. In this sense, selection on observables is analogous to an experiment, in which treatment status is unconditionally random The X are “exogenous” in a particular sense; more on this later. The CIA will be satisfied if X includes all of the variables that affect both (not either, but both) participation and outcomes. This is a very strong assumption, and one that demands a lot of the data! More on this in Lecture 3.

Four ways to implement / think about selection on observed variables 1. Parametric linear regression 2. Matching in the strict sense 3. Local linear regression 4. Weighting The recent literature shows how these relate and compares their finite sample and asymptotic behavior Economics has tended to favor 3 and 4 while statistics favors 2. All four types can actually be written as weighting estimators in the literal sense.

Parametric linear regression estimand What parameter does a parametric linear regression of the outcome on X and D estimate? Key point: it is not the ATET! Angrist and Pischke (2009) work out the case where there are discrete X, so that the parametric model is:

i ix x R i ix

Y d Dα δ ε= + +∑

One way to think about the issue is as an omitted variables problem. This model omits the interactions between D and X.

Parametric linear regression estimand (continued) Angrist and Pischke (2009) show that the estimand in this case is

2

2( ( ) )

( ( ))D X

RD

E XE Xσ δδσ

=

This is a variance (of the treatment indicator) weighted average of the treatment effects for subgroups defined by the discrete X. Note that the variance of the treatment indicator is maximized when Pr( 1| ) 0.5D X= = In contrast, the ATET puts the highest weight on the subgroup mean impacts for subgroups with the highest conditional probabilities of treatment. In a common effect world, this is all moot.

Exact / cell matching The simplest method of matching compares persons with exactly the same values of the observed variables X. For obvious reasons, this method works only with discrete X. Suppose that X takes on 10 discrete values, 1,...,10k = . Let 1kn be the number of treated persons with each value (in each cell) and 0kn be the number of untreated persons in each cell.

The impact estimator is then 01 1

{ 1} { 0}1 1 0i j

jk i

k i k D j k Dk k kk

Yn Yn n n∈ ∩ = ∈ ∩ =

−

∑ ∑ ∑∑

In words, calculate a mean difference in each cell and take a weighted average of the mean differences using the fraction of treated observations in each cell as the weights.

Curse of dimensionality If there are several discrete X, each with several values (which may include discretized continuous variables), the number of cells may become large, and many cells will have no untreated observations corresponding to each treated observation. For example, if you have five variables each with three values, you have

53 243= cells. This is the matching version of the “curse of dimensionality.” It is a version of the common support problem that is discussed in more detail later.

Inexact matching estimators When the curse of dimensionality strikes, matching estimators that do not require exact matching pose a solution. Inexact matching procedures reduce the dimension of the problem by defining a distance metric on X and then matching using the distance rather than the X. For example, you can construct the Mahalanobis distance between each treated observation and each untreated observation and match on that by defining

( , )w i j in terms of the distance. Asymptotically, all inexact matching schemes are equivalent since with an arbitrarily large sample, you have only exact matches. However, they can yield very different answers in finite samples.

Propensity score matching Rosenbaum and Rubin (1983) show that if you can match on X then you can also match on

( ) Pr( 1| )P X D X= = , the so-called propensity score. That is, 0 0( ) | ( ) | ( )Y D X Y D P X⊥ ⇒ ⊥ In words, if the data justify matching on X then they also justify matching on

( ).P X

Propensity score matching (continued) The intuition is that two groups with the same probability of participation will show up in the treated and untreated samples in equal proportions. Thus, they can be combined for purposes of comparison. Another way to think of this is that if 0Y is independent of D given X, then it should also be independent of D conditional on ( )P X , which summarizes the information in X relative to D.

Estimating the propensity score Non-parametric estimation brings back the curse of dimensionality In practice, the propensity score is almost always estimated using a parametric model such as a logit or a probit. One justification given for this practice is that Monte Carlo evidence and evidence from sensitivity analyses that parametric estimation of the propensity score makes very little difference. A related justification is that the approximation error inherent in a parametric model may be smaller in the case of a binary dependent variable. It is also important to adopt a flexible specification for the parametric propensity score model. We will talk about how balancing tests help to do that later on.

Matching estimators Single nearest neighbor matching (a.k.a. pair matching) without replacement Optimal single nearest neighbor matching without replacement Single nearest neighbor matching with replacement Optimal full matching, as in Hansen (2004) JASA With replacement means that a given untreated unit can form the match for more than one treated unit There is a bias versus variance tradeoff implicit in the choice of whether or not to match with replacement Matching without replacement can make a lot of sense in cases where additional data will be collected, as in the Labor Market Development Agreement (LMDA) evaluations in Canada

Matching estimators (continued)

All matching estimators are weighting estimators, as can be seen by the following general formula. Let 1n be the number of treated persons and 0n be the number of untreated persons. Then all matching estimators for the treatment on the treated can be written in the form

1 0{ 1} { 0}1

1 ( , )i j

Mi j

i D j DY w i j Y

n ∈ = ∈ =

∆ = −

∑ ∑ ,

where ( , )w i j is the weight placed on the jth observations in constructing the counterfactual for the ith treated observation.

Matching estimators – general form (continued) The weights satisfy ( , ) 1 for all

jw i j i=∑ .

Different matching estimators differ in how they construct the weights ( , )w i j . A similar but more complex formula holds for matching estimators of the average treatment effect.

Estimators based on non-parametric regression Another set of estimators, which some define as matching and some do not, estimate the counterfactual via a non-parametric regression of the untreated outcome on X or on the propensity score 0( | )E Y X or 0( | ( ))E Y P X . Predicted values from this non-parametric regression form the estimated expected counterfactual for each treated unit. This way of thinking about matching makes it clear that matching is not as new or novel as it seems. This way of thinking about matching also demonstrates the applicability of the broader literature on non-parametric regression for this problem. Draw the picture.

Estimators based on non-parametric regression (continued) Any method for estimating a non-parametric regression can be used Stratification (e.g. a regressogram) Flexible parametric regression Kernel regression Local linear regression These smoothers can be applied directly to X but it is more common to apply them to ( )P X . Check: everyone familiar with basic idea of non-parametric regression?

Estimators based on non-parametric regression (continued) Just as matching estimators can be framed as weighting estimators, they can also be framed as non-parametric regression estimators Also, all of the estimators based on non-parametric regression can be expressed as weighting estimators

Inverse probability weighting Basic formula for the average treatment effect:

1

1 (1 )( ) 1 ( )

Ni i

i ii i i

Y YD DN P X P X=

∆ = − − −

∑

This is related to the estimator of Horvitz and Thompson (1952) The formula for the ATET is similar Important to force weights to sum to one!

Inverse probability weighting (continued) IPW requires real conditional probabilities Note that in IPW the probabilities must actually be that; they are not just balancing scores This has implications for, e.g., choice-based samples Matching on the square root of the propensity score works fine; weighting using the square root of the propensity score does not When IPW has trouble IPW has trouble when there are probabilities very close to zero or one, as this leads to division by very small numbers

Inverse probability weighting (continued) Recall the argument earlier that matching estimators and estimators based on non-parametric regression can be expressed as weighting estimators, then ponder this quotation: “Matching is an attempt to approximate what reweighting is doing directly”

- Justin McCrary

Bandwidth choice Matching and non-parametric regression estimators require a bandwidth choice Parametric linear regression and weighting do not require a bandwidth choice Depending on the method used, choosing a bandwidth can be computationally intensive and so both researcher time and clock time Not having to choose a bandwidth is a big advantage of IPW

Bandwidth choice (continued) Examples of bandwidth choice How many neighbors? How many strata? How wide a bandwidth in kernel matching? Bandwidth choices embody a bias-variance tradeoff:

Larger bandwidth implies lower variance and higher bias Smaller bandwidth implies higher variance and lower bias

The literature offers methods to guide the bandwidth choice: A priori (e.g. just do pair matching)

Plug-in bandwidths Cross-validation

Bandwidth choice (continued)

Let 0,ˆ ( | ( ), )j j j kE Y P X bw−

denote the predicted value at jX from a non-parametric regression of 0Y on ( )P X using all of the untreated observations other than “j” and bandwidth kbw .

Cross-validation chooses bw to minimize:

20 0,

{ 0} 0

1 ˆ( ) ( ) | ( ), )j

k j j j kj D

EMSE bw Y E Y P X bwn −

∈ =

= − ∑

This estimates of the mean squared error of the regression function, implicitly weighted by the density of the propensity score among the untreated units. Cross-validation has a glacial convergence rate

Bandwidth choice (continued) As Frölich (2005) Statistics and Computing points out, the standard cross-validation criterion minimizes the MSE of the estimated regression function, weighted by the location of the untreated units, rather than the MSE of the estimated treatment effect Two issues: Minimizing the MSE for the wrong estimand Minimizing the MSE where the untreated units are Draw the picture Galdo, Black and Smith (2008) RES propose a weighted version of cross-validation that aims to get the bandwidth right in the region heaviest in treated units.

Common support / overlap In addition to SOV, matching, NPR and weighting require common support This comes in two flavors, the standard common support assumption 0 Pr( 1| ) 1D X< = < and the strict common support assumption 0 Pr( 1| ) 1 1c D X c< < = < − < The latter assumption helps with the asymptotics; see Kahn and Tamer (2010)

Common support / overlap (continued) The substance of the common support assumption is that there must be both treated and untreated units with each value of X. When estimating the ATET, all that is required is untreated units for each value of X corresponding to at least one treated unit. The common support assumption can be thought of as applying to either the sample or the population. The literature has been (too) loose about this until recently.

Common support / overlap (continued) The parametric linear regression model does not require the common support condition The functional form fills in for the data that are missing when the common support assumption fails to hold Draw a picture to show this This is both part of the charm and part of the horror of the standard parametric approach

Common support / overlap strategies

Be sure to examine this empirically … … by examining the data … by thinking about the institutions generating treatment choice … even if you ultimately go with a parametric linear model Methods to impose the support condition: Min-max as in Dehejia and Wahba (1999, 2002) Kernel choice in the case of kernel or local linear regression Density method in Heckman, Ichimura and Todd (1997) ReStud Trimming method in Huber, Lechner and Wunsch (2010) for IPW Looking where the light is: Black and Smith (2004) Journal of Econometrics Crump, Hotz, Imbens and Mitnik (2009) Biometrika

Balance Balance property from Rosenbaum and Rubin (1983)

( | , ( )) ( | ( ))E D X P X E D P X= for any X if ( )P X correctly specified In words, ( )P X contains all the information in X about D. Balance implies that ( | ( ), 1) ( | ( ), 0)f X P X D f X P X D= = = In words, conditioning on the propensity score balances the distribution of conditioning variables between the treated and untreated units. Note the analogue to experiments here, where random assignment balances both observed and unobserved characteristics of the treated and untreated units

Balance (continued) Balance is unrelated to the CIA Balance is informative about how flexible to make the propensity score specification but not informative about what variables to put in the propensity score If balance fails, then adopt a more flexible propensity score specification

Balance (continued) Analyses based on propensity scores should report evidence on balance, just as experimental analyses report evidence on balance A common measure, proposed by Rosenbaum and Rubin (1985), consists of standardized differences:

{ 0} { 0}1

{ 1} { 0}

1 ( , )( ) 100

var ( ) var ( )2

i j

i j

i ji D j D

i D j D

X w i j Xn

SDIFF XX X

∈ = ∈ =

∈ = ∈ =

−

=+

∑ ∑

The numerator is the treatment effect estimated using an X as the dependent variable while the denominator is a scale factor based on the raw variances of the X among the treated and untreated units

Balance (continued) The literature contains a variety of balancing tests whose properties are not very well understood. Wang-Sheng Lee (forthcoming) Empirical Economics is a good recent summary Gary King and co-authors take the tests more seriously than they should The regression-based test in Smith and Todd (2005b) Journal of Econometrics may be “too strong” Don’t make a fetish of it but don’t ignore it either.

Balance (continued) Some recent work uses the balance property to define the estimator Genmatch Uses a user-supplied balance metric and choose matches to maximize balance See Diamond and Sekhon (2010) unpublished (and related papers) Inverse probability tilting Like IPW with estimated using method of moments with moment conditions defined in terms of balance See Graham, Pinto and Egel (2010) unpublished These estimators are not yet as well studied as others we have discussed.

Variance estimation SEB = semi-parametric efficiency bound Like the “B” in BLUE for parametric estimators – minimum asymptotic variance Parametric regression Use the standard OLS variance estimator (or type “comma robust”) Imposing a parametric linear model when it is true yields a payoff in statistical efficiency

Variance estimation (continued) Matching Oddly, no asymptotic theory until recently for either the case of matching on X or the case of matching on the propensity score The bootstrap is not valid in either case due to the non-smoothness of the problem. Matching does not attain the semi-parametric efficiency bound See the series of papers by Abadie and Imbens on this.

Variance estimation (continued) Non-parametric regression Need to take account of variance component from estimating the propensity scores The bootstrap is valid for these estimators so long as they are smooth Some of these estimators obtain the semi-parametric efficiency bound Weighting The bootstrap is valid for IPW

Comparing alternative estimators Read the following papers: Frölich (2004) Review of Economics and Statistics Diamond and Sekhon (2010) Busso, DiNardo and McCrary (2009) Busso, DiNardo and McCrary (2011) Huber, Lechner and Wunsch (2011) Each of these provides a Monte Carlo analysis of the finite sample performance of various subsets of the estimators described above.

Comparing alternative estimators (continued) What you will learn from these papers about doing Monte Carlos: 1. Don’t just rely on one Monte Carlo paper; people get things wrong Ex: Frölich (2004) and normalizing the weights in IPW Ex: Statistics paper followed in D & S that matches on instruments 2. Monte Carlo analysis is harder than you think 3. Using (econometric and economic) theory to guide and interpret Monte Carlo analyses is very powerful. 4. Basing the DGP in a Monte Carlo analysis on a real-world DGP can yield important insights

Comparing alternative estimators (continued) 1. Propensity score stratification is a bad idea 2. IPW does very well as long as the support condition holds, even in pretty small samples 3. Pair matching has low bias but relatively high variance; it is more robust to a weak support condition than IPW 4. Kernel and ridge matching do pretty well 5. Bandwidth selection (as opposed to a priori selection) can make things worse in small samples

Sensitivity analysis Sensitivity analysis tries to examine the effects of violations of the CIA on the obtained estimates Easiest to see in the context of the bivariate normal selection model Outcomes: 0 X DY X Dβ β β ε= + + + Participation: *

0 XD Xγ γ υ= + + ; 1D = iff * 0D > and = 0 otherwise. Assume ( , ) ~ (0,0, )Nε υ ρ Estimate the model by maximum likelihood fixed ρ at various values. See Altonji, Elder and Taber (2005) JPE

Sensitivity analysis (continued) Analogues to this exist for matching; see Ichino, Mealli and Nannicini (2007) Journal of Applied Econometrics for an example and pointers to the literature All of these schemes share the common concern what one’s prior should be about the importance of particular amounts of failure of the CIA At the same time, more of this sort of analysis would be useful in the applied literature.

Dynamic treatment assignment In the real world, treatment is not offered only in period k. Think about the typical setup in most European countries. Individuals become unemployed and are then at risk of both finding a job or participating in active labor market programs in each subsequent period. Now think about the traditional evaluation practice of defining a participation window, say six months, and comparing individuals who do and do not participate in the window. If some people do not participate because they have found a job, then the traditional setup implicitly conditions on outcomes. We would expect the traditional procedure to lead to a downward bias.

Dynamic treatment assignment (continued) Sianesi (2004) ReStat proposed an alternative setup in which individuals at risk of participating in a program in each period of their unemployment spell are compared. For example, estimate the impact of participating versus waiting in the first period using all of the newly unemployed. Then estimate the impact of participating versus waiting in the second period for those still unemployed as of the start of the second period, etc. This procedure is not informative about optimal timing. This procedure is complicated to integrate into a cost-benefit framework. See also Fredriksson and Johansson (2008) JBES and Dolton and Smith (2011)

Variations on matching Bias-corrected matching (= matching plus regression = double robust)

Abadie and Imbens (2011) JBES Robins et al. (2007) Statistical Science

Multi-treatment matching Lots of pairwise comparisons More parameters than you can shake a stick at

Variations on matching (continued) Matching for continuous treatments (i.e. dose-response) Just discretize the treatment Difference-in-differences matching Like regular diff-in-diff but controlling semi-parametrically for X See Heckman, Ichimura, Smith and Todd (1998) Econometrica

Conclusions There are many estimators available when assuming selection on observed variables We have learned a great deal about the links between the various estimators We have learned a great deal about the finite sample performance of the various estimators The details of the matching procedure can and do matter Empirical practice often lags behind, sometimes way behind, our applied econometric knowledge Matching is not a magic bullet.

Documents

CLSRN Summer School 2011 Professor Jeffrey Smith Department … · 2011-07-12 · CLSRN Summer School 2011 . Professor Jeffrey Smith . Department of Economics . University of Michigan