Propensity Score Matching and Variations on the Balancing Test

Propensity Score Matching and Variations on the Balancing Test

Wang-Sheng Lee

Melbourne Institute of Applied Economic and Social ResearchThe University of Melbourne

October 27, 2006

Definition of the Problem

“The most obvious limitation at present is that multiple versions of the balancing test exist in the literature, with little known about the statistical properties of each one, or how they compare to one another given particular types of data.”

(Smith and Todd, 2005)

Preview of Main Findings

• There is a difference between a ‘before matching balancing test’ and an ‘after matching balancing test.’

• Current balancing tests as implemented in the literature have poor size properties.

• Improved balancing tests using non-parametric tests are suggested.

Propensity Score Matching Methodology

Step 1: Estimate the probability of receiving treatment, prob(D = 1 | X) = p(X), using a logit or probit model.

Step 2: Choose matching algorithm (e.g., stratification, nearest neighbour, kernel matching, caliper matching etc.) and match on p(X).

Step 3: Perform matching diagnostics (like the balancing test).

Step 4: Compare mean outcomes to get the Average Treatment Effect on the Treated (ATT).

A Matching Diagnostic: Balance

• A balancing test checks if the two groups ‘look the same’ in terms of the Xs after matching on p(X).

• The balancing property of propensity scores (Theorem 2, Rosenbaum and Rubin, 1983):

X D | p(X)

Given information on p(X), information on X is unnecessary for information on D.

Does not require any use of the outcome variable, so no bias.

Balance does not mean we have the correct Xs in the model (i.e., it does not equal the CIA).

No convenient tests for conditional independence exist.

Varieties of Balancing TestsTest 1: Test for equality of each covariate mean between groups, within

strata of p(X) (t-test).(Done after step 1: estimating p(X) on full sample)

Test 2: Standardised test of differences (of normalised covariates) between groups.(Done after step 2: matching on p(X))

Test 3: Test for equality of each covariate mean between groups (t-test).(Done after step 2: matching on p(X))

Test 4: Test for joint equality of all covariate means between groups (F-test or Hotelling test).

(Done after step 2: matching on p(X))

• QQ plots. Austin and Mamdani (2006); Imai, King and Stuart (2006).

• Box plots. Austin and Mamdani (2006).

• Binary response plots (Rubin-Cook scatter plots). Lee (2006a).

• Undirected graphical models. Lee (2006b).

Some Other Alternative Before Matching Balancing Tests

• Regression test. Smith and Todd (2005).

• Pseudo R2. Sianesi (2004).

Some Other Alternative After Matching Balancing Tests

Motivating Example: NSW Data

• This experimental data set was used in several studies to perform a ‘recovery exercise.’

See, for example, Dehejia and Wahba (1999, 2002) and Smith and Todd (2005).

• Dehejia and Wahba (1999) conducted test 1, performed stratification and nearest neighbour matching and obtained similar estimates as the experimental estimates.

Concluded that balancing test 1 is useful.

• But Dehejia and Wahba (1999) did not conduct tests 2 to 4. What happens if they did?

After estimating p(X), balance is obtained using test 1. After performing kernel matching using the same specification

of p(X), balance is obtained if we use tests 2 to 4. However, after performing nearest neighbour matching using

the same specification of p(X), imbalance is obtained if we use tests 2 to 4.

• In summary, Dehejia and Wahba’s (1999) results from nearest neighbour matching that replicated the experimental benchmark came from a matched sample with imbalanced covariates.

• Which balancing test should be used in practice?

• Is the within strata t-test (Test 1) useful as a specification test for p(X)? Approach of Dehejia and Wahba (using test 1 together with nearest

neighbour matching) still used as recently as Diaz and Handa (2006).

• Issue of multiple testing (e.g., Westfall and Young 1993).

Monte Carlo Simulations

• Generating balanced data:

If the error term in the treatment assignment equation is independent of X, then given X and β:

D X | Xβ

It follows that D X | logit(Xβ) or D X | p(X)

Monte Carlo Simulations using Generated Data

• The simulations: assume a T-C ratio of 20-80. assume we know which Xs to use to estimate the true propensity

score (CIA). vary the number and distribution of covariates and the sample size.

• Test 1 performs terribly in terms of test size. But it seems to work well with a Bonferroni correction.

• Tests 2 to 4 appear to have poor test sizes when there are more than 2 covariates.

In current practice, researchers often look at mean or median values (e.g. mean standardised difference) instead of using a “one unbalanced covariate and you’re out” rule.

Monte Carlo Simulations using NSW Data

• None of the balancing tests appear to work well. For example, test 1 rejects balance about 20% of the time when =

5%.

• Considered the issue of outliers but dropping these observations did not change the results.

• The only way to make things work appears to be dropping difficult to balance covariates.

But this is not a satisfactory solution!

Permutation Tests

• Instead of using the t-distribution for tests 1 and 3, or the Hotelling-distribution for test 4, we use permutation distributions instead.

• A similar approach used in Abadie (2002) in the context of the Kolmogorov-Smirnov statistic performing poorly in the presence of point masses.

The basic idea is to rearrange the labels on the observations, compute the test statistic and repeat many times to obtain the permutation distribution of the test statistic.

Permutation resampling is done without replacement.

• Monte Carlo simulations using the NSW data show balancing tests attain approximately the correct test sizes.

Power of the Tests

• What happens when there is an omitted variable in estimating p(X)? Using the NSW data, consider three DGPS.

1. p(X) contains RE74 and Y contains RE74.2. p(X) contains RE74 and Y does not contain RE74.3. p(X) does not contain RE74 and Y contains RE74.

• Estimate the propensity score using a set of variables that excludes RE74 (i.e., omitted variable).

All DGPs reject balance at approximately the chosen size. Balancing tests couldn’t detect misspecification in p(X).

Bias on ATT largest for DGP1. Smaller biases on ATT for DGP2 and DGP3.

• When CIA not fulfilled, balancing tests with low type 1 error rates are of limited use (i.e., Balance ≠ CIA).

Conclusions• p(X) is a relative measure and not a permanent ID tag or permanent

summary index score associated with each observation.

Matching creates weights that effectively changes the composition of the sample.

When the sample changes, it changes the nature of the balancing hypothesis X D | p(X).

• Important to distinguish between before matching and after matching balancing tests.

Test 1 is a before matching test and most appropriately used with matching by stratification (i.e., ATT is computed using the exact same strata as test 1).

Tests 2 to 4 are after matching balancing tests and most appropriately used with matching algorithms that match on p(X).

• The DW test as described in Dehejia and Wahba (1999, 2002) has a poor test size when used as a before matching test.

• Conventional t-tests and Hotelling-tests do not appear to work well as tests for after matching balance.

Related to the problem of computing standard errors for matching estimators, which is still an open problem (i.e., no analytic solution).

• Balancing tests based on permutation tests appear to provide good test sizes.

But without fulfilling the CIA, their role as a diagnostic is limited.

Das Ende

Documents

Propensity Score Matching and Variations on the Balancing Test