To Hold Out or Not - Princeton Universityerp/erp seminar pdfs/schorfheide.pdf · (Structural) Data Mining:Process by which amodeler tries to improve the tof a structural model during

To Hold Out or Not

Frank Schorfheide and Ken Wolpin

University of Pennsylvania

April 4, 2011

Introduction

Randomized controlled trials (RCTs) to evaluate policies, e.g., cashtransfers for school attendance, have become a prominentmethodology in applied economics.

Limitation: one cannot extrapolate outside of the treatmentvariation in the particular experiment.

Given their cost, RCTs cannot be used to perform ex ante policyevaluation over a wide range of policy alternatives.

Extrapolation to new treatments requires developing models thatembed behavioral and statistical assumptions.

It is thus important to have methods for assessing the relativecredibility of models.

Schorfheide and Wolpin To Hold Out or Not

Introduction

In practice researchers often hold out data from estimation to usefor external validation, e.g., Wise (1985), Todd and Wolpin (2006),Duflo, Hanna and Ryan (2009). Further References

Although having intuitive appeal, the use of holdout samples ispuzzling from a Bayesian perspective, which prescribes using theentire sample to form posteriors.

Our contributions:1 Provide a formal, albeit stylized, framework in which data mining

poses an impediment to the implementation of the ideal Bayesiananalysis.

2 Provide a numerical illustration of potential costs of data mining andpotential benefits of holdout samples designed to discourage datamining. We measure losses relative to the ideal Bayesian solution.

(Structural) Data Mining: Process by which a modeler tries toimprove the fit of a structural model during estimation, e.g., changefunctional forms, allow for unobserved heterogeneity, add latentstate variables.


To Fix Ideas... A Working Example

Evaluate impact of monetary subsidy to low-income householdsbased on school attendance of their children.

No direct tuition cost of schooling.

Structural Models Mi : Household solves (a = 1 means attendschool):

maxa∈{0,1}

Ui (c , a; x , ε, ϑi )

s.t. c = y + w(1− a)

Decision rule a∗ = ϕi (y ,w ; x , ε, ϑi ).

Attendance subsidy s modifies budget constraint

c = y + w(1− a) + sa = (y + s)︸︷︷︸y

+ (w − s)︸︷︷︸w

(1− a).

Optimal attendance choice in presence of subsidy is

a∗ = ϕi (y , w ; x , ε, ϑi ).


Example – Continued

Social experiment:

a randomly selected treatment sample has been offered a subsidy,s = s;a randomly selected control sample, has not been offered the subsidy,s = 0.

Policy maker would like to have an estimate of how sensitive theoutcome is to varying the subsidy level.

It is too costly to vary subsidy in experiment.


Example – Continued

Change of notation: Y is outcome; S subsidy; Xi , i = 1, 2, arecharacteristics such as income and wage.

Assumptions:

n observations, 50% control and 50% treatment sample.

Let X = [X1 X2]. Then:

1

nX ′X

p−→ Γ =

[σ2

1 ρσ1σ2

ρσ1σ2 σ22

].

The treatment is determined independently of the covariates:

1

nX ′S

p−→ 0.


Two Modelers

Policy maker engages 2 modelers in this endeavor: Mi , i = 1, 2.

Structural models embody restrictions that allow the extrapolationof policy effects even though no variation in the policy instrumenthas been observed (“extrapolation by theory”).

Approximation/simplification of the attendance function ϕi (·):

Write model as linear regression:

Mi : Y = Xiβi + Sθ + U, U|(X ,S) ∼ N(0, I ), i = 1, 2.

(Structural) model restriction: θ = βi .

Cross-coefficient restriction rules out need of treatment sample foridentification.

Prior: θ ∼ N

(0, 1

nλ2

).


Policy Maker

Goal: predicting the effects of a counterfactual subsidy level s∗ 6= s.

Assumption: no counterfactual policy predictions with reduced formmodel.

The policy maker can estimate a simple reduced form model:

Mpol : Y = Sθ + V

θ(Y , S)|Mpol provides a consistent estimate of treatment effect.

But, Mpol is unable to answer the question of interest.

For model selection/averaging the particular counterfactual policy s∗is irrelevant: either policy maker weights models based on fit or ontheir ability to predict effect of actual subsidy level s.


Ideal Case: Full Bayesian Analysis

Policy maker assigns prior probabilities πi,0 to Mi .

From the policy maker’s perspective the overall posterior distributionof the treatment effect is given by the mixture

p(θ|Y ,X ,S) =∑i=1,2

πi,np(θ|Y ,X ,S ,Mi ),π1,n

π2,n=π1,0

π2,0

p(Y |X ,S ,M1)

p(Y |X ,S ,M2)

Model weights are based on marginal likelihood

p(Y |X ,S ,Mi ) =

∫θ∈Θ

p(Y |θ,X ,S ,Mi )p(θ|Mi )dθ.

Treatment effect estimates conditional on full sample:

p(θ|Y ,X ,S ,Mi ) =p(Y |X ,S , θ,Mi )p(θ|Mi )

p(Y |X ,S ,Mi ).


Remark: Full Bayesian Analysis

The assumption that θ = βi ∼ N(0, 1/(nλ2)

)implies that models

remain asymptotically difficult to distinguish: log posterior odds ofM1 and M2 are not divergent as n −→∞.

In reality policy makers are confronted with multiple models that arepotentially consistent with the observed data.


Impediments to Full Bayesian Analysis

The policy maker is concerned that the modelers engage in datamining and do not report the marginal data densities p(Y |X ,S ,Mi )associated with their models truthfully.

The policy maker has the option of providing modelers with only asubset of the outcome data:

Partition Y ′ = [Y ′r ,Y′p ], here “r” stands for regression; “p” stands

for prediction (holdout sample).We assume researchers have access to the full data vectors X1, X2, S .

and to request

a predictive density for the holdout sample p(Yp|Yr ,X ,S ,Mi ),a predictive distribution for the PM’s estimate of the treatmenteffect θ(Y , S)|Mpol : p(θ([Y ′r ,Y

′p ]′, S)|Yr ,X ,S ,Mi ).

Next step: characterize behavior of modelers if they have access to

the full sample Y (Case 1);the sub-sample Yr (Case 2).


Case 1: Modeler Has Access to Full Sample Y

Our stylized representation of data mining = data-basedmodification of prior distribution

1 break link between β and θ by introducing an additional parameter ψsuch that θ = βi + ψ;

2 center prior at maximum likelihood estimate.

Step 1:

Write the model as

Y = Xi (θ − ψ) + Sθ + U = Xiθ − Xiψ + U,

where Xi = Xi + S .Let MXi

= I − Xi (X′i Xi )

−1X ′i . Then:

ψ = −(X ′i MXiXi )−1X ′i MXi

Y .

Data-miner subsequently imposes the relationship θ = βi + ψ.



Step 2:

Modified Model

Mi : Yi = Xiθ + U, with Yi = Y + Xi ψ.

Maximum likelihood estimator:

θi = (X ′i Xi )−1X ′i Yi .

Data-mining prior:

θ|Mi ∼ N

(θi , (κX

′i Xi )

−1

).



Modeler is able raise the “marginal likelihood”

from:

p(Y |X , S ,Mi ) = (2π)−n/2λ|X ′i Xi/n + λ2|−1/2

× exp

{−1

2[Y ′(I − Xi (X

′i Xi + nλ2)−1X ′i )Y ]

};

to:

p(Y |X , S , Mi ) = (2π)−n/2

(κ

κ+ 1

)1/2

× exp

{−1

2[Y ′i (I − Xi (X

′i Xi )

−1X ′i )Yi ]

}.

Penalty term is eliminated.

In-sample-fit term Y ′i (I − Xi (X′i Xi )

−1X ′i )Yi corresponds to“unrestricted” regression: Y = Xiβi + Sθ + U.

Policy maker ends up computing distorted model posteriors based onp(Y |X ,S , Mi ).


Case 2: Modeler Only Has Access to Subsample Yr

Modeler is asked to report a predictive density for Yp.

Modeler contemplates reporting p(Yp|Yr ,X ,S , Mi ) instead ofp(Yp|Yr ,X ,S ,Mi ).

By Jensen’s inequality, the expected log ratio of the predictivelikelihoods is:∫

ln

[p(Yp|Yr ,X ,S , Mi )

p(Yp|Yr ,X ,S ,Mi )

]p(Yp|Yr ,X ,S ,Mi )dYp ≤ 0

Deduce: the use of predictive densities for a holdout sample makesit optimal for the modeler to reveal p(Yp|Yr ,X ,S ,Mi ).


Case 2: Modeler Only Has Access to Subsample Yr

However, we allow the modeler to consider a reference model Mi0

that takes the form (similar to above)

Mi0 : Y = βiXi + θS +U, βi ∼ N(0, nλ2), θ ∼ N(0, nλ2).

Modeler computes posterior probabilities for Mi and Mi0.

Predictive distribution for hold-out sample:

pi (Yp|Yr ,X ,X ) = πi0,rp(Yp|Yr ,X ,S ,Mi0)

+πi,rp(Yp|Yr ,X ,S ,Mi )

Behavioral implication (approximately):

If the modeler finds Mi “rejected” against Mi0 (πi0,r ≈ 1), he reportsp(Yp|Yr ,X , S ,Mi0): “data mining” on predictive density.

Otherwise, he reports p(Yp|Yr ,X ,S ,Mi ).


So far: From the Policy Maker’s Perspective...

If modelers are provided with entire sample Y , they data-mine andreport results from model Mi .

If modelers are provided with a subsample Yr , they can potentiallyassess their restrictions θ = βi and either report results from theiractual model Mi or the reference model Mi0, depending on therelative fit.

If Yr contains no information from the treatment sample, thenmodelers have no evidence against θ = βi and always reveal Mi .

In the case of a holdout sample, the policy maker could either usepredictive distributions for Yp or θ(Yp, ·) to weight competingmodels.


When you come to a fork in the road take it. (Yogi Berra)

Post-model-averaging Model weights based on Model weights based on

estimation Yp|Yr pred. density θ|Yr pred. density

based only on r = 0 implements†

Yr sample Bayesian model weights (clearly dominated)

based on r = 0 implements†

full Y sample Bayesian solution (see illustration)

Model building without data?

Reporting high-dimensional predictive densities for Yp?

Current practice in treatment effect literature comes closest tochoosing model weights based on the θ-predictive density.

†limr−→0 p(Y1−r |Yr ,X , S,Mi ) = p(Y |X , S,Mi ).Schorfheide and Wolpin To Hold Out or Not

Numerical Illustration

First, we present results conditional on Mi and/or (θ, βi ).

Second, we present results under the marginal distribution of thedata

p(Y ,X ,S) =1

2p(Y ,X ,S |M1) +

1

2p(Y ,X ,S |M2),

where

p(Y ,X ,S |Mi ) = p(X )p(S)

∫p(Y |θ, β,X ,S ,Mi )p(β, θ|Mi )d(β, θ).

τ is fraction of observations from the treatment group in theregression sample Yp. We consider:

τ = τmin, where τmin = 0 for r ≤ 0.5 and then converges to 0.5 asr ↑ 1.τ = 0.5

Rather than conducting model averaging, we consider degeneratemodel weights that are either 0 or 1 (model selection).


Parameterization

Observable characteristics X : σ21 = σ2

2 = σ2 = 2, ρ = 0.2;

Treatment: s = 2;

Sample size: n = 1, 000 (we have a well defined limit distribution).

Policy maker: prior probability 0.5 for M1 and M2.

Modelers: prior probability of 0.52 for Mi and 0.48 for Mi0;λ1 = λ2 = 1;

Implication of experimental design: Probability of highest posteriorprobability model being the “true” model:

Integrated: 0.68

Conditional on θ = (5 ∗ prior stdd): 1.00

Conditional on θ = (0.2 ∗ prior stdd): 0.51


Policy Experiment and Loss Function

Raise subsidy from s = 2 to s∗ = 4.

Predict outcome for an individual whose relevant characteristicxi = σ and whose irrelevant characteristic x−i = ρσ.

Loss function is quadratic: L(y , y) = (y − y)2.

Optimal predictor is posterior mean; we consider posterior meanconditional on highest posterior probability model:

ybayes = βXi + θS = θbayes(σ + s∗)

We report the expected value of (y − ybayes)2 under the marginaldensity of Y (integrated risk differential).


Policy Experiment and Loss Function

Suppose M1 is the highest posterior probability model. The followingoutcomes are possible.

Full sample data mining if modelers have access to full sample Y :Modelers introduce ψ; estimates of β and θ deviate from θbayes .

Data mining on predictive density:1 Modeler 1 is “honest” and M1 is selected: y = ybayes .2 Modeler 1 is not “honest” and policy maker ends up selecting M1,0.

Misses restriction θ = βi3 Modeler 2 is “honest” and M2 is selected. Uses wrong x .4 Modeler 2 is not “honest” and policy maker ends up selecting M2,0.

Uses wrong x , misses restriction.

Example: r = 0.5, τ = 0.5

Case 1 Case 2 Case 3 Case 4Probability 0.62 0.14 0.18 0.06Cond. E[(yj − ybayes)2] 0.00 5.02 9.42 8.24


Composition of Estimation Sample Yr , n = 1, 000

τ = τmin τ = 0.5Control Treatment Control Treatment

r = 0.2 200 0 100 100

r = 0.5 500 0 250 250

r = 0.8 500 300 400 400

r = 1.0 500 500 500 500


Probability that Modeler 1 is “Honest” Cond. on M1

M1 is solidτ = τmin is blue, τ = 0.5 is red

Probability that Modeler 2 is “Honest” Cond. on M1

M2 is dashedτ = τmin is blue, τ = 0.5 is red

Probability that Modelers are “Honest” Cond. on M1

M1 is solid, M2 is dashedτ = τmin is blue


M1 is solid, M2 is dashedτ = 0.5 is red


M1 is solid, M2 is dashedτ = τmin is blue, τ = 0.5 is red

Probability that Modelers are “Honest”

r ≤ 0.5, τmin = 0: the modelers have no information that allowsthem to test the restriction of their model. In turn, they are honestwith probability 1.

τ = 0.5 and θ is small: even for small values of r the modelers findtheir restrictions rejected with some probability.

For large values of θ modeler M1 does not find his restrictionsrejected, whereas modeler M2 does – with probability 0.6 forr = 0.99.

For small values of θ both modelers find their restrictions rejectedwith approximately equal probability.


Prob. PM Finds Best Model Cond. on M1 and θ

θ-density-based selectionτ = τmin is blue, τ = 0.5 is red

Prob. PM Finds Best Model Cond. on M1 and θ

The figure confounds the probability that the modelers are honestand the probability that the predictive-density-based selection findthe highest posterior probability model.

Large value of θ:

τ = τmin dominates τ = 0.5.Inverted U-shape. For r ≈ 0.5 policy maker finds highest prob modelalmost with certainty.Conjecture: small r suffers from imprecise estimate of θ; large r fromshort evaluation sample Yp.

Small value of θ:

Policy maker finds highest posterior probability model with at mostprobability 1/2.For τ = 0.5 and r < 0.5 there is a visible effect of predictive datamining, i.e. the use of Mi0 instead of Mi .


Risk Cond. on M1 and θ


Data mining on full sample is green

Risk Cond. on M1 and θ

Results mirror the probability of PM finding the highest posteriorprobability model.

For large values of θ the policy maker can with r = 0.5 and τmin = 0obtain a risk differential that is essentially zero.

The risk associated with full sample data mining is large for bothsmall and large values of θ.

Integrated Probability that Modelers are “Honest”

M1 is solid, M2 is dashedτ = τmin is blue, τ = 0.5 is red

Integrated Probability that Modelers are “Honest”

Blue vs. red lines: if r ≤ 0.5, then τmin = 0. Thus, the modelershave no information that allows them to test the restriction of theirmodel. In turn, they are honest with probability 1.

Blue vs. red lines: if τ = 0.5, then even for small values of r themodelers find their restrictions rejected with some probability.

Blue vs. red For large values of r the difference between τ = 0.5 andτ = τmin vanishes as τmin −→ 0.5.

Solid versus dashed lines: conditional on M1, the probability thatM2 finds his model rejected is higher than that of M1 and vice versa.


Integrated Probability that PM Finds Best Model and Risk


Data mining on full sample is green

Relationship to Existing Literature

Stone (1976): Cross-validation – emphasizes that model validationon pseudo-holdout samples can generate a measure of fit thatpenalizes model complexity.

Leamer (1984): Effect of specification searches on inference innon-experimental setting.

Data Snooping: Lo and MacKinlay (1990): correcting tests of assetpricing theories based on data-snooped portfolios. White (2000):correcting standard errors for tests of no predictive superiority forspecification searches.

Discussion:

In our framework the researcher has no access to Yp before thepolicy maker weights the models.Cross validation does not rule out our kind of data mining.In the context of structural modeling it is not feasible to mimic thedata mining / specification search on samples that could have beenobserved.


Extensions

Model misspecification: include a third model, M∗, such that thepolicy maker entertains the possibility that neither M1 or M2 arecorrect.

Specification search versus data mining: modelers could discoverthat restrictions hold conditional on additional regressors.

Non-random hold-out samples.


Conclusion

We develop a framework that allows us to characterize potentialcosts of data mining and potential benefits of holdout samplesdesigned to discourage data mining.

In our numerical illustration we find that a model weighting basedon a predictive density for the subsidy effect estimate that the policymaker can generate on the full sample is preferable to a selectionbased on full sample marginal likelihoods that are contaminated bydata mining.

In our setup the best results are obtained if the holdout sampleconsists purely of observations from the control group.


Literature: Examples of Random Holdout Samples

Wise (1985) - housing rent subsidy experiment

Todd and Wolpin (2006) - student attendance subsidy experiment

Duflo, Hanna and Ryan (2009) - teacher attendance subsidyexperiment

Back


Literature: Examples of Non-random Holdout Samples

Lumsdaine, Stock, and Wise (1992) - effect of introducing a pensionwindow on retirement:

estimation sample - pre-window periodholdout sample - post-window period

Kaboski and Townsend (2007) - effect of Thai Million BahtProgram, a transfer to 80,000 villages to start village banks, onvillage investment

estimation sample - pre-program periodholdout sample - post-program period

Keane and Wolpin (2007) - effect of welfare on female schooling,labor supply, marriage, fertility and take-up

estimation sample - individuals in five states: California, Michigan,New York, North Carolina, Ohioholdout sample - Texas (very low welfare state)

Back


Documents

To Hold Out or Not - Princeton Universityerp/erp seminar pdfs/schorfheide.pdf · (Structural) Data Mining:Process by which amodeler tries to improve the tof a structural model during