2018: ,nieuorrLtcefa ThCi alofUy t eniisrev Su:hy AeStdt

The University of California Lecture, 2018:
Susan Athey (Stanford): Machine learning for heterogeneous treatment eects and personalized policy
estimation
The Tammy and JayThe Tammy and Jay Levine FoundationLevine Foundation
Machine Learning and Causal Inference
Susan Athey – Stanford University
Athey and Imbens (Recursive Partitioning for Heterogeneous Treatment E↵ects, PNAS, 2016)
Wager and Athey (Estimation and Inference of Causal E↵ects with Random Forests, JASA, 2018)
Athey, Tibshirani and Wager (Generalized Random Forests, 2016) Friedberg, Athey, Tibshirani, Wager (Local Linear Forests, 2018)
Athey and Wager (Ecient Policy Learning, 2016) Zhou, Athey, and Wager (Multi-Arm Policy Estimation, 2018)
Dimakopoulou, Athey and Imbens (Estimation Considerations for Contextual Bandits, 2017)
See also: Athey, Imbens and Wager (Residual Balancing, forthcoming, JRSS-B); Athey, Bayati, Duodchenko, Imbens, Khosravi (Matrix
Completion Methods for Causal Panel Data Models, 2017), Athey, Blei, Ruiz (Shopper: A Probabilistic Model of Consumer Choice with Substitutes
and Complements, 2017); Athey et al (Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile
Location Data, AEA P&P, 2018)
Machine Learning and Econometrics for Causal Inference
Machine Learning Themes
I Regularization: penalization, model averaging, subsampling
I Goal: goodness of fit in held-out test set, same distribution
I Methods have similar goals as semi-parameteric estimation, with better practical performance. Theory?
Contributions to Causal Inference
I Control for confounders (e.g. Double LASSO (Belloni, Chernozhukov and Hansen); Residual Balancing (Athey, Imbens and Wager); Double ML (Chernozhukov et al))
I Select from many instruments (Chernozhukov et al)
I Panel data/DID/matrix factorization (Athey et al; Athey, Blei, Ruiz)
I Today: Heterogeneous parameter estimation
See Athey, “The Impact of ML on Economics” for survey
Treatment E↵ect Heterogeneity Goals
I Insight about mechansims
I Personalized policies
I Identifying subgroups (Athey and Imbens, 2016) or other low-dimensional parameter estimates
I Testing for heterogeneity across all covariates (List, Shaikh, and Xu, 2016)
I Robustness to model specification (Athey and Imbens, 2015)
I Personalized estimates with theoretical guarantees (Wager and Athey, 2018; Athey, Tibshirani, and Wager (forthcoming))
I Identifying individuals with highest estimated treatment e↵ects (Chernozhukov et al, 2018)
I Estimating optimal policies (Athey and Wager, 2016)
ML Methods for Causal Inference: Treatment E↵ect Heterogeneity
I ML methods perform well in practice, but many do not have well established statistical properties
I Unlike prediction, ground truth for causal parameters not directly observed
I Need valid confidence intervals for many applications (AB testing, drug trials); challenges include adaptive model selection and multiple testing
Some themes of ML/CI research agenda: I Either decompose problem into prediction and causal
components; or build novel methods inspired by ML I Sample splitting/cross-fitting to avoid spurious findings and to
get consistency/asymptotic normality I Build on insights from semi-parametric theory I Use orthogonal moments to build in greater tolerance for slow
convergence in estimation of nuisance parameters I Insight: use ML to build data-driven neighborhood functions
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple (X
i
2 Rp,
2 R, and I A treatment assignment W
i
2 {0, 1}.
Following the potential outcomes framework (Holland, 1986, Imbens and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin, 1974), we posit the existence of quantities Y
(0) i
.
I These correspond to the response we would have measured given that the i-th subject received treatment (W
i
i
= 0).
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple (X
i
2 Rp,
2 R, and I A treatment assignment W
i
Goal is to estimate the conditional average treatment e↵ect
(x) = E h Y (1) Y (0)
X = x i .
NB: In experiments, we only get to see Y i
= Y (W
If we make no further assumptions, estimating (x) is not possible.
I Literature often assumes unconfoundedness (Rosenbaum and Rubin, 1983)
{Y (0) i
,Y (1) i
.
I When this assumption holds, methods based on matching or propensity score estimation are usually consistent.
Regression Trees: Titanic Example
Regression Trees: The Tree as a Partition
Causal Trees
Divide population into subgroups to minimize MSE in treatment e↵ects
I Goal: report heterogeneity without pre-analysis plan but with valid confidence intervals
I Moving the goalposts: method defines estimand (treatment e↵ects for subgroups) and generates estimates
I Solve over-fitting problem with sample splitting: choose subgroups in half the sample and estimate on other half
Challenges
i
))2
I Need to estimate objective to optimize for it rather than take a simple average of squared error
P i
Three samples: model selection/tree construction: Str , estimation sample for leaf e↵ects Sest , and a (hypothetical) test sample Ste .
Given a partition , (X i
;Sest ,) is the sample average treatment e↵ect in sample Sest for the leaf `(X
i
Criterion for evaluating a partition anticipating re-estimating leaf e↵ects using sample splitting:
MSE (Sest ,Ste) = X
]
The last equality makes use of fact that estimates are unbiased in independent test sample. Can construct empirical estimates of each of these quantities except for the last which does not depend on and thus does not a↵ect partition selection.
Causal Tree Algorithm
I Divide data into tree-building Str and estimation Sest samples I Use a greedy algorithm to recursively partition covariate space
X into a deep partition I At each node the split is selected as the one that minimizes
our estimate of EMSE over all possible binary splits I Preserve minimum number of treated and control units in each
child leaf
I Use cross-validation to select the depth d of the partition that minimizes an estimate of MSE of treatment e↵ects, using left-out folds as proxies for the test set
I Select partition by pruning to depth d, pruning leaves that provide the smallest improvement in goodness of fit
I Estimate the treatment e↵ects in each leaf of using the estimation sample S
Causal Trees: Search Demotion Example
Causal Trees: Adaptive versus Honest Estimates
Crucial to use sample splitting!
Causal Trees: Adaptive versus Honest Estimates
Low-Dimensional Representations v. Fully Nonparametric Estimation
Causal Trees
I Easy to interpret, easy to mis-interpret
I Can be many trees
I Leaves di↵er in many ways if covariates correlated; describe leaves by means in all covariates
Causal Forests
I Can estimate partial e↵ects
I In high dimensions, still can have omitted variable issues
I Confidence intervals lose coverage in high dimensions (bias)
Baseline method: k-NN matching
(x) = 1
,
where S0/1(x) is the set of k-nearest cases/controls to x . This is consistent given unconfoundedness and regularity conditions.
I Pro: Transparent asymptotics and good, robust performance when p is small.
I Con: Acute curse of dimensionality, even when p = 20 and n = 20k .
NB: Kernels have similar qualitative issues as k-NN.
Adaptive nearest neighbor matching
Random forests are a a popular heuristic for adaptive nearest neighbors estimation introduced by Breiman (2001).
I Pro: Excellent empirical track record.
I Con: Often used as a black box, without statistical discussion.
There has been considerable interest in using forest-like methods for treatment e↵ect estimation, but without formal theory.
I Green and Kern (2012) and Hill (2011) have considered using Bayesian forest algorithms (BART, Chipman et al., 2010).
I Several authors have also studied related tree-based methods: Athey and Imbens (2016), Su et al. (2009), Taddy et al. (2014), Wang and Rudin (2015), Zeilis et al. (2008), ...
Wager and Athey (2015) provide the first formal results allowing random forest to be used for provably valid asymptotic inference.
Making k-NN matching adaptive Athey and Imbens (2016) introduce causal tree: defines neighborhoods for matching based on recursive partitioning (Breiman, Friedman, Olshen, and Stone, 1984), advocate sample splitting (w/ modified splitting rule) to get assumption-free confidence intervals for treatment e↵ects in each leaf.
Euclidean neighborhood, for k-NN matching.
Tree-based neighborhood.
Suppose we have a training set {(X i
, Y i
, W i
and a tree predictor
, Y i
, W i
)}n i=1) .
Random forest idea: build and average many di↵erent trees T :
(x) = 1
Suppose we have a training set {(X i
, Y i
, W i
and a tree predictor
, Y i
, W i
)}n i=1) .
Random forest idea: build and average many di↵erent trees T :
(x) = 1
We turn T into T by:
I Bagging / subsampling the training set (Breiman, 1996); this helps smooth over discontinuities (Buhlmann and Yu, 2002).
I Selecting the splitting variable at each step from m out of p randomly drawn features (Amit and Geman, 1997).
Statistical inference with regression forests
Honest trees do not use the same data to select partition (splits) and make predictions. Ex: Split-sample trees, propensity trees.
Theorem. (Wager and Athey, JASA, 2018) Regression forests are asymptotically Gaussian and centered,
µ n
n
given the following assumptions (+ technical conditions):
1. Honesty. Individual trees are honest.
2. Subsampling. Individual trees are built on random subsamples of size s n , where min < < 1.
3. Continuous features. The features X i
have a density that is bounded away from 0 and 1.
4. Lipschitz response. The conditional mean function µ(x) = E
Y
Proof idea
= (X i
, Y i
µ := µ (Z1, ..., Zn
µ = E [µ] +
Var [ µ] Var [µ] , and that lim
n!1 Var [
Now, let µ b
(x) denote the estimate for µ(x) given by a single regression tree, and let
µ b
be its Hajek projection,
I Using the adaptive nearest neighbors framework of Lin and Jeon (2006), we show that
Var [ µ b
] & logp(s).
I As a consequence of the ANOVA decomposition of Efron and Stein (1981), the full forest gets
Var [ µ]
,
thus yielding the asymptotic normality result for s n for any 0 < < 1.
I For centering, we bound the bias by requiring > min.
Variance estimation for regression forests We estimate the variance of the regression forest using the infinitesimal jackknife for random forests (Wager, Hastie, and Efron, 2014). For each of the b = 1, ..., B trees comprising the forest, define
I The estimated response as µ b
(x), and
.
Then, defining Cov as the covariance taken with respect to all the trees comprising the forest, we set
2 = n 1
]2 .
Theorem. (Wager and Athey, 2018) Given the same conditions as used for asymptotic normality, the infinitesimal jackknife for regression forests is consistent:
2 n
1.
Causal forest example We have n = 20k observations whose features are distributed as X U([1, 1]p) with p = 6; treatment assignment is random. All the signal is concentrated along two features.
The plots below depict (x) for 10k random test examples, projected into the 2 signal dimensions.
true e↵ect (x) causal forest k-NN estimate

0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
x1
x2
Software: causalTree for R (Athey, Kong, and Wager, 2015) available at github: susanathey/causalTree
Causal forest example We have n = 20k observations whose features are distributed as X U([1, 1]p) with p = 20; treatment assignment is random. All the signal is concentrated along two features.
The plots below depict (x) for 10k random test examples, projected into the 2 signal dimensions.
true e↵ect (x) causal forest k-NN estimate

Documents

2018: ,nieuorrLtcefa ThCi alofUy t eniisrev Su:hy AeStdt