2018: ,nieuorrLtcefa ThCi alofUy t eniisrev Su:hy AeStdt
58
The University of California Lecture, 2018: Susan Athey (Stanford): Machine learning for heterogeneous treatment e ✓ ects and personalized policy estimation Chaired by A. Colin Cameron (UCDavis) The Tammy and Jay The Tammy and Jay Levine Foundation Levine Foundation
2018: ,nieuorrLtcefa ThCi alofUy t eniisrev Su:hy AeStdt
The University of California Lecture, 2018:
Susan Athey (Stanford): Machine learning for heterogeneous
treatment eects and personalized policy
estimation
The Tammy and JayThe Tammy and Jay Levine FoundationLevine
Foundation
Machine Learning and Causal Inference
Susan Athey – Stanford University
Athey and Imbens (Recursive Partitioning for Heterogeneous
Treatment E↵ects, PNAS, 2016)
Wager and Athey (Estimation and Inference of Causal E↵ects with
Random Forests, JASA, 2018)
Athey, Tibshirani and Wager (Generalized Random Forests, 2016)
Friedberg, Athey, Tibshirani, Wager (Local Linear Forests,
2018)
Athey and Wager (Ecient Policy Learning, 2016) Zhou, Athey, and
Wager (Multi-Arm Policy Estimation, 2018)
Dimakopoulou, Athey and Imbens (Estimation Considerations for
Contextual Bandits, 2017)
See also: Athey, Imbens and Wager (Residual Balancing, forthcoming,
JRSS-B); Athey, Bayati, Duodchenko, Imbens, Khosravi (Matrix
Completion Methods for Causal Panel Data Models, 2017), Athey,
Blei, Ruiz (Shopper: A Probabilistic Model of Consumer Choice with
Substitutes
and Complements, 2017); Athey et al (Estimating Heterogeneous
Consumer Preferences for Restaurants and Travel Time Using
Mobile
Location Data, AEA P&P, 2018)
Machine Learning and Econometrics for Causal Inference
Machine Learning Themes
I Regularization: penalization, model averaging, subsampling
I Goal: goodness of fit in held-out test set, same
distribution
I Methods have similar goals as semi-parameteric estimation, with
better practical performance. Theory?
Contributions to Causal Inference
I Control for confounders (e.g. Double LASSO (Belloni, Chernozhukov
and Hansen); Residual Balancing (Athey, Imbens and Wager); Double
ML (Chernozhukov et al))
I Select from many instruments (Chernozhukov et al)
I Panel data/DID/matrix factorization (Athey et al; Athey, Blei,
Ruiz)
I Today: Heterogeneous parameter estimation
See Athey, “The Impact of ML on Economics” for survey
Treatment E↵ect Heterogeneity Goals
I Insight about mechansims
I Personalized policies
I Identifying subgroups (Athey and Imbens, 2016) or other
low-dimensional parameter estimates
I Testing for heterogeneity across all covariates (List, Shaikh,
and Xu, 2016)
I Robustness to model specification (Athey and Imbens, 2015)
I Personalized estimates with theoretical guarantees (Wager and
Athey, 2018; Athey, Tibshirani, and Wager (forthcoming))
I Identifying individuals with highest estimated treatment e↵ects
(Chernozhukov et al, 2018)
I Estimating optimal policies (Athey and Wager, 2016)
ML Methods for Causal Inference: Treatment E↵ect
Heterogeneity
I ML methods perform well in practice, but many do not have well
established statistical properties
I Unlike prediction, ground truth for causal parameters not
directly observed
I Need valid confidence intervals for many applications (AB
testing, drug trials); challenges include adaptive model selection
and multiple testing
Some themes of ML/CI research agenda: I Either decompose problem
into prediction and causal
components; or build novel methods inspired by ML I Sample
splitting/cross-fitting to avoid spurious findings and to
get consistency/asymptotic normality I Build on insights from
semi-parametric theory I Use orthogonal moments to build in greater
tolerance for slow
convergence in estimation of nuisance parameters I Insight: use ML
to build data-driven neighborhood functions
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple
(X
i
2 Rp,
2 R, and I A treatment assignment W
i
2 {0, 1}.
Following the potential outcomes framework (Holland, 1986, Imbens
and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin, 1974), we posit
the existence of quantities Y
(0) i
.
I These correspond to the response we would have measured given
that the i-th subject received treatment (W
i
i
= 0).
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple
(X
i
2 Rp,
2 R, and I A treatment assignment W
i
Goal is to estimate the conditional average treatment e↵ect
(x) = E h Y (1) Y (0)
X = x i .
NB: In experiments, we only get to see Y i
= Y (W
The potential outcomes framework
If we make no further assumptions, estimating (x) is not
possible.
I Literature often assumes unconfoundedness (Rosenbaum and Rubin,
1983)
{Y (0) i
,Y (1) i
.
I When this assumption holds, methods based on matching or
propensity score estimation are usually consistent.
Regression Trees: Titanic Example
Regression Trees: The Tree as a Partition
Causal Trees
Divide population into subgroups to minimize MSE in treatment
e↵ects
I Goal: report heterogeneity without pre-analysis plan but with
valid confidence intervals
I Moving the goalposts: method defines estimand (treatment e↵ects
for subgroups) and generates estimates
I Solve over-fitting problem with sample splitting: choose
subgroups in half the sample and estimate on other half
Challenges
i
))2
I Need to estimate objective to optimize for it rather than take a
simple average of squared error
P i
Three samples: model selection/tree construction: Str , estimation
sample for leaf e↵ects Sest , and a (hypothetical) test sample Ste
.
Given a partition , (X i
;Sest ,) is the sample average treatment e↵ect in sample Sest for
the leaf `(X
i
Criterion for evaluating a partition anticipating re-estimating
leaf e↵ects using sample splitting:
MSE (Sest ,Ste) = X
]
The last equality makes use of fact that estimates are unbiased in
independent test sample. Can construct empirical estimates of each
of these quantities except for the last which does not depend on
and thus does not a↵ect partition selection.
Causal Tree Algorithm
I Divide data into tree-building Str and estimation Sest samples I
Use a greedy algorithm to recursively partition covariate
space
X into a deep partition I At each node the split is selected as the
one that minimizes
our estimate of EMSE over all possible binary splits I Preserve
minimum number of treated and control units in each
child leaf
I Use cross-validation to select the depth d of the partition that
minimizes an estimate of MSE of treatment e↵ects, using left-out
folds as proxies for the test set
I Select partition by pruning to depth d, pruning leaves that
provide the smallest improvement in goodness of fit
I Estimate the treatment e↵ects in each leaf of using the
estimation sample S
Causal Trees: Search Demotion Example
Causal Trees: Search Demotion Example
Causal Trees: Search Demotion Example
Causal Trees: Adaptive versus Honest Estimates
Crucial to use sample splitting!
Causal Trees: Adaptive versus Honest Estimates
Low-Dimensional Representations v. Fully Nonparametric
Estimation
Causal Trees
I Easy to interpret, easy to mis-interpret
I Can be many trees
I Leaves di↵er in many ways if covariates correlated; describe
leaves by means in all covariates
Causal Forests
I Can estimate partial e↵ects
I In high dimensions, still can have omitted variable issues
I Confidence intervals lose coverage in high dimensions
(bias)
Baseline method: k-NN matching
(x) = 1
,
where S0/1(x) is the set of k-nearest cases/controls to x . This is
consistent given unconfoundedness and regularity conditions.
I Pro: Transparent asymptotics and good, robust performance when p
is small.
I Con: Acute curse of dimensionality, even when p = 20 and n = 20k
.
NB: Kernels have similar qualitative issues as k-NN.
Adaptive nearest neighbor matching
Random forests are a a popular heuristic for adaptive nearest
neighbors estimation introduced by Breiman (2001).
I Pro: Excellent empirical track record.
I Con: Often used as a black box, without statistical
discussion.
There has been considerable interest in using forest-like methods
for treatment e↵ect estimation, but without formal theory.
I Green and Kern (2012) and Hill (2011) have considered using
Bayesian forest algorithms (BART, Chipman et al., 2010).
I Several authors have also studied related tree-based methods:
Athey and Imbens (2016), Su et al. (2009), Taddy et al. (2014),
Wang and Rudin (2015), Zeilis et al. (2008), ...
Wager and Athey (2015) provide the first formal results allowing
random forest to be used for provably valid asymptotic
inference.
Making k-NN matching adaptive Athey and Imbens (2016) introduce
causal tree: defines neighborhoods for matching based on recursive
partitioning (Breiman, Friedman, Olshen, and Stone, 1984), advocate
sample splitting (w/ modified splitting rule) to get
assumption-free confidence intervals for treatment e↵ects in each
leaf.
Euclidean neighborhood, for k-NN matching.
Tree-based neighborhood.
Suppose we have a training set {(X i
, Y i
, W i
and a tree predictor
, Y i
, W i
)}n i=1) .
Random forest idea: build and average many di↵erent trees T :
(x) = 1
Suppose we have a training set {(X i
, Y i
, W i
and a tree predictor
, Y i
, W i
)}n i=1) .
Random forest idea: build and average many di↵erent trees T :
(x) = 1
We turn T into T by:
I Bagging / subsampling the training set (Breiman, 1996); this
helps smooth over discontinuities (Buhlmann and Yu, 2002).
I Selecting the splitting variable at each step from m out of p
randomly drawn features (Amit and Geman, 1997).
Statistical inference with regression forests
Honest trees do not use the same data to select partition (splits)
and make predictions. Ex: Split-sample trees, propensity
trees.
Theorem. (Wager and Athey, JASA, 2018) Regression forests are
asymptotically Gaussian and centered,
µ n
n
given the following assumptions (+ technical conditions):
1. Honesty. Individual trees are honest.
2. Subsampling. Individual trees are built on random subsamples of
size s n , where min < < 1.
3. Continuous features. The features X i
have a density that is bounded away from 0 and 1.
4. Lipschitz response. The conditional mean function µ(x) = E
Y
Proof idea
= (X i
, Y i
µ := µ (Z1, ..., Zn
µ = E [µ] +
Var [ µ] Var [µ] , and that lim
n!1 Var [
Now, let µ b
(x) denote the estimate for µ(x) given by a single regression tree,
and let
µ b
be its Hajek projection,
I Using the adaptive nearest neighbors framework of Lin and Jeon
(2006), we show that
Var [ µ b
] & logp(s).
I As a consequence of the ANOVA decomposition of Efron and Stein
(1981), the full forest gets
Var [ µ]
,
thus yielding the asymptotic normality result for s n for any 0
< < 1.
I For centering, we bound the bias by requiring > min.
Variance estimation for regression forests We estimate the variance
of the regression forest using the infinitesimal jackknife for
random forests (Wager, Hastie, and Efron, 2014). For each of the b
= 1, ..., B trees comprising the forest, define
I The estimated response as µ b
(x), and
.
Then, defining Cov as the covariance taken with respect to all the
trees comprising the forest, we set
2 = n 1
]2 .
Theorem. (Wager and Athey, 2018) Given the same conditions as used
for asymptotic normality, the infinitesimal jackknife for
regression forests is consistent:
2 n
1.
Causal forest example We have n = 20k observations whose features
are distributed as X U([1, 1]p) with p = 6; treatment assignment is
random. All the signal is concentrated along two features.
The plots below depict (x) for 10k random test examples, projected
into the 2 signal dimensions.
true e↵ect (x) causal forest k-NN estimate
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
x1
x2
Software: causalTree for R (Athey, Kong, and Wager, 2015) available
at github: susanathey/causalTree
Causal forest example We have n = 20k observations whose features
are distributed as X U([1, 1]p) with p = 20; treatment assignment
is random. All the signal is concentrated along two features.
The plots below depict (x) for 10k random test examples, projected
into the 2 signal dimensions.
true e↵ect (x) causal forest k-NN estimate