, Rolf H.H. Groenwold arXiv:1807.09462v1 [stat.ML] 25 Jul 2018hbiostat.org/papers/missingData/pen18pro.pdf · records to be incorporated in the tree ﬁtting and provide propensity

PROPENSITY SCORE ESTIMATION USING CLASSIFICATION AND REGRESSION

TREES IN THE PRESENCE OF MISSING COVARIATE DATA

Bas BL Penning de Vrieslowastdagger Maarten van Smedenlowast Rolf HH Groenwoldlowast

July 2018

Abstract

Data mining and machine learning techniques such as classification and regression trees(CART) represent a promising alternative to conventional logistic regression for propensityscore estimation Whereas incomplete data preclude the fitting of a logistic regression onall subjects CART is appealing in part because some implementations allow for incompleterecords to be incorporated in the tree fitting and provide propensity score estimates for all sub-jects Based on theoretical considerations we argue that the automatic handling of missingdata by CART may however not be appropriate Using a series of simulation experiments weexamined the performance of different approaches to handling missing covariate data (i) apply-ing the CART algorithm directly to the (partially) incomplete data (ii) complete case analysisand (iii) multiple imputation Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed standard error mean squared error and coverage Applyingthe CART algorithm directly to incomplete data resulted in bias even in scenarios where datawere missing completely at random Overall multiple imputation followed by CART resultedin the best performance Our study showed that automatic handling of missing data in CARTcan cause serious bias and does not outperform multiple imputation as a means to account formissing data

1 IntroductionPropensity score analysis has gained increasing popularity as means to adjust for measured con-founding (Rosenbaum and Rubin 1983 Sturmer et al 2006) Inference typically proceeds bystratification on the propensity score propensity score adjustment in a regression model inverseprobability weighting (IPW) or matching based on propensity scores given measured covariates(Rosenbaum and Rubin 1983 Austin 2011a) It is standard practice to obtain estimates of thelowastClinical Epidemiology Leiden University Medical Center 2300 RC Leiden the NetherlandsdaggerCorresponding author Clinical Epidemiology room C7-107 Leiden University Medical Center PO Box 9600

2300 RC Leiden the Netherlands BBLPenning_de_Vrieslumcnl

propensity score by a parametric (logistic) regression of the exposure on measured covariatesHowever parametric models rely on assumptions about the distribution of variables in relation toone another including the functional form and the presence or absence of interactions If any ofthese are violated covariate balance may not be attained potentially leading to bias in makingcausal inferences about the exposure-outcome relation of interest (Drake 1993)

It has been suggested that machine learning and data mining methods such as classificationand regression tree analysis (CART) be used to estimate the relationship between the exposureand measured covariates These methods avoid making the assumptions regarding functional formand interaction as in a standard logistic regression The utility of data mining methods to estimatepropensity scores in complete data settings has been studied previously (Setoguchi et al 2008 Leeet al 2010 Westreich et al 2010 Wyss et al 2014) However in practice researchers are oftenfaced with missing values on the measured variables Whereas incomplete data preclude logisticregression on all subjects some CART algorithms allow for incomplete records to be incorporatedin the tree fitting and provide propensity score estimates for all subjects The ability of CARTto accommodate missing values has been described as advantageous (Lee et al 2010 McCaffreyet al 2004 Moisen 2008 Rai et al 2017) However the precise impact of missing data on theperformance of CART-based propensity score estimators has received little attention The objectiveof this study was therefore to examine the performance of various CART-based propensity scoreestimation procedures in the presence of missing data Throughout particular emphasis is placedon the causal odds ratio for the marginal effect among the exposed (or Average Effect among thelsquoTreatedrsquo ATT) as the effect measure of interest

The remainder of this article is structured as follows In Section 2 we briefly reviewpertinent theory Based on analytical work we identify caveats in the handling of missing databy CART Section 3 describes a series of Monte Carlo simulations that were used to evaluate theperformance of various approaches to handling missing data including (i) subjecting incompletedata directly to the CART algorithm (ii) complete case analysis and (iii) multiple imputationIn Section 4 we apply and compare the approaches in a case study on the effect of influenzavaccination and mortality We conclude with a summary and discussion of our findings in thecontext of the existing literature

2 Theory

21 Propensity score analysis of complete data

Counterfactual outcomes and estimating causal effects

We adopt a perspective of potential or counterfactual outcomes formal accounts of which aregiven for example by Neyman et al (1935) Rubin (1974) Holland (1986) Holland (1988) andPearl (2009)

Consider a sequence S = (X1X2 Xn) of variables and let F = ( fX1 fX2 fXn) be acollection of functions fX j that deterministically map a realisation of the predecessors (Xi i lt j)of X j and of exogenous variable εX j into a realisation of X j We may write the random variable X j

as follows

X j = fX j

(fX1(X1εX1) fX2(X1X2εX2) fX jminus1(X1X2 X jminus2εX jminus1)εX j

For any intervention setting Xt = xt for t in a subset T of 1 n j the counterfactual versionof X j is obtained by evaluating the right-hand side of (1) with Xt replaced by xt for all t isin T

Specifically let S = (WAYR) so that W = fW (εW ) A = fA(WεA) Y = fY (WAεY )and R = fR(WAYεR) W may be thought of as a (random vector of) baseline or pre-exposurevariable(s) A denotes the binary exposure of interest Y the outcome and R a missing indicatorvector of W A subjectrsquos counterfactual outcomes Y0 and Y1 obtained if exposure A were setpossibly contrary to fact to 0 and 1 respectively are defined such that Y0 = fY (W0εY ) and Y1 =fY (W1εY )

Causal effects are readily defined in terms of counterfactual outcomes In this article thefocus is on the causal odds ratio (OR) for the marginal effect of exposure A on binary outcome Yamong the exposed (A = 1) that is

OR =E[Y1|A = 1](1minusE[Y1|A = 1])E[Y0|A = 1](1minusE[Y0|A = 1])

Under consistency as defined by Cole and Frangakis (2009) Y1 is equal to the observedoutcome Y for subjects in the exposure group Y0 on the other hand is not observed for exposedsubjects We may however validly estimate the causal OR under a set of conditions which in-cludes no interference between subjects (or Stable Unit Treatment Value Assumption Tchetgenand VanderWeele 2012) consistency positivity and conditional exchangeability (Lesko et al2017) To simplify arguments and notation we shall assume that all of these conditions hold withthe exception of conditional exchangeability unless otherwise indicated If there exists a (set of)variable(s) Z such that the potential outcomes are conditionally independent of exposure statusgiven Z we may write

P(Y0|A = 1) = E[P(Y0|A = 1Z)|A = 1]= E[P(Y0|A = 0Z)|A = 1]= E[P(Y |A = 0Z)|A = 1]

so that the causal OR may be expressed entirely in terms of observables

OR =E[Y |A = 1](1minusE[Y |A = 1])

EE[Y |A = 0Z]|A = 1(1minusEE[Y |A = 0Z]|A = 1)

W satisfies the definition of Z whenever εY perpperp εA|W In practice validly estimating E[Y |A = 0Z]may be difficult when Z is multidimensional and Y is rare (Albert and Anderson 1984) In thiscase it may be desirable to summarise Z in a single balancing score (Rosenbaum and Rubin 1983)

The propensity score

The propensity score e(W ) defined as the conditional probability of exposure given covariates W satisfies a number of balancing properties First covariate(s) W and exposure A are conditionally

independent given the propensity score and conditional exchangeability given covariate(s) W im-plies conditional exchangeability given e(W ) (Rosenbaum and Rubin 1983 Theorems 1 and 3)Thus the causal OR becomes

OR =E[Y |A = 1](1minusE[Y |A = 1])

EE[Y |A = 0e(W )]|A = 1(1minusEE[Y |A = 0e(W )]|A = 1)

This formulation has motivated the propensity score matching approach as discussed by Rosen-baum and Rubin (1983)

Balance may also be attained by inverse probability weighting (Appendix A) To simplifyarguments and notation we assume that W and Y take a discrete joint distribution however the re-sults extend to continuous or mixed discretecontinuous distributions To obtain an IPW estimatorof the ATT let

ϕ(wa) =ϕlowast(wa)

E[ϕlowast(WA)|A = a] ϕ

lowast(wa) = I(a = 1)+ I(a = 0)e(w)

1minus e(w)

for realisations w of W and a of A where I denotes the indicator function taken the value 1 if theargument is true and 0 otherwise Weighting by ϕ yields independence between covariate(s) Wand A that is for all w

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

Also conditional exchangeability given W implies exchangeability following weighting by ϕ thatis if (Y0Y1)perpperp A|W = w for all w then

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

for all y0y1 Thus the causal OR becomes

OR =sumw ϕ(w1)Pr(Y = 1W = w|A = 1)

1minussumw ϕ(w1)Pr(Y = 1W = w|A = 1)

sumw ϕ(w0)Pr(Y = 1W = w|A = 0)1minussumw ϕ(w0)Pr(Y = 1W = w|A = 0)

In words this means that the causal odds ratio is equal to the crude odds ratio of the ATT in the(pseudo-)population that is obtained by weighting each observation by ϕ

Ensemble CART methods in the absence of missing data

We will now briefly describe how CART can be applied to estimate the propensity score Detailedinformation can be found elsewhere (McCaffrey et al 2004 Breiman 1996 Ridgeway 1999Breiman 2001 Elith et al 2008 Hastie et al 2009) CART is a type of supervised learning taskthat entails finding a set of rules subject to constraints that partition the data into regions basedon the input data (covariates) such that within regions target values (eg exposure levels) meet

a desirable level of homogeneity Typically a tree is built in a recursive manner by splitting thedataset into increasingly homogeneous subsets and choosing the splitting rule at each step or nodethat best splits the data further with lsquobestrsquo referring to the greatest improvement in terms of somehomogeneity metric such as the Gini index (Therneau and Atkinson 2017) Ensemble techniquesby definition fit more than one tree to the data and combine them to form a single predictor oflsquothe outcomersquo (in the case of propensity score the assigned exposure) The aim of ensembletechniques is to enhance performance and reduce issues of overfitting by a single tree (Elith et al2008 Moisen 2008 Hastie et al 2009) We focus here on two popular CART ensemble methodsnamely boostrap aggregated (bagged) CART and boosted CART

Bootstrap aggregated CARTBagged CART involves drawing bootstrap samples form the original study sample (Breiman1996) A CART tree is formed in each bootstrap sample yielding multiple predictors of thetarget variable For each subject the final prediction is formed by the average or majority voteacross all predictors In the context of propensity scores the prediction of a single tree for anygiven subject may be defined as the proportion of exposed subjects among those individuals thatare assigned to the same region by the given tree The final propensity score is the average of thepredictions across all bootstrap samples Propensity score matching may then be thought of asmatching exposed subjects to unexposed subjects from the same or lsquonearbyrsquo region

Boosted CARTBoosted CART is related to bagged CART in the sense that it is an ensemble method multipletrees are fit and merged to form a single predictor With boosted CART trees are fit in a forwardstagewise procedure In boosting trees are fit iteratively to the data such those observations whoseobserved exposure levels are poorly predicted by the predictor of the previous iteration receivegreater weight at the current iteration (Ridgeway 1999 Elith et al 2008) Some implementa-tions construct trees using data splits aimed not at achieving homogeneity of the exposure valuesthemselves but at achieving homogeneity of prediction error of the estimator obtained in the pre-vious step (McCaffrey et al 2004 Elith et al 2008) With each iteration a new predictor isformed by making adjustments to the predictor obtained in the previous step The final predictoris constructed with contributions from all trees

22 Ignorable missing data and generalised propensity scores

In this section we briefly review the concept of ignorable missing data and discuss a generalisationof the propensity score which allows for missing data as well as strategies to incorporate missingdata directly in the CART fitting For certain CART algorithms (in our case boosted CART) theinherent missing data strategy yields estimates of the generalised propensity score

Ignorable missing data

Suppose W = (W1W2 Wp) and R = (R1R2 Rp) are random vectors of size p such that forj = 12 p R j = 0 if Wj is missing and R j = 1 if Wj is observed Following Rubin (1976)define the extended random vector V = (V1V2 Vp) with range to include the special value lowast to

indicate a missing datum Vj = Wj if R j = 1 and Vj = lowast if R j = 0 Let v be a particular samplerealisation of V so that each v j is either a known quantity or lowast These values imply a realisationfor the random variable R denoted r For notational convenience we write W = (W obsW mis) andV = (V obsV mis) to indicate that each may be partitioned into two vectors corresponding to all jsuch that r j = 1 for observed data and r j = 0 for missing data It is important to note that thesepartitions are defined with respect to r the observed pattern of missing data Given a realisation rof R and provided that AY are observed covariate data are said to be missing at random (MAR) ifPr(R = r|W obsW mis = uAY ) and Pr(R = r|W obsW mis = uprimeAY ) are the same for all uuprime and ateach possible value of the parameter vector φ that fully characterises the missing data mechanism(Rubin 1976) If in addition to MAR the parameter φ is distinct in the sense of Rubin (1976)from the vector θ that parameterises the distribution of the data that we would have based inferenceon had there been no missingness then missing data is said to be ignorable and it is not necessaryto consider the missing data or the missing data mechanism in making inferences about θ (Rubin1976 1987 Schafer 1997) Thus if the missing data mechanism is ignorable one may validlymodel the complete data to create imputations for the missing data (Rubin 1987 Van Buuren2012)

The generalised propensity score

The generalised propensity score elowast(V ) is defined as the conditional exposure probability given theextended covariate vector V (DrsquoAgostino Jr and Rubin 2000) That is

elowast(V ) = Pr(A = 1|W obsR)

= sumw

Pr(A = 1|WR)Pr(W mis = w|W obsR)

Using the same argumentation to establish the balancing properties of the usual propensity scoreit can be shown that the generalised propensity score has the same balancing properties with re-spect to V as the usual propensity score has with respect to W Thus the observed covariate dataand missingness information and exposure A are conditionally independent given the generalisedpropensity score and conditional exchangeability given the extended covariate(s) V implies con-ditional exchangeability given the generalised propensity score elowast(V )

To obtain an IPW estimator of the ATT let

γ(va) =γlowast(va)

E[γlowast(VA)|A = a] γ

lowast(va) = I(a = 1)+ I(a = 0)elowast(v)

1minus e(v)

for realisations v of V and a of A Then weighting by γ renders V independent of A that is for allv

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

Also conditional exchangeability given V implies conditional exchangeability following weight-ing by γ that is if (Y0Y1)perpperp A|V then

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

Importantly the propensity score e(W ) need not equal the generalised propensity scoreelowast(V ) That is given the observed covariate data the unobserved covariate data need not providethe same information about exposure allocation as does the missing data pattern In additionneither covariate balance given the propensity score (W perpperp A|e(W )) nor balance of the observeddata and missingness information given the generalised propensity score (V perpperp A|elowast(V )) generallyimplies covariate balance given the generalised propensity score (W perpperp A|elowast(V ))

More crucially perhaps conditional exchangeability given the generalised propensity scoreis not guaranteed even if conditional exchangeability given the usual propensity score holds or thegeneralised propensity score balances both observed and unobserved covariate data (ie neither(Y0Y1)perpperp A|e(W ) nor W perpperp A|elowast(V ) nor both imply that (Y0Y1)perpperp A|elowast(V ) see Appendix B foran example)

This suggests that it is not generally desirable to distribute across exposure groups boththe observed data and the missingness information by adjusting for the generalised propensityscore However there are situations conceivable in which it is appropriate to base inference onthe generalised rather than the usual propensity score Until now we have assumed an orderingof the variables in which the outcome Y precedes R the missingness pattern of W ConsequentlyY was defined as a function fY of W A and exogenous variable εY and not of R Consider now asetting where S = (WRAY ) so that R forms a predecessor of A and Y (and therefore a potentialcommon cause of A and Y ) Then if exchangeability can be attained by conditioning on elowast(V )conditional exchangeability given e(W ) need not hold (see Appendix C for an example)

Thus the choice between adjustment for the generalised versus the usual propensity scoreshould ideally rest on the relative extent to which conditional exchangeability holds given thegeneralised versus the usual propensity score In practice it is not possible to estimate directly thetrue propensity score when covariate data are missing (Rosenbaum and Rubin 1983 DrsquoAgostinoJr and Rubin 2000 Cham and West 2016) However under ignorability of missing data one maylsquorecoverrsquo the unobserved data eg via multiple imputation (Rubin 1987 Van Buuren 2012) priorto estimating propensity scores Henceforth we assume that exchangeability can be attained byconditioning on the complete covariate data or therefore the usual propensity score if data werenot missing We also assume that missing data is ignorable

Applying CART to incomplete data

Bootstrap aggregated CARTIn this study we used bagged CART as implemented in the R package ipred (Peters and Hothorn2017 version 09-6) This implementation allows for missing data by first evaluating homogene-ity at a given node among only those observations whose candidate splitting variable is observedOnce the splitting variable and split point have been decided the algorithm uses a surrogate splitsapproach to classify records whose splitting variable is missing based on the other variables in-cluded in the tree fitting (Therneau and Atkinson 2017)

The bagged CART algorithm replaces missing confounder values without regard of theoutcome or exposure status As a result any two subjects whose covariate data are identicalexcept possibly for the missing covariate would be allocated to the same covariate region by anygiven tree However subjects within a given region need not be exchangeable In fact systematicdifferences in the outcome of the causal model (Y ) between exposed and unexposed subjects maybe in part attributable to the missing covariate (confounder) As such even under completely

at random missingness (MCAR) we would expect propensity score matching or IPW based onbagged CART to yield bias in the direction of confounding by the missing covariate

Boosted CARTAn implementation of boosted CART to estimate propensity scores is available in the R packagetwang (Ridgeway et al 2017 version 15) This implementation allows for incomplete recordsto be incorporated in the tree fitting by regarding missingness as a special covariate level andassigning to a given (non-terminal) node three child nodes one to which any individual is allocatedwhose splitting variable is missing one for observed values that exceed some threshold and onefor the remainder That is rather than modelling the relationship between exposure and covariatesan attempt is made to model the association between exposure on the one hand and observedcovariate data and missingness information on the other hand and therefore to construct scoresthat balance the missingness across the matched or weighted exposure groups In other words thealgorithm represents an estimator of the generalised propensity score

While boosted CART may be successful at distributing missingness rates across exposuregroups it makes no attempt at distributing the unobserved values If the partially observed co-variate represents a confounder systematic differences across exposure groups may persist afterpropensity score matching or IPW based on the generalised propensity score As such underMCAR we would expect boosted CART to yield a propensity score matching or IPW estimatorthat is biased in the direction of confounding by the partially observed covariate When missing-ness is MAR dependent on the outcome boosted CART tends to render exposure groups morecomparable in terms of the outcome and therefore attenuate the apparent exposure-outcome ef-fect

Bias when applying CART to incomplete dataIn summary above we argued that using either boosted CART or bagged CART to estimatepropensity scores may yield a biased estimator of the causal ATT when applying the CART algo-rithm directly to the (partially) incomplete data In bagged CART missing confounder values arereplaced yet this procedure may not be appropriate since exposure and outcome status are ignoredin this process Boosted CART on the other hand balances observed covariate values as well asmissing indicator values Since the latter may depend on the outcome (under the assumption ofignorability) boosted CART potentially balances outcome values too yielding a biased estimatorof the causal effect

3 Monte Carlo simulations

We now describe a simulation study in which we evaluated the performance of CART-basedpropensity score matching and IPW in the presence of missing confounder data

31 Methods

Simulation structure

We performed a series of Monte-Carlo simulation experiments based on the simulation structuredescribed in Setoguchi et al (2008) with modifications so as to allow for missing data For n =2000 subjects we independently generated 10 covariates Wi (four confounders three predictorsof the exposure only and three predictors of the outcome only) a binary exposure variable Aand a binary outcome Y (Figure 01) Missing data were introduced into one or two covariatesA number of CART-based approaches were used to estimate propensity scores before and afterthe introduction of missing data and in turn the log odds ratio for the exposure-outcome effectamong the treated For comparison we also estimated propensity scores in imputed datasets usinga correctly specified propensity score model and using a logistic model with main effects onlyThe process was repeated 5000 times for each of eight simulation scenarios that varied primarilyby missing data mechanism All simulations were conducted with R-322 on a Windows 7 (64-bit)platform (R Core Team 2016)

Data generation

Data were generated by sequentially going through the following steps First the covariates weregenerated by sampling from a multivariate normal distribution with zero means and unit variancescorrelations were set to zero except for the correlations between W1 and W5 W2 and W6 W3 and W8and W4 and W9 which were set to 02 09 02 and 09 respectively Second covariates W1 W3W5 W6 W8 and W9 were dichotomised setting any value to 1 if greater than 0 and to 0 otherwise

Following Setoguchi et al (2008) the binary exposure variable A was related to the covari-ate vector following the propensity score model Pr(A = 1|W ) = expit08W1minus025W2 +06W3minus04W4minus08W5minus05W6+07W7minus025W 2

2 minus04W 24 +07W 2

7 +04W1W3minus0175W2W4+03W3W5minus028W4W6minus 04W5W7 + 04W1W6minus 0175W2W3 + 03W3W4minus 02W4W5minus 04W5W6 Realisationsa for A were generated by drawing a pseudo-value from the uniform(01) distribution and settinga to 1 if this number was less than the true propensity score and to 0 otherwise Consequently A

W1W2 W3 W4

Figure 01 Complete data structure for simulation experiments Dashed arcs without arrowheadsconnecting variables indicate non-zero entries for the corresponding variables in the covariancematrix of the joint distribution of all Wi i = 12 10

can be thought of as a exposure that is generated by a non-linear and non-additive propensity scoremodel This model assigns approximately half of subjects to the exposure group

Outcomes were generated following the mechanism described by Setoguchi et al (2008)with slight modifications to increase the outcome fraction (from approximately 2 to 20 or40) Specifically the binary outcome Y was modelled as a Bernoulli random variable givenA and W an independent random number εY was drawn from the uniform distribution Ywas set to 1 if this number was less than the inverse logit (expit) of a linear transformationη(AW ) = minus1+03W1minus036W2minus073W3minus02W4 +071W8minus019W9 +026W10 + γA of A andW and to 0 otherwise The true conditional log odds ratio for the exposure-outcome effect wasset to 1 or minus1 depending on the scenario The outcome incidence was roughly 40 for scenarioswith γ = 1 and 20 for scenarios with γ = minus1 The counterfactual outcomes Y0 and Y1 for anysubject with realisations w of W and u of εY are found by computing I(u lt expitη(0w)) andI(u lt expitη(1w)) I denoting the indicator function With knowledge of the counterfactualoutcomes it can be inferred that with γ = 1 the marginal log odds ratio for the true exposure-outcome effect among the exposed (or treated ATT) is approximately 0906 with γ = minus1 themarginal log odds ratio is approximately minus0926 (Hernan and Robins 2017) Note that these aredifferent from the conditional causal odds ratios as a result of the non-collapsibility property of theodds ratio

We considered ignorable missing data mechanisms for introducing missing dataMCAR missingness For all subjects irrespective of complete data values of W3 were

set to missing with probability p characterising the MCAR mechanism The missingness proba-bility of the other variables was set to zero

MAR missingness Let M3 be a missing indicator variable that takes the value ofone if and only if the value of W3 is missing Similarly define M4 to be the missing indicatorvariable pertaining to W4 Given the full data W3 and W4 were set to missing independently ofone and other and with probability equal to Pr(M3 = 1|WAY ) = p and Pr(M4 = 1|WAY ) =expitα0 +α1W1 +α2A+α3Y The missingness probability of the other variables was set tozero

Scenarios

We evaluated the performance of various CART-based methods in eight scenarios (Table 1) Theintercepts α0 in scenarios five through eight were chosen so as to yield roughly the same averageproportion of missing data points per generated dataset of 24000 data points (2000 records on10 covariates one exposure and one outcome variable) namely 3 The average proportion ofmissing data points and the fraction of incomplete records were largest in scenario 2 (5 and 60respectively see Table 1) In all of the scenarios considered data are lsquomissing at randomrsquo and it isassumed that there is conditional exchangeability given measured covariates (ie (Y0Y1)perpperp A|W )

Note that in scenarios 3 trough 8 conditioning on M may break the independence betweenA and unobserved outcome predictor εY through what is known as collider stratification (cf Pearl2009) One might therefore expect that that discarding incomplete records in these scenarios wouldresult in bias In scenarios 3 and 4 however covariate missingness M is conditionally independentof exposure status and covariate data given the outcome (ie M perpperp (AW )|Y ) As a result inthese scenarios the conditional OR for the effect of A on Y given W is equal to the conditionalOR given W among the complete cases (Westreich 2012) Bias of complete case estimators in

Scenario γ MCARMAR p α0 α1 α2 α3 PMP PIR

1 1 MCAR 03 ndash ndash ndash ndash 003 0302 1 MCAR 06 ndash ndash ndash ndash 005 0603 1 MAR 00 minus07 00 00 15 004 0484 minus1 MAR 00 minus10 00 00 15 003 0355 1 MAR 01 minus16 05 05 05 003 0376 1 MAR 01 minus21 05 05 15 003 0377 1 MAR 01 minus23 05 15 05 003 0368 1 MAR 01 minus22 15 05 05 003 037

Table 1 Description of scenarios γ equals the conditional log odds ratio for the effect of A on Ygiven W Given the full data variables W3 and W4 were set to missing independently of one andother and with probabilities p and expitα0 +α1W1 +α2A+α3Y respectively AbbreviationsMCAR missing completely at random MAR missing at random PMP average proportion ofmissing data points PIR average proportion of incomplete records

scenarios 3 and 4 therefore cannot be attributed to collider stratification despite the presence of anunobserved outcome predictor Instead it could result from the non-collapsibility of the odds ratioand changes in the covariate distribution brought about by narrowing the focus of inference to thecomplete cases (Hernan and Robins 2017)

Estimators

Bagged CART was based on 100 bootstrap replicates (Lee et al 2010) We imposed complex-ity constraints on the tree fitting algorithm using the rpart package default control settings Forboosted CART we used 20000 iterations a shrinkage parameter of 00005 and an iteration stop-ping rule based on the mean Kolmogorov-Smirnov test statistic (Lee et al 2010 McCaffrey et al2004)

The CART methods were combined with several common approaches to handling miss-ing data leaving missingness information as is (ie subjecting incomplete data directly to theCART algorithm) complete case analysis (CCA) and multiple imputation (MI) MI was imple-mented with the mice package (version 2460) using the logreg and norm options to impute miss-ing binary and continuous variables respectively and otherwise default settings (Van Buuren andGroothuis-Oudshoorn 2011) Imputation models included apart from the variable to be imputedall other variables including the outcome as untransformed main effects only Propensity scoreanalysis was performed within imputed datasets using the respective sets of estimated propensityscores (Penning de Vries and Groenwold 2016) and results were combined using Rubinrsquos (1987)rules

In addition to using CART as stated we also estimated propensity scores in imputeddatasets using a correctly specified propensity score model (LRc) and using a logistic model withmain effects only (LRm)

Within each (multiply imputed) dataset the ATT was estimated from a logistic model withrobust variance estimation using the survey package (Lumley 2014 version 331) We used bothpropensity score matching and inverse probability weighting Matching was performed using a

greedy 11 nearest neighbour algorithm matching exposed (A = 1) to unexposed individuals (A =0) (Austin 2011a) For any given (imputed) dataset matching was performed on the logit of thepropensity score using a calliper distance of 20 of the standard deviation of the logit propensityscore estimates (Austin 2011b) With the ATT as the estimand IPW weights were defined as 1 forexposed subjects and PS(1minusPS) for unexposed subjects (PS denoting the estimated propensityscore) To avoid undefined weights (10) or logit propensity scores (logit(0)) we placed boundson the estimated propensity scores truncating all estimates less than 0001 to 0001 and settingestimates greater than 0999 to 0999 MI-based estimates were pooled using Rubinrsquos rules to yieldfor each original dataset a single effect estimate standard error estimate and 90 confidenceinterval (90CI)

Performance metrics

We evaluated the performance of the various methods through several measures bias estimatedby the mean deviation of the estimated from the true marginal exposure-outcome effect on the logscale empirical standard error mean estimated standard error mean squared error (MSE) and90CI coverage estimated by the percentage of the 5000 data sets in which the 90CI includedthe true exposure-outcome effect Based on 5000 simulation runs the Monte Carlo standard errorfor the true coverage probability of 090 is

radic(090(1minus 090)5000) asymp 00042 implying that the

estimated coverage probability is expected to lie with 95 probability between 0893 and 0907Empirical coverage rates outside this interval provide evidence against the true coverage probabil-ities being equivalent to the nominal level of 090 The primary interest however was to gaugethe effect of missing data on the various effect estimators Therefore we also compared for eachscenario the effect estimates before and after the introduction of missing data

32 Results

In this section we present (Table 2) and describe the results for IPW-based estimators only Trendsfor estimators based on propensity score matching are similar and the results are presented in fullin the Supplementary Material

Before the introduction of missing data baCART and bCART showed small to no bias (with abso-lute values ranging from 0000 to 0011 on the log odds ratio scale) MI+LRc performed generallywell and to a similar extent as MI+baCART and MI+bCART MI+LRm consistently underesti-mated the true effect when inference was based on IPW this trend was weaker for inferencebased on propensity score matching (Supplementary Table 1) Among all CART-based missingdata approaches considered multiple imputation yielded the least biased estimators overall (witha maximum absolute value of bias of 0026 versus 0221 and 0138 for CART-only and CCA esti-mators respectively) whereas bCART deviated on average the most from the true effect after theintroduction of missingness

As expected baCART and bCART were biased (withminus0029 andminus0039 respectively forscenario 1 andminus0064 andminus0072 for scenario 2) under MCAR in the direction of confounding by

W3 whereas CCA and MI produced exposure-outcome effect estimates that were on average veryclose to the true effect In scenarios 3 and 4 where missingness was outcome-dependent bCARTwas biased toward the null after the introduction of missingness (with bias estimates ofminus0117 and0064 for scenario 3 and 4 respectively where the causal log odds ratios were approximately 0906and minus0926) baCART was downwardly biased in both scenarios (with bias estimates of minus0088and minus0112) Estimators based on CCA or MI with CART were considerably less biased Inscenarios 5 through 8 CCA estimators systematically underestimated the true effect particularlywhen the effect of the exposure or the outcome on the missingness probability was large (scenarios6 and 7 where bias estimates ranged from minus0116 to minus0138) In these scenarios (5 through8) baCART produced estimates that were on average close to the true effect except in scenario 6where the effect was clearly underestimated (estimated biasminus0050) bCART resulted in estimatesthat deviated in the same direction and to a similar or greater extent from the true effect as comparedwith CCA estimators Again MI with CART resulted in estimates that were on average close tothe true effect Increasing the effect of covariate W1 on the missingness probability (scenario 8versus 5) had no evident impact on the results of any of the estimators

Other performance

As expected discarding incomplete records (CCA) resulted in relatively large empirical standarderrors Interestingly MI+LRc had the largest empirical standard error in most scenarios probablyas a consequence of the complexity of the fitted propensity score models In comparing empiricaland mean estimated standard errors note that multiple imputation produced generally conservativeestimates of the standard error This is consistent with previous observations (Van Buuren 2012)Among the CART-based estimators the MSE was largest for CCA in nearly all scenarios MI es-timators had consistently small MSE Overall the best performance in terms of MSE was attainedby MI estimators followed by baCART and bCART Multiple imputation with CART resulted inempirical coverage rates close to or slightly higher than the nominal 90 and those of the otherestimators

33 Additional simulation experiment

To investigate the estimator performances in a simpler setting we repeated the simulation exper-iment of scenario 2 with the squared and interaction terms removed from the exposure allocationmodel of the data generating mechanism The results presented in Supplementary Table 2 indi-cate generally the same trends as previously noted Of note in the absence of missing data inverseweighting based on CART showed noticeably more bias than in scenarios 1 through 8 This isprobably related to CARTrsquos inherent limited ability to model smooth functions Multiply imputingmissing data followed by CART yielded approximately the same extent of bias However thisbias appears to be partially cancelled out by the bias introduced by CARTrsquos automatic handlingof missing data to the extent that CART alone performed better with than without missing dataNonetheless relative to the extent of bias of the respective CART algorithm before the introductionof missing data multiple imputation with CART outperformed both CCA with CART and CARTapplied directly to incomplete data in terms of bias

Missing ScenarioMetric data Method 1 2 3 4 5 6 7 8

Bias Without baCART 0009 0011 0009 0011 0007 0007 0008 0010bCART minus0001 0002minus0000 0004minus0001minus0000minus0001 0001

With baCART minus0029minus0064minus0088minus0112minus0011minus0050 0010minus0004bCART minus0037minus0072minus0117 0064minus0057minus0221minus0123minus0053CCA+baCART 0001 0001 0016minus0023minus0040minus0138minus0116minus0032CCA+bCART 0000 0006 0022minus0010minus0046minus0136minus0129minus0035MI+baCART 0001 0002 0025 0018 0020 0016 0022 0026MI+bCART minus0008minus0011minus0024minus0025minus0010minus0023minus0008minus0007MI+LRc 0002minus0007minus0028minus0029minus0007minus0025minus0004minus0002MI+LRm minus0099minus0094minus0072minus0075minus0087minus0083minus0089minus0082

Empirical Without baCART 0116 0114 0116 0129 0115 0115 0114 0116SE bCART 0134 0133 0136 0147 0135 0133 0133 0135

With baCART 0118 0121 0126 0136 0119 0121 0119 0119bCART 0132 0128 0125 0137 0131 0127 0149 0131CCA+baCART 0141 0186 0189 0199 0147 0156 0143 0148CCA+bCART 0158 0202 0211 0221 0165 0172 0154 0163MI+baCART 0116 0116 0115 0129 0114 0115 0113 0116MI+bCART 0132 0130 0129 0140 0128 0126 0126 0128MI+LRc 0216 0205 0202 0218 0210 0211 0203 0214MI+LRm 0125 0123 0125 0138 0122 0121 0124 0123

Mean Without baCART 0114 0114 0114 0128 0114 0114 0114 0114SE bCART 0136 0136 0136 0149 0136 0136 0136 0136

With baCART 0115 0119 0118 0129 0117 0117 0119 0117bCART 0134 0132 0131 0144 0134 0135 0153 0135CCA+baCART 0139 0189 0190 0199 0148 0156 0145 0148CCA+bCART 0160 0204 0211 0221 0166 0174 0160 0163

Continued on next page

MI+baCART 0116 0120 0115 0130 0116 0116 0115 0116MI+bCART 0140 0143 0137 0150 0138 0138 0137 0138MI+LRc 0196 0198 0187 0201 0189 0191 0185 0191MI+LRm 0131 0135 0128 0142 0128 0128 0128 0128

MSE Without baCART 0014 0013 0013 0017 0013 0013 0013 0014bCART 0018 0018 0019 0022 0018 0018 0018 0018

Empirical Without baCART 0896 0901 0895 0897 0892 0899 0898 089090CI bCART 0907 0909 0903 0904 0904 0909 0909 0907coverage With baCART 0881 0847 0784 0753 0890 0854 0898 0893

bCART 0897 0861 0772 0886 0878 0497 0797 0880CCA+baCART 0893 0905 0894 0901 0888 0760 0792 0884CCA+bCART 0906 0905 0902 0904 0890 0798 0804 0886MI+baCART 0904 0914 0894 0897 0895 0900 0903 0896MI+bCART 0922 0931 0915 0919 0918 0928 0926 0923MI+LRc 0911 0920 0909 0906 0908 0921 0919 0906MI+LRm 0815 0841 0858 0865 0831 0848 0827 0841

Table 2 Performance metrics of inverse probability weighting estimators in 5000 simulated datasets with and without missing dataAbbreviations SE standard error MSE mean squared error 90CI 90 confidence interval CART classification and regressiontrees baCART bootstrap aggregated CART bCART boosted CART CCA complete case analysis MI multiple imputation LRclogistic regression with correctly specified model LRm logistic regression with main effects only

4 Case study

In this section we illustrate the application of the CART-based estimators to an empirical datasetconstructed to assess the association between annual influenza vaccination and mortality riskamong elderly (Groenwold et al 2009) The dataset comprises 44418 complete records on vacci-nation status mortality during the influenza epidemic period and potential confounders (age sexhealth status and prior health care and medication use) Among the 32388 vaccinated individuals266 died whereas 113 out of 12030 nonvaccinated individuals died (crude odds ratio 087 90CI073ndash105) To control for measured confounders propensity scores were estimated via bCARTand a pseudopopulation was constructed using IPW such as to preserve the covariate distributionof the vaccination group This resulted in an odds ratio of 060 (90CI 049ndash073) for the marginaleffect of vaccination on mortality risk among the vaccinated Substituting bCART with baCARTyielded an odds ratio of 065 (90 053ndash081) As expected introducing MCAR missingness intoa confounder by setting a random 50 of subjectsrsquo number of prior general practitioner (GP) visitsto missing resulted in odds ratio estimates that were closer to the crude effect Setting the numberof GP visits to missing with probability 05 for all subjects who died and zero otherwise resultedin estimates substantially closer to the null for bCART and away from the null for baCART Thusas in our simulations outcome-dependent MAR missingness resulted in apparent attenuation ofthe exposure-outcome effect as estimated by bCART Table 3 shows the results also for the com-plete case and multiple imputation equivalents of baCART and bCART as well as for IPW basedon propensity score estimation using main effects logistic regression and with weights truncatedto the interval from the 0th to the 975th percentile To better handle potential violations of stan-dard imputation model assumptions we used a nonparametric multiple imputation strategy (optioncart rather than norm in the mice package) to estimate the effect of vaccination on mortality risk(Doove et al 2014) Interestingly in the MAR setting bCART and baCART yielded the two mostextreme estimates for the effect of vaccination on mortality risk among the elderly

5 Discussion

In this paper we examined the workings of CART based propensity score estimators in scenarioswith missing covariate data Although the CART has been described as a promising approach toautomatically handle missing covariate data when developing a propensity score (Setoguchi et al2008 Lee et al 2010) there has been little discussion on the performance of these methodsThrough analysis and simulations we showed that the application of CART for propensity scoreestimation can yield serious bias in estimates of exposure-outcome relations We showed that thisproblem not only pertains to the situation of MAR but critically also to the situations with MCARwhich are often considered harmless when bias is concerned resulting only in larger variance ofthe estimator of exposure-outcome relations

An attractive property of CART-based methods relative to standard logistic regression pro-cedures is perhaps not having to discard incomplete records Indeed in our simulations discard-ing incomplete records resulted in the largest empirical standard errors Alternatively multipleimputation may be used to replace missing values under MCAR or MAR prior to propensity scoreestimation This approach was shown to work well in our simulations One criticism of multipleimputation in its parametric form is that it makes possibly erroneous distributional assumptions In

MissingnessNone MCAR MAR

Method OR (90CI) OR (90CI) OR (90CI)

baCART 065 (053ndash081) 069 (056ndash085) 053 (044ndash066)bCART 060 (049ndash073) 063 (051ndash077) 079 (063ndash098)CCA+baCART ndash 055 (041ndash073) 062 (046ndash084)CCA+bCART ndash 050 (037ndash066) 056 (042ndash075)MI+baCART ndash 060 (047ndash075) 070 (055ndash089)MI+bCART ndash 058 (047ndash072) 063 (051ndash078)LRmdagger 059 (049ndash071) 062 (051ndash076) 070 (057ndash086)

Table 3 Estimated effects of vaccination on mortality risk among the elderly in dataset with nomissing data MCAR missingness or outcome-dependent MAR missingness Estimates are ad-justed for age sex health status and prior health care and medication use Abbreviations MCARmissing completely at random MAR missing at random OR odds ratio 90CI 90 confidenceinterval CART classification and regression trees baCART bootstrap aggregated CART bCARTboosted CART CCA complete case analysis MI multiple imputation LRm main effects logisticregression daggerIn case of (MCAR or MAR) missingness MI was implemented before LRm

particular the standard multiple imputation algorithms do not properly capture nonlinear relationslike interaction effects (Cham and West 2016) Multiple imputation algorithms that use nonpara-metric methods have been developed For example Doove et al (2014) following Burgette andReiter (2010) proposed CART to be incorporated as imputation method in the multiple imputa-tion by chained equations framework As with parametric multiple imputation the algorithm isdesigned to account for the inherent variability in the data However while the approach of Dooveet al (2014) seems promising there is still room for improvement Particularly the algorithmdoes not explicitly account for uncertainty about the (implicit) CART treesrsquo model parameters Toaddress this Shah et al (2014) proposed a promising algorithm in which random forest CART isembedded in the multiple imputation by chained equations framework and imputation models arefitted to bootstrap samples An implementation is available via the R package CALIBERrfimpute

(Shah 2014)In interpreting our findings it is important to note that we considered only a small number

of scenarios We assumed throughout that data were MCAR or MAR and that there was no unmea-sured confounding (conditional exchangeability given measured confounders) As noted there aresituations conceivable in which it is not problematic to estimate the generalised propensity scoreIf the missingness information conveys information about a strong unmeasured confounder esti-mating the generalised propensity score may allow for partial control of unmeasured confoundingOn the other hand adjusting for missingness information eg through the generalised propensityscore (estimated by some CART algorithms) may be problematic particularly when it is a strongproxy for the outcome an intermediate or a common effect of the exposure and outcome

Our arguments for caution when using CART to estimate propensity scores in the presenceof missing data are in line with the recommendation to incorporate information on the outcomein imputing missing covariate data (Penning de Vries and Groenwold 2016 Leyrat et al 2017Moons et al 2006) Since propensity score estimation is typically done without any information

on the outcome (Rubin et al 2008) any missing data imputation (eg with a surrogate) that isinherent to the propensity score estimation procedure will likely fail An important feature of thepropensity score matching or weighting methodology is that in the absence of missing data it neednot make distributional assumptions about the outcome in relation to the exposure and covariatesin constructing a matched or weighted dataset In the presence of missing covariate data omittinginformation on the outcome in imputing missing covariate data however imposes a structure onthe data that likely contrasts with the true data distribution and the analysis model This is similarto the idea of models being ldquouncongenialrdquo in the sense of Meng (1994) The current study alsorelates to the literature on the missing indicator method given its resemblance with the approachto handling missing data taken by the boosted CART algorithm Like the automatic handing ofmissing data by the boosted CART algorithm the missing indicator method typically results inbias (Groenwold et al 2012)

It has been suggested to perform balance diagnostics on the matched or weighted studysample at hand (Austin 2011a) If systematic differences persist between exposure groups follow-ing matching or weighting this may be an indication that the propensity score estimation algorithmrequires modification (Austin and Stuart 2015) In the context of CART one may assign greaterweight to subjects at a certain covariate level in evaluating exposure homogeneity at any givennode We did not adopt an iterative approach to propensity score estimation and balance diagnos-tics in our simulation studies for several reasons First doing so would increase the computationalburden of the simulations Second whereas CART facilitates the estimation of propensity scoresthat balance the entire covariate joint distribution across exposure groups standard balance diag-nostics procedures typically ignore the complex relationship between exposure and covariates Forexample when using the standardised mean difference it is typically assumed that all variablesthat need to be balanced with respect to the mean are identified and included in the set over whicha summary (eg weighted mean or maximum) standardised mean difference is calculated Theutility of the metric may be poor if important variables (eg higher order moments) are omittedOther balance metrics such as the Kolomogorov-Smirnov metric Levy distance and overlappingcoefficient (Belitser et al 2011 Franklin et al 2014 Ali et al 2015) often fail to reflect the extentof imbalance with respect to the entire covariate joint distribution In addition what constitutesgood balance ultimately depends on the outcome model too Substantial imbalance may be accept-able for covariates that are weakly predictive of the outcome while small departures from perfectbalance may be problematic for covariates that are strongly predictive of the outcome

We emphasise that our simulations were not designed to compare CART versus logisticregression as means to estimate propensity scores Main effects logistic regression here and inprevious studies demonstrated a robust performance against model misspecification in terms ofbias when inference was based on propensity score matching (Setoguchi et al 2008) This is likelyattributable to the set-up of the simulations The outcome model included homogeneous exposure-outcome effects and main effects only Since between-exposure-group imbalances with respect tointeraction terms or higher order moments of covariates need not accompany systematic differencesin outcomes it is not surprising that propensity score matching based on main effects logisticregression may perform roughly the same in terms of bias as propensity score matching basedon logistic regression with correct model specification Further studies comparing CART versusmain effects logistic regression may well demonstrate more clearly the advantageous properties ofCART in settings with both complex propensity score and complex outcome models

In summary we compared various approaches to handling missing data in estimating

propensity scores via CART While the use of machine learning in estimating propensity scoresseems promising for handling complex full data structures it unlikely represents a suitable substi-tute for well-established methods such as multiple imputation to deal with missing data

Appendices

Appendix A

For realisations w of W and a of A let

1minus e(w)

We show in this subsection that weighting by ϕ yields independence between covariate(s) W andA that is for all w

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

for all y0y1We begin by considering E[ϕlowast(WA)|A = a] It is easy to see that E[ϕlowast(WA)|A = 1] = 1

For a = 0 we have

E[ϕlowast(WA)|A = 0] = E[

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 1)Pr(A = 1)Pr(W = w)Pr(W = w|A = 0)Pr(A = 0)Pr(W = w)

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

Since ϕ(w1) = 1 to prove the first statement it suffices to show that ϕ(w0)Pr(W = w|A =0) = Pr(W = w|A = 1) for all w Now

ϕ(w0)Pr(W = w|A = 0)

1minus e(w)Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

=Pr(W = w|A = 1)Pr(A = 1)Pr(W = w|A = 0)Pr(A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

for all w as desiredTo complete this proof observe that

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

timesPr(Y0 = y0Y1 = y1|W = wA = 0)Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

for all y0y1 Under conditional exchangeability given W ie (Y0Y1) perpperp A|W we have Pr(Y0 =y0Y1 = y1|W =wA= 0) = Pr(Y0 = y0Y1 = y1|W =wA= 1) for all w Hence sumw ϕ(w0)Pr(Y0 =y0Y1 = y1W = w|A = 0) becomes sumw Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 1) whichis equal to Pr(Y0 = y0Y1 = y1|A = 1) Since ϕ(w1) = 1 we also have that sumw ϕ(w1)Pr(Y0 =y0Y1 = y1W = w|A = 1) = Pr(Y0 = y0Y1 = y1|A = 1) for all y which completes this proof

Appendix B

In this subsection we give an example of a simple setting where (Y0Y1) perpperp A|e(W ) and W perpperpA|elowast(V ) hold yet (Y0Y1) 6perpperp A|elowast(V )

Let W A and Y be binary mutually independent random variables and suppose that covari-ate missingness is MAR dependent on Y Specifically let Pr(W = 1) = 05 and Pr(A = 1|W =w) = Pr(A = 1) = 05 for all w Further define Y = I(εY lt (1+A)10) where εY simU (01) suchthat εY perpperp (AW ) Thus there is conditional exchangeability given W so that Pr(Y = 1|A = aW =w) = Pr(Ya = 1|A = aW = w) = (1+a)10 for all aw Rosenbaum and Rubin (1983 Theorems1 and 3) and Appendix A establish conditional exchangeability given e(W ) and exchangeabil-ity following inverse probability weighting with weights defined on the basis of e(W ) Now letPr(R = 0|WAYεY ) = 01+05Y It is easily verified that W perpperp A|elowast(V ) However elowast(V ) = 47

if and only if V = lowast or equivalently R = 0 Since Rperpperp εY |(AY ) and Rperpperp A|Y for any u isin (01)we therefore have

Pr(εY le u|A = aelowast(V ) = 47)= Pr(εY le u|A = aR = 0)

= sumy

Pr(εY le u|A = aY = y)Pr(Y = y|A = aR = 0)

= sumy

Pr(εY le u|A = aY = y)

times Pr(R = 0|Y = y)Pr(Y = y|A = a)sumyprime Pr(R = 0|Y = yprime)Pr(Y = yprime|A = a)

yPr(εY le u|A = aY = y)

(1+5y)[(1+a)y+(9minusa)(1minus y)]15+5a

Pr(εY le u|A = aY = y) =Pr(Y = y|A = aεY le u)Pr(εY le u)

Pr(Y = y|A = a)

=q(yua)u

(1+a)y10+(9minusa)(1minus y)10

with q(yua) = 1minus y+(minus1)1minusymin(1+a)10uu In particular Pr(εY le 05|A = aelowast(V ) =47) equals 23 if a = 0 and 34 if a = 1 Hence εY 6perpperp A|elowast(V ) and given the definitions of Y Y0and Y1 we have (Y0Y1) 6perpperp A|elowast(V )

Appendix C

This subsection details an example where (Y0Y1)perpperp A|elowast(V ) yet (Y0Y1) 6perpperp A|e(W )Suppose that W and that A and Y are all binary random variables Further let (AR)

be marginally independent of W let A conditionally depend on R given W and let Y condi-tionally depend on A and R given W Specifically let Pr(W = 1) = 05 Pr(R = 0|W ) = 01Pr(A = 1|RW ) = 2(1+R)10 and Y = I(εY lt 2(1+2R)20) where εY perpperp (WRA) To see that(Y0Y1)perpperp A|elowast(V ) first note that

e(w) = Pr(A = 1|W = w)= Pr(A = 1|W = wR = 0)Pr(R = 0|W = w)

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

for w = 01 and that elowast(v) equals Pr(A = 1|R = 0) = 020 if v = lowast and Pr(A = 1|R = 1) = 040otherwise Now

for a = 01 Also Pr(Y0 = 1|Aelowast(V ) = 040) = 010 Thus Y0 perpperp A|elowast(V ) Similarly it can beshown that (Y0Y1)perpperp A|elowast(V ) Next observe that

Pr(Y0 = 1|A = ae(W ) = 038) = Pr(Y0 = 1|A = a)= Pr(Y0 = 1|A = aR = 0)Pr(R = 0|A = a)

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

which is not invariant to changes in a = 01 Hence (Y0Y1) 6perpperp A|e(W )

6 BibliographyAlbert A and J Anderson (1984) ldquoOn the existence of maximum likelihood estimates in logistic

regression modelsrdquo Biometrika 71 1ndash10Ali M R Groenwold S Belitser W Pestman A Hoes K Roes A de Boer and O Klungel

(2015) ldquoReporting of covariate selection and balance assessment in propensity score analysis issuboptimal a systematic reviewrdquo Journal of Clinical Epidemiology 68 122ndash131

Austin P (2011a) ldquoAn introduction to propensity score methods for reducing the effects of con-founding in observational studiesrdquo Multivariate Behavioral Research 46 399ndash424

Austin P (2011b) ldquoOptimal caliper widths for propensity-score matching when estimating differ-ences in means and differences in proportions in observational studiesrdquo Pharmaceutical Statis-tics 10 150ndash161

Austin P and E Stuart (2015) ldquoMoving towards best practice when using inverse probability oftreatment weighting (IPTW) using the propensity score to estimate causal treatment effects inobservational studiesrdquo Statistics in Medicine 34 3661ndash3679

Belitser S E Martens W Pestman R Groenwold A Boer and O Klungel (2011) ldquoMeasuringbalance and model selection in propensity score methodsrdquo Pharmacoepidemiology and DrugSafety 20 1115ndash1129

Breiman L (1996) ldquoBagging predictorsrdquo Machine Learning 24 123ndash140

Breiman L (2001) ldquoRandom forestsrdquo Machine Learning 45 5ndash32Burgette L and J Reiter (2010) ldquoMultiple imputation for missing data via sequential regression

treesrdquo American journal of epidemiology 172 1070ndash1076Cham H and S West (2016) ldquoPropensity score analysis with missing datardquo Psychological Meth-

ods 21 427ndash445Cole S and C Frangakis (2009) ldquoThe consistency statement in causal inference a definition or

an assumptionrdquo Epidemiology 20 3ndash5DrsquoAgostino Jr R and D Rubin (2000) ldquoEstimating and using propensity scores with partially

missing datardquo Journal of the American Statistical Association 95 749ndash759Doove L S van Buuren and E Dusseldorp (2014) ldquoRecursive partitioning for missing data

imputation in the presence of interaction effectsrdquo Computational Statistics amp Data Analysis 7292ndash104

Drake C (1993) ldquoEffects of misspecification of the propensity score on estimators of treatmenteffectrdquo Biometrics 49 1231ndash1236

Elith J J Leathwick and T Hastie (2008) ldquoA working guide to boosted regression treesrdquo Journalof Animal Ecology 77 802ndash813

Franklin J J Rassen D Ackermann D Bartels and S Schneeweiss (2014) ldquoMetrics for covari-ate balance in cohort studies of causal effectsrdquo Statistics in Medicine 33 1685ndash1699

Groenwold R D Nelson K Nichol A Hoes and E Hak (2009) ldquoSensitivity analyses to esti-mate the potential impact of unmeasured confounding in causal researchrdquo International Journalof Epidemiology 39 107ndash117

Groenwold R H I R White A R T Donders J R Carpenter D G Altman and K G Moons(2012) ldquoMissing covariate data in clinical research when and when not to use the missing-indicator method for analysisrdquo Canadian Medical Association Journal 184 1265ndash1269

Hastie T R Tibshirani and J Friedman (2009) The Elements of Statistical Learning DataMining Inference and Prediction New York Springer second edition

Hernan M and J Robins (2017) ldquoFine point 43 Collapsibility of the odds ratiordquo in M Hernanand J Robins eds Causal Inference Boca Raton Chapman amp HallCRC URL httpswww

hsphharvardedumiguel-hernancausal-inference-book forthcomingHolland P (1986) ldquoStatistics in causal inferencerdquo Journal of the American Statistical Association

81 945ndash960Holland P (1988) ldquoCausal inference path analysis and recursive structural equations modelsrdquo

Sociological Methodology 18 449ndash484Lee B J Lessler and E Stuart (2010) ldquoImproving propensity score weighting using machine

learningrdquo Statistics in Medicine 29 337ndash346Lesko C A Buchanan D Westreich J Edwards M Hudgens and S Cole (2017) ldquoGeneralizing

study results a potential outcomes perspectiverdquo Epidemiology 28 553ndash561Leyrat C S R Seaman I R White I Douglas L Smeeth J Kim M Resche-Rigon J R

Carpenter and E J Williamson (2017) ldquoPropensity score analysis with partially observed co-variates How should multiple imputation be usedrdquo Statistical methods in medical research0962280217713032

Lumley T (2014) survey Analysis of complex survey samples (R package version 331) Com-prehensive R Archive Network Vienna Austria URL httpcranr-projectorgweb

packagessurveyindexhtmlMcCaffrey D G Ridgeway and A Morral (2004) ldquoPropensity score estimation with boosted

regression for evaluating adolescent substance abuse treatmentrdquo Psychological Methods 9 403ndash425

Meng X-L (1994) ldquoMultiple-imputation inferences with uncongenial sources of inputrdquo Statisti-cal Science 538ndash558

Moisen G (2008) ldquoClassification and regression treesrdquo in S Jorgensen and B Fath eds Ency-clopedia of Ecology volume 1 Oxford Elsevier

Moons K G R A Donders T Stijnen and F E Harrell Jr (2006) ldquoUsing the outcome forimputation of missing predictor values was preferredrdquo Journal of clinical epidemiology 591092ndash1101

Neyman J K Iwaszkiewicz and St Kolodziejczyk (1935) ldquoStatistical problems in agriculturalexperimentationrdquo Supplement to the Journal of the Royal Statistical Society 2 107ndash180

Pearl J (2009) Causality Models Reasoning and Inference New York Cambridge UniversityPress

Penning de Vries B and R Groenwold (2016) ldquoComments on propensity score matching follow-ing multiple imputationrdquo Statistical Methods in Medical Research 25 3066ndash3068

Peters A and T Hothorn (2017) ipred Improved Predictors (R package version 09-6) Com-prehensive R Archive Network Vienna Austria URL httpcranr-projectorgweb

packagesipredindexhtmlR Core Team (2016) R A language and environment for statistical computing R Foundation for

Statistical Computing Vienna Austria URL httpswwwR-projectorgRai D B Lee C Dalman C Newschaffer G Lewis and C Magnusson (2017) ldquoAntidepres-

sants during pregnancy and autism in offspring population based cohort studyrdquo BMJ 385j2811

Ridgeway G (1999) ldquoThe state of boostingrdquo Computing Science and Statistics 31 172ndash181Ridgeway G D McCaffrey A Morral B Griffin and L Burgette (2017) twang Toolkit for

Weighting and Analysis of Nonequivalent Groups (R package version 15) ComprehensiveR Archive Network Vienna Austria URL httpcranrprojectorgwebpackages

twangindexhtmlRosenbaum P and D Rubin (1983) ldquoThe central role of the propensity score in observational

studies for causal effectsrdquo Biometrika 70 41ndash55Rubin D (1974) ldquoEstimating causal effects of treatments in randomized and nonrandomized

studiesrdquo Journal of Educational Psychology 66 688ndash701Rubin D (1976) ldquoInference and missing datardquo Biometrika 63 581ndash592Rubin D (1987) Multiple imputation for nonresponse in surveys New York WileyRubin D B et al (2008) ldquoFor objective causal inference design trumps analysisrdquo The Annals of

Applied Statistics 2 808ndash840Schafer J (1997) Analysis of incomplete multivariate data Boca Raton CRC PressSetoguchi S S Schneeweiss M B MA R Glynn and E Cook (2008) ldquoEvaluating uses of data

mining techniques in propensity score estimation a simulation studyrdquo Pharmacoepidemiologyand Drug Safety 17 546ndash555

Shah A (2014) CALIBERrfimpute Imputation in MICE using Random Forest (R packageversion 01-2) Comprehensive R Archive Network Vienna Austria URL httpcran

r-projectorgwebpackagesCALIBERrfimputeindexhtmlShah A J Bartlett J Carpenter O Nicholas and H Hemingway (2014) ldquoComparison of ran-

dom forest and parametric imputation models for imputing missing data using mice a caliber

studyrdquo American Journal of Epidemiology 179 764ndash774Sturmer T M Joshi R Glynn J Avorn K Rothman and S Schneeweiss (2006) ldquoA review of

the application of propensity score methods yielded increasing use advantages in specific set-tings but not substantially different estimates compared with conventional multivariable meth-ods journal of clinical epidemiologyrdquo Journal of Clinical Epidemiology 59 437ndashe1

Tchetgen E T and T VanderWeele (2012) ldquoOn causal inference in the presence of interferencerdquoStatistical Methods in Medical Research 21 55ndash75

Therneau T and E Atkinson (2017) ldquoAn introduction to recursive partitioning using the RPARTroutinesrdquo Rochester Mayo Foundation

Van Buuren S (2012) Flexible imputation of missing data Boca Raton CRC PressVan Buuren S and K Groothuis-Oudshoorn (2011) ldquomice Multivariate imputation by chained

equations in Rrdquo Journal of Statistical Software 45 1ndash67Westreich D (2012) ldquoBerksonrsquos bias selection bias and missing datardquo Epidemiology 23 159ndash

164Westreich D J Lessler and M Jonsson Funk (2010) ldquoPropensity score estimation neural net-

works support vector machines decision trees (cart) and meta-classifiers as alternatives tologistic regressionrdquo Journal of clinical epidemiology 63 826ndash833

Wyss R A Ellis M Brookhart C Girman M Jonsson Funk R LoCasale and T Sturmer(2014) ldquoThe role of prediction modeling in propensity score estimation an evaluation of logis-tic regression bcart and the covariate-balancing propensity scorerdquo American Journal of Epi-demiology 180 645ndash655

Supplementary Material

Bias Without baCART minus0050minus0048minus0052minus0046minus0052minus0052minus0053minus0051bCART minus0054minus0053minus0056minus0047minus0054minus0055minus0055minus0054

With baCART minus0094minus0114minus0061minus0116minus0054minus0056minus0031minus0046bCART minus0115minus0170minus0165 0110minus0135minus0365minus0228minus0129CCA+baCARTminus0072minus0110minus0068minus0127minus0109minus0191minus0175minus0093CCA+bCART minus0063minus0070minus0040minus0104minus0102minus0169minus0171minus0079MI+baCART minus0056minus0054minus0026minus0031minus0035minus0033minus0032minus0030MI+bCART minus0062minus0065minus0063minus0061minus0056minus0059minus0056minus0054MI+LRc minus0010minus0014minus0006 0000minus0006minus0005minus0004minus0004MI+LRm minus0006minus0004minus0012minus0017minus0008minus0010minus0008minus0005

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

bCART 0795 0685 0707 0840 0773 0188 0569 0775CCA+baCART 0851 0856 0880 0874 0833 0702 0719 0846CCA+bCART 0888 0892 0900 0894 0858 0790 0773 0875MI+baCART 0901 0912 0928 0926 0926 0921 0921 0921MI+bCART 0920 0927 0935 0942 0934 0935 0937 0934MI+LRc 0939 0947 0944 0942 0948 0943 0942 0946

MI+LRm 0933 0945 0941 0932 0941 0940 0935 0938Supplementary Table 1 Performance metrics of propensity score matching estimators in 5000 simulated datasets with and withoutmissing data Abbreviations SE standard error MSE mean squared error 90CI 90 confidence interval CART classification andregression trees baCART bootstrap aggregated CART bCART boosted CART CCA complete case analysis MI multiple imputationLRc logistic regression with correctly specified model LRm logistic regression with main effects only

Missing Metricdata Method Bias Bias dif Emp SE Mean SE MSE Coverage

Inverse probability weightingWithout baCART 0061 ref 0104 0106 0015 0851

bCART 0028 ref 0119 0123 0015 0902With baCART 0008 minus0053 0113 0113 0013 0902

bCART minus0019 minus0048 0118 0121 0014 0909CCA+baCART 0039 minus0023 0171 0178 0031 0906CCA+bCART 0041 0013 0185 0192 0036 0903MI+baCART 0068 0007 0105 0109 0016 0846MI+bCART 0027 minus0001 0119 0128 0015 0919MI+LRc 0004 0132 0141 0017 0927MI+LRm 0005 0130 0138 0017 0920

MatchingWithout baCART minus0003 ref 0121 0121 0015 0901

bCART minus0061 ref 0133 0138 0021 0879With baCART minus0045 minus0042 0124 0121 0017 0869

bCART minus0137 minus0075 0134 0137 0037 0731CCA+baCARTminus0097 minus0094 0220 0222 0058 0871CCA+bCART minus0092 minus0031 0239 0245 0066 0880MI+baCART 0007 0010 0117 0134 0014 0938MI+bCART minus0063 minus0001 0129 0155 0021 0923MI+LRc 0007 0112 0131 0013 0946MI+LRm 0007 0112 0131 0013 0944

Supplementary Table 2 Performance metrics of inverse probability weighting and matching esti-mators in 5000 simulated datasets of additional simulation experiment Abbreviations Bias difestimated bias after minus estimated bias before introduction missing data Emp SE empiricalstandard error Mean SE mean estimated standard error MSE mean squared error CART classi-fication and regression trees baCART bootstrap aggregated CART bCART boosted CART CCAcomplete case analysis MI multiple imputation LRc logistic regression with correctly specifiedmodel LRm logistic regression with main effects only

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

propensity score by a parametric (logistic) regression of the exposure on measured covariatesHowever parametric models rely on assumptions about the distribution of variables in relation toone another including the functional form and the presence or absence of interactions If any ofthese are violated covariate balance may not be attained potentially leading to bias in makingcausal inferences about the exposure-outcome relation of interest (Drake 1993)

It has been suggested that machine learning and data mining methods such as classificationand regression tree analysis (CART) be used to estimate the relationship between the exposureand measured covariates These methods avoid making the assumptions regarding functional formand interaction as in a standard logistic regression The utility of data mining methods to estimatepropensity scores in complete data settings has been studied previously (Setoguchi et al 2008 Leeet al 2010 Westreich et al 2010 Wyss et al 2014) However in practice researchers are oftenfaced with missing values on the measured variables Whereas incomplete data preclude logisticregression on all subjects some CART algorithms allow for incomplete records to be incorporatedin the tree fitting and provide propensity score estimates for all subjects The ability of CARTto accommodate missing values has been described as advantageous (Lee et al 2010 McCaffreyet al 2004 Moisen 2008 Rai et al 2017) However the precise impact of missing data on theperformance of CART-based propensity score estimators has received little attention The objectiveof this study was therefore to examine the performance of various CART-based propensity scoreestimation procedures in the presence of missing data Throughout particular emphasis is placedon the causal odds ratio for the marginal effect among the exposed (or Average Effect among thelsquoTreatedrsquo ATT) as the effect measure of interest

The remainder of this article is structured as follows In Section 2 we briefly reviewpertinent theory Based on analytical work we identify caveats in the handling of missing databy CART Section 3 describes a series of Monte Carlo simulations that were used to evaluate theperformance of various approaches to handling missing data including (i) subjecting incompletedata directly to the CART algorithm (ii) complete case analysis and (iii) multiple imputationIn Section 4 we apply and compare the approaches in a case study on the effect of influenzavaccination and mortality We conclude with a summary and discussion of our findings in thecontext of the existing literature

2 Theory

21 Propensity score analysis of complete data

Counterfactual outcomes and estimating causal effects

We adopt a perspective of potential or counterfactual outcomes formal accounts of which aregiven for example by Neyman et al (1935) Rubin (1974) Holland (1986) Holland (1988) andPearl (2009)

Consider a sequence S = (X1X2 Xn) of variables and let F = ( fX1 fX2 fXn) be acollection of functions fX j that deterministically map a realisation of the predecessors (Xi i lt j)of X j and of exogenous variable εX j into a realisation of X j We may write the random variable X j

as follows

X j = fX j

OR =E[Y |A = 1](1minusE[Y |A = 1])

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

= sumw

1minus e(v)

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

as follows

X j = fX j

OR =E[Y |A = 1](1minusE[Y |A = 1])

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

= sumw

1minus e(v)

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

OR =E[Y |A = 1](1minusE[Y |A = 1])

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

= sumw

1minus e(v)

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

= sumw

1minus e(v)

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

= sumw

1minus e(v)

γ(v0)Pr(V = v|A = 0) = γ(v1)Pr(V = v|A = 1)

γ(v0)Pr(Y0 = y0Y1 = y1V = v|A = 0)

= sumv

γ(v1)Pr(Y0 = y0Y1 = y1V = v|A = 1)

for all y0y1

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

31 Methods

Data generation

2 minus04W 24 +07W 2

W1W2 W3 W4

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Scenarios

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Estimators

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Performance metrics

32 Results

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Other performance

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

4 Case study

5 Discussion

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Appendices

Appendix A

1minus e(w)

ϕ(w0)Pr(W = w|A = 0) = ϕ(w1)Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

ϕ(w1)Pr(Y0 = y0Y1 = y1W = w|A = 1)

For a = 0 we have

1minus e(W )

∣∣∣A = 0]

Pr(A = 1|W )

Pr(A = 0|W )

∣∣∣A = 0]

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(W = w|A = 0)

= sumw

Pr(W = w|A = 0)

Pr(A = 0)sumw

Pr(W = w|A = 1)Pr(A = 1)

=Pr(A = 1)Pr(A = 0)

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

ϕ(w0)Pr(W = w|A = 0)

Pr(W = w|A = 0)

=Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

Pr(A = 0)Pr(A = 1)

Pr(W = w|A = 0)

= Pr(W = w|A = 1)

ϕ(w0)Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 1|W = w)Pr(A = 0|W = w)

Pr(A = 0)Pr(A = 1)

Pr(Y0 = y0Y1 = y1W = w|A = 0)

= sumw

Pr(A = 0)Pr(A = 1)

= sumw

Pr(W = w|A = 1)Pr(Y0 = y0Y1 = y1|W = wA = 0)

Appendix B

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

= sumy

Pr(Y = y|A = a)

=q(yua)u

Appendix C

+Pr(A = 1|W = wR = 1)Pr(R = 1|W = w)= 038

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

+Pr(Y0 = 1|A = aR = 1)Pr(R = 1|A = a)

= 010Pr(A = a|R = 0)Pr(R = 0)

Pr(A = a)

+030Pr(A = a|R = 1)Pr(R = 1)

Pr(A = a)

=020a0801minusa001+040a0601minusa027

038a0621minusa

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

With baCART 0128 0126 0127 0153 0128 0127 0129 0127

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

1 Introduction

2 Theory

4 Case study

5 Discussion

6 Bibliography

Documents

, Rolf H.H. Groenwold arXiv:1807.09462v1 [stat.ML] 25 Jul 2018hbiostat.org/papers/missingData/pen18pro.pdf · records to be incorporated in the tree ﬁtting and provide propensity