More Robust Doubly Robust Estimators

More Robust Doubly RobustEstimators

Marie Davidian and Anastasios A. Tsiatis

Department of Statistics

North Carolina State University

Weihua Cao

Center for Devices and Radiological Health

US Food and Drug Administration

http://www.stat.ncsu.edu/∼davidian

More Robust Doubly Robust Estimators 1

Outline

• The simplest setting

• Estimators and double robustness

• More robust doubly robust estimators

• Extension to longitudinal data


The problem

The simplest setting:

• Response (outcome ) Y

• Parameter of interest : µ = E(Y )

• Full, ideal data : (Yi, Xi), i = 1, . . . , n, independent and identically

distributed (iid)

• Xi is a vector of covariates for individual i

• With no additional assumptions (nonparametric model): Usual

estimator for µ

µ̂ = n−1n∑

i=1

Yi (X’s are not needed)


The problem

Complication: Response is missing for some individuals

• Ri = 1 if Y is observed for individual i and Ri = 0 otherwise

• Observed data: iid (Ri, RiYi, Xi), i = 1, . . . , n

• Unless unobserved Y are missing completely at random (MCAR ),

i.e., Ri⊥⊥Yi, the complete case estimator

(n∑

i=1

Ri

)−1 n∑

i=1

RiYi

is a biased (inconsistent ) estimator for µ

• Suppose we are willing to assume that the missingness is missing at

random (MAR ), i.e.,

Ri⊥⊥Yi|Xi

• Under MAR , several estimators for µ have been advocated. . .


Outcome regression estimator

Under MAR: R⊥⊥Y |X

µ = E(Y ) = E{E(Y |X)} = E{E(Y |R = 1, X)} (1)

• Suppose the true E(Y |X) = m0(X), and posit a (parametric )

regression model m(X, β) for m0(X)

• If the model is correct , µ = E(Y ) = E{m0(X)} = E{m(X, β0)},

suggestingµ̂OR = n−1

n∑

i=1

m(Xi, β̂) for some β̂

• By (1), β can be estimated using the complete cases

{(Yi, Xi) : Ri = 1}; e.g., by least squares, solving

n∑

i=1

Ri mβ(X, β){Yi − m(Xi, β)} = 0, mβ(X, β) =∂m(Xi, β)

∂β

• Whether or not µ̂OR is a consistent estimator for µ depends on

whether or not the posited model m(X, β) is correct


Inverse propensity score weighted estimator


pr(R = 1|Y, X) = E(R|Y, X) = E(R|X) = pr(R = 1|X)

• Propensity score pr(R = 1|X); π0(X) = true propensity score

• If π0(X) were known , estimate µ by

n−1n∑

i=1

RiYi

π0(Xi)or

{n−1

n∑

i=1

Ri

π0(Xi)

}−1

n−1n∑

i=1

RiYi

π0(Xi)

• I.e., weight the observed responses (Yi : Ri = 1) so that these

individuals represent themselves and “similar ” others whose

responses are missing




pr(R = 1|Y, X) = E(R|Y, X) = E(R|X) = pr(R = 1|X) (2)

n−1n∑

i=1

RiYi

π0(Xi)

p−→ E

{RY

π0(X)

}= E

[E

{RY

π0(X)

∣∣∣∣Y, X

}]

= E

{Y E(R|Y, X)

π0(X)

}= E

{Y

π0(X)

π0(X)

}

using MAR (2)

= E(Y ) = µ as long as π0(X) > 0 a.s.



Unless missingness is by design: π0(X) is not known

• Posit a model π(X, γ) for pr(R = 1|X); e.g., logistic regression

• Estimate γ; e.g., by maximum likelihood, (ML) maximizing

n∏

i=1

{π(Xi, γ)}Ri{1 − π(Xi, γ)}1−Ri

and form the estimator

µ̂IPW = n−1n∑

i=1

RiYi

π(Xi, γ̂)

• Interestingly : If the model is correct , i.e., π(X, γ0) = π0(X)

n−1n∑

i=1

RiYi

π(Xi, γ̂)is more efficient than n−1

n∑

i=1

RiYi

π0(Xi)

• If the propensity score model is misspecified , then µ̂IPW may be

biased (inconsistent )


Semiparametric theory

Robins, Rotnitzky, and Zhao (1994): If the propensity model is

correct , with no additional assumptions on the distribution of the data

• All consistent and asymptotically normal estimators are

asymptotically equivalent to estimators of the form

n−1n∑

i=1

{RiYi

π(Xi, γ̂)+

Ri − π(Xi, γ̂)

π(Xi, γ̂)h(Xi)

}for some function h(X)

• Optimal h(X) leading to smallest variance (asymptotically ) is

h(X) = −E(Y |X)

• Suggests modeling E(Y |X) by m(X, β), estimating β, and

estimating µ by

n−1n∑

i=1

{RiYi

π(Xi, γ̂)−

Ri − π(Xi, γ̂)

π(Xi, γ̂)m(Xi, β̂)

}


Semiparametric theory

• If γ̂p

−→ γ∗ and β̂p

−→ β∗, this estimatorp

−→

E

{RY

π(X, γ∗)−

R − π(X, γ∗)

π(X, γ∗)m(X, β∗)

}

= E

[Y +

{R − π(X, γ∗)

π(X, γ∗)

}{Y − m(X, β∗)}

]

= µ + E

[{R − π(X, γ∗)

π(X, γ∗)

}{Y − m(X, β∗)}

]

• Because R⊥⊥Y |X, E

[{R − π(X, γ∗)

π(X, γ∗)

}{Y − m(Xi, β

∗)}

]= 0

and hence the estimator is consistent if either

– π(X, γ∗) = π(X, γ0) = π0(X) (propensity model correct )

– m(X, β∗) = m(X, β0) = m0(X) (outcome regression correct )

• Double robustness


Not so fast. . .

Kang and Schafer (2007): Simulation studies

• The “usual ” doubly robust estimator can be severely biased when

both models are even just “slightly ” misspecified

• And can exhibit non-negligible bias if the estimated propensity score

is close to zero for some observations, even if the model is correct

• The outcome regression estimator µ̂OR performed much better ,

even under misspecification

• “Two wrong models are not necessarily better than one ”


Simulation scenario

• Zi = (Zi1, . . . , Zi4)T ∼ standard multivariate normal, n = 1000

• Xi = (Xi1, . . . , Xi4)T , where

Xi1 = exp(Zi1/2), Xi2 = Zi2/{1 + exp(Zi1)} + 10

Xi3 = (Zi1Zi3/25 + 0.6)3, Xi4 = (Zi3 + Zi4 + 20)2

• True outcome regression model, Y |X ∼ N{m0(X), 1}

m0(X) = 210 + 27.4Z1 + 13.7Z2 + 13.7Z3 + 13.7Z4

• True propensity score model

π0(X) = expit(−Z1 + 0.5Z2 − 0.25Z3 − 0.1Z4)

• Misspecified models used X’s instead of Z’s, fitted by least squares

and ML respectively (“usual ” doubly robust estimator µ̂DR)

• True µ = 210


Simulation results

Bias RMSE MCSD AveSE Cov

OR correct, PS correct

µ̂OR −0.03 1.13 1.13 1.15 0.95

µ̂DR −0.03 1.13 1.13 1.15 0.95

OR correct, PS incorrect

µ̂OR −0.03 1.13 1.13 1.15 0.95

µ̂DR −0.03 2.07 2.07 1.32 0.95

OR incorrect, PS correct

µ̂OR 2.31 2.72 1.43 1.41 0.63

µ̂DR 0.18 1.84 1.84 1.64 0.93

OR incorrect, PS incorrect

µ̂OR 2.31 2.72 1.43 1.41 0.63

µ̂DR −17.89 179.88 178.98 22.60 0.94


New perspective

Can we identify alternative doubly robust estimators?

• For now : Propensity model π(X) is fixed (no unknown parameters)

• Consider the class of estimators

n−1n∑

i=1

{RiYi

π(Xi)−

Ri − π(Xi)

π(Xi)m(Xi, β)

},

indexed by the parameter β

• Assume the propensity model is correct ; i.e., π(X) = π0(X)

• The outcome regression model may or may not be correct

• Then each estimator in the class is consistent for µ and can be

shown to have asymptotic variance

var(Y ) + E

[{1 − π0(X)

π0(X)

}{Y − m(X, β)}2

]


New perspective

var(Y ) + E

[{1 − π0(X)

π0(X)

}{Y − m(X, β)}2

](3)

• The best (efficient ) estimator in this class is the one with smallest

asymptotic variance =⇒ minimize (3) in β

• Choose β = βopt satisfying

E

[{1 − π0(X)

π0(X)

}mβ(X, βopt){Y − m(X, βopt)}

]= 0

or equivalently

E

[{1 − π0(X)

π0(X)

}mβ(X, βopt){m0(X) − m(X, βopt)}

]= 0


New perspective

Key idea:

• Because the estimator

n−1n∑

i=1

{RiYi

π(Xi)−

Ri − π(Xi)

π(Xi)m(Xi, β̂)

}

is asymptotically equivalent to the estimator

n−1n∑

i=1

{RiYi

π(Xi)−

Ri − π(Xi)

π(Xi)m(Xi, β

∗)

},

where β̂p

−→ β∗, it would be desirable to find β̂ such that

1. β̂p

−→ βopt when the propensity score model is correct , whether

or not the regression model is misspecified

2. m(X, β̂) is a consistent estimator for m0(X) , i.e., β̂p

−→ β0,

when the outcome regression model is correct , even if the

propensity score model is misspecified


New perspective

• The least squares estimator for β is the solution to

n∑

i=1

Ri{Yi − m(Xi, β)}mβ(Xi, β) = 0

and hence satisfies condition 2.

• However , this estimatorp

−→ β∗, the solution to

E [R mβ(X, β∗) {Y − m(X, β∗)}] = E [π0(X) mβ(X, β∗) {Y − m(X, β∗)}]

= E [π0(X) mβ(X, β∗) {m0(X) − m(X, β∗)}] = 0

• Thus, β∗ is not the same as βopt, the solution to

E

[{1 − π0(X)

π0(X)

}mβ(X, βopt) {m0(X) − m(X, βopt)}

]= 0,

unless m(X, β∗) = m0(X) (in which case β∗ = β0 = βopt),

• So the least squares estimator does not satisfy condition 1.


New perspective

E

[{1 − π0(X)

π0(X)

}mβ(X, βopt) {m0(X) − m(X, βopt)}

]= 0, (4)

Implication:

• The “usual ” doubly robust estimator using the least squares

estimator for β is doubly robust but does not achieve minimum

variance when m(X, β) is misspecified

• Motivated by (4), consider estimating β instead by the solution to

n∑

i=1

Ri

{1 − π(Xi)

π2(Xi)

}mβ(Xi, β) {Yi − m(Xi, β)} = 0


New perspective

• For arbitrary π(X), this estimatorp

−→ to the solution to

E

[R

{1 − π(X)

π2(X)

}mβ(X, β) {Y − m(X, β)}

]

= E

[π0(X)

{1 − π(X)

π2(X)

}mβ(X, β) {Y − m(X, β)}

]

= E

[π0(X)

{1 − π(X)

π2(X)

}mβ(X, β) {m0(X) − m(X, β)}

]= 0

(5)

• Thus, if the propensity model is correct , i.e., π(X) = π0(X), (5)

reduces to (4), and the estimatorp

−→ βopt

• And, more generally, the estimator satisfies condition 2.


Result

• The best estimator for β for the regression model

E(Y |X) = m(X, β) (under var(Y |X) = constant) does not

necessarily lead to the best estimator for µ

• Weighting the summand in the least squares estimating equation by1 − π(Xi)

π2(Xi)may be preferable

• Holds more generally ; e.g., if m(X, β) is a generalized linear model


Projection estimator

• The foregoing took the propensity score model to be fixed , with no

unknown parameters γ

• When γ is estimated via ML, a slight modification is necessary

• The optimal estimator for β is found by solving jointly in (β, c)

n∑

i=1

[Ri

π(Xi, γ̂)

1 − π(Xi, γ̂)

π(Xi, γ̂)

mβ(X, β)

πγ(Xi, γ̂)

1 − π(Xi, γ̂)

×

{Yi − m(Xi, β) − cT πγ(Xi, γ̂)

1 − π(Xi, γ̂)

}]= 0

where πγ(X, γ) =∂π(X, γ)

∂γ

• Denote the resulting doubly robust estimator by µ̂PROJ


Simulation results revisited


OR correct, PS correct

µ̂OR −0.03 1.13 1.13 1.15 0.95

µ̂DR −0.03 1.13 1.13 1.15 0.95

µ̂PROJ −0.03 1.13 1.13 1.15 0.95

OR correct, PS incorrect

µ̂OR −0.03 1.13 1.13 1.15 0.95

µ̂DR −0.03 2.07 2.07 1.32 0.95

µ̂PROJ −0.03 1.13 1.13 1.15 0.95


Simulation results revisited


OR incorrect, PS correct

µ̂OR 2.31 2.72 1.43 1.41 0.63

µ̂DR 0.18 1.84 1.84 1.64 0.93

µ̂PROJ 0.19 1.18 1.16 1.17 0.95

OR incorrect, PS incorrect

µ̂OR 2.31 2.72 1.43 1.41 0.63

µ̂DR −17.89 179.88 178.98 22.60 0.94

µ̂PROJ 0.13 1.25 1.25 1.20 0.94


Remarks

Take home message:

• How one estimates β in the regression model can have substantial

impact on the performance of the corresponding doubly robust

estimator for µ, especially under model misspecification

• Details : Cao, Tsiatis, and Davidian (2009)

• A similar idea was proposed in the literature on survey sampling by

Montanari (1987)

• Tan (2006) suggested related strategy : Substitute α0 + α1m(X, β)

for m(X, β), estimate β by least squares and then estimate (α0, α1)

by a weighted regression

• All of this generalizes to estimation of more complex quantities


Remarks

Causal inference on treatment effect in an observational study:

Supplementary material to Cao et al. (2009)

• Potential outcomes (Y0, Y1) for treatments (0, 1); inference on

∆ = E(Y1) − E(Y0)

• Observed data : iid (Ri, Yi, Xi), i = 1, . . . , n; treatment indicator

Ri = 0 or 1; assume Y = (1 − R)Y0 + RY1

• No unmeasured confounders : (Y0, Y1)⊥⊥R|X

• Propensity score model : π(X) = pr(R = 1|X)

• Estimators for ∆: For some h(X, β)

n−1n∑

i=1

[RiYi

π(Xi)−

(1 − Ri)Yi

1 − π(Xi)− {Ri − π(Xi)}h(Xi, β̂)

]


Remarks

n−1n∑

i=1

[RiYi

π(Xi)−

(1 − Ri)Yi

1 − π(Xi)− {Ri − π(Xi)}h(Xi, β̂)

]

• Let E(Yk|X) = m(k)0 (X); posit models mk(X, αk), k = 0, 1, and let

β = (αT0 , αT

1 )T

• Optimal choice : If models are correct

h(X, β) =m0(X, α0)

1 − π(X)+

m1(X, α1)

π(X)

• Can find βopt and β̂ to achieve the “best ” doubly robust estimator


Remarks

Start with premise regression model is correct?

• Instead assume m(X, β) is correct , so m(X, β0) = m0(X)

• The propensity model may or may not be correct

• The “best ” estimator for µ is µ̂OR

• However : see Tsiatis and Davidian (2007), discussion of Kang and

Schafer (2007), for speculation on choices of π(X) to yield

alternative doubly robust estimators for µ with good robustness

properties


Longitudinal data

Extension: Monotonely coarsened data and general parameters in a

statistical model (Tsiatis, Davidian, and Cao, 2011)

• Special case : Longitudinal study with dropout

• Collect data Lj at time tj , j = 1, . . . , M + 1

• Full, ideal data : L = LM+1 = (L1, . . . , LM+1)

• Dropout : If subject is last seen at time tj , dropout indicator D = j,

observe only Lj = (L1, . . . , Lj)

• Observed data : iid (Di, LDi), i = 1, . . . , n

• Interest : Parameter µ in a semiparametric model for the full data

• Full data estimator for µ : Solve

n∑

i=1

ϕ(Li, µ) = 0, E{ϕ(L, µ)} = 0


Missing at random

MAR assumption: Probability of dropout depends on the full data only

through the data observed before dropout

• pr(D = j|L) depends only on Lj , j = 1, . . . , M + 1

• Must have pr(D = M + 1|L) > 0 a.s.

• Fully specified dropout model : pr(D = j|L) = π(j, Lj),

j = 1, . . . , M + 1

• Write π(M + 1, L) = π(L)


Estimators

Robins et al. (1994), Tsiatis (2006): If the dropout model is

correctly specified

• All consistent and asymptotically normal estimators for µ solve

n∑

i=1

I(Di = M + 1)ϕ(Li, µ)

π(Li)+

M∑

j=1

dMji(Lji)

Kji(Lji)Lj(Lji)

= 0

• Arbitrary functions Lj(Lj), j = 1, . . . , M

• λj(Lj) =

8

>

<

>

:

π(1, L1), j = 1

π(j, Lj)

1 −Pj−1

k=1π(k, Lk)

, j = 2, . . . , M

• dMj(Lj) = I(D = j) − λj(Lj)I(D ≥ j), j = 1, . . . , M

• Kr(Lj) = 1 −Pj

k=1π(j, Lj), j = 1, . . . , M

• Optimal Lj(Lj) = E{ϕ(L, µ)|Lj}


Estimators

Simplest case: M = 1, L1 = X, L2 = Y

• D = 1, observe L1 = X; D = 2, observe L = (X, Y ) (full data )

• Estimate µ = E(Y ) =⇒ ϕ(L, µ) = Y − µ

• π(1, L1) = pr(D = 1|X) = 1 − pr(D = 2|X);

π(2, L2) = π(L) = pr(D = 2|X, Y ) = pr(D = 2|X) = π(X), say

=⇒ π(1, L1) = 1 − π(X)

• 1st term =I(D = 2)(Y − µ)

π(X), 2nd term =

I(D = 1) − {1 − π(X)}I(D ≥ 1)

1 − {1 − π(X)}L1(X) = −

I(D = 2) − π(X)

π(X)L1(X)

• Optimal L1(X) = E(Y − µ|X) = E(Y |X) − µ

• I(D = 2) ⇐⇒ I(R = 1) = R

• Algebra =⇒ same estimator for µ as before


Estimators

Models for E{ϕ(L, µ)|Lj):

• Specify Lj(Lj , β), j = 1, . . . , M

• Feasibility : Must satisfy E{Lj+1(Lj+1, β)|Lj} = Lj(Lj , β)

• Want an estimator for β that satisfies conditions analogous to

conditions 1. and 2. from before

Estimator for β: Solve

n∑

i=1

M∑

j=1

I(D > j)qj(Lj , β){Lj+1(Lj+1, β) − Lj(Lj , β)}

= 0,

where LM+1(LM+1, β) = ϕ(L, µ)

• Appropriate weighting

qj(Lj , β) = −{Kj(Lj)}−1

j∑

k=1

λk(Lk)

Kk(Lk)Lk,β(Lk, β), Lj,β(Lj , β) =

∂Lj(Lj , β)

∂β


Estimators

• All of this extends by analogy to the projection estimator when the

π(j, Lj) are modeled and fitted

Simplest case: With these weights, estimating equation becomes

n∑

i=1

I(Di = 2)

{1 − π(Xi)

π2(Xi)

}L1,β(X, β)[ Y − {L1(X, β) + µ}︸︷︷︸

model for E(Y |X)

] = 0

• With I(D = 2) ⇐⇒ R, same as slide 18

• Reasoning extends to M > 1

Simulation studies: Tsiatis et al. (2011)

• The proposed estimator outperformed a competing estimator of

Bang and Robins (2005) that does not attempt to “optimize ” the

method of fitting the models Lj(Lj , β)


Discussion

• Not all doubly robust estimators are the same

• Key : How one estimates β in the regression model matters

• The proposed estimators are designed to equal or exceed the

efficiency of “usual ” doubly robust estimators when the missingness

mechanism is correctly modeled

• Empirical evidence suggests that using an estimator for β that

minimizes the variance in this case endows the estimator for µ with

robustness to model misspecification

• . . . and serves to counteract instability of estimation due to very

small estimated π(X, γ̂)

• This idea can be adapted to other contexts


References

Bang, H. and Robins, J.M. (2005). Doubly robust estimation in missing data and

causal inference models. Biometrics 61, 962–972.

Cao, W., Tsiatis, A.A. and Davidian, M. (2009). Improving efficiency and

robustness of the doubly robust estimator for a population mean with

incomplete data. Biometrika 96, 723–734.

Kang, D.Y.J. and Schafer, J.L (2007). Demystifying double robustness: a

comparison of alternative strategies for estimating a population mean from

incomplete data (with discussion and rejoinder) Statistical Science 22, 523–580.

Montanari, G.E. (1987). Post-sampling efficient QR-prediction in large-sample

surveys. International Statistical Review 55, 191–202.

Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1994). Estimation of regression

coefficients when some regressors are not always observed. Journal of the

American Statistical Association 89, 846–866.

Tan, Z. (2006). A distributional approach for causal inference using propensity

scores. Journal of the American Statistical Association 101, 1619–1637.

Tan, Z. (2007). Understanding OR, PS, and DR. Statistical Science 22, 560–568.


References

Tan, Z. (2008). Comment: Improved local efficiency and double-robustness. The

International Journal of Biostatistics 4, Issue 1, Article 10.

Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data. New York:

Springer.

Tsiatis, A.A., Davidian, M. and Cao, W. (2011). Improved doubly robust estimation

when the data are monotonely coarsened, with application to longitudinal

studies with dropout. Biometrics 67, 536–545.


Documents

More Robust Doubly Robust Estimators