A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001

A REVIEWA REVIEW

ByBy

Chi-Ming KamChi-Ming Kam

Surajit RaySurajit Ray

April 23, 2001April 23, 2001

Imputation TechniquesImputation Techniques Implemented in SOLAS 3.0 Implemented in SOLAS 3.0

SINGLE IMPUTATIONSINGLE IMPUTATIONHot DeckingHot Decking

Predicted Mean ImputationPredicted Mean Imputation

Last Value Carried ForwardLast Value Carried Forward

MULTIPLE IMPUTATIONSPropensity Score Based ImputationPredictive Model Based Imputation

Method 1: Propensity Score Based Method 1: Propensity Score Based ImputationImputation

This was the only Method in Version 1.This was the only Method in Version 1.

Method similar to Lavori,Dawson,Shera Method similar to Lavori,Dawson,Shera (1995) (1995) “A multiple imputation strategy for clinical trials “A multiple imputation strategy for clinical trials with truncation of patient data”with truncation of patient data”

GOAL:GOAL: To impute Missing values by minimal To impute Missing values by minimal Distributional AssumptionsDistributional Assumptions

How it WorksHow it Works

Let R be the indicator for the Let R be the indicator for the missingness pattern (R=0 or missingness pattern (R=0 or 1)1)

X1 X2 ……….XP Y

?

?

.

.

?

R

1

1

1

1

.

.

0

0

.

.

0

Model R from X1, X2,..., XP

using logistic regression

p=Prob(R=1| X1, X2,…,XP) for each case yielding N pi’s.

How it works…. How it works…. (Approximate Bayesian bootstrap, Rubin, 1987)(Approximate Bayesian bootstrap, Rubin, 1987)

Group (user specified) Group (user specified) the units by the value of the units by the value of the quintiles of p.the quintiles of p.

Suppose that within a Suppose that within a particular group there are particular group there are nn1 1 observed and nobserved and n00

missing values.missing values.Quintiles of p

sample n1+n0 units with replacement from the observed values.

From the sampled pool, subsample n0 units with replacement

Use these n0 units as the imputed values for the n0 missing values

Repeat the procedure m times to get m imputations

with replacement with replacement

n1 obs n0+ n1 n0

Theoretical JustificationTheoretical Justification

It produces an imputed distribution of Y that It produces an imputed distribution of Y that has been corrected for biases due to has been corrected for biases due to missingness related to X.missingness related to X.

It's similar in spirit to reweighting but here we It's similar in spirit to reweighting but here we have a multiple imputation version of it.have a multiple imputation version of it.

The method produces unbiased estimates for The method produces unbiased estimates for marginal distribution of Y.marginal distribution of Y.

Problems/DrawbacksProblems/Drawbacks

The method does not preserve the The method does not preserve the association between Y and individual association between Y and individual XXii’s.’s.

Reasoning: Reasoning: The only aspect of X The only aspect of X ii’s that is ’s that is

used here is the linear prediction for Y used here is the linear prediction for Y ( (00+ + 11XX11++22XX22

….…. + +ppXXpp) in the logistic ) in the logistic

model. This is the function that predicts model. This is the function that predicts missingness of Y (R) but not Y itself.missingness of Y (R) but not Y itself.

Problems/Drawbacks Problems/Drawbacks (Continued….)(Continued….)

Suppose XSuppose X11 is highly correlated with Y but is is highly correlated with Y but is

unrelated to P(R=1). Xunrelated to P(R=1). X11 will drop out of the will drop out of the

the logistic model and it is not used in the the logistic model and it is not used in the imputation. As a result, the model will imputation. As a result, the model will misrepresent the correlation of Xmisrepresent the correlation of X11 and Y. and Y.

Also, by not using XAlso, by not using X11 in the imputation, we are in the imputation, we are

failing to impute Y efficiently.failing to impute Y efficiently.

Simulation Results Using SOLAS 1.1

Data Generation Mechanism:

Y=X+Z+, where and ~(0,1)

Source: Paul D. Allison “Multiple Imputation for Missing Data, A Cautionary Tale”

1 2 3 4Missing DataMechanism

Ordinary LeastSquares on

Original Data

listwiseDeletion

MultipleImputationWith SOLAS

MultipleImputation With

data AugmentationMissing completelyat random

XZ

0.979(011)1.014 (.012)

0.969 (.016)1.029 (.017)

1.141 (.016)0.667 (.020)

0.976 (.012)1.028 (.016)

Missing at random(dependent on X)

XZ

1.012 (.012)1.007 (.012)

0.986 (.025)1.011 (.017)

1.470 (.013)0.448 (.015)

1.005 (.025)0.997 (.016)

Missing at random(dependent on Y)

XZ

0.993 (.012)1.001 (.012)

0.695 (.015)0.708 (.015)

1.350 (.013)0.746 (.023)

0.985 (.021)0.997 (.013)

Nonignorable(dependent on Z)

XZ

1.003 (.012)1.002 (.012)

0.995 (.016)1.007 (.024)

1.250 (.013)1.215 (.027)

1.154 (.015)1.245 (.020)

Some Comments About the Some Comments About the Propensity Score Based MethodPropensity Score Based Method

The method can provide valid but The method can provide valid but possibly inefficient inferences about possibly inefficient inferences about Y (marginal).Y (marginal).

The method can lead to very The method can lead to very misleading inferences about the misleading inferences about the relationships between Y and other relationships between Y and other variables.variables.

Method 2: Predictive Model Method 2: Predictive Model Based Multiple ImputationBased Multiple Imputation

This method is implemented in SOLAS 2.0 and 3.0This method is implemented in SOLAS 2.0 and 3.0

HOW IT WORKS:HOW IT WORKS:

Regress Y on XRegress Y on X1, 1, XX2,…, 2,…, XXpp

Get the estimates of Get the estimates of 0,0, 1,1, 2,….2,…. pp and and

Draw Draw 00**,, 11

**,, 22

**….…. pp

*, *, ** from an approximate posterior from an approximate posterior distributiondistribution

Impute YImpute Y**= = 00**+ + 11

* * XX11++22* * XX22

….…. + +pp

* * XXpp++**

where where **Normal(0, Normal(0, **))

Repeat m times to get the m imputed datasetsRepeat m times to get the m imputed datasets

Good pointsGood points

The method provides correct model based MI The method provides correct model based MI under the regression model and MARunder the regression model and MAR

It also preserves the correlation between XIt also preserves the correlation between X ii's

and Yand Y

What is the difference with NORMWhat is the difference with NORM ? ?

NORMNORM does the same thing with MCMC does the same thing with MCMC

Under multivariate normal model, both Under multivariate normal model, both methods give the same results methods give the same results

Which Software is More General ?Which Software is More General ?

I work for arbitrary missingness pattern

I work for non-linear relation of y on X

But that’s probably very similar to norm with rounding

Concluding Remarks

SOLAS is the first commercial missing data software.

It has good graphical interface.

Easy data import and export to other softwares.

Performs well under monotone missingness pattern.

Estimates are not always unbiased.

Documents

A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001