Multiple Imputation for Missing Datafiles.meetup.com/4325882/CooperB_MI_MeetupTalk1505.pdf · Imputation: Multiple Complete Copies of the Dataset Y X1 X2 X3 44.61 11.37 178 1 54.3

Multiple Imputation for Missing Data

Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Washington University in St. Louis

publichealth.wustl.edu

The Institute for Public Health @ Washington University

Outline

•  Missing data mechanisms •  What is Multiple Imputation? •  Software Options

–  SAS, Stata, IVEware, R, SPSS –  Compare/Contrast software options

•  Working example •  Imputation issues and problems

Missing data mechanisms

•  Missing Completely At Random (MCAR) –  The probability of missingness doesn't depend on

anything. •  Missing At Random (MAR)

–  The probability of missingness does not depend on the unobserved value of the missing variable, but it can depend on any of the other variables in your dataset

•  Not Missing at Random (NMAR) –  The probability of missingness depends on the

unobserved value of the missing variable itself

What is Imputation?

•  Quite simply, the process of replacing missing data with substituted values

Source http://en.wikipedia.org/wiki/Imputation_(statistics)

•  Other methods include hot/cold deck replacement and mean imputation

•  These methods were rather crude and often limited by computer processing power

•  Most common ‘method’ is deletion

Missing Data?! Just get rid of ‘em!

•  Even today, the most common approach to dealing with missing data is deletion

Source http://en.wikipedia.org/wiki/Imputation_(statistics)

•  Researchers may often be unaware of the impact of listwise deletion on sample size and bias

•  Unless data are missing completely at random, removing cases introduces bias and decreases statistical power

Impact of Listwise/Pairwise Deletion

Consider this example using regression analysis. What predicts cancer in this dataset? Cancer = CancerHx n = 500

Cancer = CancerHx + Race n = 483

Cancer = CancerHx + Race + Gender n = 437

Cancer = CancerHx + Race + Gender + Education + Income

n = 405

Why Impute?

BIAS

BIAS

BIAS

BIAS

BIAS

BIAS BIAS Confounding

Selection

Recall

Missing Data

Attrition

Observer

What is Multiple Imputation?

•  Multiple imputation uses common statistical techniques to generate multiple imputed (complete) datasets

•  Rubin (1987) laid out this approach and a method for combining point and variance estimates

•  A naïve or poorly executed imputation can introduce more bias than no MI at all

Multiple Imputation on the rise

Source: www.multiple-imputation.com

Imputation: Multiple Complete Copies of the Dataset

Y X1 X2 X344.61 11.37 178 154.3 8.65 156 049.87 9.22 . .. 11.95 176 139.44 13.08 174 150.54 . . 144.75 11.12 176 051.86 10.33 166 040.84 10.95 168 .46.77 10.25 . .

_I_ Y X1 X2 X32 44.609 11.37 178 12 54.297 8.65 156 02 49.874 9.22 137.47 0.06662 39.849 11.95 176 12 39.442 13.08 174 12 50.541 9.9192 162.67 12 44.754 11.12 176 02 51.855 10.33 166 02 40.836 10.95 168 0.22882 46.774 10.25 184.83 0.0998

Implicate 2

_I_ Y X1 X2 X31 44.61 11.37 178 11 54.3 8.65 156 01 49.87 9.22 181.2 0.231 39.97 11.95 176 11 39.44 13.08 174 11 50.54 9.117 168.2 11 44.75 11.12 176 01 51.86 10.33 166 01 40.84 10.95 168 0.7561 46.77 10.25 185.9 0.632

Implicate 1 Original Data

Three basic steps

1.  Imputation •  Make M=2 to 50 copies (implicates) of original data set filling

in with conditionally random values

2.  Analyses •  Of each data set separately

3.  Pooling •  Point estimates. Average across M analyses •  Standard errors & Confidence Intervals. Combine variances.

Before you begin…

Know your data! Check for skip patterns and other issues that could allow data to be imputed that shouldn’t exist in the first place

Ensure all missing data is <null> or represented by a period. Alpha missing value codes may not get imputed.

If working with multiple discrete groups of observations, consider imputing separately and combine afterward.

Create some variables before imputation - Example, mutually exclusive binary variables for one construct (race)

MI software comparison

•  STATA –  based on each conditional density –  chained equations

•  SAS –  joint distribution of all the variables –  assumed multivariate normal distribution

•  IVEware (SAS-callable or standalone) –  same as Stata, more options for complex survey data.

•  R –  Multiple packages (mi, Amelia, MICE, etc.)

•  SPSS (ver.17 or greater) –  (offers MI but only through the add-pm Missing Values module

A bit more on IVEware…

•  Why yet another software package when your existing software may offer MI?

First, some background…

A bit more on IVEware…

•  MI in IVEware works by using a sequential regression technique to ‘predict’ missing values

•  To date, SAS, Stata & R require the user to specify the regression model and distribution type…

Example: SAS vs IVEware

•  To date, in SAS, must specify the regression model and distribution type for each variable to be imputed.

•  But how does one know which variables should be used to predict missing values?

•  IVEware picks the best predictors for each variables

IVEware: IMPUTE command

IMPUTE – Multiple options; straight-forward setup

IMPUTE produces imputed values on a variable-by-variable basis for each individual in the data set conditional on all the values observed for that individual.

Support for five data types with specific regression models for each: (1) Continuous (linear) (2) Binary (logistic) (3) Categorical (polytomous with more than two categories) (4) Count (Poisson) (5) Mixed (a continuous variable with a non-zero probability mass at zero,

generalized logit or mixed logistic/linear).

Imputations are created through a sequence of multiple regressions, varying the type of regression model by the type of variable being imputed.

IVEware: IMPUTE command (continued)

The sequence of imputing missing values can be continued in a cyclical manner, each time overwriting previously drawn values, building interdependence among imputed values and exploiting the correlational structure among covariates.

Covariates include all other variables observed or imputed for that individual.

Sample imputation syntax

Sample imputation output

NUMCIG Observed Imputed Combined Number 454 444 898 Minimum 0 0 0 Maximum 98 65.0899 98 Mean 22.7004 4.89177 13.8953 Std Dev 14.3406 11.1498 15.6404

Diabetes Observed Imputed Combined Code Freq Per Freq Per Freq Per 0 831 92.95 4 100.00 835 92.98 1 63 7.05 0 0.00 63 7.02 Total 894 100.00 4 100.00 898 100.00

SAS IVEware: 4 Components

1. IMPUTE -- nice options. 2. DESCRIBE estimates the population means, proportions,

subgroup differences, contrasts and linear combinations of means and proportions. A Taylor Series approach is used to obtain variance estimates appropriate for a user specified complex sample design.

3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and

proportional hazard regression models for data resulting from a complex sample design.

4. SASMOD allows users to take into account complex sample

design features when analyzing data with several SAS procedures. SAS PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG, MIXED, NLIN, PHREG, and PROBIT.

A Few Issues

•  Can I impute the dependent variable? •  Is there an upper limit to the amount of missing

data to be imputed? •  How many implicates do I need? •  Can I impute in one language and analyze in

another? •  How do I get summary statistics such as R

squared?

General http://sites.stat.psu.edu/~jls/mifaq.html Stata (Windows & MacOS) http://www.stata.com/capabilities/multiple-imputation/ http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_pt1.htm SAS (Windows) http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/

viewer.htm#statug_mi_sect038.htm IVEware (Windows / MacOS / Linux) http://www.isr.umich.edu/src/smp/ive/ R (Windows / MacOS / Linux) http://www.stat.ucla.edu/~yajima/Publication/mipaper.rev04.pdf http://www.stat.columbia.edu/~gelman/arm/missing.pdf

References

Documents

Multiple Imputation for Missing Datafiles.meetup.com/4325882/CooperB_MI_MeetupTalk1505.pdf · Imputation: Multiple Complete Copies of the Dataset Y X1 X2 X3 44.61 11.37 178 1 54.3