Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Multiple Imputation for Missing Data
Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Washington University in St. Louis
publichealth.wustl.edu
The Institute for Public Health @ Washington University
Outline
• Missing data mechanisms • What is Multiple Imputation? • Software Options
– SAS, Stata, IVEware, R, SPSS – Compare/Contrast software options
• Working example • Imputation issues and problems
Missing data mechanisms
• Missing Completely At Random (MCAR) – The probability of missingness doesn't depend on
anything. • Missing At Random (MAR)
– The probability of missingness does not depend on the unobserved value of the missing variable, but it can depend on any of the other variables in your dataset
• Not Missing at Random (NMAR) – The probability of missingness depends on the
unobserved value of the missing variable itself
What is Imputation?
• Quite simply, the process of replacing missing data with substituted values
Source http://en.wikipedia.org/wiki/Imputation_(statistics)
• Other methods include hot/cold deck replacement and mean imputation
• These methods were rather crude and often limited by computer processing power
• Most common ‘method’ is deletion
Missing Data?! Just get rid of ‘em!
• Even today, the most common approach to dealing with missing data is deletion
Source http://en.wikipedia.org/wiki/Imputation_(statistics)
• Researchers may often be unaware of the impact of listwise deletion on sample size and bias
• Unless data are missing completely at random, removing cases introduces bias and decreases statistical power
Impact of Listwise/Pairwise Deletion
Consider this example using regression analysis. What predicts cancer in this dataset? Cancer = CancerHx n = 500
Cancer = CancerHx + Race n = 483
Cancer = CancerHx + Race + Gender n = 437
Cancer = CancerHx + Race + Gender + Education + Income
n = 405
Why Impute?
BIAS
BIAS
BIAS
BIAS
BIAS
BIAS BIAS Confounding
Selection
Recall
Missing Data
Attrition
Observer
What is Multiple Imputation?
• Multiple imputation uses common statistical techniques to generate multiple imputed (complete) datasets
• Rubin (1987) laid out this approach and a method for combining point and variance estimates
• A naïve or poorly executed imputation can introduce more bias than no MI at all
Multiple Imputation on the rise
Source: www.multiple-imputation.com
Imputation: Multiple Complete Copies of the Dataset
Y X1 X2 X344.61 11.37 178 154.3 8.65 156 049.87 9.22 . .. 11.95 176 139.44 13.08 174 150.54 . . 144.75 11.12 176 051.86 10.33 166 040.84 10.95 168 .46.77 10.25 . .
_I_ Y X1 X2 X32 44.609 11.37 178 12 54.297 8.65 156 02 49.874 9.22 137.47 0.06662 39.849 11.95 176 12 39.442 13.08 174 12 50.541 9.9192 162.67 12 44.754 11.12 176 02 51.855 10.33 166 02 40.836 10.95 168 0.22882 46.774 10.25 184.83 0.0998
Implicate 2
_I_ Y X1 X2 X31 44.61 11.37 178 11 54.3 8.65 156 01 49.87 9.22 181.2 0.231 39.97 11.95 176 11 39.44 13.08 174 11 50.54 9.117 168.2 11 44.75 11.12 176 01 51.86 10.33 166 01 40.84 10.95 168 0.7561 46.77 10.25 185.9 0.632
Implicate 1 Original Data
Three basic steps
1. Imputation • Make M=2 to 50 copies (implicates) of original data set filling
in with conditionally random values
2. Analyses • Of each data set separately
3. Pooling • Point estimates. Average across M analyses • Standard errors & Confidence Intervals. Combine variances.
Before you begin…
Know your data! Check for skip patterns and other issues that could allow data to be imputed that shouldn’t exist in the first place
Ensure all missing data is <null> or represented by a period. Alpha missing value codes may not get imputed.
If working with multiple discrete groups of observations, consider imputing separately and combine afterward.
Create some variables before imputation - Example, mutually exclusive binary variables for one construct (race)
MI software comparison
• STATA – based on each conditional density – chained equations
• SAS – joint distribution of all the variables – assumed multivariate normal distribution
• IVEware (SAS-callable or standalone) – same as Stata, more options for complex survey data.
• R – Multiple packages (mi, Amelia, MICE, etc.)
• SPSS (ver.17 or greater) – (offers MI but only through the add-pm Missing Values module
A bit more on IVEware…
• Why yet another software package when your existing software may offer MI?
First, some background…
A bit more on IVEware…
• MI in IVEware works by using a sequential regression technique to ‘predict’ missing values
• To date, SAS, Stata & R require the user to specify the regression model and distribution type…
Example: SAS vs IVEware
• To date, in SAS, must specify the regression model and distribution type for each variable to be imputed.
• But how does one know which variables should be used to predict missing values?
• IVEware picks the best predictors for each variables
IVEware: IMPUTE command
IMPUTE – Multiple options; straight-forward setup
IMPUTE produces imputed values on a variable-by-variable basis for each individual in the data set conditional on all the values observed for that individual.
Support for five data types with specific regression models for each: (1) Continuous (linear) (2) Binary (logistic) (3) Categorical (polytomous with more than two categories) (4) Count (Poisson) (5) Mixed (a continuous variable with a non-zero probability mass at zero,
generalized logit or mixed logistic/linear).
Imputations are created through a sequence of multiple regressions, varying the type of regression model by the type of variable being imputed.
IVEware: IMPUTE command (continued)
The sequence of imputing missing values can be continued in a cyclical manner, each time overwriting previously drawn values, building interdependence among imputed values and exploiting the correlational structure among covariates.
Covariates include all other variables observed or imputed for that individual.
Sample imputation syntax
Sample imputation output
NUMCIG Observed Imputed Combined Number 454 444 898 Minimum 0 0 0 Maximum 98 65.0899 98 Mean 22.7004 4.89177 13.8953 Std Dev 14.3406 11.1498 15.6404
Diabetes Observed Imputed Combined Code Freq Per Freq Per Freq Per 0 831 92.95 4 100.00 835 92.98 1 63 7.05 0 0.00 63 7.02 Total 894 100.00 4 100.00 898 100.00
SAS IVEware: 4 Components
1. IMPUTE -- nice options. 2. DESCRIBE estimates the population means, proportions,
subgroup differences, contrasts and linear combinations of means and proportions. A Taylor Series approach is used to obtain variance estimates appropriate for a user specified complex sample design.
3. REGRESS fits linear, logistic, polytomous, Poisson, Tobit and
proportional hazard regression models for data resulting from a complex sample design.
4. SASMOD allows users to take into account complex sample
design features when analyzing data with several SAS procedures. SAS PROCS can be called:CALIS, CATMOD, GENMOD, LIFEREG, MIXED, NLIN, PHREG, and PROBIT.
A Few Issues
• Can I impute the dependent variable? • Is there an upper limit to the amount of missing
data to be imputed? • How many implicates do I need? • Can I impute in one language and analyze in
another? • How do I get summary statistics such as R
squared?
General http://sites.stat.psu.edu/~jls/mifaq.html Stata (Windows & MacOS) http://www.stata.com/capabilities/multiple-imputation/ http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_pt1.htm SAS (Windows) http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/
viewer.htm#statug_mi_sect038.htm IVEware (Windows / MacOS / Linux) http://www.isr.umich.edu/src/smp/ive/ R (Windows / MacOS / Linux) http://www.stat.ucla.edu/~yajima/Publication/mipaper.rev04.pdf http://www.stat.columbia.edu/~gelman/arm/missing.pdf
References