18
4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all) cases. If data are missing on a variable for all cases, then that variable is said to be latent or unobserved. On the other hand, if data are missing on all variables for some cases, we have what is known as unit non-response, as opposed to item non-response which is another name for the subject of this chapter. I will not deal with methods for latent variables or unit non- response here, although some of the methods we will consider can be adapted to those situations. Why are missing data a problem? Because conventional statistical methods and software presume that all variables in a specified model are measured for all cases. The default method for virtually all statistical software is simply to delete cases with any missing data on the variables of interest, a method known as listwise deletion or complete case analysis. The most obvious drawback of listwise deletion is that it often deletes a large fraction of the sample, leading to a severe loss of statistical power. Researchers are understandably reluctant to discard data that they have spent a great deal time, money and effort in collecting. And so, many methods for ‘salvaging’ the cases with missing data have become popular. For a very long time, however, missing data could be described as the ‘dirty little secret’ of statistics. Although nearly everyone had missing data and needed to deal with the problem in some way, there was almost nothing in the textbook literature to pro- vide either theoretical or practical guidance. Although the reasons for this reticence are unclear, I suspect it was because none of the popular methods had any solid mathematical foundation. As it turns out, all of the most commonly used methods for handling missing data have serious deficiencies. Fortunately, the situation has changed dramatically in recent years. There are now two broad approaches to missing data that have excellent statistical properties if the specified assumptions are met: maximum likelihood and multiple imputation. Both of these methods have been around in one form or another for at least 30 years, but it is only in the last decade that they have become fully developed and incorporated into widely-available and easily-used software. A third method, inverse probability weighting (Robins and Rotnitzky, 1995; Robins et al.,

Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

  • Upload
    vuhuong

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

4Missing Data

Paul D. Al l i son

INTRODUCTION

Missing data are ubiquitous in psychologicalresearch. By missing data, I mean data thatare missing for some (but not all) variablesand for some (but not all) cases. If data aremissing on a variable for all cases, then thatvariable is said to be latent or unobserved.On the other hand, if data are missing onall variables for some cases, we have whatis known as unit non-response, as opposedto item non-response which is another namefor the subject of this chapter. I will not dealwith methods for latent variables or unit non-response here, although some of the methodswe will consider can be adapted to thosesituations.

Why are missing data a problem? Becauseconventional statistical methods and softwarepresume that all variables in a specifiedmodel are measured for all cases. The defaultmethod for virtually all statistical softwareis simply to delete cases with any missingdata on the variables of interest, a methodknown as listwise deletion or complete caseanalysis. The most obvious drawback oflistwise deletion is that it often deletes a largefraction of the sample, leading to a severeloss of statistical power. Researchers areunderstandably reluctant to discard data that

they have spent a great deal time, money andeffort in collecting. And so, many methods for‘salvaging’ the cases with missing data havebecome popular.

For a very long time, however, missingdata could be described as the ‘dirty littlesecret’ of statistics. Although nearly everyonehad missing data and needed to deal withthe problem in some way, there was almostnothing in the textbook literature to pro-vide either theoretical or practical guidance.Although the reasons for this reticence areunclear, I suspect it was because none of thepopular methods had any solid mathematicalfoundation. As it turns out, all of the mostcommonly used methods for handling missingdata have serious deficiencies.

Fortunately, the situation has changeddramatically in recent years. There are nowtwo broad approaches to missing data thathave excellent statistical properties if thespecified assumptions are met: maximumlikelihood and multiple imputation. Both ofthese methods have been around in oneform or another for at least 30 years, butit is only in the last decade that they havebecome fully developed and incorporated intowidely-available and easily-used software.A third method, inverse probability weighting(Robins and Rotnitzky, 1995; Robins et al.,

Page 2: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 73

1995; Scharfstein et al., 1999), also showsmuch promise for handling missing data buthas not yet reached the maturity of the othertwo methods. In this chapter, I review thestrengths and weaknesses of conventionalmissing data methods but focus the bulk ofmy attention on maximum likelihood andmultiple imputation.

MISSING COMPLETELY AT RANDOM

Before beginning an examination of specificmethods, it is essential to spend sometime discussing possible assumptions. Nomethod for handling missing data can beexpected to perform well unless there aresome restrictions on how the data came tobe missing. Unfortunately, the assumptionsthat are necessary to justify a missing datamethod are typically rather strong and oftenuntestable. The strongest assumption thatis commonly made is that the data aremissing completely at random (MCAR). Thisassumption is most easily explained for thesituation in which there is only a singlevariable with missing data, which we willdenote by Z . Suppose we have another set ofvariables (represented by the vector X) whichis always observed. Let RZ be an indicator(dummy) variable having a value of 1 if Z ismissing and 0 if Z is observed. The MCARassumption can then be expressed by thestatement:

Pr(RZ = 1|X,Z) = Pr(RZ = 1)

That is, the probability that Z is missingdepends neither on the observed variables Xnor on the possibly missing values of Z itself.

A common question is: what variables haveto be in X in order for the MCAR assumptionto be satisfied? All variables in the data set?All possible variables, whether in the data setor not? The answer is: only the variables in themodel to be estimated. If you are estimatinga multiple regression and Z is one of thepredictors, then the vector X must includeall the other variables in the model. Butmissingness on Z could depend on some other

variable (whether in the data set or not), andit would not be a violation of MCAR. On theother hand, if you are merely estimating themean of Z , then all that is necessary for MCARis Pr(RZ = 1|Z) = Pr(RZ = 1). That is,missingness on Z does not depend on Z itself.

When more than one variable in the modelof interest has missing data, the statementof MCAR is a bit more technical and willnot be given here (see Rubin, 1976). But thebasic idea is the same: the probability thatany variable is missing cannot depend on anyother variable in the model of interest, oron the potentially missing values themselves.It is important to note, however, that theprobability that one variable is missing candepend on whether or not another variableis missing, without violating MCAR. In theextreme, two or more variables may alwaysbe missing together or observed together.This is actually quite common. It typicallyoccurs when data sets are pieced together frommultiple sources, for example, administrativerecords and personal interviews. If the subjectdeclines to be interviewed, all the responses inthat interview will be jointly missing.

For most data sets, the MCAR assumptionis unlikely to be precisely satisfied. Onesituation in which the assumption is likelyto be satisfied is when data are missing bydesign (Graham et al., 1996). For example,a researcher may decide that a brain scan isjust too costly to administer to everyone inher study. Instead, she does the scan for onlya 25% random subsample. For the remaining75%, the brain-scan data are MCAR.

Can the MCAR assumption be tested?Well, it is easy to test whether missingnesson Z depends on X. The simplest approachis to test for differences in means of theXvariables between those who responded toZ and those who did not, a strategy that hasbeen popular for years.Amore comprehensiveapproach is to do a logistic regression of RZ

on X. Significant coefficients, either singly orjointly, would indicate a violation of MCAR.On the other hand, it is not possible to testwhether missingness on Z depends on Zitself (conditional on X). That would requireknowledge of the missing values.

Page 3: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

74 DESIGN AND INFERENCE

MISSING AT RANDOM

A much weaker (but still strong) assumptionis that the data are missing at random (MAR).Again, let us consider the special case in whichonly a single variable Z has missing data,and there is a vector of variables X that isalways observed. The MAR assumption maybe stated as:

Pr(RZ = 1|X,Z) = Pr(RZ = 1|X)

This equation says that missingness on Z maydepend on X , but it does not depend on Z itself(after adjusting for X). For example, supposethat missingness on a response variabledepends on whether a subject is assignedto the treatment or the control group, witha higher fraction missing in the treatmentgroup. But within each group, suppose thatmissingness does not depend on the valueof the response variable. Then, the MARassumption is satisfied. Note that MCAR isa special case of MAR. That is, if the data areMCAR, they are also MAR.

As with MCAR, the extension to morethan one variable with missing data requiresmore technical care in stating the assumption(Rubin, 1976), but the basic idea is thesame: the probability that a variable ismissing may depend on anything that isobserved; it just cannot depend on any ofthe unobserved values of the variables withmissing data (after adjusting for observedvalues). Nevertheless, missingness on onevariable is allowed to depend on missingnesson other variables.

Unfortunately, the MAR assumption is nottestable. You may have reasons to suspect thatthe probability of missingness depends on thevalues that are missing, for example, peoplewith high incomes may be less likely to reporttheir incomes. But nothing in the data will tellyou whether this is the case or not. Fortunately,there is a way to make the assumption moreplausible. The MAR assumption says thatmissingness on Z does not depend on Z , afteradjusting for the variables in X. And likeMCAR, the set of variables in X dependson the model to be estimated. If you put

many variables in X, especially those that arehighly correlated with Z , you may be ablereduce or eliminate the residual dependenceof missingness on Z itself. In the case ofincome, for example, putting such variables asage, sex, occupation and education into the Xvector can make the MAR assumption muchmore reasonable. Later on, we shall discussstrategies for doing this.

The missing-data mechanism (the processgenerating the missingness) is said to beignorable if the data are MAR and anadditional, somewhat technical, conditionis satisfied. Specifically, the parametersgoverning the missing-data mechanism mustbe distinct from the parameters in themodel to be estimated. Since this conditionis unlikely to be violated in real-worldsituations it is commonplace to use theterms MAR and ignorability interchangeably.As the name suggests, if the missing-datamechanism is ignorable, then it is possibleto get valid, optimal estimates of parameterswithout directly modeling the missing-datamechanism.

NOT MISSING AT RANDOM

If the MAR assumption is violated, thedata are said to be not missing at random(NMAR). In that case, the missing-data mech-anism is not ignorable, and valid estimationrequires that the missing-data mechanism bemodeled as part of the estimation process.A well-known method for handling one kindof NMAR is Heckman’s (1979) method forselection bias. In Heckman’s method, the goalis to estimate a linear model with NMARmissing data on the dependent variable Y . Themissing-data mechanism may be representedby a probit model in which missingnesson Y depends on both Y and X. Usingmaximum likelihood, the linear model and theprobit model are estimated simultaneously toproduce consistent and efficient estimates ofthe coefficients.

As there are many situations in which theMAR assumption is implausible it is temptingto turn to missing-data methods that do not

Page 4: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 75

require this assumption. Unfortunately, thesemethods are fraught with difficulty. Becauseevery NMAR situation is different, the modelfor the missing-data mechanism must becarefully tailored to each situation. Further-more, there is no information in the datathat would help you choose an appropriatemodel, and no statistic that will tell youhow well a chosen model fits the data.Worse still, the results are often exquisitelysensitive to the choice of the model (Little andRubin, 2002).

It is no accident, then, that most commercialsoftware for handling missing data is based onthe assumption of ignorability. If you decideto go the NMAR route you should do so withgreat caution and care. It is probably a goodidea to enlist the help and advice of someonewho has real expertise in this area. It is alsorecommended that you try different modelsfor the missing-data mechanism to get anidea of how sensitive the results are to modelchoice. The remainder of this chapter willassume ignorability, although it is importantto keep in mind that both maximum likelihoodand multiple imputation can produce validestimates in the NMAR case if you have acorrectly specified model for the missing-datamechanism.

CONVENTIONAL METHODS

This section is a brief review of conventionalmethods for handling missing data, with anemphasis on what is good and bad about eachmethod. To do that, we need some criteria forevaluating a missing-data method. There isgeneral agreement that a good method shoulddo the following:

1. Minimize bias. Although it is well-known thatmissing data can introduce bias into parameterestimates, a good method should make that biasas small as possible.

2. Maximize the use of available information. Wewant to avoid discarding any data, and we wantto use the available data to produce parameterestimates that are efficient (i.e., have minimum-sampling variability).

3. Yield good estimates of uncertainty. We wantaccurate estimates of standard errors, confidenceintervals and p-values.

In addition, it would be nice if the missing-data method could accomplish these goalswithout making unnecessarily restrictiveassumptions about the missing-data mecha-nism. As we shall see, maximum likelihoodand multiple imputation do very well atsatisfying these criteria. But conventionalmethods are all deficient on one or more ofthese goals.

Listwise deletion

How does listwise deletion fare on thesecriteria? The short answer is good on 3(above), terrible on 2 and so-so on 1. Letus first consider bias. If the data are MCAR,listwise deletion will not introduce any biasinto parameter estimates. We know thatbecause, under MCAR, the subsample ofcases with complete data is equivalent toa simple random sample from the originaltarget sample. It is also well-known thatsimple random sampling does not causebias. If the data are MAR but not MCAR,listwise deletion may introduce bias. Hereis a simple example. Suppose the goal is toestimate mean income for some population.In the sample, 85% of women report theirincome but only 60% of men (a viola-tion of MCAR), but within each gendermissingness on income does not dependon income (MAR). Assuming that men, onaverage, make more than women, listwisedeletion would produce a downwardly biasedestimate of mean income for the wholepopulation.

Somewhat surprisingly, listwise deletionis very robust to violations of MCAR (oreven MAR) for predictor variables in aregression analysis. Specifically, so long asmissingness on the predictors does not dependon the dependent variable, listwise deletionwill yield approximately unbiased estimatesof regression coefficients (Little, 1992). Andthis holds for virtually any kind of regression –linear, logistic, Poisson, Cox, etc.

Page 5: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

76 DESIGN AND INFERENCE

The obvious downside of listwise deletionis that it often discards a great deal ofpotentially usable data. On the one hand, thisloss of data leads to larger standard errors,wider confidence intervals, and a loss of powerin testing hypotheses. On the other hand, theestimated standard errors produced by listwisedeletion are usually accurate estimates of thetrue standard errors. In this sense, listwisedeletion is an ‘honest’ method for handlingmissing data, unlike some other conventionalmethods.

Pairwise deletion

For linear models, a popular alternative to list-wise deletion is pairwise deletion, also knownas available case analysis. For many linearmodels (e.g., linear regression, factor analysis,structural equation models), the parametersof interest can be expressed as functionsof the population means, variances andcovariances (or, equivalently, correlations). Inpairwise deletion, each of these ‘moments’is estimated using all available data for eachvariable or each pair of variables. Then,these sample moments are substituted intothe formulas for the population parameters.In this way, all data are used and nothing isdiscarded.

If the data are MCAR, pairwise deletionproduces consistent (and, hence, approxi-mately unbiased) estimates of the parameters(Glasser, 1964). Like listwise deletion, how-ever, if the data are MAR but not MCAR,pairwise deletion may yield biased estimates.Intuitively, pairwise deletion ought to be moreefficient than listwise deletion because moredata are utilized in producing the estimates.This is usually the case, although simulationresults suggest that in certain situationspairwise deletion may actually be less efficientthan listwise.

Occasionally, pairwise deletion breaksdown completely because the estimatedcorrelation matrix is not a definite positiveand cannot be inverted to calculate theparameters. The more common problem,however, is the difficulty in getting accurateestimates of the standard errors. That is

because each covariance (or correlation) maybe based on a different sample size, depend-ing on the missing-data pattern. Althoughmethods have been proposed for gettingaccurate standard error estimates (Van Praaget al., 1985), they are complex and havenot been incorporated into any commercialsoftware.

Dummy-variable adjustment

In their 1985 textbook, Cohen and Cohenpopularized a method for dealing with missingdata on predictors in a regression analysis. Foreach predictor with missing data, a dummyvariable is created to indicate whether or notdata are missing on that predictor. All suchdummy variables are included as predictorsin the regression. Cases with missing dataon a predictor are coded as having someconstant value (usually the mean for non-missing cases) on that predictor.

The rationale for this method is that itincorporates all the available informationinto the regression. Unfortunately, Jones(1996) proved that this method typicallyproduces biased estimates of the regressioncoefficients, even if the data are MCAR.He also proved the same result for a closely-related method for categorical predictorswhereby an extra category is created tohold the cases with missing data. Althoughthese methods probably produce reasonablyaccurate standard error estimates, the biasmakes them unacceptable.

Imputation

A wide variety of methods falls under thegeneral heading of imputation. This classincludes any method in which some guessor estimate is substituted for each missingvalue, after which the analysis is doneusing conventional software. A simple butpopular approach is to substitute means formissing values, but this is well- known toproduce biased estimates (Haitovsky, 1968).Imputations based on linear regression aremuch better, although still problematic. Oneproblem, suffered by most conventional,

Page 6: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 77

deterministic methods is that they producebiased estimates of some parameters. In par-ticular, variances for the variables withmissing data tend to be underestimated, andthis bias is propagated to any parametersthat depend on variances (e.g., regressioncoefficients).

Even more serious is the tendency forimputation to produce underestimates ofstandard errors, which leads in turn to inflatedtest statistics and p-values that are too low.That is because conventional software hasno way of distinguishing real data fromimputed data and cannot take into accountthe inherent uncertainty of the imputations.The larger the fraction of missing data, themore severe this problem will be. In thissense, all conventional imputation methodsare ‘dishonest’ and should be viewed withsome skepticism.

MAXIMUM LIKELIHOOD

Maximum likelihood has proven to be anexcellent method for handling missing datain a wide variety of situations. If theassumptions are met, maximum likelihood formissing data produces estimates that have thedesirable properties normally associated withmaximum likelihood: consistency, asymp-totic efficiency and asymptotic normality.Consistency implies that estimates will beapproximately unbiased in large samples.Asymptotic efficiency means that the esti-mates are close to being fully efficient (i.e.,having minimal standard errors). Asymptoticnormality is important because it meanswe can use a normal approximation tocalculate confidence intervals and p-values.Furthermore, maximum likelihood can pro-duce accurate estimates of standard errors thatfully account for the fact that some of the dataare missing.

In sum, maximum likelihood satisfies allthree criteria stated earlier for a good missing-data method. Even better is the fact that itcan accomplish these goals under weakerassumptions than those required for manyconventional methods. In particular, it does

well when data are MAR but not MCAR.It also does well when the data are notNMAR – if one has a correct model for themissing-data mechanism.

Of course there are some downsides.Specialized software is typically required.Also required is a parametric model for thejoint distribution of all the variables withmissing data. Such a model is not alwayseasy to devise, and results may be somewhatsensitive to model choice. Finally, the goodproperties of maximum likelihood estimatesare all ‘large sample’ approximations, andthose approximations may be poor in smallsamples.

Most software for maximum likelihoodwith missing data assumes ignorability (and,hence, MAR). Under that assumption, themethod is fairly easy to describe. As usual,to do maximum likelihood we first needa likelihood function, which expresses theprobability of the data as a function ofthe unknown parameters. Suppose we havetwo discrete variables X and Z , with ajoint probability function denoted by p(x, z|θ)where θ is a vector of parameters. That is,p(x, z|θ) gives the probability that X = xand Z = z. If there are no missing data andobservations are independent, the likelihoodfunction is given by:

L(θ) =n∏

i=1

p(xi, zi|θ)

To get maximum likelihood estimates, we findthe value of θ that makes this function as largea possible.

Now suppose that data are MAR on Z forthe first r cases, and MAR on X for the next scases. Let:

g(x|θ ) =∑

z

p(x, z|θ)

be the marginal distribution of X (summingover Z) and let:

h(z|θ ) =∑

x

p(x, z|θ)

Page 7: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

78 DESIGN AND INFERENCE

be the marginal distribution of Z (summingover X). The likelihood function is then:

L(θ) =r∏

i=1

g(xi|θ)r+s∏

i=r+1

h(zi|θ)

n∏i=r+s+1

p(xi, zi|θ)

That is, the likelihood function is factored intoparts that corresponding to different missing-data patterns. For each pattern, the likelihoodis found by summing the joint distributionover all possible values of the variable(s) withmissing data. If the variables are continuousrather than discrete, the summation signs arereplaced with integral signs. The extension tomore than two variables is straightforward.

To implement maximum likelihood formissing data, one needs a model for thejoint distribution of all the relevant variablesand a numerical method for maximizingthe likelihood. If all the variables arecategorical, an appropriate model mightbe the unrestricted multinomial model,or a log-linear model that imposes somerestrictions on the data. The latter isnecessary when there are many variableswith many categories. Otherwise, withoutrestrictions (e.g., all three-way and higherinteractions are 0), there would be too manyparameters to estimate. An excellent freewarepackage for maximizing the likelihood forany log-linear model with missing data isLEM (available at http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html). LEM can also estimate logistic-regression models (in the special case whenall predictors are discrete) and latent-classmodels.

When all variables are continuous, itis typical to assume a multivariate-normalmodel. This implies that each variable isnormally distributed and can be expressedas a linear function of the other variables(or any subset of them), with errors that arehomoscedastic and have a mean of 0. Whilethis is a strong assumption, it is commonlyused as the basis for multivariate analysis andlinear-structural equation modeling.

Under the multivariate-normal model, thelikelihood can be maximized using either theexpectation-maximization (EM) algorithm ordirect maximum likelihood. Direct maximumlikelihood is strongly preferred becauseit gives accurate standard error estimatesand is more appropriate for ‘overidentified’models. However, because the EM methodis readily available in many commercialsoftware packages, it is worth taking a closerlook at it.

EM is a numerical algorithm that can beused to maximize the likelihood under a widevariety of missing-data models (Dempsteret al., 1977). It is an iterative algorithm thatrepeatedly cycles through two steps. In theexpectation step, the expected value of thelog-likelihood is taken over the variableswith missing data, using the current valuesof the parameter estimates to compute theexpectation. In the maximization step, theexpected log-likelihood is maximized toget new values of the parameter estimates.These two steps are repeated over and overuntil convergence, i.e., until the parameterestimates do not change from one iteration tothe next.

Under the multivariate-normal model, theparameters that are estimated by the EMalgorithm are the means, variances and covari-ances. In this case, the algorithm reduces tosomething that can be described as iteratedlinear regression imputation. The steps are asfollows:

1. Get starting values for the means, variances andcovariances. These can be obtained by listwise orpairwise deletion.

2. For each missing-data pattern, construct regres-sion equations for predicting the missing vari-ables based on the observed variables. Theregression parameters are calculated directlyfrom the current estimates of the means,variances and covariances.

3. Use these regression equations to generatepredicted values for all the variables and caseswith missing data.

4. Using all the real and imputed data, recalculatethe means, variances and covariances. For means,the standard formula works fine. For variances(and sometimes covariances) a correction factor

Page 8: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 79

must be applied to compensate for the downwardbias that results from using imputed values.

5. Go back to step 2 and repeat until convergence.

The principal output from this algorithmis the set of maximum likelihood estimatesof the means, variances and covariances.Although imputed values are generated aspart of the estimation process, it is notrecommended that these values be used inany other analysis. They are not designedfor that purpose, and they will yield biasedestimates of many parameters. One drawbackof the EM method is that, although it producesthe correct parameter estimates, it does notproduce standard error estimates.

EXAMPLE

To illustrate the EM algorithm (as wellas other methods to be considered later),we will use a data set that has recordsfor 581 children who were interviewed in1990 as part of the National LongitudinalSurvey of Youth (NLSY). A text file con-taining these data is available at http://www.ssc.upenn.edu/∼allison. Here are thevariables:

ANTI antisocial behavior, measured with ascale ranging from 0 to 6.

SELF self-esteem, measured with a scaleranging from 6 to 24.

POV poverty status of family, coded 1 for inpoverty, otherwise 0.

BLACK 1 if child is black, otherwise 0HISPANIC 1 if child is Hispanic, otherwise 0DIVORCE 1 if mother was divorced in 1990,

otherwise 0GENDER 1 if female, 0 if maleMOMWORK 1 if mother was employed in 1990,

otherwise 0

BLACK and HISPANIC are two categoriesof a three-category variable, the referencecategory being non-Hispanic white. Theultimate goal is to estimate a linear-regressionmodel with ANTI as the dependent variableand all the others as predictors.

The original data set had no missingdata. I deliberately produced missing dataon several of the variables, using a methodthat satisfied the MAR assumption. Thevariables with missing data and their per-centage missing are: SELF (25%), POV(26%), BLACK and HISPANIC (19%) andMOMWORK (15%). Listwise deletion on thisset of variables leaves only 225 cases, less thanhalf the original sample.

Application of the multivariate normalEM algorithm to these data produced themaximum likelihood estimates of the means,variances and covariances in Table 4.1.It may be objected that method is notappropriate for the variables POV, BLACKand HISPANIC and MOMWORK because, asdummy variables, they cannot possibly havea normal distribution. Despite the apparentvalidity of this objection, a good deal ofsimulation evidence and practical experiencesuggests that method does a reasonably goodjob, even when the variables with missingdata are dichotomous (Schafer, 1997). Wewill have more to say about this issuelater on.

What can be done with these estimates?Because covariances are hard to interpret, itis usually desirable to convert the covariancematrix into a correlation matrix, somethingthat is easily accomplished in many softwarepackages. One of the nice things aboutmaximum likelihood estimates is that anyfunction of those estimates will also be amaximum likelihood estimate of the corre-sponding function in the population. Thus,if si is the maximum likelihood estimate ofthe standard deviation of xi, and sij is themaximum likelihood estimate of the covari-ance between xi and xj, then r = sij/(sisj)is the maximum likelihood estimate of theircorrelation. Table 4.2 displays the maximumlikelihood estimates of the correlations.

Next, we can use the EM estimates asinput to a linear regression program toestimate the regression of ANTI on the othervariables. Many regression programs allow acovariance or correlation matrix as input. Ifmaximum likelihood estimates for the meansand covariances are used as the input, the

Page 9: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

80 DESIGN AND INFERENCE

Table 4.1 Expectation-maximization (EM) estimates of means and covariance matrix

ANTI SELF POV BLACK HISPANIC DIVORCE GENDER MOMWORK

Means 1.56799 20.1371 0.34142 0.35957 0.24208 0.23580 0.50430 0.33546Covariance matrixANTI 2.15932 −0.6402 0.15602 0.08158 −0.04847 0.01925 −0.12637 0.07415SELF −0.64015 9.7150 −0.10947 −0.09724 −0.13188 −0.14569 −0.03381 0.00750POV 0.15602 −0.1095 0.22456 0.06044 −0.00061 0.05259 0.00770 0.05446BLACK 0.08158 −0.0972 0.06044 0.22992 −0.08716 0.00354 0.00859 −0.01662HISPANIC −0.04847 −0.1319 −0.00061 −0.08716 0.18411 0.00734 −0.01500 0.01657DIVORCE 0.01925 −0.1457 0.05259 0.00354 0.00734 0.18020 −0.00015 −0.00964GENDER −0.12637 −0.0338 0.00770 0.00859 −0.01500 −0.00015 0.24998 0.00407MOMWORK 0.07415 0.0075 0.05446 −0.01662 0.01657 −0.00964 0.00407 0.22311

Table 4.2 Expectation-maximization (EM) estimates of correlation matrix

ANTI SELF POV BLACK HISPANIC DIVORCE GENDER MOMWORK

ANTI 1.0000 −0.1398 0.2241 0.1158 −0.0769 0.0309 −0.1720 0.1068SELF −0.1398 1.0000 −0.0741 −0.0651 −0.0986 −0.1101 −0.0217 0.0051POV 0.2241 −0.0741 1.0000 0.2660 −0.0030 0.2614 0.0325 0.2433BLACK 0.1158 −0.0651 0.2660 1.0000 −0.4236 0.0174 0.0358 −0.0734HISPANIC −0.0769 −0.0986 −0.0030 −0.4236 1.0000 0.0403 −0.0699 0.0817DIVORCE 0.0309 −0.1101 0.2614 0.0174 0.0403 1.0000 −0.0007 −0.0481GENDER −0.1720 −0.0217 0.0325 0.0358 −0.0699 −0.0007 1.0000 0.0172MOMWORK 0.1068 0.0051 0.2433 −0.0734 0.0817 −0.0481 0.0172 1.0000

Table 4.3 Regression of ANTI on other variables

No missing data Listwise deletion Maximum likelihood Multiple imputation

Variable Coeff. SE Coeff. SE Coeff. Two-step SE Direct SE Coeff. SE

SELF –0.054 0.018 −0.045 0.031 –0.066 0.022 0.022 –0.069 0.021POV 0.565 0.137 0.727 0.234 0.635 0.161 0.162 0.625 0.168BLACK 0.090 0.140 0.053 0.247 0.071 0.164 0.160 0.073 0.155HISPANIC −0.346 0.153 −0.353 0.253 −0.336 0.176 0.170 −0.332 0.168DIVORCE 0.068 0.144 0.085 0.243 −0.109 0.166 0.146 −0.107 0.147GENDER –0.537 0.117 −0.334 0.197 –0.560 0.135 0.117 –0.556 0.118MOMWORK 0.184 0.129 0.259 0.216 0.215 0.150 0.142 0.242 0.143

Coefficients (Coeff.) in bold are statistically significant at the .01 level.SE, standard error.

resulting regression coefficient estimates willalso be maximum likelihood estimates. Theproblem with this two-step approach is thatit is not easy to get accurate standard errorestimates.As with pairwise deletion, one mustspecify a sample size to get conventionalregression software to produce standard errorestimates. But there is no single number thatwill yield the right standard errors for allthe parameters. I generally get good resultsusing the number of non-missing cases forthe variable with the most missing data(in this example, 431 cases on POV). Butthis method may not work well under allconditions.

Results are shown in Table 4.3. Thefirst set of regression estimates is basedon the original data set with no missingdata. Three variables have p-values below.01: SELF, POV and GENDER. Higherlevels of antisocial behavior are associatedwith lower levels of self-esteem, beingin poverty and being male. The negativecoefficient for Hispanic is also marginallysignificant. The next set of estimates wasobtained with listwise deletion. Although thecoefficients are reasonably close to those inthe original data set, the standard errors aremuch larger, reflecting the fact that morethan half the cases are lost. As a result,

Page 10: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 81

only the coefficient for POV is statisticallysignificant.

Maximum likelihood estimates are shownin the third panel of Table 4.3. The coefficientsare generally closer to the original valuesthan those from listwise deletion. Moreimportantly, the estimated standard errors(using 431 as the sample size in the two-step method) are much lower than those fromlistwise deletion, with the result that POV,SELF and GENDER all have p-values below.01. The standard errors are still larger thanthose from the original data set, but that is tobe expected because a substantial fraction ofthe data is now missing.

DIRECT MAXIMUM LIKELIHOOD

As noted, the problem with the two-stepmethod is that we do not get dependablestandard error estimates. This problem can besolved by using direct maximum likelihood,also known as ‘raw’ maximum likelihood(because one must use the raw data as inputrather than a covariance matrix) or ‘fullinformation’ maximum likelihood (Arbuckle,1996; Allison, 2003). In this approach, thelinear model of interest is specified, andthe likelihood function is directly maxi-mized with respect to the parameters ofthe model. Standard errors may be calcu-lated by conventional maximum likelihoodmethods (such as computing the negativeinverse of the information matrix). Thepresumption is still that the data followa multivariate normal distribution, but themeans and covariance matrix are expressedas functions of the parameters in the specifiedlinear model.

Direct maximum likelihood is now widelyavailable in most stand-alone programs forestimating linear-structural equation models,including LISREL, AMOS, EQS, M-PLUSand MX. For the NLSY data, the maximumlikelihood panel in Table 4.3 shows the stan-dard error estimates reported by AMOS. (Thecoefficients are identical those obtained fromthe two-step method). With one exception(POV), the maximum likelihood standard

errors are all somewhat lower than thoseobtained with the two-step method (with aspecified sample size of 431).

MULTIPLE IMPUTATION

Although maximum likelihood is an excellentmethod for handling missing data, it doeshave limitations. The principal limitation isthat one must specify a joint probabilitydistribution for all the variables, and suchmodels are not always easy to come by.Consequently, although models and softwareare readily available in the linear and log-linear cases, there is no commercial softwarefor maximum likelihood with missing data forlogistic regression, Poisson regression or Coxregression.

An excellent alternative is multiple impu-tation (Rubin, 1987), which has statisticalproperties that are nearly as good as maxi-mum likelihood. Like maximum likelihood,multiple imputation estimates are consistentand asymptotically normal. They are close tobeing asymptotically efficient. (In fact, youcan get as close as you like by having a suffi-cient number of imputations.) Like maximumlikelihood, multiple imputation has thesedesirable properties under either the MARassumption or a correctly specified model forthe missing-data mechanism. However, mostsoftware assumes MAR.

Compared with maximum likelihood, mul-tiple imputation has two big advantages. First,it can be applied to virtually any kind ofdata or model. Second, the analysis can bedone using conventional software rather thanhaving to use a special package like LEMor AMOS. The major downside of multipleimputation is that it produces different resultsevery time you use it. That is because theimputed values are random draws rather thandeterministic quantities. A second downsideis that there are many different ways todo multiple imputation, possibly leading touncertainty and confusion.

The most widely-used method formultiple imputation is the Markov ChainMonte Carlo (MCMC) algorithm based on

Page 11: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

82 DESIGN AND INFERENCE

linear regression. This method was firstimplemented in the stand-alone packageNORM (Schafer, 1997), but is now availablein SAS and S-PLUS. The approach isquite similar to the multivariate normalEM algorithm which, as we saw earlier,is equivalent to iterated linear regressionimputation. There is one major difference,however. After generating predicted valuesbased on the linear regressions, randomdraws are made from the (simulated) errordistribution for each regression equation.These random ‘errors’ are added to thepredicted values for each individual toproduce the imputed values. The addition ofthis random variation compensates for thedownward bias in variance estimates thatusually results from deterministic imputationmethods.

If you apply conventional analysis softwareto a single data set produced by thisrandom imputation method, you get parameterestimates that are approximately unbiased.However, standard errors will still be underes-timated because, as noted earlier, the softwarecan not distinguish real values from imputedvalues, and the imputed values contain muchless information. The parameter estimates willalso be inefficient because random variationin the imputed values induces additionalsampling variability.

The solution to both of these problemsis to do the imputation more than once.Specifically, create several data sets, eachwith different, randomly drawn, imputedvalues. If we then apply conventional softwareto each data set, we get several sets ofalternative estimates. These may be combinedinto a single set of parameter estimates andstandard errors using two simple rules (Rubin,1987). For parameter estimates, one simplytakes the mean of the estimates over theseveral data sets. Combining the standarderrors is a little more complicated. First,take the average of the squared standarderrors across the several data sets. This isthe ‘within’ variance. The ‘between’ varianceis just the sample variance of the parameterestimates across the several data sets. Addthe within and between variances (applying

a small correction factor to the latter) andtake the square root. The formula is asfollows:√√√√ 1

M

M∑k=1

s2k +

(1+ 1

M

)(1

M −1

) M∑k=1

(bk − b̄)2

M is the number of data sets, sk is thestandard error in the kth data set, andbk is the parameter estimated in the kthdata set.

There is one further complication to themethod. For this standard error formula tobe accurate, the regression parameters used togenerate the predicted values must themselvesbe random draws from their ‘posterior’distribution, one random draw for each dataset. Otherwise, there will be insufficientvariation across data sets. For details, seeSchafer (1997).

How many data sets are necessary formultiple imputation? With moderate amountsof missing data, five data sets (the default inSAS) are usually sufficient to get parameterestimates that are close to being fully efficient.Somewhat more may be necessary to getsufficiently stable p-values and confidenceintervals. More data sets may also be neededif the fraction of missing data is large.

For the NLSY example, I used PROC MIin SAS to generate 15 ‘completed’ data sets.For each data set, I used PROC REG toestimate the linear regression of ANTI onthe other variables. Finally, I used PROCMIANALYZE to combine the results into asingle set of parameter estimates, standarderrors, confidence intervals and p-values.While this may seem like a lot of work, theprogramming is really quite simple. Here isthe SAS program that accomplished all ofthese tasks:

proc mi data=nlsy out=mioutnimpute=15;

var anti self pov black hispanicdivorce gender momwork;

proc reg data=miout outest=a covout;model anti=self pov black hispanic

divorce gender momwork;by _imputation_;proc mianalyze data=a;

Page 12: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 83

var intercept self pov blackhispanic divorce gender momwork;

run;

PROC MI reads the NLSY data set andproduces a new data set called MIOUT. Thisdata set actually consists of 15 stacked-datasets, with a variable named _IMPUTATION_having values 1 through 15 to distinguishthe different data sets. The VAR statementspecifies the variables that go into the imputa-tion process. Each variable with missing datais imputed using a linear regression of thatvariable on all the other variables.

PROC REG estimates the desired regres-sion model using the MIOUT data set. The BYstatement requests that separate regressions beestimated for each value of _IMPUTATION_.The OUTEST option writes the coefficientestimates to a data set called A and theCOVOUT option includes the estimatedcovariance matrix in that data set. Thisdata set is passed to PROC MIANALYZE,which then applies the combining rules tothe each of the coefficients specified in theVAR statement. Clearly, there is a majoradvantage in being able to do the imputation,the analysis and the combination within asingle software package. With a stand-aloneimputation program, moving the necessarydata sets back and forth between packages canget very tedious.

Results are shown in the last two columnsof Table 4.3. As expected, both coefficientsand standard errors are very similar to thoseproduced by direct maximum likelihood.Keep in mind that this is just one set of possible

estimates produced by multiple imputation.The first panel of Table 4.4 contains anotherset of estimates produced by the same SASprogram.

COMPLICATIONS

Space is not sufficient for a thorough treatmentof various complications that may arise in theapplication of multiple imputation. However,it is certainly worth mentioning some of themore important issues that frequently arise.

Auxiliary variables

An auxiliary variable is one that is used inthe imputation process but does not appearin the model to be estimated. The mostdesirable auxiliary variables are those thatare moderately to highly correlated withthe variables having missing data. Suchvariables can be very helpful in gettingmore accurate imputations, thereby increasingthe efficiency of the parameter estimates. Ifauxiliary variables are also associated with theprobability that other variables are missing,their inclusion can also reduce bias. In fact,including such variables can go a long waytoward making the MAR assumption moreplausible.

The dependent variable

If the goal is to estimate some kind ofregression model, two questions arise regard-ing the dependent variable. First, should the

Table 4.4 Regression of ANTI using two multiple imputation methods

Multivariate normal MCMC Sequential generalized regression

Coeff. SE Coeff. SE

SELF –0.065 0.021 –0.067 0.021POV 0.635 0.180 0.700 0.161BLACK 0.082 0.160 0.042 0.160HISPANIC −0.321 0.173 −0.334 0.173DIVORCE −0.112 0.147 −0.129 0.148GENDER –0.553 0.118 –0.559 0.118MOMWORK 0.235 0.135 0.217 0.157

Coefficients (Coeff.) in bold are statistically significant at the .01 level.MCMC, Markov Chain Monte Carlo; SE, standard error.

Page 13: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

84 DESIGN AND INFERENCE

dependent variable be included among thevariables used to impute missing values on theindependent variables? In conventional, deter-ministic imputation, the answer is no. Usingthe dependent variable to impute independentvariables can lead to overestimates of themagnitudes of the coefficients. With multipleimputation, however, the answer is definitelyyes, because the random component avoidsany bias. In fact, leaving out the dependentvariable will yield regression coefficientsthat are attenuated toward zero (Landermanet al., 1997).

Second, should the dependent variableitself be imputed? If the data are MARand there are no auxiliary variables, theanswer is no. Imputation of the dependentvariable merely increases sampling variability(Little, 1992). So the preferred procedureis to delete cases with missing data onthe dependent variable before doing theimputation. If there are auxiliary variablesthat are strongly correlated with the dependentvariable, imputation of the dependent variablecan be helpful in increasing efficiency and, insome cases, reducing bias. Often, one of thebest auxiliary variables is the same variablemeasured at a different point in time.

Combining test statistics

With multiple imputation, any parameterestimates can simply be averaged over themultiple data sets. But test statistics shouldnever be averaged. That goes for t-statistics, z-statistics, chi-square statistics and F-statistics.Special procedures are required for combininghypothesis tests from multiple data sets. Theseprocedures can be based on Wald tests,likelihood ratio tests, or a simple method forcombining chi-square statistics. For details,see Schafer (1997) or Allison (2001).

Model congeniality

Any multiple imputation method must bebased on some model for the data (theimputation model), and that model is notnecessarily (or even usually) the same modelthat one desires to estimate (the analysis

model). That raises the question of howsimilar the imputation model and the analysismodel must be in order to get good results.Although they do not have to be identical, thetwo models should be ‘congenial’ in the sensethat the imputation model should be able toreproduce the major features of the data thatare the object of the analysis model (Rubin,1987; Meng, 1994). Trouble is most likely tooccur if the imputation model is simpler thanthe analysis model. Two examples:

1. The analysis model treats a variable as categoricalbut the imputation model treats it as quantitative.

2. The analysis model includes interactions andnonlinearities, but the imputation model is strictlylinear.

If the fraction of missing data is small, thislack of congeniality may be unproblematic.But if the fraction of missing data is large,results may be misleading.

One implication of the congeniality prin-ciple is that imputation models should berelatively ‘rich’ so that they may be congenialwith lots of different models that couldbe of interest. However, there are seriouspractical limitations to the complexity ofthe imputation model. And if the imputerand analyst are different people, it may bequite difficult for the imputer to anticipatethe kinds of models that will be estimatedwith the data. Consequently, it may often benecessary (or at least desirable) to producedifferent imputed data sets for differentanalysis models.

Longitudinal data

Longitudinal studies are particularly prone tomissing data because subjects often drop out,die, or cannot be located. While there are manykinds of longitudinal data, I focus here onthe most common kind, often referred to aspanel data. In panel data, one or more variablesare measured repeatedly, and the measure-ments are taken at the same times for allsubjects.

Missing data in panel studies can bereadily handled by the methods of maximum

Page 14: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 85

likelihood and multiple imputation that wehave already discussed. For multiple impu-tation, the critical consideration is that theimputation must be done in such a way thatit reproduces the correlations over time. Thisis most easily accomplished if the data areformatted so that there is only one record persubject rather than separate records for eachobservation time point. The imputation modelshould be formulated so that each variablewith missing data may be imputed based onany of the variables at any of the time points(including the variable itself at a differenttime point).

Categorical variables

The MCMC method based on themultivariate-normal model is the mostpopular approach to multiple imputationfor good reasons. It can handle virtuallyany pattern of missing data, and it is extre-mely efficient computationally. Its biggestdisadvantage, however, is that it presumesthat every variable with missing data isnormally distributed and that is clearly notthe case for categorical variables. I ignoredthis problem for the NLSY example, treatingeach categorical variable as a set of dummyvariables and imputing the dummies just likeany other variables.

Of course, the resulting imputed valuesfor the dummy variables can be any realnumbers and not infrequently, are greaterthan 1 or less than 0. Many authorities(including me in my 2001 book) recommendrounding the imputed values to 0 and 1 beforeestimating the analysis model (Schafer, 1997).However, recent analytical and simulationresults suggest that this nearly always makesthings worse (Horton et al., 2003; Allison2006). If the dummy variables are to beused as predictor variables in some kind ofregression analysis, you are better off justleaving the imputed values as they are. Forcategorical variables with more than twocategories, there is no need to attempt toimpose consistency on the imputed valuesfor the multiple dummy variables. Unless thefraction of cases in any one category is very

small, this approach usually produces goodresults.

Alternative methods may be necessary if thefraction of cases in a category is very small(say, 5% or less), or if the analysis methodrequires that the imputed variable be trulycategorical (e.g., the imputed variable is thedependent variable in a logistic regression).In the next section, we will consider somemethods more appropriate for the imputationof categorical variables.

OTHER IMPUTATION METHODS

There are numerous alternative models andcomputational methods that are availablefor doing multiple imputation. One classof methods uses the MCMC algorithm butapplies it to models other than the multi-variate normal model. For example, Schafer(http://www.stat.psu.edu/∼jls) has developeda freeware package called CAT (availableonly as an S-Plus library), which is designedfor data in which all the variables arecategorical. It uses the MCMC algorithmunder a multinomial model or a restrictedlog-linear model.

Schafer has another package called MIX(also available only for S-Plus) that is suitablefor data sets and models that include bothcategorical and quantitative variables. Themodel is a multinomial (or restricted log-linear) model for the categorical data. Withineach cell of the contingency table formedby the categorical variables, the quantitativevariables are assumed to follow a multivariatenormal distribution with means that may varyacross cells but a covariance matrix that isconstant across cells. While this method mightseem to be ideal for many situations, the modelis rather complex and requires considerablethought and care in its implementation.

It is also possible to do imputation underthe multivariate normal model but with analgorithm other than MCMC to produce theimputed values. AMELIA, for example, isa stand-alone package that uses the SIR(sampling/importance resampling) algorithm.This is a perfectly respectable approach.

Page 15: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

86 DESIGN AND INFERENCE

Indeed, the authors claim that it is morecomputationally efficient than the MCMCalgorithm (King et al., 1999).

Perhaps the most promising alternativemethod for multiple imputation is an approachthat is described as either ‘sequential gen-eralized regression’ or ‘multiple imputationfor chained equations’ (MICE). Instead ofassuming a single multivariate model for allthe data, one specifies a separate regressionmodel that is used to impute each vari-able with missing data. Typically, this isa linear regression model for quantitativevariables, a logistic regression model (eitherbinomial or multinomial) for categoricalvariables or a Poisson regression model forcount variables (Brand, 1999; Raghunathanet al., 2001).

These models are estimated sequentiallyusing available data, starting with the variablethat has the least missing data and proceedingto the variable with the most missing data.After each model is estimated, it is usedto generate imputed values for the missingdata. For example, in the case of logisticregression, the model is applied to generatepredicted probabilities of falling into eachcategory for each case with missing data.These probabilities are then used as the basisfor making random draws from the possiblevalues of the categorical variable.

Once imputed values have been generatedfor all the missing data, the sequentialimputation process is repeated, except nowthe imputed values of the previous roundare used as predictors for imputing othervariables. This is one thing that distinguishessequential generalized regression from theMCMC algorithm – in the latter, valuesimputed for one variable are never used aspredictors to impute other variables. Thesequential process is repeated for manyrounds, with a data set selected at periodicintervals, say, every tenth round.

As noted, the main attraction of sequentialgeneralized regression methods (comparedwith MCMC methods) is that it is unnecessaryto specify a comprehensive model for the jointdistribution of all the variables. Potentially,then, one can tailor the imputation model

to be optimally suited for each variable thathas missing data. A major disadvantage isthat, unlike MCMC, there is no theory thatguarantees that the sequential method willconverge to the correct distribution for themissing values. Recent simulation studiessuggest that the method works well, but suchstudies have only examined a limited range ofcircumstances (Van Buuren et al., 2006). Thesequential method may also require a gooddeal more computing time, simply becauseestimation of logistic and Poisson modelsis more intensive than estimation of linearmodels.

User contributed add-ons for sequentialgeneralized regression are currently availablefor SAS (Raghunathan et al., 2000), S-Plus(Van Buuren and Oudshoorn, 2000), and Stata(Royston, 2004). In the remainder of thissection, I apply the ICE command for Statato the NLSY data set. Here are the Statacommands:

use "d:\nlsy.dta"gen race=1+black+2*hispanicice anti self pov race divorce

gender momwork, dryrunice anti self pov race black

hispanic divorce gender momworkusing nlsyimp,m(15) passive(black:race==2\hispanic:race==3)substitute(race:black hispanic)

use nlsyimp, clearmicombine regress anti self pov

black hispanic divorce gendermomwork

A bit of explanation is needed here. TheGEN command creates a new variable RACEthat has values of 1, 2 or 3, corresponding towhite/non-Hispanic, black and Hispanic. It isbetter to impute this variable rather than theindividual dummies for black and Hispanicbecause that ensures that each person withmissing race data will be assigned to one andonly one category.

The first ICE command is a ‘dryrun’.It scans the data set, identifies the variableswith missing data, and proposes an impu-tation model for each one. In this case,ICE proposed a linear model for imputing

Page 16: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 87

SELF, binary-logit models for imputing POVand MOMWORK, and a multinomial-logitmodel for imputing RACE. The second ICEcommand actually does the imputation, usingthe default methods that were proposed,and writes the imputed data sets into thesingle Stata data set, NLSYIMP. The M(15)option requests 15 data sets, distinguishedby the variable _j, which has values of1 through 15. The PASSIVE option saysthat the dummy variables BLACK andHISPANIC are imputed ‘passively’ basedon the imputed values of RACE. TheSUBSTITUTE option tells ICE to use thedummy variables BLACK and HISPANICas predictors when imputing other variables,rather than the 3-category variable RACE.Without this option, RACE would be treatedas a quantitative predictor which wouldclearly be inappropriate.

The USE command switches from theoriginal data set to the newly-imputed dataset. The MICOMBINE command (alongwith the REGRESS command) estimates theregression model for the 15 imputed data setsand then combines the results into a singleset of estimates and test statistics. Resultsare shown in the second panel Table 4.4.The first panel of this table is simply areplication of the MCMC multivariate normalmethod that produced the results in the lastpanel of Table 4.3. It is included here forcomparison with the sequential generalizedregression results, but also to illustrate thedegree to which results may vary from onereplication of multiple imputation to another.Although the coefficients vary slightly acrossthe replications and methods, they all tellessentially the same story.And the differencesbetween the MCMC results and the sequentialresults are no greater than the differencesbetween one run of MCMC and another.

SUMMARY AND CONCLUSION

Conventional methods for handling missingdata are seriously flawed. Even under thebest of conditions, they typically yieldbiased parameter estimates, biased standard

error estimates, or both. Despite the oftensubstantial loss of power, listwise deletion isprobably the safest method because it is notprone to Type I errors. On the other hand,conventional imputation methods may be themost dangerous because they often lead toserious underestimates of standard errors andp-values.

By contrast, maximum likelihood and mul-tiple imputation have nearly optimal statisticalproperties and they possess these propertiesunder assumptions that are typically weakerthan those used to justify conventionalmethods. Specifically, maximum likelihoodand multiple imputation perform well underthe assumption that the data are MAR, ratherthan the more severe requirement of MCAR;and if the data are not MAR, these twomethods do well under a correctly specifiedmodel for missingness (something that is notso easy to come by).

Of the two methods, I prefer maximumlikelihood because it yields a unique set ofestimates, while multiple imputation producesdifferent results every time you use it.Software for maximum likelihood estimationof linear models with missing data is readilyavailable in most stand-alone packages forlinear-structural equation modeling, includingLISREL, AMOS, EQS and M-PLUS. Forlog-linear modeling of categorical data, thereis the freeware package LEM.

Multiple imputation is an attractive alter-native when estimating models for whichmaximum likelihood is not currently avail-able, including logistic regression and Coxregression. It also has the advantage of notrequiring the user to master an unfamiliarsoftware package to do the analysis. Thedownside, of course, is that it does not producea determinate result. And there are lots ofdifferent ways to do multiple imputation, sosome care must go into choosing the mostsuitable method for a particular application.

Both maximum likelihood and multi-ple imputation usually require more timeand effort than conventional methods forhandling missing data. With improvementsin software, however, both methods havebecome much easier to implement, and further

Page 17: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

88 DESIGN AND INFERENCE

improvements can be expected. And some-times you just have to do more to get thingsright. Nowadays, there is no good excuse foravoiding these clearly superior methods.

REFERENCES

Allison, P.D. (2001) Missing Data. Thousand Oaks, CA:Sage.

Allison, P.D. (2003) ‘Missing data techniques forstructural equation models’, Journal of AbnormalPsychology 112: 545–557.

Allison, P.D. (2006) ‘Multiple imputation of cate-gorical variables under the multivariate normalmodel.’ Paper presented at the annual meetingof the American Sociological Association, MontrealConvention Center, Montreal, Quebec, Canada, Aug11, 2006

Arbuckle, J.L. (1996) ‘Full information estimation in thepresence of incomplete data’, in Marcoulides, G.A.and Schumacker, R.E. (eds.), Advanced StructuralEquation Modeling: Issues and Techniques. Mahwah,NJ: Lawrence Erlbaum Associates.

Brand, J.P.L. (1999) ‘Development, implementationand evaluation of multiple imputation strategiesfor the statistical analysis of incomplete data sets.’Dissertation, Erasmus University Rotterdam.

Cohen, J. and Cohen, P. (1985) Applied Multiple Regres-sion and Correlation Analysis for the BehavioralSciences (2nd edn.). Mahwah, NJ: Lawrence ErlbaumAssociates.

Dempster, A.P., Laird, N.M. and Donald R.B. (1977)‘Maximum likelihood estimation from incompletedata via the EM algorithm’, Journal of the RoyalStatistical Society, Series B 39: 1–38.

Glasser, M. (1964) ‘Linear regression analysis withmissing observations among the independent vari-ables’, Journal of the American Statistical Association59: 834–844.

Graham, J.W., S.M. Hofer and MacKinnon, D.P. (1996)‘Maximizing the usefulness of data obtained withplanned missing value patterns: an applicationof maximum likelihood procedures’, MultivariateBehavioral Research 31:197–218.

Haitovsky, Y. (1968) ‘Missing data in regressionanalysis’, Journal of the Royal Statsitical Society,Series B 30: 67–82.

Heckman, J.J. (1979) ‘Sample selection bias as aspecification error’, Econometrica 47: 153–161.

Horton, N. J., Lipsitz, S.R. and Parzen, M. (2003)‘A potential for bias when rounding in multipleimputation’, The American Statistician, 57: 229–232.

Jones, M.P (1996) ‘Indicator and stratification methodsfor missing explanatory variables in multiple linearregression’, Journal of the American StatisticalAssociation 91: 222–230.

King, G., Honaker, J., Joseph, A., Scheve, K. and Singh,N. (1999) ‘AMELIA: A program for missing data.’Unpublished program manual. Online. Available:http://gking.harvard.edu/stats.shtml.

Landerman, L.R., Land, K.C. and Pieper, C.F. (1997) ‘Anempirical evaluation of the predictive mean matchingmethod for imputing missing values’, SociologicalMethods and Research 26: 3–33.

Little, R.J.A. (1992) ‘Regression with missing X’s:a review’, Journal of the American StatisticalAssociation 87: 1227–1237.

Little, R.J.A. and Rubin, D.B. (2002). Statisticalanalysis with missing data (2nd edn.). New York:Wiley.

Meng, X-L (1994) ‘Multiple-imputation inferences withuncongenial sources of input’, Statistical Science,9 (4): 538–558.

Raghunathan T.E., Lepkowski, J.M., Van Hoewyk, J.and Solenberger, P. (2001) ‘A multivariate tech-nique for multiply imputing missing valuesusing asequence of regression models’, Survey Methodology,27: 85–95.

Raghunathan, T.E., Solenberger, P. and van Hoewyk, J.(2000) ‘IVEware: imputation and variance estima-tion software: installation instructions and userguide.’ Survey Research Center, Institute of SocialResearch, University of Michigan. Online. Available:http://www.isr.umich.edu/src/smp/ive/

Robins, J.M. and Rotnitzky, A. (1995) ‘Semiparametricefficiency in multivariate regression models withmissing data’, Journal of the American StatisticalAssociation, 90: 122–129.

Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1995)‘Analysis of semiparametric regression models forrepeated outcomes in the presence of missingdata’, Journal of the American Statistical Association,90: 106–129.

Royston, P. (2004) ‘Multiple imputation of missingvalues’, The Stata Journal 4: 227–241.

Rubin, D.B. (1976) ‘Inference and missing data’,Biometrika 63: 581–592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponsein Surveys. New York: Wiley.

Schafer, J.L. (1997) Analysis of Incomplete MultivariateData. London: Chapman and Hall.

Scharfstein, D.O., Rotnizky, A. and Robins, J.M. (1999)‘Adjusting for nonignorable drop-out using semi-parametric nonresponse models (with comments)’,Journal of the American Statistical Association94: 1096–1146.

Page 18: Allison, P.D., (2009), Missing Data ... - Statistical · PDF file4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data,

MISSING DATA 89

Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn,C.G.M. and Rubin, D.B. (2006) ‘Fully condi-tional specification in multivariate imputation’,Journal of Statistical Computation and Simulation76: 1046–1064.

Van Buuren, S. and Oudshoorn, C.G.M. (2000)‘Multivariate imputation by chained equations: MICE

V1.0 user’s manual.’ Report PG/VGZ/00.038. Leiden:TNO Preventie en Gezondheid.

Van Praag, B.M.S., Dijkstra, T.K. and Van Velzen,J. (1985) ‘Least-squares theory based on generaldistributional assumptions with an application to theincomplete observations problem’, Psychometrika 50:25–36.