View
1
Download
0
Category
Preview:
Citation preview
Running head: SELECTION OF AUXILIARY VARIABLES 1
Selection of auxiliary variables in missing data problems: Not all auxiliary variables are
created equal
Felix Thoemmes
Cornell University
Norman Rose
University of Tuebingen
Author Note
The authors would like to thank the participants of the colloquium of the Methodology
Center at Pennsylvania State University. Inquiries to this article should be addressed to the
first author, Felix Thoemmes, MVR G62A, Cornell University, Ithaca, NY 14853,
felix.thoemmes@cornell.edu.
SELECTION OF AUXILIARY VARIABLES 2
Abstract
The treatment of missing data in the social sciences has changed tremendously during the
last decade. Modern missing data techniques such as multiple imputation and
full-information maximum likelihood are used much more frequently. These methods assume
that data are missing at random. One very common approach to increase the likelihood that
missing at random is achieved, consists of including many covariates as so-called auxiliary
variables. These variables are either included based on data considerations or in an inclusive
fashion, i.e., taking all available auxiliary variables. However, neither approach accounts for
the fact that under a wide range of circumstances there is a class of variables that, when
used as auxiliary variables, will always increase bias in the estimation of parameters from
data with missing values. In this paper we show that this bias exists, quantify it in a
simulation study, and discuss possible ways how one can avoid selecting bias-inducing
covariates as auxiliary variables.
Keywords: missing data, auxiliary variables,multiple imputation, full information
maximum likelihood
SELECTION OF AUXILIARY VARIABLES 3
Selection of auxiliary variables in missing data problems: Not all auxiliary variables are
created equal
Introduction
The presence of missing data is a prevalent problem in social science research (?, ?).
Given that a large portion of social science studies are conducted outside the confines of a
laboratory, the threat of suffering missing data due to non-compliance or attrition is even
more pronounced. The pervasiveness of this problem has triggered much research during the
last 30 years. ? (?) laid the foundation of modern missing data theory which has culminated
in sophisticated methods to deal with missing values, specifically the use of full-information
maximum likelihood (FIML) and multiple imputation (MI). For an overview see e.g., ? (?).
Both of these so-called modern missing data techniques are expected to yield unbiased
estimates of parameters in the presence of missing data, given that certain assumptions
about missingness hold. It should be noted that especially MI, while conceptually
straightforward (?, ?), can be conducted with various different techniques, see e.g., ? (?), ?
(?), ? (?), or ? (?). However, despite computational differences, all techniques, whether they
may be FIML or variants of MI, rely on the same, untestable assumptions, notably, the
missing at random (MAR) assumption (?, ?), which we will define more formally later in the
manuscript. The goal of this paper is to critically examine current recommendations to
increase the plausibility of MAR, especially in regards to the selection of auxiliary variables.
We argue that the current recommendations are incomplete and simply ignore the possibility
of complex relationships between substantive analysis variables and variables that are solely
used to improve the missing data estimation, so-called auxiliary variables. Further, we
believe that the complexities of the assumptions are not widely appreciated among social
science researchers and many quantitative scientists alike, who have long believed that
inclusion of as many auxiliary variables as possible is a safe strategy to asymptotically
achieve or approximate unbiasedness. We will show in a small example and a larger
simulation study that this strategy is not guaranteed to yield unbiased results and that
SELECTION OF AUXILIARY VARIABLES 4
biases due to missing data and the use of auxiliary variables are much more complex than
previously thought. As a result, the use of modern missing data techniques, while laudable,
does often not guarantee that bias in studies with missing data has been adequately dealt
with.
We will first review classic missingness mechanisms and discuss which conditional
independencies these conditions imply and how these independencies can be encoded in a
graph. Further, we demonstrate that there are situations and classes of variables that should
not be used as auxiliary variables in FIML or MI as they tend to increase bias. We will
quantify the bias in our simulation studies, and suggest possible ways to avoid it. Finally, we
will discuss implications for applied research and offer an alternative framework to think
about and communicate assumptions of missing data problems.
Missing data mechanism
We begin by reviewing the classic mechanisms defined by ? (?): missing completely at
random (MCAR), missing at random (MAR), and missing not at random (MNAR). In our
overview we use a slightly modified version of the notation employed by ? (?). In addition,
we also express missing mechanisms using conditional independence statements. In
conjunction with the conditional independence statements, we present graphical displays to
illustrate the mechanisms. Using graphs to illustrate how missingness relates to other
variables in a model is not a novel approach and has in fact been used in popular texts and
articles to aid understanding of the mechanisms (?, ?, ?). In this paper however, we do not
use graphs simply as illustrations, but also use formal graph theory (?, ?) to derive certain
results.
MCAR
Following the notation of ? (?), we denote an N ×K matrix Y . The rows of Y
represents the cases n = 1, . . . , N of the sample and the columns represent the variables
i = 1, . . . , K. Y can be partitioned into an observed part, labeled Yobs, and a missing part
SELECTION OF AUXILIARY VARIABLES 5
Ymis, which yields Y=(Yobs, Ymis). Further, we denote an indicator matrix of missigness, R,
whose elements take on values of 0 or 1, for observed or missing values of Y , respectively.
Accordingly, R is also an N ×K matrix. Each variable in Y can therefore have both
observed or unobserved values.
Missing completely at random (MCAR) is the most restrictive assumption, but, when
fulfilled, the least problematic. It states that the unconditional distribution of missingness
P (R) is equal to the conditional distribution of missingness given Yobs and Ymis, or simply Y .
P (R | Y ) = P (R | Yobs, Ymis) = P (R) (1)
These equalities of probabilities imply (can be expressed as) conditional independence
statements, here in particular
R ⊥ (Yobs, Ymis). (2)
The MCAR condition is therefore fulfilled when the missingness has no relationship with
either the observed and unobserved part of Y . In an applied research context we could
imagine MCAR being fulfilled if the missing data arose from a purely “accidental” (random)
process, like dropping a single sheet from a questionnaire. In other words, the probability of
missingness is related only to factors that are completely unrelated to any other variable in
the model. MCAR is rare in applied research and usually does not hold, unless it has been
planned by the researcher in so-called missingness by design studies (?, ?). When MCAR
holds, even simple techniques, like listwise deletion will yield unbiased estimates (?, ?), even
though it might still not be advisable to use these simple methods due to loss in statistical
power. As ? (?) described and ? (?) more formally showed, MCAR cannot be tested
empirically, and homogeneity of means, variances, or more generally distributions, of
observed variables across missing data patterns constitutes only necessary, but not sufficient
evidence for MCAR. The inability to directly test MCAR can also be seen by the fact that it
posits independence assumptions about quantities that are by definition unobserved, here in
particular Ymis.
SELECTION OF AUXILIARY VARIABLES 6
Before we proceed further, it is necessary to address the graphical displays that we will
be using. First, they are constructed as so-called direct acyclic graphs (?, ?), which we will
abbreviate as DAGs. DAGs are widely used in epidemiology (?, ?, ?, ?, ?), medicine (?, ?, ?),
computer science (?, ?, ?, ?, ?) and other fields. They also have been used to examine
missing data situations (?, ?). Researchers who are familiar with structural equation models
(SEM) will also feel familiar with DAGs, however there are some differences (for a complete
overview of differences refer to ? (?)). Briefly explained, in a DAG we use the ε terms, the
so-called disturbance terms, to denote all unmeasured variables that may have an effect on
the variable that is endowed with this ε term. Note that these disturbance terms are not
identical to regression residuals that are by definition uncorrelated with variables that were
used to predict the variable with the ε term. Further, the DAG is completely non-parametric
and encodes conditional independencies among the variables displayed. Precisely because of
this ability to encode conditional independencies are DAGs well suited to express missing
data mechanisms (which can be expressed as such conditional independencies, as we have
shown earlier in the example of MCAR). We will use DAGs to express conditional
independencies that are prescribed by different missingness mechanisms and in doing so,
show how novel insights about missingness problems can be gathered.
In Figure 1 we present a graphical display of MCAR for the simple case in which a
single variable X has an effect on a unidimensional variable Y . In this simple case, X is
completely observed and only Y suffers from missingness. Whether data on Y is missing is
encoded by the indicator RY in the graph. We use an additional subscript for R here to
denote that this missingness indicator pertains only to variable Y . Note that we could have
visually partitioned Y in the graph into Yobs and Ymis, but for clarity simply denote it as Y .
In this example equation 2, which expresses the condition that needs to hold for MCAR, can
be written as RY ⊥ (X, Y ).
Independence relations in DAGs are expressed as so-called d-separation statements.
d-separation is a graphical criterion that can be applied to DAGs to infer independence
SELECTION OF AUXILIARY VARIABLES 7
relations among variables. In short, if two variables are said to be d-separated, there exists
no traceable, unblocked path in the diagram between the variables. Conversely, if two
variables are d-connected, there exists a traceable and unblocked path between the variables.
A traceable path is defined as any path that connects two variables in a graph. It is not of
importance for the definition of a path whether the segments of the path have arrows
pointing in one or the other direction. To examine d-separation one examines whether all
paths are open or blocked. A path is said to be blocked if one conditions on a variable in the
path that acts as a mediator, i.e., takes on the form ← X ← or → X →, or is an
arrow-emanating variable, i.e., takes on the form, ← X →. Further, a path is blocked if one
does not condition on a variable that has two arrows pointing in it, i.e., takes on the form
→ X ←. Such a variable is usually called a collider variable (?, ?). If two variables are said
to be d-connected there exists at least one traceable path between them that has not been
blocked. Being d-connected implies that the two variables are stochastically dependent on
each other. ? (?) has provided a proof that variables in a graph that are d-separated are
stochastically independent from each other, regardless of the functional form of the
relationships among the variables in the graph. For a more thorough introduction to
d-separation for social scientist, consult ? (?) or the original text by ? (?).
In the graph in Figure 1 we can see that there is only a single arrow pointing to RY
from the disturbance term εR, meaning that missingness arises only due to unobserved
factors. Further, these unobserved factors have no association with any other variable or
disturbance term in the model, as can be seen by the fact that εR is unassociated with other
parts of the model. In this graph, there is no traceable path between Y and RY (or X and
RY ) and they are said to be d-separated without having to condition on any other variables,
implying unconditional stochastic independence between the variables Y and RY (as defined
in equation 1), and therefore the missing data mechanism is MCAR. So far we have used the
the expression “to condition on” in the context of missing data problems this relates to
observing and using a variable in a FIML or multiple imputation model.
SELECTION OF AUXILIARY VARIABLES 8
MAR
A somewhat less restrictive condition is missing at random (MAR). MAR states that
the conditional distribution of missingness, given the observed part Yobs is equal to the
probability of missingness, given the observed and the unobserved part (Yobs, Ymis).
P (R | Y ) = P (R | Yobs, Ymis) = P (R | Yobs). (3)
These equalities of probabilities again imply (can be expressed as) conditional independence
statements, here in particular
R ⊥ Ymis | Yobs. (4)
In words, MAR states that the missingness is stochastically independent of the unobserved
variables, whereas dependencies between observed variables and missingness are allowed. In
an applied research context, we could imagine that missingess is caused by certain observed
variables that may also have an effect on important analysis variables. For example,
missigness on an achievement measure could be caused by motivation (or lack thereof).
Further we can assume that motivation has also an effect on achievement. MAR is an
important condition, because when it holds, modern estimation techniques (MI and FIML)
yield unbiased results. Just as MCAR, MAR cannot be tested empirically, as it also posits
conditional stochastic independence assumptions among quantities that are by definition
unobserved, specifically, Ymis. Returning to the example with variable X, the unidimensional
variable Y and the respective missing indicator RY , the MAR condition (see equation 4)
implies the conditional stochastic independence RY ⊥ Y | X. In Figure 2 (a) we show the
simple situation in which MAR holds. In this figure, Y and RY are d-connected, via the
path Y ← X → RY . However, if one conditions on X, this path becomes blocked and Y and
RY are now d-separated, implying conditional stochastic independence RY ⊥ Y | X, as
similarly defined in equation 4, and therefore MAR holds, as long as one has observed X and
uses it in the estimation of Y in a FIML framework, or uses it as a predictor variable in an
MI framework. Often, researchers use variables to predict missingness that may not be of
SELECTION OF AUXILIARY VARIABLES 9
substantive interest. Such variables are usually called “auxiliary” variables, because they are
not of theoretical interest to the applied researcher but aid in the estimation of the missing
data. In the second graphical example in 2 (b), we explicitly describe an auxiliary variable
and how it can help to create conditional independence between the missingness and the
variable with missing value, thereby implying MAR. We use the same set of variables as in
Figure 1 (a), but introduce a new variable A, which in this example must be used as an
auxiliary variable for an unbiased estimate of the relationship of interest between X and Y
in the presence of missing data on Y .
In Figure 2 (b), Y and RY are d-connected, via the path Y ← A→ RY and via the
path Y ← X → RY . However, if one conditions on A, the first path becomes blocked, and if
one conditions on X, the second path becomes blocked and Y and RY are now d-separated,
implying conditional stochastic independence RY ⊥ Y | (A,X), and therefore MAR holds.
Note that A in the graph could be a multidimensional set of variables that all exhibit the
same structure.
MNAR
Finally, missing not at random (MNAR) is the least stringent assumption, however the
most problematic, as even FIML and MI will typically, though not always for all parameter
estimates, yield biased results. MNAR is characterized by the probability of missingness
being dependent on both the observed part, Yobs, and the unobserved part, Ymis. That is,
P (R | Yobs, Ymis) 6= P (R | Yobs). (5)
No conditional independencies are implied be equation 5. In an applied research context, we
could consider different ways that MNAR could arise. One situation would be if missingness
was caused by the variable with missing data itself, e.g., participants with a very high income
are more likely to not report their income. This situation is depicted in Figure 3 (a), in which
Y and RY are directly connected by a path. Y and RY are said to be d-connected through
SELECTION OF AUXILIARY VARIABLES 10
the direct path Y → RY . Two adjacent, connected variables in a graph, can never be
d-separated. Hence, no conditional stochastic independence can arise, and MNAR is present.
A similar MNAR situation would arise when an unobserved variable has an effect on
both the missingness RY and Y . In an applied research context, this could happen whenever
a variable that influences missingness also has an effect on analysis variables, but the
variable has not been measured and is therefore omitted. This omitted variable can be
displayed as a latent, unobserved variable in the graph, or simply as correlated disturbance
terms. Figure 3 (b) displays such a situation in which an omitted variable influences both Y
and RY . Here, Y and RY are d-connected via the path Y ← L1 → RY . This path cannot be
blocked via conditioning, because no observed variables reside in the middle of the path.
Again, no stochastic conditional independence can be achieved through conditioning and
MNAR holds. Note that the variable L1 in the graph should not be confused with a
modeled, latent variable in a SEM, but rather is a simple depiction of an unobserved
variable. To make this clear, we deviate slightly from regular symbolic language of DAGs
and SEM graphs and used a dashed outline for the unobserved variable.
Equivalence of missing data mechanisms and graphs
In the previous section we showed how the classic missingness mechanisms can be
expressed via graphs that encode conditional independencies and applied the graph-theoretic
concept of d-separation. In summary, when a variable Y and its associated missing indicator
RY are d-connected, MNAR holds and bias will typically emerge. If Y and RY can be
d-separated using any set of other observed variables, then MAR holds, and parameters
related to Y can be estimated without bias, when using methods that rely on MAR
(FIML,MI) and using those variables that are needed to d-separate Y and RY in the
imputation or analysis model, respectively for MI and FIML. A special case arises when Y
and RY are d-separated given no other variables (unconditionally independent), which maps
on to the classic MCAR condition. As we shall see, relying on the graph-theoretic concept of
SELECTION OF AUXILIARY VARIABLES 11
d-separation will allow us to further determine, whether any given auxiliary variable is
needed to achieve d-separation of Y and RY or whether a variable would in fact make these
two variables d-connected and induce conditional dependencies. We believe that herein lies
an important advantage of using graphical models as we can easily spot auxiliary variables
that may be bias-reducing or - as we will show - bias-inducing, something that is not
apparent when relying on the classic conditional independence notation that has been used
to describe the missing data mechanisms.
Current approaches
While all assumptions of the missing mechanisms are important, insofar as they
prescribe which methods will yield biased or unbiased estimates, MAR is an assumption that
is necessary for the two missing data approaches that are considered state-of-the-art, FIML
and MI. A pertinent question is therefore how a researcher can achieve MAR or at least
make MAR plausible in his or her study. As seen in equation 3 and 4 and in the
accompanying graphs it is necessary to include all variables in the imputation or FIML
model that make Y and RY independent of each other. In other words, researchers need to
capture all variables that they believe have a direct or indirect effect on the probability of
being missing and at the same time a direct or indirect effect on the variable with missing
data. Some of these variables might already be part of the analytic model, others might not
be part of the analytic model, but might be needed to satisfy the MAR assumption, i.e.,
auxiliary variables. We now describe current approaches that aim to achieve MAR and
present an example that illustrates potential problems with these approaches.
Inclusive approach
The so-called inclusive approach (?, ?) to achieve MAR directs researchers to include
many auxiliary variables in their imputation model (or in their FIML estimation, following
guidelines by ? (?)). The reasoning behind the inclusive strategy is as follows: if many
variables are included it becomes less likely that variables that are both causes of the
SELECTION OF AUXILIARY VARIABLES 12
missingness and the analytic variables with missing data are omitted. Such omission would
be harmful as it would destroy the conditional independence posited in MAR and induce
bias. ? (?) showed that bias in means, variances, and regression estimates can be substantial
if this kind of variable is omitted. A second rationale for adopting an inclusive strategy is
that the inclusion of variables that may not be causes of the missingness or causes of the
analytic variables with missing data, was shown to be “far from being harmful[,]...at worst
neutral, and at best extremely beneficial” (?, ?, p. 349). In particular ? (?) examined the
influence of including variables that are completely uncorrelated to missingness or analytic
variables with missing data (so called “trash variables”), or only related to analytic variables
with missing data but not with the missingness itself. Completely uncorrelated variables did
not have any impact on bias, and variables that were only correlated with Y , were shown to
be able to attenuate bias in MNAR situations and reduce standard errors.
Data-driven approach
Even if one fully acknowledges the benefits of an inclusive strategy, such a strategy can
reach its limits, especially when applied to large-scale datasets, which may contain hundreds
of variables. If analytic models include many variables and many auxiliary variables are
added, both MI and FIML will likely encounter problems in the convergence of models. To
mitigate this problem it has been suggested to examine data for the inclusion of variables as
auxiliaries. ? (?) suggest that variables make good candidates for auxiliary variables if they
are related to the missingness or the analytic variable that exhibits missingness. The
rationale behind this advice is straight-forward: a variable that is completely uncorrelated
with the probability of missing, cannot induce any dependencies between RY and Y .
Likewise, a variable that is completely uncorrelated with the analysis variable with missing
values can also not induce any dependencies between RY and Y . As a demonstration of this
principle, consider Figure 4 in which three auxiliary variables A1, A2, and A3 are added to a
model in which X d-connects Y and RY via Y ← X → RY and A1 d-connects Y and RY via
SELECTION OF AUXILIARY VARIABLES 13
Y ← A1 → RY . The two variables A2 and A3 do not d-connect Y and RY and conditioning
on them is therefore not needed to render Y and RY conditionally independent, and hence
fulfilling the MAR condition. Simply using X and A1 is sufficient in this example. 1
The data-driven approach advises us to screen our set of potential auxiliary variables
as to whether they are related (usually examined using correlations) with any of the analysis
variables, or any of the missing value indicator variables. Variables that are related to either
or both should be included as auxiliary variables, while variables that fall below a certain
correlation threshold to either, should not be used. Particular guidelines on the inclusion and
exclusion of auxiliary variables were formulated by ? (?) who recommend to include a
variable if the correlation of it with either missingness or the variable with missing data
exceeds ± .1 (or any other chosen threshold, e.g., ? (?) suggests correlations with the
analysis variables greater than ± .4). The implicit assumption is that variables that are
correlated even lower than the chosen threshold will have little power to induce any
dependencies, and that variables that are correlated higher, are assumed to induce biases in
the estimation of parameters in the presence of missing data.
Generally, the advice to include auxiliary variables in missing data problems is sound
and has, in both simulations studies (?, ?) and theoretical work (?, ?), been shown to be
useful. However, both the inclusive strategy and the data-driven approach ignore the
possibility that there are certain instances and classes of variables that should not be used as
auxiliary variables, because they induce bias in the estimation of parameters in the presence
of missing data, by destroying the conditional independence between Y and RY , hence
violating MAR. We now turn to these situations and variables and show, using illustrative
examples and simulations, that this bias can become potentially large, if ignored.
1Note that if the disturbance terms of A2 and A3 were correlated (e.g., due to an unobserved variable
that has a relationship to both of these variables), an active path Y ← A2 ← εA2 ↔ εA3 → A3 → RY would
be present, which could be blocked by either conditioning on A2, A3, or both. Hence at least one of these
variables would need to be included in a FIML or imputation model.
SELECTION OF AUXILIARY VARIABLES 14
Bias-enhancing auxiliary variables
Consider first a simplified illustrative example of a single variable Y with missing data,
a missing data indicator RY , and two potential auxiliary variables A1 and A2 that are at the
disposal of the applied researcher. In addition, two unobserved variables L1 and L2 are part
of the true data-generating model. The full model is displayed in Figure 5.
An initial reaction to this model might be that the unobserved variables L1 and L2
make this an MNAR situation and that some bias would be expected and is not surprising.
However, the situation is more subtle. Variable A1 indeed induces conditional dependencies
between Y and RY via the path Y ← A1 → RY and therefore biases the estimates of Y , in
the presence of missing data. Therefore, if one uses A1 as an auxiliary variable, bias due to
A1 will be eliminated, as the biasing path is blocked. Variable A2 on the other hand, even
though spuriously correlated with Y and RY , does not induce conditional dependencies via
the path Y ← L1 → A2 ← L2 → RY and therefore cannot bias the estimates of Y no matter
what values the constituent path coefficients would take on. This is because A2 is a collider
variable on this path and not conditioning on it, closes this path and does not induce any
dependencies between Y and RY . What however happens when A2 is also used as an
auxiliary variable, along with A1? The inclusion of A2 will actually destroy the conditional
independence that was achieved earlier with the inclusion of A1 and induce an MNAR
situation. The path Y ← L1 → A2 ← L2 → RY that was initially blocked becomes open
when A2 is conditioned on (used as an auxiliary variable).
To illustrate this point further using data, we simulated a single dataset based on the
model in Figure 5. The data generation is fully described in the first simulation study below.
Briefly described, we chose a large sample size of n = 1000. All continuous variables were
multivariate normally distributed with mean of 0 and variance of 1. Path coefficients in the
model were completely standardized and the size of the path coefficients was chosen so that
the total R2 (or the respective McKelvey-Zavoina pseudo-R2 (?, ?)) of every single
dependent variable in the model (Y,A2, RY ) was identical to 50%. We chose the sign of the
SELECTION OF AUXILIARY VARIABLES 15
path coefficients so that the direction of bias due to the omission of A1 and the bias due to
the inclusion of A2 was in the same direction and not incidentally offsetting each other. The
amount of missing data was set to 50%. We estimated the mean and standard deviation of
the variable Y using a listwise deletion approach, FIML estimation in Mplus (?, ?) and
lavaan (?, ?) using only A1 as the auxiliary variables, using only A2 as the auxiliary variable,
or using both A1 and A2 as auxiliary variables. Auxiliary variables in the FIML estimation
were included using the Mplus auxiliary command, which automatically fits a model
suggested by ? (?). We also used mice (?, ?) to generate 5 multiple imputations whose
results were pooled following standard recommendations (?, ?). As expected, and previously
reported by ? (?), results of FIML and MI did not differ substantially when the same set of
auxiliary variables were used. We only report results of the FIML estimation in Table 1.
In the single simulated dataset the completely observed data of Y had a mean of .03, and a
standard deviation (SD) of 1.00. When using listwise deletion, the mean of Y was .19, and
the SD was .98. Not surprisingly we observed bias in the means, as would be expected under
a MAR situation in which missingness was induced through a linear function of other
variables. Using A1 as an auxiliary variable and estimating the mean of Y with FIML
estimation yielded a mean of .06. Using A1 does a very good job of reducing bias. The
relative percent reduction of bias compared to the listwise model was 100× .19−.06.17 ≈ 68%.
Using A2 as an auxiliary variable on the other hand actually increases bias! The estimated
mean of Y was now .30, with a resulting percent bias amplification of 100× .19−.30.19 ≈ 58%
compared to the listwise results. Finally, when using both A1 and A2 as auxiliary variables,
the mean of Y was estimated to be .14, resulting in a bias reduction of a mere
100× .14−.19.19 ≈ 26%. We observed that using both variables as auxiliary variables was worse
than using A1 alone. This result may not be obvious when considering the formulas for
MAR or MNAR, and in fact it goes counter to the advice that an auxiliary variable can be
at worst neutral. Clearly, this auxiliary variable was not neutral, but highly bias-inducing.
When one uses a graph to encode the structural relationships between the auxiliary variables
SELECTION OF AUXILIARY VARIABLES 16
and missingness and analysis variables, respectively, this result however is expected and can
be directly seen by the fact that conditioning on A2 d-connects Y and RY by opening a
previously blocked path.
A single simulated dataset is seldom a convincing argument, however it can serve as a
departing point for a more developed argument. First, it shows that an auxiliary variable
can increase bias in the estimation of parameters in the presence of missing data. Second, a
bias-inducing variable cannot be distinguished from a helpful auxiliary variable by examining
correlations with analysis variables and missingness indicators. In fact, in this example, the
variable A2 posed as a perfectly innocent and potentially very helpful auxiliary variable. In
the complete dataset A2 was both significantly correlated with the analysis variable Y
(r = .26, p < .001) and the missing data indicator RY (point-biserial correlation
rpb = .25, p < .001). Using inclusion criteria that rely solely on correlations would incorrectly
lead to the inclusion of A2 in the set of auxiliary variables.
In addition, a simple example like this one helps to link what could simply be a
mathematical curiosity to an applied context. To make this illustrative example more
concrete, consider that Y , the variable with missing data, is a measure of mathematical
ability with a missingness indicator RY . For this example, we assume that MAR holds and
that there is no direct path from Y to RY . Variable A1 is a measure of motivation of the
participant that has been observed and is used in the analysis as a potential auxiliary
variable. Specifically, more motivated participants score higher on the math achievement
test, and are less likely to have missing data. Consider further that A2 is the income of the
participant, another variable that was assessed as part of the study. The two unobserved
variables L1 and L2 are IQ and gender of the participant, respectively. Note that we are
assuming in this model that IQ and gender are in fact uncorrelated (which seems like a
tenable assumption). The model further expresses that participants with higher IQ scores
also score higher on math achievement, and that participant’s gender has an influence on
missingness (maybe one gender group was more likely to skip certain items). While this
SELECTION OF AUXILIARY VARIABLES 17
example is admittedly somewhat artificial due to it’s constrained nature, we believe that it is
not entirely implausible and suggests that auxiliary variables of the type as A2 in our
example could in fact be lurking among seemingly benign potential auxiliary variables.
Henceforth, we will refer to these variables as collider auxiliary variables.
Research questions
Having established in a single example that auxiliary variables can induce bias we set
forth to answer several research questions.
1. First, we are interested in the absolute magnitude of bias that can be induced when
using collider auxiliary variables as a function of the magnitude of the constituent paths that
connect a collider auxiliary variable to missingness and analysis variables. In addition, we
want to put this magnitude into context and contrasts it with bias that is induced due to the
omission of a helpful auxiliary variable. This latter form of bias has been examined before
and we only include it to provide a benchmark for the bias that we expect to observe with
the inclusion of a collider auxiliary variable. Earlier research by ? (?) in the area of
confounding in causal inference suggests that the magnitude of bias due to conditioning on a
collider, especially of the kind that we presented in our example, is usually smaller than
omitting a confounder. We therefore suspect that bias due to including a collider variable as
an auxiliary variable will be noticeable, but smaller in magnitude than omitting a true
confounding auxiliary variable (i.e., a variable that is directly or indirectly causing both
missingness and analysis variables with missing data).
2. The second research question examines behavior of auxiliary variables in data
situations that are inherently MNAR. In the MAR cases considered in the first simulation
study, the conditional independence between missingness and analysis variables with missing
data can always be created by using some observed variables. Hence there is an expectation
that including the collider auxiliary variable will necessarily increase bias. by disturbing the
conditional independence. In the MNAR case collider auxiliary variables are expected to
SELECTION OF AUXILIARY VARIABLES 18
behave differently, insofar as the relationship that they induce between the missingness and
variables with missing data can either enhance or reduce the already existing relationship
between missingness and analysis variables with missing data. In a similar fashion, we will
also explore the behavior of auxiliary variables that are directly related to both missingness
and analysis variables.
Simulations studies
Simulation study 1.1
Our first simulation study explores the absolute magnitude of bias that can be induced
when using a collider auxiliary variable in a MAR situation. The simulation study roughly
followed ? (?), in terms of data-generation and evaluation criteria. Generally speaking, data
are first generated under a specific model, then missing data are imposed based on a
described mechanism, then parameters are estimated using listwise deletion and FIML with
auxiliary variables. Lastly, results of replications are pooled within condition and
performance criteria assessed. While it is possible to examine bias in many different
parameters of interest (means, variances, skew, regression coefficients, factor loadings, etc.),
we only focus on estimates of the population mean. The reason behind this choice was that
mean responses (potentially across different groups) are still one of the most widely used
measures to describe research phenomena in the social sciences. The examination of
regression coefficients is left to future studies and is briefly mentioned in the discussion.
Data generation and analysis. The data-generating model for simulation 1.1 is
shown in Figure 6. In the model, a single independent variable Y is generated with missing
data, indicated by RY . Auxiliary variable A1 is spuriously correlated with the probability of
missing and the outcome Y , via two unobserved, uncorrelated variables L1 and L2. In the
model Y and RY are d-separated but become d-connected as soon as A1, the collider
auxiliary variable, is used in MI or FIML. All continuous variables were multivariate
normally distributed and completely standardized by fixing the total variance of each
SELECTION OF AUXILIARY VARIABLES 19
variable to 1 and setting means to 0. We did not vary sample size, but chose a single
constant sample of 500. This single sample size was also chosen by other authors in similar
simulations (?, ?, ?), as a somewhat large, but still reasonable sample size to consider.
Furthermore, changes in sample size usually yield predictable results when other factors are
held constant, namely that standard errors decrease with increased sample size. We also did
not vary the amount of missing data, but fixed it at a relatively high value of 30%, which
was in-between the two values chosen by ? (?). Varying the amount of missing data is often
not very interesting as results of such variation have previously been shown to yield expected
results (bias gets worse as missing data increases). All path coefficients in the
data-generating model, labeled α were chosen so that the uniquely explained variance in the
outcome variable that these paths were connected to was set to a particular value. Paths
coefficients were set at 0, .224, .387, .500, .592 and .671. This corresponds to uniquely
explained variance of 0%, 5%, 15%, 25%, 35%, and 45%, respectively. See the Appendix for
details on how missingness was generated and how explained variance in RY was defined.
Finally, we varied the sign of the coefficient labeled α? (positive or negative). This sign
change of a single path of the constituent paths of the collider auxiliary variable does not
alter the magnitude of the bias that is induced, but alters the direction. Note that it is not
of importance which of the four paths α is varied in sign, because the direction of bias is
determined by the product of all four constituent paths (?, ?). Finally note that conditions
in which all paths were set to 0 correspond to a pure MCAR condition. In this simulation
design we varied all paths labeled α simultaneously. Our primary interest was to observe
overall bias and not bias due to differential changes in constituent paths. This simulation
design thus yielded 5 conditions with a positive sign, 5 conditions with a negative sign, and
one condition in which all paths were set to 0, for a total of 11 conditions. We replicated
each condition 1000 times. All simulations were conducted using R (?, ?) and the following
packages: lavaan (?, ?), MASS (?, ?), mice (?, ?), MplusAutomation (?, ?), and plyr (?, ?).
For the generation of graphs we used ggplot2 (?, ?) and tikzdevice (?, ?).
SELECTION OF AUXILIARY VARIABLES 20
Performance measures. In order to analyze the results of our simulation study, we
assess a range of standard criteria commonly employed in simulation studies.
1. We assessed standardized bias in the estimates (mean, variance) of variables with
missing data, defined identical to ? (?) as raw bias (average parameter estimate across
replications minus true parameter value) divided by the standard error, defined as the
standard deviation across all replication estimates. ? (?) gives a rule of thumb that absolute
values of .4 or higher are worrisome on the standardized bias metric.
2. We recorded the precision of the estimates defined as the average standard error
across all replications. In general it is desirable to have estimates with smaller standard
errors, and hence narrower confidence intervals and more precise estimates.
3. We computed the root mean squared error (RMSE) defined as the square root of the
average squared difference between a parameter estimate and the true value of the parameter.
4. Lastly, we observed coverage rates, defined as the percentage of replications whose
95% confidence interval included the true parameter estimate. Ideally, one observes 95%
coverage rates, as this would indicate that the confidence intervals of the estimator are in the
long run accurately capturing the true parameter and have the nominal α error rate. Again,
relying on rules of thumb by ? (?), we regard coverage rates below 90% as worrisome.
Results of simulation study 1.1. The complete results are shown in Table B1 in
the Appendix. In order to communicate the most important findings, we display the
amount of standardized bias in the means in Figure 7, and coverage values in Figure 8. Both
figures shows that the listwise model is unbiased and has perfect coverage across all
conditions. The inclusion of A1 as an auxiliary variable in the FIML estimation induced bias
in the mean, as would be expected based on missingness patterns that are imposed in a
linear fashion. Bias emerges in all conditions that used FIML, expect the one in which all
paths labeled α are set to 0 (the MCAR condition). Note that this is true even though
variable A1 is related to both Y and RY and would be included as an auxiliary variable
under all current recommendations to achieve MAR. The general pattern as seen in Figure 7
SELECTION OF AUXILIARY VARIABLES 21
and 8 is that increases in the amount of explained variance yield monotonic increases in bias.
Little to none bias is observed in conditions of weak path coefficients and stronger biases are
observed in more extreme conditions. The standardized bias (and other performance
measures) reach a critical threshold, based on the rule of thumbs by (?, ?), when path
coefficients are as strong that they explain slightly less than 25% of the variance. Bias in
conditions with even stronger effects is so large that confidence intervals approach 40%
coverage. Also, not surprisingly, the direction of bias changes when the sign of the coefficient
α? changes its sign. In conditions in which the sign is negative, positive bias is induced due
to the inclusion of the collider auxiliary variable, and negative bias is induced when the path
coefficient has a positive sign, respectively. The results of this simulation clearly show that
an auxiliary variable, even though it exhibits strong correlations with missingness and
analysis variables, can increase bias. This somewhat surprising result is evident from the
graphical model, in which we can see that A1 is a collider auxiliary variable which will
induce a bias in the path from Y to RY .
Simulation study 1.2
To put the results of the first simulation study into a broader context, we performed a
second simulation study that was essentially a replication of earlier findings that an omitted
variable that has an effect on both missingness and analysis variables with missing data can
bias estimates. While this simulation study by itself does not give us any new insights, we
performed this study to answer our research question 2, aimed at exploring whether the
magnitude of bias due to omission of a bias-inducing collider auxiliary variable is similar in
strength to omission of a potentially more helpful auxiliary variable. We replicated the first
simulation study using the exact same values of explained variance in our data-generating
model, but changed the role of the collider auxiliary variable to an auxiliary variable that
has direct influences on both missingness and analysis variables.
SELECTION OF AUXILIARY VARIABLES 22
Data generation and analysis. The data-generating model for simulation 1.2 is
shown in Figure 9. In this model, a single independent variable Y is generated with missing
data, indicated by RY . This time, an auxiliary variable A2 is directly affecting both Y and
RY , thus d-connecting the two variables. The graphical criterion therefore tells us that A2 is
a bias-inducing variable that should be used in the FIML estimation. The generation of all
variables was identical to simulation study 1.1. The unique explained variance of each effect
labeled β was also identical to the previous simulation and set to 0%, 5%, 15%, 25%, 35%, and
45%. Again, we varied the sign of the path labeled β?, for a total of 11 simulation conditions.
Results of simulation study 1.2. Table B2 in the Appendix lists the complete
results of the second simulation study. To visualize our main findings we present
standardized bias in the means and coverage rates of means in Figure 10 and Figure 11 for
all conditions. In this simulation we observe a slightly different pattern than the
previous simulation. Not surprisingly and shown previously by other researchers, the listwise
model is biased in the parameter estimates of the means, and in the more extreme cases even
in the variance of Y (not shown in Figure, but in table). The FIML model that included A2
is virtually unbiased in all conditions and has perfect coverage, because the true
data-generating mechanism of the missingness is captured. Several important observations
can be made. First, the bias that is induced through the omission of a helpful auxiliary
variable is larger in magnitude in comparison with the inclusion of a bias-inducing collider
auxiliary variable. This can also be observed when examining coverage rates that drop much
more dramatically than in the case of an included collider auxiliary variable. For example, in
the condition with 25% explained variance, the standardized bias in the previous simulation
was .61, whereas in this simulation with an omitted and helpful auxiliary variable, the bias is
2.45. A second observation is that the direction of bias is flipped compared to the results of
the previous study. A negative sign of the path coefficient labeled with a ? yielded negative
bias, and likewise a positive path coefficient yielded positive bias.
SELECTION OF AUXILIARY VARIABLES 23
Intermediate summary of results of simulation study 1
We have shown that in cases that are not MNAR, bias can be induced through the
inclusion of auxiliary variables in a FIML estimation framework. The fact that an auxiliary
variable can actually make bias worse in parameter estimates in the presence of missing data
is a novel point that is not addressed by the currently practiced approaches of including
auxiliary variables. It also provides a counter-argument that is sometimes brought forth in
defense of including many variables that states that as soon as the explained variance in the
missingness or the outcome variable gets very large, there is no more room for any potential
biasing influences. This is clearly wrong, as our simulation examined cases in which
explained variance through the inclusion of a collider auxiliary variable was very large and
yet bias increased.
In our simulation studies this bias seemed to become problematic (as assessed through
rules of thumbs of standardized bias and coverage) as soon as the explained variance of the
unobserved variables associated with the collider auxiliary variable crossed a threshold of
slightly less than 25%. On a correlation metric we therefore would have to observe
correlations in the magnitude of approximately .4− .5. While this may seem very high, it is
important to remember that in our simulation studies there was only a single collider
auxiliary variable with only 2 unobserved variables, while in reality there could be a
multitude of both colliders and unobserved variables, especially if one is considering
psychological constructs that are often multiply caused. Those taken together might be able
to explain more variance and potentially make the inclusion of collider auxiliary variables
more problematic. However, the second simulation study also demonstrated that the bias
that is observed due to the inclusion of a collider auxiliary variable is much smaller than the
bias observed due to the omission of an auxiliary variable that has directional effects on both
missingness and analysis variables with missing data. In our simulation setup we observed
troublesome levels of bias, as soon as the omitted auxiliary variable explained slightly less
than 15% of the variance in the related variables, which translates to correlations of
SELECTION OF AUXILIARY VARIABLES 24
approximately .3− .4.
These intermediate results should not give the impression that listwise deletion is
generally preferable over MI or FIML models with auxiliary variable, as may erroneously be
believed based on the result of the first simulation study. However, it shows that inclusion of
auxiliary variables does not always mitigate bias, but can enhance it and that researchers
should be aware of picking good auxiliary variables. We discuss some strategies later in the
discussion.
Simulation study 2.1
In our second set of simulation studies we explore how collider and other auxiliary
variables behave in the presence of data that is inherently MNAR. Simulations that assume
that data are MNAR are probably more realistic, because in real applications some degree of
MNAR, even though it might be small, is often likely.
Data generation and analysis. Our data-generating model for the first simulation
study in the second set, depicted in Figure 12, had the identical sample size, and number of
replications as simulation study 1.1. The notable difference was that data were simulated
under an MNAR scheme (indicated through the direct paths labeled γ from an unobserved
variable U1 to both Y and RY ). The direct effects γ were always positive in sign and held
constant at 20% explained variance, thus indicating a moderately strong degree of MNAR
missingness, in comparison with the range of explained variance for the auxiliary variables.
The strength of the paths labeled α was varied over the same levels as in simulation 1.1,
including the changing of signs of the path labeled α?. The total number of conditions in the
simulation was again 11.
Results of simulation study 2.1. The complete results of simulation study 2.1 are
shown in the Appendix in Table B3. We report the most important findings on the
standardized bias of the means and coverage of means in Figure 13 and Figure 14. The
listwise model displayed a relatively constant, and high amount of bias across all conditions
SELECTION OF AUXILIARY VARIABLES 25
with no particular relationship to the strength or sign of α. The pure MNAR bias due to the
unobserved variable U1 is at around 2 on the standardized bias metric and corresponds to
coverage levels of approximately 50%. Bias was mostly induced in the estimate of the mean,
as was expected under a linear MAR situation.
The FIML model that included the auxiliary collider variable showed a more
interesting pattern. The overall shape of results for the standardized bias in Figure 13 looked
almost identical to previous results, as if only shifted along the y-axis. However there is an
important difference that becomes obvious in Figure 14 that displays coverage rates. Because
there is a constant MNAR bias, the inclusion of variable A1 in the FIML model now either
reduces or increases bias. In particular, if the sign of α? (and therefore the product of the
constituent paths) was positive, bias in the means was attenuated due to the inclusion of A1.
On the other hand, if α? was negative in sign, the inclusion of A1 as an auxiliary variable
increased bias of parameter estimates, making it even worse than the listwise model. The
Figure that displays coverage rates makes this differential effect easily visible. While
coverage stays relatively constant for the listwise model, it now increases monotonically for
the FIML model in all conditions with a positive sign, and decreases monotonically in all
conditions with a negative sign.
Bias in the estimates of the standard deviation was also observed, but much smaller in
absolute magnitude than the bias observed in means. For standard deviations, we observed
that bias in the listwise models was constant across all conditions and in the magnitude of .3
on the standardized bias metric, with corresponding coverage at around 90%. With weak
constituent paths, the bias that was observed in the FIML model was identical to the
listwise model. With very strong constituent paths, the FIML model eliminated the small
bias in standard deviations and recovered true parameter estimates.
SELECTION OF AUXILIARY VARIABLES 26
Simulation study 2.2
In simulation study 2.2 we examined the performance of inclusion and exclusion of an
auxiliary variable that has direct effects on both missingness and analysis variables, when
the missing mechanism is MNAR.
Data generation and analysis. The data-generating model is shown in Figure 15
and follows the same general pattern as simulation study 1.2 with the difference that the
direct paths γ were added to induce an MNAR situation. The amount of explained variance
for paths β and γ was identical as in simulation study 1.2. Therefore, while being
structurally different, this simulation mimicked previous simulations in regards to strength of
pure MNAR bias and explained variance of auxiliary variables.
Results of simulation study 2.2. The results of simulation study 2.2 are given in
full in Table B4 in the Appendix. We again present the main results in Figure 16 and
Figure 17, displaying standardized bias of the estimate of the mean and coverage rates,
respectively. We observe that the listwise model showed a pattern that consisted of regions
of extreme positive bias in the means, no bias, and some negative bias, depending on the size
and magnitude of the path coefficients of the auxiliary variable A2. In conditions in which
the explained variance of the auxiliary variable was 0%, standardized bias was around 1.95,
which is similar to the amount of bias that was observed in the first MNAR simulation.
When the strength of the relationship to missingness and outcomes to A2 was increased, the
amount of bias changed, however again dependent on the sign of the coefficient. If the sign
of the coefficient was positive (and therefore the product of constituent paths was positive)
bias increased to very high levels (standardized bias larger than 6 in the most extreme
conditions). If on the other hand the sign of the path coefficient of the auxiliary variable was
negative, bias decreased, going towards zero, and then increasing again but in the opposite
direction. This pattern is also visualized in the Figure that presents coverage rates. We
observe that the listwise model that excluded A2 had coverage of around 50% in the
condition in which path coefficients were set to 0. Increases in the magnitude of path
SELECTION OF AUXILIARY VARIABLES 27
coefficients while having a positive sign, deteriorated coverage quickly all the way to 0%.
Increasing path coefficients in the presence of a negative sign, first decreases bias, and
coverage rates approached the unbiased ideal of 95%, at about 25% of explained variance.
After this region where the bias due to the unobserved variable U1 and the omitted auxiliary
variable A2 canceled each other out, the bias from omitting A2 dominated and bias in the
opposite direction was observed and coverage levels dropped again.
The FIML model showed a somewhat stable amount of bias and coverage, however
with the interesting observation that bias increased in more extreme regions of explained
variance. This is visualized in Figure 16, in which the line for bias under FIML slopes
slightly upward at both ends. In Figure 17 we see this pattern even clearer, as coverage rates
drop from 50% at the center of the graph to around 30% at the extreme regions. This
phenomenon of residual bias-amplification has been described previously in the context of
instrumental variable models (?, ?, ?, ?) and it is clearly visible here in the context of
missing data problems as well. What has been shown in the context of instrumental
variables is that any bias of a relationship between two variables (in our case Y and RY ) is
amplified as soon as variables are introduced that explain some of the variance in the
explanatory variable of the two-variable relationship. ? (?) showed that bias amplification is
equal to a factor of 11−R2 , where R2 is the explained variance of Y in our case. In our
example, the inclusion of A2 explains variance in Y and therefore any bias that is due to U1
gets amplified monotonically, as the explained variance in Y due to A2 increases. Note that
in simulation 2.1 this phenomenon was also observed, but so attenuated as to be virtually
unnoticeable. This is due to the fact that the explained variance in Y due to the inclusion of
the collider auxiliary variable A1 is much smaller, because the explained variance of A1 in Y
is itself only based on the induced relationship between A1 and Y through L1.
Finally, we also observed biases in standard deviations of Y that were meaningfully
larger than in previous simulations. These results are not central to our work, but are
described in more detail in the Appendix.
SELECTION OF AUXILIARY VARIABLES 28
Intermediate results summary of simulation study 2
Simulation study 2 provided evidence that auxiliary variables in the presence of MNAR
can reduce or increase bias, depending on the sign of the constituent paths of the auxiliary
variable. Given certain constellations of relationships and signs of coefficients, it is beneficial
to exclude auxiliary variables to reduce bias, while in others it is highly beneficial. In
particular, if an MNAR situation exists that is believed to induce a positive relationship
between missingness and variables with missing data (i.e., participants with high values on a
variable are also more likely to be missing on this variable), it is beneficial to include an
auxiliary collider variable if the product of the constituent paths is also positive (as this
induces negative and therefore offsetting bias). However, this is only true, as long as the
induced bias is not so strong that it becomes larger than the original bias that is due to
unobserved variables. Bias is increased with the inclusion of a collider auxiliary variable if
the sign of the product of constituent paths is opposite to the sign of the induced
relationship due to unobserved variables. In these cases, it is better to exclude this auxiliary
collider variable.
A similar pattern emerged in the presence of MNAR and an auxiliary variable that has
direct effects on missingness and outcomes. If the existing MNAR bias due to unobserved
variables is believed to be positive, it is beneficial to include an auxiliary variable, if the
product of the constituent paths is also positive. Exclusion of such a variable will always
compound existing bias. If the sign of the product of the auxiliary variable is in the opposite
direction than the bias due to unobserved variables it can be beneficial to exclude this
variable, namely whenever one can assume that the biases cancel each other out. However, it
can also happen that bias due to omission is so strong in the opposite direction that it
becomes larger than the bias due to the unobserved variables.
The results of this last set of simulations adds an important piece of information,
namely that bias can be increased in the presence of MNAR, even if an auxiliary variable is
added that is directly related to missingness and outcomes. The special collider structure
SELECTION OF AUXILIARY VARIABLES 29
that we discussed as bias-inducing in our first set of simulations is not even necessary to
induce biases due to inclusion of auxiliary variables. As soon as MNAR is present, the
bias-reducing or increasing properties of auxiliary variables are dependent on the sign of the
constituent paths of auxiliary variables, which makes it exceedingly hard for an applied
researcher to exactly know whether any given variable may help or hurt in the mitigation of
missing data bias. As we explain in more detail in the discussion section, matters become
even more complicated when we allow correlations among observed and unobserved variables.
Finally, the results of the second study should not be misinterpreted that a positive
sign of the product of constituent paths is inherently better than a negative sign, or vice
versa. The reason why in some conditions the positive sign reduced bias was only due to the
fact that the paths labeled γ were set to a positive value - we could easily rerun all
simulations and change the sign of γ, and observe reversal of the sign of standardized biases.
Again, the results should under no circumstances be misinterpreted that a listwise deletion
approach is inherently superior to FIML or MI. In fact, in many applied circumstances, an
applied researcher might have good reasons to believe that the auxiliary variables at hand
exhibit direct effects on missingness and variables with missing data, that MAR holds, and
that therefore variables should be included. However, as we have shown, there are sets of
plausible situations in which it is indeed better to not include an auxiliary variable, contrary
to common suggestions.
Discussion
The overarching picture that emerged from our study is that the effects of auxiliary
variables cannot be easily described in a single statement, and even less so in a simple and
universally applicable rule or recommendation of inclusion or exclusion of auxiliary variables.
We have demonstrated through several examples that auxiliary variables can increase biases
both in the presence of MAR or MNAR, in some conditions substantially so. Specifically,
when MAR is believed to hold, auxiliary variables that have a collider structure increase bias.
SELECTION OF AUXILIARY VARIABLES 30
Under MNAR, any variable can theoretically increase or decrease bias, depending on
strength and magnitude of both observed and unobserved variables that are related to
missingness and variables with missing data. What however does this imply for an applied
researcher who is faced with a missing data problem?
Recommendations
The orthodox recommendation for the selection of auxiliary variables in MI or FIML is
to either take all available covariates or select them based on their observed correlation with
the missingness or the outcome variables. Our study has shown that neither approach
guarantees that only variables are selected that are bias-reducing. Further, neither approach
guarantees that following the recommendation ultimately leads to the best possible estimate
of parameters in the presence of missing data. Using all available variables as auxiliary
variables may include bias-inducing variables, and relying on correlational evidence is not
sufficient to distinguish between bias-inducing and bias-reducing variables.
In theory one could always identify the best possible set of auxiliary variables, by
examining a graphical model and - in the case of MAR - select those variables that
d-separate the missingness indicator and the variables with missing data, thus fulfilling the
conditional independence that needs to hold. In MNAR situations one could quantify the
amount of induced covariance due to observed and unobserved variables using path tracing
rules and knowledge about sign and magnitude of path coefficients and then select those
variables that minimize bias. However, it is highly unlikely that applied researchers have
good qualitative knowledge about relationships among auxiliary variables, let alone
quantitative knowledge about the magnitude of such relations. This would suggest the rather
pessimistic perspective that bias reduction due to missing data is impossible in practice. We
argue that it is indeed non-trivial to select auxiliary variables, but hope that some of our
results can aid in the process. Two results are of special usefulness in this regard: the fact
that bias-induction can be assumed under certain conditions and that magnitude of bias due
SELECTION OF AUXILIARY VARIABLES 31
to omission tends on average to be smaller than bias due to inclusion.
We argue that if researchers happen to have very specific knowledge about their
auxiliary variables, that this knowledge should be used. For example, if a researcher has good
reasons to believe that MAR holds and assumes that an auxiliary variable is only related to
missingness and analysis variables due to spurious relations, and is itself not related to any
other bias-inducing variables, then it would be best to exclude this variable. On the other
hand, if the researcher believes that direct effects are more plausible, then inclusion of this
auxiliary variable is the best choice of action. These decisions presuppose that researchers in
fact think carefully about their auxiliary variables, which might not always be easy due to
lack of theoretical knowledge. However, it might still be preferable over a weakly argued plea
to MAR and blindly putting in auxiliary variables in one’s imputation model. In general it
would be very desirable if stronger arguments for the plausibility of MAR would be brought
forth, may that in the form of written arguments about relations among auxiliary variables,
or through graphical models. Tacitly assuming that MAR holds (possibly with claims of
large explained variance) should never be a defensible strategy.
The second result that may prove useful for applied researchers is that (especially in
the case of MAR) the bias due to omission of a useful auxiliary variable and the bias due to
inclusion of a collider auxiliary variable is not symmetrical. As our studies suggest, the
former seems to outweigh the latter. This means, that if an applied researcher who observes
a correlation between an auxiliary variable and missingness is unsure whether this variable
may be bias-inducing or reducing, it might be more often beneficial to include it. This may
mean that one ends up at an inclusive strategy in which all potential auxiliary variables are
used. An important difference though is that one arrives at this solution through careful
consideration of auxiliary variables and thus presumably can provide a stronger theoretical
argument in favor of MAR.
SELECTION OF AUXILIARY VARIABLES 32
Objections and limitations
Several objections might be raised to the graphical models we presented in general and
the existence of bias-inducing variables in particular. First, one might question whether the
unobserved variables that we posited in our simulation studies could in any situation be
potentially observable and in fact could be used as auxiliary variables in an inclusive fashion.
If the unobserved variables L1 and L2 in our examples were in fact observed, then it is true
that bias due to the collider auxiliary variable would vanish. Unfortunately, it is not trivial
to always rule out whether those unobserved variables are truly captured, or whether
additional such unobserved variables may exist.
Second, one might question whether two unobserved variables would be uncorrelated as
in our example. One might argue that it is more realistic that there are other variables
(potentially unobserved as well) that induce correlations between L1 and L2. As we can see
using the d-separation criterion, making the unobserved variables related to each other,
would not change the fact that conditioning on A2 would induce a covariance between Y and
LY .
Third, it might be argued that it seems implausible that a collider variable like A2 does
not have any direct effect on Y or RY . If one supposes that A2 has a direct effect on either
Y or RY , then conditioning on A2 would close a bias-inducing path, but at the same time
open another one. This would be very similar to situations that we describe in simulation
studies 2.1 and 2.2 in which it is impossible to tell whether bias is reduced or induced
without specific knowledge of magnitude of path coefficients.
Fourth, one could argue that the examples and the whole concept of collider bias is too
artificial and simply does not occur in real datasets. The question whether such variables like
A2 in Figure 5 can exist in real data settings has already been widely discussed. ? (?) argues
that such data situations are rare or virtually impossible, whereas other authors (e.g., ?, ?, ?,
?, ?) seem to suggest that such structures can in fact emerge. We believe that while it may
be rare to find such a simple structure as we have displayed, it does not seem completely
SELECTION OF AUXILIARY VARIABLES 33
implausible to find unobserved variables that happen to have an effect on a potential
auxiliary variable and also on missingness or analysis variables, respectively. It seems in fact
especially plausible if the variables that are being considered are psychological constructs,
that are often caused by many other variables that may or may have not been observed.
Furthermore, as we have shown in our second set of simulations, it is not even necessary to
conjure the concept of a collider to observe bias-inducing properties of auxiliary variables.
Besides these objections the study has other limitations. First, we only examined
biases in means and standard deviations using a MAR-linear pattern. Clearly, the
simulations could be extended to regression coefficients, or various other parameters, under
more complicated missingness patterns. Moreover, we could have simulated data with
correlated auxiliary variables, a larger number of auxiliary variables, and more variables with
missing data. Especially the correlations among auxiliary variables could have potentially
made the bias-inducing and reducing properties even more complicated. We acknowledge
that all these aspects could have been investigated, and hopefully we will have a chance to
do so in the future. For this particular study we purposefully kept the complexity of models
and missingness to a minimum to show that under very simple models, bias-induction due to
auxiliary variables can occur.
Future directions
The limitation section above points in the direction of future research. First, it will be
interesting to consider more variables with missing data. We argue that the underlying
mechanisms of bias-reduction and bias-induction would be similar if more than one variable
is considered, however we concede that it might be more difficult to graphically display
models in which many missing data indicators need to be considered and would complicate
auxiliary variable selection as some variables may be bias-reducing for some variables, and
bias-inducing for others.
Second, it would be fruitful to examine models in which various auxiliary variables are
SELECTION OF AUXILIARY VARIABLES 34
represented that are correlated with each other. This adds immense complexity as the
inclusion or exclusion of any given variable has far-reaching consequences for the potential to
induce or reduce bias of other auxiliary variables. As an example, it might be beneficial to
include a collider auxiliary variable, even though it is known to be bias-inducing, just for the
reason that it is correlated with a variable that is also bias-inducing, but unobserved. Given
complex patterns of relationships it can prove potentially very challenging even in a
graphical model to disentangle the effects at work.
Third, it would be interesting to directly compare the performance of the inclusive
approach, the data-driven approach, and an approach that relies on theoretical assumptions
of structural relationships, in their ability to reduce bias. In our studies we focused on small
examples and bias behavior, but have not examined differential behavior of different
approaches in a comprehensive fashion.
Fourth, we have only examined a subset of possible relations of auxiliary variables and
missingness. It would be possible to extend our results to other scenarios, e.g., an auxiliary
variable having an impact on missingness but being spuriously related to the variable with
missing values due to an unobserved variable, or auxiliary variables having both spurious
and direct effects to missingness. In summary, we believe that there is still a lot to be
learned about the selection of auxiliary variables in missing data.
SELECTION OF AUXILIARY VARIABLES 35
Table 1
Results of illustrative example of bias-inducing auxiliary variable.
M (SD) Bias reduction
compared to listwise
Complete data .03 (1.00)
Listwise .19 (.98)
FIML with A1 .06 (.96) 68%
FIML with A2 .30 (.98) −58%
FIML with both .14 (.98) 26%
SELECTION OF AUXILIARY VARIABLES 36
Figure 1 . A simple MCAR model
X
RY
Y
εX εY
εR
SELECTION OF AUXILIARY VARIABLES 37
Figure 2 . A simple MAR model without auxiliary variables (a) and with auxiliary variables
(b).
X
RY
Y
εX εY
εR
(a)
RY
X Y
A
εX εY
εR
εA
(b)
SELECTION OF AUXILIARY VARIABLES 38
Figure 3 . A simple MNAR model with direct path between missingness and variable with
missing data (a) and unobserved variable related to both Y and RY (b).
X
RY
Y
εX εY
εR
(a)
X
RY
Y
εX εY
εR
L1
εL1
(b)
SELECTION OF AUXILIARY VARIABLES 39
Figure 4 . A model with several auxiliary variables. Not all of the auxiliary variables are
needed for an unbiased estimate.
A1
A2
A3
RY
X Y
εX εY
εR
εA1
εA2
εA3
SELECTION OF AUXILIARY VARIABLES 40
Figure 5 . Simple structure of two auxiliary variables and a single variable Y exhibiting
missing data.
A1
L1
L2
A2
RY
Y
εY
εR
εA1
εA2
εL1
εL2
SELECTION OF AUXILIARY VARIABLES 41
Figure 6 . Data generating model for Simulation 1.2.
L1α?
L2α
A1
α
α
RY
Y
εY
εR
εA1
εL1
εL2
SELECTION OF AUXILIARY VARIABLES 42
Figure 7 . Partial results of simulation study 1.1. Standardized bias in the estimate of the
mean across all conditions for both listwise and FIML. Arrows at the bottom of the graph
display the sign of the path labeled with a ?.
-4
-2
0
2
4
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Stan
dardized
bias
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 43
Figure 8 . Partial results of simulation study 1.1. Coverage in the estimate of the mean
across all conditions for both listwise and FIML. Arrows at the bottom of the graph display
the sign of the path labeled with a ?.
0.0
0.2
0.4
0.6
0.8
1.0
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Coverage
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 44
Figure 9 . Data generating model for Simulation 1.2.
A2
β?
βRY
Y
εY
εR
εA2
SELECTION OF AUXILIARY VARIABLES 45
Figure 10 . Partial results of simulation study 1.2. Standardized bias in the estimate of the
mean across all conditions for both listwise and FIML. Arrows at the bottom of the graph
display the sign of the path labeled with a ?.
-4
-2
0
2
4
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Stan
dardized
bias
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 46
Figure 11 . Partial results of simulation study 1.2. Coverage in the estimate of the mean
across all conditions for both listwise and FIML. Arrows at the bottom of the graph display
the sign of the path labeled with a ?.
0.0
0.2
0.4
0.6
0.8
1.0
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Coverage
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 47
Figure 12 . Data generating model for Simulation 2.1.
L1α?
L2α
A1
α
α
U1
γ
γRY
Y
εY
εR
εA1
εL1
εL2εU1
SELECTION OF AUXILIARY VARIABLES 48
Figure 13 . Partial results of simulation study 2.1. Standardized bias in the estimate of the
mean across all conditions for both listwise and FIML. Arrows at the bottom of the graph
display the sign of the path labeled with a ?.
-4
-2
0
2
4
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Stan
dardized
bias
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 49
Figure 14 . Partial results of simulation study 2.1. Coverage in the estimate of the mean
across all conditions for both listwise and FIML. Arrows at the bottom of the graph display
the sign of the path labeled with a ?.
0.0
0.2
0.4
0.6
0.8
1.0
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Coverage
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 50
Figure 15 . Data generating model for Simulation 2.2.
A2
β?
β
U1
γ
γRY
Y
εY
εR
εA2 εU1
SELECTION OF AUXILIARY VARIABLES 51
Figure 16 . Partial results of simulation study 1.2. Standardized bias in the estimate of the
mean across all conditions for both listwise and FIML. Arrows at the bottom of the graph
display the sign of the path labeled with a ?.
-4
-2
0
2
4
6
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Stan
dardized
bias
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 52
Figure 17 . Partial results of simulation study 2.2. Coverage in the estimate of the mean
across all conditions for both listwise and FIML. Arrows at the bottom of the graph display
the sign of the path labeled with a ?.
0.0
0.2
0.4
0.6
0.8
1.0
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Coverage
ModellistwiseFIML
SELECTION OF AUXILIARY VARIABLES 53
Appendix A
Generation of missing values and explained variance in RY
Note that the missigness indicator is a binary outcome variable, and should ideally be
modeled using a logistic or probit model. We modeled the relationship between predictor
variables and missingness by modeling a latent, continuous variable that expresses the
likelihood of being missing, given values on variables that predict missingness. This allowed
us to use the same path coefficients and model the same amount of explained variance. This
latent variable is not displayed in our graphs to make the visualization of the underlying
missingness mechanism clearer. Paths going into the latent, continuous variable had the
same magnitude and explanatory power as paths from variables going into the variable with
missing data, hence they are also displayed with the same letter α in our graphs. To
generate missing data, we created a binary indicator based on the latent missingness
propensity, by performing a cut at the 30th percentile of the underlying continuous variable.
We fully acknowledge that this dichotomization results in amounts of explained variance that
are nominally lower than the ones that were specified in regards to the latent continuous
variable. We examined this attenuation and found in line with previous research (?, ?, ?)
that the attenuation factor is constant, as long as the dichotomization always occurs at the
same percentile. We also reran this simulation and modeled the binary missingness indicator
directly, choosing logistic regression coefficients that map on to the exact same values on the
McKelvey-Zavoina Pseudo-R2. Results from these studies showed a very similar pattern,
with biases across all conditions being slightly higher, due to the absence of any attenuation.
The only reason why we did not employ the approach of directly modeling the binary
response was that it becomes exceedingly hard to get the exact desired Pseudo-R2 in models
with several, potentially correlated predictors. This lesser known point about logistic
regression is explained in more detail by ? (?).
SELECTION OF AUXILIARY VARIABLES 54
Appendix B
Appendix tables
Table B1
Results of simulation study 1.1. Table presents results, broken up by estimation strategy
(listwise, FIML), parameter estimate (mean or variance), type of performance measure
(standardized bias, standard error, RMSE, coverage), and across columns, the sign and
magnitude of the relationships on the R2 metric.
Sign of coefficient negative positive
Unique explained variance in each path α
45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%
Std. bias 0.03 -0.08 -0.05 0.03 0.02 0.04 0.03 -0.03 0.00 -0.01 0.02
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.05 0.05 0.05 0.05 0.05 0.05 0.06 0.05 0.05 0.05 0.05
listwise Coverage 0.96 0.95 0.95 0.96 0.95 0.95 0.94 0.95 0.95 0.95 0.96
Std. bias -0.02 -0.07 -0.03 -0.03 -0.02 -0.04 -0.13 -0.02 -0.04 0.02 -0.1
σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08
RMSE 0.08 0.07 0.07 0.08 0.07 0.08 0.08 0.07 0.08 0.08 0.07
Coverage 0.95 0.96 0.95 0.94 0.95 0.93 0.94 0.96 0.95 0.95 0.95
Std. bias 2.21 1.16 0.54 0.24 0.04 0.03 0.01 -0.24 -0.61 -1.24 -2.17
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.12 0.08 0.06 0.05 0.05 0.05 0.06 0.06 0.06 0.08 0.12
FIML with A1 Coverage 0.43 0.8 0.91 0.95 0.95 0.95 0.95 0.94 0.92 0.79 0.44
Std. bias 0.31 0.05 0.00 -0.02 -0.02 -0.04 -0.13 -0.02 -0.01 0.13 0.24
σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08
RMSE 0.08 0.07 0.07 0.08 0.07 0.08 0.08 0.07 0.08 0.08 0.08
Coverage 0.95 0.96 0.95 0.94 0.95 0.93 0.94 0.95 0.95 0.96 0.95
SELECTION OF AUXILIARY VARIABLES 55
Table B2
Results of simulation study 1.2. Table presents results, broken up by estimation strategy
(listwise, FIML), parameter estimate (mean or variance), type of performance measure
(standardized bias, standard error, RMSE, coverage), and across columns, the sign and
magnitude of the relationships on the R2 metric.
Sign of coefficient negative positive
Unique explained variance in each path α
45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%
Std. bias -4.46 -3.18 -2.24 -1.41 -0.50 0.09 0.51 1.36 2.36 3.24 3.90
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.23 0.18 0.14 0.09 0.06 0.05 0.06 0.09 0.13 0.18 0.21
listwise Coverage 0.00 0.08 0.36 0.70 0.93 0.96 0.92 0.69 0.35 0.09 0.03
Std. bias -1.52 -0.9 -0.46 -0.22 -0.04 -0.06 -0.03 -0.17 -0.47 -0.96 -1.27
σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08
RMSE 0.12 0.10 0.08 0.08 0.08 0.08 0.07 0.07 0.08 0.10 0.11
Coverage 0.65 0.82 0.92 0.94 0.94 0.94 0.96 0.94 0.89 0.79 0.72
Std. bias 0.01 0.01 0.02 -0.03 -0.03 0.09 0.04 0.00 -0.01 0.01 -0.01
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
FIML with A2 Coverage 0.96 0.95 0.94 0.94 0.96 0.96 0.96 0.95 0.95 0.95 0.96
Std. bias 0.00 -0.03 0.00 -0.06 -0.02 -0.06 -0.02 -0.01 -0.04 -0.05 0.01
σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08
RMSE 0.08 0.08 0.07 0.08 0.08 0.08 0.07 0.07 0.08 0.08 0.08
Coverage 0.96 0.94 0.96 0.95 0.94 0.94 0.96 0.95 0.94 0.95 0.94
SELECTION OF AUXILIARY VARIABLES 56
Table B3
Results of simulation study 2.1. Table presents results, broken up by estimation strategy
(listwise, FIML), parameter estimate (mean or variance), type of performance measure
(standardized bias, standard error, RMSE, coverage), and across columns, the sign and
magnitude of the relationships on the R2 metric.
Sign of coefficient negative positive
Unique explained variance in each path α
45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%
Std. bias 1.89 1.84 1.99 1.92 1.90 1.83 1.77 1.89 1.88 1.81 1.90
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.11 0.11 0.11 0.11 0.12 0.11 0.11 0.11 0.11 0.11 0.11
listwise Coverage 0.54 0.54 0.51 0.58 0.53 0.53 0.55 0.55 0.55 0.55 0.54
Std. bias -0.33 -0.24 -0.37 -0.31 -0.29 -0.38 -0.3 -0.27 -0.32 -0.28 -0.31
σy Std. Error 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07
RMSE 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08
Coverage 0.91 0.93 0.92 0.92 0.92 0.93 0.93 0.92 0.92 0.92 0.92
Std. bias 4.38 3.23 2.72 2.17 1.93 1.83 1.75 1.70 1.34 0.71 0.01
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.23 0.18 0.15 0.12 0.12 0.11 0.11 0.10 0.09 0.07 0.05
FIML with A1 Coverage 0.00 0.10 0.26 0.47 0.52 0.53 0.55 0.61 0.74 0.88 0.95
Std. bias 0.08 -0.09 -0.33 -0.3 -0.29 -0.38 -0.3 -0.27 -0.30 -0.18 -0.02
σy Std. Error 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.08
RMSE 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08
Coverage 0.95 0.94 0.92 0.92 0.91 0.93 0.93 0.93 0.92 0.93 0.95
SELECTION OF AUXILIARY VARIABLES 57
Table B4
Results of simulation study 2.2. Table presents results, broken up by estimation strategy
(listwise, FIML), parameter estimate (mean or variance), type of performance measure
(standardized bias, standard error, RMSE, coverage), and across columns, the sign and
magnitude of the relationships on the R2 metric.
Sign of coefficient negative positive
Unique explained variance in each path α
45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%
Std. bias -2.29 -1.46 -0.48 0.47 1.49 1.95 2.42 3.23 4.13 5.33 6.36
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.14 0.09 0.06 0.06 0.09 0.11 0.14 0.18 0.23 0.28 0.33
listwise Coverage 0.34 0.71 0.93 0.92 0.69 0.53 0.33 0.09 0.01 0.00 0.00
Std. bias -0.47 -0.16 -0.05 -0.02 -0.17 -0.26 -0.49 -0.94 -1.49 -2.34 -3.68
σy Std. Error 0.07 0.07 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.06 0.06
RMSE 0.08 0.07 0.08 0.08 0.08 0.08 0.08 0.1 0.12 0.17 0.23
Coverage 0.90 0.94 0.94 0.95 0.94 0.93 0.90 0.81 0.64 0.33 0.08
Std. bias 2.44 2.35 2.21 2.07 2.02 1.94 2.00 2.03 2.08 2.38 2.53
µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
RMSE 0.14 0.13 0.12 0.12 0.12 0.11 0.12 0.12 0.12 0.14 0.14
FIML with A1 Coverage 0.33 0.39 0.44 0.48 0.51 0.53 0.51 0.45 0.43 0.33 0.26
Std. bias 1.14 0.86 0.43 0.16 -0.15 -0.27 -0.47 -0.79 -1.07 -1.49 -2.02
σy Std. Error 0.09 0.08 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.07 0.07
RMSE 0.13 0.11 0.09 0.08 0.08 0.08 0.08 0.09 0.11 0.13 0.16
Coverage 0.84 0.91 0.93 0.95 0.94 0.93 0.90 0.85 0.76 0.62 0.44
SELECTION OF AUXILIARY VARIABLES 58
Appendix C
Bias in standard deviations in simulation study 2.2.
We present an additional graph to highlight the pattern of biases in standard deviations.
Figure C1 . Partial results of simulation study 2.2. Standardized bias in the estimate of the
standard deviation across all conditions for both listwise and FIML. Arrows at the bottom of
the graph display the sign of the path labeled with a ?.
-4
-2
0
2
4
Positive signNegative sign
45 35 25 15 5 0 5 15 25 35 45
Percentage of explained variance
Stan
dardized
bias ModellistwiseFIML
Figure C1 shows that the listwise model had a moderate amount of bias in situations in
which only U1 biased the effects, but with increased influence of the omitted variable A2 bias
increased, however only in situations in which the sign of the path coefficient was positive.
When the effect of A2 was negative, bias due to U1 in the standard deviation was attenuated
up to a point, and then became slightly larger again when influence of A2 was very large and
negative in sign. In all conditions, the listwise model the estimate of the standard deviation
SELECTION OF AUXILIARY VARIABLES 59
was negatively biased, i.e., too small. The FIML model showed a different behavior than the
listwise model. At small values of explained variance the two models yielded similar amounts
of biases. With positive signs of the relationship between A2 and Y , and residual
confounding through U1, the standard deviation was consistently underestimated. The
underestimation increased monotonically with the strength of the relationship of A2. In
conditions with a positive sign of A2, the FIML model generally outperformed the listwise
model. In conditions with a negative sign, we observed that bias increased montonically,
therefore eventually overestimating the variability in the data. This bias became stronger
than the negative bias that was observed in the listwise model.
Recommended