FelixThoemmes NormanRose UniversityofTuebingen fileSELECTIONOFAUXILIARYVARIABLES 2 Abstract Thetreatmentofmissingdatainthesocialscienceshaschangedtremendouslyduringthe lastdecade

Running head: SELECTION OF AUXILIARY VARIABLES 1

Selection of auxiliary variables in missing data problems: Not all auxiliary variables are

created equal

Felix Thoemmes

Cornell University

Norman Rose

University of Tuebingen

Author Note

The authors would like to thank the participants of the colloquium of the Methodology

Center at Pennsylvania State University. Inquiries to this article should be addressed to the

first author, Felix Thoemmes, MVR G62A, Cornell University, Ithaca, NY 14853,

[email protected].

SELECTION OF AUXILIARY VARIABLES 2

Abstract

The treatment of missing data in the social sciences has changed tremendously during the

last decade. Modern missing data techniques such as multiple imputation and

full-information maximum likelihood are used much more frequently. These methods assume

that data are missing at random. One very common approach to increase the likelihood that

missing at random is achieved, consists of including many covariates as so-called auxiliary

variables. These variables are either included based on data considerations or in an inclusive

fashion, i.e., taking all available auxiliary variables. However, neither approach accounts for

the fact that under a wide range of circumstances there is a class of variables that, when

used as auxiliary variables, will always increase bias in the estimation of parameters from

data with missing values. In this paper we show that this bias exists, quantify it in a

simulation study, and discuss possible ways how one can avoid selecting bias-inducing

covariates as auxiliary variables.

Keywords: missing data, auxiliary variables,multiple imputation, full information

maximum likelihood


Selection of auxiliary variables in missing data problems: Not all auxiliary variables are

created equal

Introduction

The presence of missing data is a prevalent problem in social science research (?, ?).

Given that a large portion of social science studies are conducted outside the confines of a

laboratory, the threat of suffering missing data due to non-compliance or attrition is even

more pronounced. The pervasiveness of this problem has triggered much research during the

last 30 years. ? (?) laid the foundation of modern missing data theory which has culminated

in sophisticated methods to deal with missing values, specifically the use of full-information

maximum likelihood (FIML) and multiple imputation (MI). For an overview see e.g., ? (?).

Both of these so-called modern missing data techniques are expected to yield unbiased

estimates of parameters in the presence of missing data, given that certain assumptions

about missingness hold. It should be noted that especially MI, while conceptually

straightforward (?, ?), can be conducted with various different techniques, see e.g., ? (?), ?

(?), ? (?), or ? (?). However, despite computational differences, all techniques, whether they

may be FIML or variants of MI, rely on the same, untestable assumptions, notably, the

missing at random (MAR) assumption (?, ?), which we will define more formally later in the

manuscript. The goal of this paper is to critically examine current recommendations to

increase the plausibility of MAR, especially in regards to the selection of auxiliary variables.

We argue that the current recommendations are incomplete and simply ignore the possibility

of complex relationships between substantive analysis variables and variables that are solely

used to improve the missing data estimation, so-called auxiliary variables. Further, we

believe that the complexities of the assumptions are not widely appreciated among social

science researchers and many quantitative scientists alike, who have long believed that

inclusion of as many auxiliary variables as possible is a safe strategy to asymptotically

achieve or approximate unbiasedness. We will show in a small example and a larger

simulation study that this strategy is not guaranteed to yield unbiased results and that


biases due to missing data and the use of auxiliary variables are much more complex than

previously thought. As a result, the use of modern missing data techniques, while laudable,

does often not guarantee that bias in studies with missing data has been adequately dealt

with.

We will first review classic missingness mechanisms and discuss which conditional

independencies these conditions imply and how these independencies can be encoded in a

graph. Further, we demonstrate that there are situations and classes of variables that should

not be used as auxiliary variables in FIML or MI as they tend to increase bias. We will

quantify the bias in our simulation studies, and suggest possible ways to avoid it. Finally, we

will discuss implications for applied research and offer an alternative framework to think

about and communicate assumptions of missing data problems.

Missing data mechanism

We begin by reviewing the classic mechanisms defined by ? (?): missing completely at

random (MCAR), missing at random (MAR), and missing not at random (MNAR). In our

overview we use a slightly modified version of the notation employed by ? (?). In addition,

we also express missing mechanisms using conditional independence statements. In

conjunction with the conditional independence statements, we present graphical displays to

illustrate the mechanisms. Using graphs to illustrate how missingness relates to other

variables in a model is not a novel approach and has in fact been used in popular texts and

articles to aid understanding of the mechanisms (?, ?, ?). In this paper however, we do not

use graphs simply as illustrations, but also use formal graph theory (?, ?) to derive certain

results.

MCAR

Following the notation of ? (?), we denote an N ×K matrix Y . The rows of Y

represents the cases n = 1, . . . , N of the sample and the columns represent the variables

i = 1, . . . , K. Y can be partitioned into an observed part, labeled Yobs, and a missing part


Ymis, which yields Y=(Yobs, Ymis). Further, we denote an indicator matrix of missigness, R,

whose elements take on values of 0 or 1, for observed or missing values of Y , respectively.

Accordingly, R is also an N ×K matrix. Each variable in Y can therefore have both

observed or unobserved values.

Missing completely at random (MCAR) is the most restrictive assumption, but, when

fulfilled, the least problematic. It states that the unconditional distribution of missingness

P (R) is equal to the conditional distribution of missingness given Yobs and Ymis, or simply Y .

P (R | Y ) = P (R | Yobs, Ymis) = P (R) (1)

These equalities of probabilities imply (can be expressed as) conditional independence

statements, here in particular

R ⊥ (Yobs, Ymis). (2)

The MCAR condition is therefore fulfilled when the missingness has no relationship with

either the observed and unobserved part of Y . In an applied research context we could

imagine MCAR being fulfilled if the missing data arose from a purely “accidental” (random)

process, like dropping a single sheet from a questionnaire. In other words, the probability of

missingness is related only to factors that are completely unrelated to any other variable in

the model. MCAR is rare in applied research and usually does not hold, unless it has been

planned by the researcher in so-called missingness by design studies (?, ?). When MCAR

holds, even simple techniques, like listwise deletion will yield unbiased estimates (?, ?), even

though it might still not be advisable to use these simple methods due to loss in statistical

power. As ? (?) described and ? (?) more formally showed, MCAR cannot be tested

empirically, and homogeneity of means, variances, or more generally distributions, of

observed variables across missing data patterns constitutes only necessary, but not sufficient

evidence for MCAR. The inability to directly test MCAR can also be seen by the fact that it

posits independence assumptions about quantities that are by definition unobserved, here in

particular Ymis.


Before we proceed further, it is necessary to address the graphical displays that we will

be using. First, they are constructed as so-called direct acyclic graphs (?, ?), which we will

abbreviate as DAGs. DAGs are widely used in epidemiology (?, ?, ?, ?, ?), medicine (?, ?, ?),

computer science (?, ?, ?, ?, ?) and other fields. They also have been used to examine

missing data situations (?, ?). Researchers who are familiar with structural equation models

(SEM) will also feel familiar with DAGs, however there are some differences (for a complete

overview of differences refer to ? (?)). Briefly explained, in a DAG we use the ε terms, the

so-called disturbance terms, to denote all unmeasured variables that may have an effect on

the variable that is endowed with this ε term. Note that these disturbance terms are not

identical to regression residuals that are by definition uncorrelated with variables that were

used to predict the variable with the ε term. Further, the DAG is completely non-parametric

and encodes conditional independencies among the variables displayed. Precisely because of

this ability to encode conditional independencies are DAGs well suited to express missing

data mechanisms (which can be expressed as such conditional independencies, as we have

shown earlier in the example of MCAR). We will use DAGs to express conditional

independencies that are prescribed by different missingness mechanisms and in doing so,

show how novel insights about missingness problems can be gathered.

In Figure 1 we present a graphical display of MCAR for the simple case in which a

single variable X has an effect on a unidimensional variable Y . In this simple case, X is

completely observed and only Y suffers from missingness. Whether data on Y is missing is

encoded by the indicator RY in the graph. We use an additional subscript for R here to

denote that this missingness indicator pertains only to variable Y . Note that we could have

visually partitioned Y in the graph into Yobs and Ymis, but for clarity simply denote it as Y .

In this example equation 2, which expresses the condition that needs to hold for MCAR, can

be written as RY ⊥ (X, Y ).

Independence relations in DAGs are expressed as so-called d-separation statements.

d-separation is a graphical criterion that can be applied to DAGs to infer independence


relations among variables. In short, if two variables are said to be d-separated, there exists

no traceable, unblocked path in the diagram between the variables. Conversely, if two

variables are d-connected, there exists a traceable and unblocked path between the variables.

A traceable path is defined as any path that connects two variables in a graph. It is not of

importance for the definition of a path whether the segments of the path have arrows

pointing in one or the other direction. To examine d-separation one examines whether all

paths are open or blocked. A path is said to be blocked if one conditions on a variable in the

path that acts as a mediator, i.e., takes on the form ← X ← or → X →, or is an

arrow-emanating variable, i.e., takes on the form, ← X →. Further, a path is blocked if one

does not condition on a variable that has two arrows pointing in it, i.e., takes on the form

→ X ←. Such a variable is usually called a collider variable (?, ?). If two variables are said

to be d-connected there exists at least one traceable path between them that has not been

blocked. Being d-connected implies that the two variables are stochastically dependent on

each other. ? (?) has provided a proof that variables in a graph that are d-separated are

stochastically independent from each other, regardless of the functional form of the

relationships among the variables in the graph. For a more thorough introduction to

d-separation for social scientist, consult ? (?) or the original text by ? (?).

In the graph in Figure 1 we can see that there is only a single arrow pointing to RY

from the disturbance term εR, meaning that missingness arises only due to unobserved

factors. Further, these unobserved factors have no association with any other variable or

disturbance term in the model, as can be seen by the fact that εR is unassociated with other

parts of the model. In this graph, there is no traceable path between Y and RY (or X and

RY ) and they are said to be d-separated without having to condition on any other variables,

implying unconditional stochastic independence between the variables Y and RY (as defined

in equation 1), and therefore the missing data mechanism is MCAR. So far we have used the

the expression “to condition on” in the context of missing data problems this relates to

observing and using a variable in a FIML or multiple imputation model.


MAR

A somewhat less restrictive condition is missing at random (MAR). MAR states that

the conditional distribution of missingness, given the observed part Yobs is equal to the

probability of missingness, given the observed and the unobserved part (Yobs, Ymis).

P (R | Y ) = P (R | Yobs, Ymis) = P (R | Yobs). (3)

These equalities of probabilities again imply (can be expressed as) conditional independence

statements, here in particular

R ⊥ Ymis | Yobs. (4)

In words, MAR states that the missingness is stochastically independent of the unobserved

variables, whereas dependencies between observed variables and missingness are allowed. In

an applied research context, we could imagine that missingess is caused by certain observed

variables that may also have an effect on important analysis variables. For example,

missigness on an achievement measure could be caused by motivation (or lack thereof).

Further we can assume that motivation has also an effect on achievement. MAR is an

important condition, because when it holds, modern estimation techniques (MI and FIML)

yield unbiased results. Just as MCAR, MAR cannot be tested empirically, as it also posits

conditional stochastic independence assumptions among quantities that are by definition

unobserved, specifically, Ymis. Returning to the example with variable X, the unidimensional

variable Y and the respective missing indicator RY , the MAR condition (see equation 4)

implies the conditional stochastic independence RY ⊥ Y | X. In Figure 2 (a) we show the

simple situation in which MAR holds. In this figure, Y and RY are d-connected, via the

path Y ← X → RY . However, if one conditions on X, this path becomes blocked and Y and

RY are now d-separated, implying conditional stochastic independence RY ⊥ Y | X, as

similarly defined in equation 4, and therefore MAR holds, as long as one has observed X and

uses it in the estimation of Y in a FIML framework, or uses it as a predictor variable in an

MI framework. Often, researchers use variables to predict missingness that may not be of


substantive interest. Such variables are usually called “auxiliary” variables, because they are

not of theoretical interest to the applied researcher but aid in the estimation of the missing

data. In the second graphical example in 2 (b), we explicitly describe an auxiliary variable

and how it can help to create conditional independence between the missingness and the

variable with missing value, thereby implying MAR. We use the same set of variables as in

Figure 1 (a), but introduce a new variable A, which in this example must be used as an

auxiliary variable for an unbiased estimate of the relationship of interest between X and Y

in the presence of missing data on Y .

In Figure 2 (b), Y and RY are d-connected, via the path Y ← A→ RY and via the

path Y ← X → RY . However, if one conditions on A, the first path becomes blocked, and if

one conditions on X, the second path becomes blocked and Y and RY are now d-separated,

implying conditional stochastic independence RY ⊥ Y | (A,X), and therefore MAR holds.

Note that A in the graph could be a multidimensional set of variables that all exhibit the

same structure.

MNAR

Finally, missing not at random (MNAR) is the least stringent assumption, however the

most problematic, as even FIML and MI will typically, though not always for all parameter

estimates, yield biased results. MNAR is characterized by the probability of missingness

being dependent on both the observed part, Yobs, and the unobserved part, Ymis. That is,

P (R | Yobs, Ymis) 6= P (R | Yobs). (5)

No conditional independencies are implied be equation 5. In an applied research context, we

could consider different ways that MNAR could arise. One situation would be if missingness

was caused by the variable with missing data itself, e.g., participants with a very high income

are more likely to not report their income. This situation is depicted in Figure 3 (a), in which

Y and RY are directly connected by a path. Y and RY are said to be d-connected through


the direct path Y → RY . Two adjacent, connected variables in a graph, can never be

d-separated. Hence, no conditional stochastic independence can arise, and MNAR is present.

A similar MNAR situation would arise when an unobserved variable has an effect on

both the missingness RY and Y . In an applied research context, this could happen whenever

a variable that influences missingness also has an effect on analysis variables, but the

variable has not been measured and is therefore omitted. This omitted variable can be

displayed as a latent, unobserved variable in the graph, or simply as correlated disturbance

terms. Figure 3 (b) displays such a situation in which an omitted variable influences both Y

and RY . Here, Y and RY are d-connected via the path Y ← L1 → RY . This path cannot be

blocked via conditioning, because no observed variables reside in the middle of the path.

Again, no stochastic conditional independence can be achieved through conditioning and

MNAR holds. Note that the variable L1 in the graph should not be confused with a

modeled, latent variable in a SEM, but rather is a simple depiction of an unobserved

variable. To make this clear, we deviate slightly from regular symbolic language of DAGs

and SEM graphs and used a dashed outline for the unobserved variable.

Equivalence of missing data mechanisms and graphs

In the previous section we showed how the classic missingness mechanisms can be

expressed via graphs that encode conditional independencies and applied the graph-theoretic

concept of d-separation. In summary, when a variable Y and its associated missing indicator

RY are d-connected, MNAR holds and bias will typically emerge. If Y and RY can be

d-separated using any set of other observed variables, then MAR holds, and parameters

related to Y can be estimated without bias, when using methods that rely on MAR

(FIML,MI) and using those variables that are needed to d-separate Y and RY in the

imputation or analysis model, respectively for MI and FIML. A special case arises when Y

and RY are d-separated given no other variables (unconditionally independent), which maps

on to the classic MCAR condition. As we shall see, relying on the graph-theoretic concept of


d-separation will allow us to further determine, whether any given auxiliary variable is

needed to achieve d-separation of Y and RY or whether a variable would in fact make these

two variables d-connected and induce conditional dependencies. We believe that herein lies

an important advantage of using graphical models as we can easily spot auxiliary variables

that may be bias-reducing or - as we will show - bias-inducing, something that is not

apparent when relying on the classic conditional independence notation that has been used

to describe the missing data mechanisms.

Current approaches

While all assumptions of the missing mechanisms are important, insofar as they

prescribe which methods will yield biased or unbiased estimates, MAR is an assumption that

is necessary for the two missing data approaches that are considered state-of-the-art, FIML

and MI. A pertinent question is therefore how a researcher can achieve MAR or at least

make MAR plausible in his or her study. As seen in equation 3 and 4 and in the

accompanying graphs it is necessary to include all variables in the imputation or FIML

model that make Y and RY independent of each other. In other words, researchers need to

capture all variables that they believe have a direct or indirect effect on the probability of

being missing and at the same time a direct or indirect effect on the variable with missing

data. Some of these variables might already be part of the analytic model, others might not

be part of the analytic model, but might be needed to satisfy the MAR assumption, i.e.,

auxiliary variables. We now describe current approaches that aim to achieve MAR and

present an example that illustrates potential problems with these approaches.

Inclusive approach

The so-called inclusive approach (?, ?) to achieve MAR directs researchers to include

many auxiliary variables in their imputation model (or in their FIML estimation, following

guidelines by ? (?)). The reasoning behind the inclusive strategy is as follows: if many

variables are included it becomes less likely that variables that are both causes of the


missingness and the analytic variables with missing data are omitted. Such omission would

be harmful as it would destroy the conditional independence posited in MAR and induce

bias. ? (?) showed that bias in means, variances, and regression estimates can be substantial

if this kind of variable is omitted. A second rationale for adopting an inclusive strategy is

that the inclusion of variables that may not be causes of the missingness or causes of the

analytic variables with missing data, was shown to be “far from being harmful[,]...at worst

neutral, and at best extremely beneficial” (?, ?, p. 349). In particular ? (?) examined the

influence of including variables that are completely uncorrelated to missingness or analytic

variables with missing data (so called “trash variables”), or only related to analytic variables

with missing data but not with the missingness itself. Completely uncorrelated variables did

not have any impact on bias, and variables that were only correlated with Y , were shown to

be able to attenuate bias in MNAR situations and reduce standard errors.

Data-driven approach

Even if one fully acknowledges the benefits of an inclusive strategy, such a strategy can

reach its limits, especially when applied to large-scale datasets, which may contain hundreds

of variables. If analytic models include many variables and many auxiliary variables are

added, both MI and FIML will likely encounter problems in the convergence of models. To

mitigate this problem it has been suggested to examine data for the inclusion of variables as

auxiliaries. ? (?) suggest that variables make good candidates for auxiliary variables if they

are related to the missingness or the analytic variable that exhibits missingness. The

rationale behind this advice is straight-forward: a variable that is completely uncorrelated

with the probability of missing, cannot induce any dependencies between RY and Y .

Likewise, a variable that is completely uncorrelated with the analysis variable with missing

values can also not induce any dependencies between RY and Y . As a demonstration of this

principle, consider Figure 4 in which three auxiliary variables A1, A2, and A3 are added to a

model in which X d-connects Y and RY via Y ← X → RY and A1 d-connects Y and RY via


Y ← A1 → RY . The two variables A2 and A3 do not d-connect Y and RY and conditioning

on them is therefore not needed to render Y and RY conditionally independent, and hence

fulfilling the MAR condition. Simply using X and A1 is sufficient in this example. 1

The data-driven approach advises us to screen our set of potential auxiliary variables

as to whether they are related (usually examined using correlations) with any of the analysis

variables, or any of the missing value indicator variables. Variables that are related to either

or both should be included as auxiliary variables, while variables that fall below a certain

correlation threshold to either, should not be used. Particular guidelines on the inclusion and

exclusion of auxiliary variables were formulated by ? (?) who recommend to include a

variable if the correlation of it with either missingness or the variable with missing data

exceeds ± .1 (or any other chosen threshold, e.g., ? (?) suggests correlations with the

analysis variables greater than ± .4). The implicit assumption is that variables that are

correlated even lower than the chosen threshold will have little power to induce any

dependencies, and that variables that are correlated higher, are assumed to induce biases in

the estimation of parameters in the presence of missing data.

Generally, the advice to include auxiliary variables in missing data problems is sound

and has, in both simulations studies (?, ?) and theoretical work (?, ?), been shown to be

useful. However, both the inclusive strategy and the data-driven approach ignore the

possibility that there are certain instances and classes of variables that should not be used as

auxiliary variables, because they induce bias in the estimation of parameters in the presence

of missing data, by destroying the conditional independence between Y and RY , hence

violating MAR. We now turn to these situations and variables and show, using illustrative

examples and simulations, that this bias can become potentially large, if ignored.

1Note that if the disturbance terms of A2 and A3 were correlated (e.g., due to an unobserved variable

that has a relationship to both of these variables), an active path Y ← A2 ← εA2 ↔ εA3 → A3 → RY would

be present, which could be blocked by either conditioning on A2, A3, or both. Hence at least one of these

variables would need to be included in a FIML or imputation model.


Bias-enhancing auxiliary variables

Consider first a simplified illustrative example of a single variable Y with missing data,

a missing data indicator RY , and two potential auxiliary variables A1 and A2 that are at the

disposal of the applied researcher. In addition, two unobserved variables L1 and L2 are part

of the true data-generating model. The full model is displayed in Figure 5.

An initial reaction to this model might be that the unobserved variables L1 and L2

make this an MNAR situation and that some bias would be expected and is not surprising.

However, the situation is more subtle. Variable A1 indeed induces conditional dependencies

between Y and RY via the path Y ← A1 → RY and therefore biases the estimates of Y , in

the presence of missing data. Therefore, if one uses A1 as an auxiliary variable, bias due to

A1 will be eliminated, as the biasing path is blocked. Variable A2 on the other hand, even

though spuriously correlated with Y and RY , does not induce conditional dependencies via

the path Y ← L1 → A2 ← L2 → RY and therefore cannot bias the estimates of Y no matter

what values the constituent path coefficients would take on. This is because A2 is a collider

variable on this path and not conditioning on it, closes this path and does not induce any

dependencies between Y and RY . What however happens when A2 is also used as an

auxiliary variable, along with A1? The inclusion of A2 will actually destroy the conditional

independence that was achieved earlier with the inclusion of A1 and induce an MNAR

situation. The path Y ← L1 → A2 ← L2 → RY that was initially blocked becomes open

when A2 is conditioned on (used as an auxiliary variable).

To illustrate this point further using data, we simulated a single dataset based on the

model in Figure 5. The data generation is fully described in the first simulation study below.

Briefly described, we chose a large sample size of n = 1000. All continuous variables were

multivariate normally distributed with mean of 0 and variance of 1. Path coefficients in the

model were completely standardized and the size of the path coefficients was chosen so that

the total R2 (or the respective McKelvey-Zavoina pseudo-R2 (?, ?)) of every single

dependent variable in the model (Y,A2, RY ) was identical to 50%. We chose the sign of the


path coefficients so that the direction of bias due to the omission of A1 and the bias due to

the inclusion of A2 was in the same direction and not incidentally offsetting each other. The

amount of missing data was set to 50%. We estimated the mean and standard deviation of

the variable Y using a listwise deletion approach, FIML estimation in Mplus (?, ?) and

lavaan (?, ?) using only A1 as the auxiliary variables, using only A2 as the auxiliary variable,

or using both A1 and A2 as auxiliary variables. Auxiliary variables in the FIML estimation

were included using the Mplus auxiliary command, which automatically fits a model

suggested by ? (?). We also used mice (?, ?) to generate 5 multiple imputations whose

results were pooled following standard recommendations (?, ?). As expected, and previously

reported by ? (?), results of FIML and MI did not differ substantially when the same set of

auxiliary variables were used. We only report results of the FIML estimation in Table 1.

In the single simulated dataset the completely observed data of Y had a mean of .03, and a

standard deviation (SD) of 1.00. When using listwise deletion, the mean of Y was .19, and

the SD was .98. Not surprisingly we observed bias in the means, as would be expected under

a MAR situation in which missingness was induced through a linear function of other

variables. Using A1 as an auxiliary variable and estimating the mean of Y with FIML

estimation yielded a mean of .06. Using A1 does a very good job of reducing bias. The

relative percent reduction of bias compared to the listwise model was 100× .19−.06.17 ≈ 68%.

Using A2 as an auxiliary variable on the other hand actually increases bias! The estimated

mean of Y was now .30, with a resulting percent bias amplification of 100× .19−.30.19 ≈ 58%

compared to the listwise results. Finally, when using both A1 and A2 as auxiliary variables,

the mean of Y was estimated to be .14, resulting in a bias reduction of a mere

100× .14−.19.19 ≈ 26%. We observed that using both variables as auxiliary variables was worse

than using A1 alone. This result may not be obvious when considering the formulas for

MAR or MNAR, and in fact it goes counter to the advice that an auxiliary variable can be

at worst neutral. Clearly, this auxiliary variable was not neutral, but highly bias-inducing.

When one uses a graph to encode the structural relationships between the auxiliary variables


and missingness and analysis variables, respectively, this result however is expected and can

be directly seen by the fact that conditioning on A2 d-connects Y and RY by opening a

previously blocked path.

A single simulated dataset is seldom a convincing argument, however it can serve as a

departing point for a more developed argument. First, it shows that an auxiliary variable

can increase bias in the estimation of parameters in the presence of missing data. Second, a

bias-inducing variable cannot be distinguished from a helpful auxiliary variable by examining

correlations with analysis variables and missingness indicators. In fact, in this example, the

variable A2 posed as a perfectly innocent and potentially very helpful auxiliary variable. In

the complete dataset A2 was both significantly correlated with the analysis variable Y

(r = .26, p < .001) and the missing data indicator RY (point-biserial correlation

rpb = .25, p < .001). Using inclusion criteria that rely solely on correlations would incorrectly

lead to the inclusion of A2 in the set of auxiliary variables.

In addition, a simple example like this one helps to link what could simply be a

mathematical curiosity to an applied context. To make this illustrative example more

concrete, consider that Y , the variable with missing data, is a measure of mathematical

ability with a missingness indicator RY . For this example, we assume that MAR holds and

that there is no direct path from Y to RY . Variable A1 is a measure of motivation of the

participant that has been observed and is used in the analysis as a potential auxiliary

variable. Specifically, more motivated participants score higher on the math achievement

test, and are less likely to have missing data. Consider further that A2 is the income of the

participant, another variable that was assessed as part of the study. The two unobserved

variables L1 and L2 are IQ and gender of the participant, respectively. Note that we are

assuming in this model that IQ and gender are in fact uncorrelated (which seems like a

tenable assumption). The model further expresses that participants with higher IQ scores

also score higher on math achievement, and that participant’s gender has an influence on

missingness (maybe one gender group was more likely to skip certain items). While this


example is admittedly somewhat artificial due to it’s constrained nature, we believe that it is

not entirely implausible and suggests that auxiliary variables of the type as A2 in our

example could in fact be lurking among seemingly benign potential auxiliary variables.

Henceforth, we will refer to these variables as collider auxiliary variables.

Research questions

Having established in a single example that auxiliary variables can induce bias we set

forth to answer several research questions.

1. First, we are interested in the absolute magnitude of bias that can be induced when

using collider auxiliary variables as a function of the magnitude of the constituent paths that

connect a collider auxiliary variable to missingness and analysis variables. In addition, we

want to put this magnitude into context and contrasts it with bias that is induced due to the

omission of a helpful auxiliary variable. This latter form of bias has been examined before

and we only include it to provide a benchmark for the bias that we expect to observe with

the inclusion of a collider auxiliary variable. Earlier research by ? (?) in the area of

confounding in causal inference suggests that the magnitude of bias due to conditioning on a

collider, especially of the kind that we presented in our example, is usually smaller than

omitting a confounder. We therefore suspect that bias due to including a collider variable as

an auxiliary variable will be noticeable, but smaller in magnitude than omitting a true

confounding auxiliary variable (i.e., a variable that is directly or indirectly causing both

missingness and analysis variables with missing data).

2. The second research question examines behavior of auxiliary variables in data

situations that are inherently MNAR. In the MAR cases considered in the first simulation

study, the conditional independence between missingness and analysis variables with missing

data can always be created by using some observed variables. Hence there is an expectation

that including the collider auxiliary variable will necessarily increase bias. by disturbing the

conditional independence. In the MNAR case collider auxiliary variables are expected to


behave differently, insofar as the relationship that they induce between the missingness and

variables with missing data can either enhance or reduce the already existing relationship

between missingness and analysis variables with missing data. In a similar fashion, we will

also explore the behavior of auxiliary variables that are directly related to both missingness

and analysis variables.

Simulations studies

Simulation study 1.1

Our first simulation study explores the absolute magnitude of bias that can be induced

when using a collider auxiliary variable in a MAR situation. The simulation study roughly

followed ? (?), in terms of data-generation and evaluation criteria. Generally speaking, data

are first generated under a specific model, then missing data are imposed based on a

described mechanism, then parameters are estimated using listwise deletion and FIML with

auxiliary variables. Lastly, results of replications are pooled within condition and

performance criteria assessed. While it is possible to examine bias in many different

parameters of interest (means, variances, skew, regression coefficients, factor loadings, etc.),

we only focus on estimates of the population mean. The reason behind this choice was that

mean responses (potentially across different groups) are still one of the most widely used

measures to describe research phenomena in the social sciences. The examination of

regression coefficients is left to future studies and is briefly mentioned in the discussion.

Data generation and analysis. The data-generating model for simulation 1.1 is

shown in Figure 6. In the model, a single independent variable Y is generated with missing

data, indicated by RY . Auxiliary variable A1 is spuriously correlated with the probability of

missing and the outcome Y , via two unobserved, uncorrelated variables L1 and L2. In the

model Y and RY are d-separated but become d-connected as soon as A1, the collider

auxiliary variable, is used in MI or FIML. All continuous variables were multivariate

normally distributed and completely standardized by fixing the total variance of each


variable to 1 and setting means to 0. We did not vary sample size, but chose a single

constant sample of 500. This single sample size was also chosen by other authors in similar

simulations (?, ?, ?), as a somewhat large, but still reasonable sample size to consider.

Furthermore, changes in sample size usually yield predictable results when other factors are

held constant, namely that standard errors decrease with increased sample size. We also did

not vary the amount of missing data, but fixed it at a relatively high value of 30%, which

was in-between the two values chosen by ? (?). Varying the amount of missing data is often

not very interesting as results of such variation have previously been shown to yield expected

results (bias gets worse as missing data increases). All path coefficients in the

data-generating model, labeled α were chosen so that the uniquely explained variance in the

outcome variable that these paths were connected to was set to a particular value. Paths

coefficients were set at 0, .224, .387, .500, .592 and .671. This corresponds to uniquely

explained variance of 0%, 5%, 15%, 25%, 35%, and 45%, respectively. See the Appendix for

details on how missingness was generated and how explained variance in RY was defined.

Finally, we varied the sign of the coefficient labeled α? (positive or negative). This sign

change of a single path of the constituent paths of the collider auxiliary variable does not

alter the magnitude of the bias that is induced, but alters the direction. Note that it is not

of importance which of the four paths α is varied in sign, because the direction of bias is

determined by the product of all four constituent paths (?, ?). Finally note that conditions

in which all paths were set to 0 correspond to a pure MCAR condition. In this simulation

design we varied all paths labeled α simultaneously. Our primary interest was to observe

overall bias and not bias due to differential changes in constituent paths. This simulation

design thus yielded 5 conditions with a positive sign, 5 conditions with a negative sign, and

one condition in which all paths were set to 0, for a total of 11 conditions. We replicated

each condition 1000 times. All simulations were conducted using R (?, ?) and the following

packages: lavaan (?, ?), MASS (?, ?), mice (?, ?), MplusAutomation (?, ?), and plyr (?, ?).

For the generation of graphs we used ggplot2 (?, ?) and tikzdevice (?, ?).


Performance measures. In order to analyze the results of our simulation study, we

assess a range of standard criteria commonly employed in simulation studies.

1. We assessed standardized bias in the estimates (mean, variance) of variables with

missing data, defined identical to ? (?) as raw bias (average parameter estimate across

replications minus true parameter value) divided by the standard error, defined as the

standard deviation across all replication estimates. ? (?) gives a rule of thumb that absolute

values of .4 or higher are worrisome on the standardized bias metric.

2. We recorded the precision of the estimates defined as the average standard error

across all replications. In general it is desirable to have estimates with smaller standard

errors, and hence narrower confidence intervals and more precise estimates.

3. We computed the root mean squared error (RMSE) defined as the square root of the

average squared difference between a parameter estimate and the true value of the parameter.

4. Lastly, we observed coverage rates, defined as the percentage of replications whose

95% confidence interval included the true parameter estimate. Ideally, one observes 95%

coverage rates, as this would indicate that the confidence intervals of the estimator are in the

long run accurately capturing the true parameter and have the nominal α error rate. Again,

relying on rules of thumb by ? (?), we regard coverage rates below 90% as worrisome.

Results of simulation study 1.1. The complete results are shown in Table B1 in

the Appendix. In order to communicate the most important findings, we display the

amount of standardized bias in the means in Figure 7, and coverage values in Figure 8. Both

figures shows that the listwise model is unbiased and has perfect coverage across all

conditions. The inclusion of A1 as an auxiliary variable in the FIML estimation induced bias

in the mean, as would be expected based on missingness patterns that are imposed in a

linear fashion. Bias emerges in all conditions that used FIML, expect the one in which all

paths labeled α are set to 0 (the MCAR condition). Note that this is true even though

variable A1 is related to both Y and RY and would be included as an auxiliary variable

under all current recommendations to achieve MAR. The general pattern as seen in Figure 7


and 8 is that increases in the amount of explained variance yield monotonic increases in bias.

Little to none bias is observed in conditions of weak path coefficients and stronger biases are

observed in more extreme conditions. The standardized bias (and other performance

measures) reach a critical threshold, based on the rule of thumbs by (?, ?), when path

coefficients are as strong that they explain slightly less than 25% of the variance. Bias in

conditions with even stronger effects is so large that confidence intervals approach 40%

coverage. Also, not surprisingly, the direction of bias changes when the sign of the coefficient

α? changes its sign. In conditions in which the sign is negative, positive bias is induced due

to the inclusion of the collider auxiliary variable, and negative bias is induced when the path

coefficient has a positive sign, respectively. The results of this simulation clearly show that

an auxiliary variable, even though it exhibits strong correlations with missingness and

analysis variables, can increase bias. This somewhat surprising result is evident from the

graphical model, in which we can see that A1 is a collider auxiliary variable which will

induce a bias in the path from Y to RY .


To put the results of the first simulation study into a broader context, we performed a

second simulation study that was essentially a replication of earlier findings that an omitted

variable that has an effect on both missingness and analysis variables with missing data can

bias estimates. While this simulation study by itself does not give us any new insights, we

performed this study to answer our research question 2, aimed at exploring whether the

magnitude of bias due to omission of a bias-inducing collider auxiliary variable is similar in

strength to omission of a potentially more helpful auxiliary variable. We replicated the first

simulation study using the exact same values of explained variance in our data-generating

model, but changed the role of the collider auxiliary variable to an auxiliary variable that

has direct influences on both missingness and analysis variables.


Data generation and analysis. The data-generating model for simulation 1.2 is

shown in Figure 9. In this model, a single independent variable Y is generated with missing

data, indicated by RY . This time, an auxiliary variable A2 is directly affecting both Y and

RY , thus d-connecting the two variables. The graphical criterion therefore tells us that A2 is

a bias-inducing variable that should be used in the FIML estimation. The generation of all

variables was identical to simulation study 1.1. The unique explained variance of each effect

labeled β was also identical to the previous simulation and set to 0%, 5%, 15%, 25%, 35%, and

45%. Again, we varied the sign of the path labeled β?, for a total of 11 simulation conditions.

Results of simulation study 1.2. Table B2 in the Appendix lists the complete

results of the second simulation study. To visualize our main findings we present

standardized bias in the means and coverage rates of means in Figure 10 and Figure 11 for

all conditions. In this simulation we observe a slightly different pattern than the

previous simulation. Not surprisingly and shown previously by other researchers, the listwise

model is biased in the parameter estimates of the means, and in the more extreme cases even

in the variance of Y (not shown in Figure, but in table). The FIML model that included A2

is virtually unbiased in all conditions and has perfect coverage, because the true

data-generating mechanism of the missingness is captured. Several important observations

can be made. First, the bias that is induced through the omission of a helpful auxiliary

variable is larger in magnitude in comparison with the inclusion of a bias-inducing collider

auxiliary variable. This can also be observed when examining coverage rates that drop much

more dramatically than in the case of an included collider auxiliary variable. For example, in

the condition with 25% explained variance, the standardized bias in the previous simulation

was .61, whereas in this simulation with an omitted and helpful auxiliary variable, the bias is

2.45. A second observation is that the direction of bias is flipped compared to the results of

the previous study. A negative sign of the path coefficient labeled with a ? yielded negative

bias, and likewise a positive path coefficient yielded positive bias.


Intermediate summary of results of simulation study 1

We have shown that in cases that are not MNAR, bias can be induced through the

inclusion of auxiliary variables in a FIML estimation framework. The fact that an auxiliary

variable can actually make bias worse in parameter estimates in the presence of missing data

is a novel point that is not addressed by the currently practiced approaches of including

auxiliary variables. It also provides a counter-argument that is sometimes brought forth in

defense of including many variables that states that as soon as the explained variance in the

missingness or the outcome variable gets very large, there is no more room for any potential

biasing influences. This is clearly wrong, as our simulation examined cases in which

explained variance through the inclusion of a collider auxiliary variable was very large and

yet bias increased.

In our simulation studies this bias seemed to become problematic (as assessed through

rules of thumbs of standardized bias and coverage) as soon as the explained variance of the

unobserved variables associated with the collider auxiliary variable crossed a threshold of

slightly less than 25%. On a correlation metric we therefore would have to observe

correlations in the magnitude of approximately .4− .5. While this may seem very high, it is

important to remember that in our simulation studies there was only a single collider

auxiliary variable with only 2 unobserved variables, while in reality there could be a

multitude of both colliders and unobserved variables, especially if one is considering

psychological constructs that are often multiply caused. Those taken together might be able

to explain more variance and potentially make the inclusion of collider auxiliary variables

more problematic. However, the second simulation study also demonstrated that the bias

that is observed due to the inclusion of a collider auxiliary variable is much smaller than the

bias observed due to the omission of an auxiliary variable that has directional effects on both

missingness and analysis variables with missing data. In our simulation setup we observed

troublesome levels of bias, as soon as the omitted auxiliary variable explained slightly less

than 15% of the variance in the related variables, which translates to correlations of


approximately .3− .4.

These intermediate results should not give the impression that listwise deletion is

generally preferable over MI or FIML models with auxiliary variable, as may erroneously be

believed based on the result of the first simulation study. However, it shows that inclusion of

auxiliary variables does not always mitigate bias, but can enhance it and that researchers

should be aware of picking good auxiliary variables. We discuss some strategies later in the

discussion.


In our second set of simulation studies we explore how collider and other auxiliary

variables behave in the presence of data that is inherently MNAR. Simulations that assume

that data are MNAR are probably more realistic, because in real applications some degree of

MNAR, even though it might be small, is often likely.

Data generation and analysis. Our data-generating model for the first simulation

study in the second set, depicted in Figure 12, had the identical sample size, and number of

replications as simulation study 1.1. The notable difference was that data were simulated

under an MNAR scheme (indicated through the direct paths labeled γ from an unobserved

variable U1 to both Y and RY ). The direct effects γ were always positive in sign and held

constant at 20% explained variance, thus indicating a moderately strong degree of MNAR

missingness, in comparison with the range of explained variance for the auxiliary variables.

The strength of the paths labeled α was varied over the same levels as in simulation 1.1,

including the changing of signs of the path labeled α?. The total number of conditions in the

simulation was again 11.

Results of simulation study 2.1. The complete results of simulation study 2.1 are

shown in the Appendix in Table B3. We report the most important findings on the

standardized bias of the means and coverage of means in Figure 13 and Figure 14. The

listwise model displayed a relatively constant, and high amount of bias across all conditions


with no particular relationship to the strength or sign of α. The pure MNAR bias due to the

unobserved variable U1 is at around 2 on the standardized bias metric and corresponds to

coverage levels of approximately 50%. Bias was mostly induced in the estimate of the mean,

as was expected under a linear MAR situation.

The FIML model that included the auxiliary collider variable showed a more

interesting pattern. The overall shape of results for the standardized bias in Figure 13 looked

almost identical to previous results, as if only shifted along the y-axis. However there is an

important difference that becomes obvious in Figure 14 that displays coverage rates. Because

there is a constant MNAR bias, the inclusion of variable A1 in the FIML model now either

reduces or increases bias. In particular, if the sign of α? (and therefore the product of the

constituent paths) was positive, bias in the means was attenuated due to the inclusion of A1.

On the other hand, if α? was negative in sign, the inclusion of A1 as an auxiliary variable

increased bias of parameter estimates, making it even worse than the listwise model. The

Figure that displays coverage rates makes this differential effect easily visible. While

coverage stays relatively constant for the listwise model, it now increases monotonically for

the FIML model in all conditions with a positive sign, and decreases monotonically in all

conditions with a negative sign.

Bias in the estimates of the standard deviation was also observed, but much smaller in

absolute magnitude than the bias observed in means. For standard deviations, we observed

that bias in the listwise models was constant across all conditions and in the magnitude of .3

on the standardized bias metric, with corresponding coverage at around 90%. With weak

constituent paths, the bias that was observed in the FIML model was identical to the

listwise model. With very strong constituent paths, the FIML model eliminated the small

bias in standard deviations and recovered true parameter estimates.



In simulation study 2.2 we examined the performance of inclusion and exclusion of an

auxiliary variable that has direct effects on both missingness and analysis variables, when

the missing mechanism is MNAR.

Data generation and analysis. The data-generating model is shown in Figure 15

and follows the same general pattern as simulation study 1.2 with the difference that the

direct paths γ were added to induce an MNAR situation. The amount of explained variance

for paths β and γ was identical as in simulation study 1.2. Therefore, while being

structurally different, this simulation mimicked previous simulations in regards to strength of

pure MNAR bias and explained variance of auxiliary variables.

Results of simulation study 2.2. The results of simulation study 2.2 are given in

full in Table B4 in the Appendix. We again present the main results in Figure 16 and

Figure 17, displaying standardized bias of the estimate of the mean and coverage rates,

respectively. We observe that the listwise model showed a pattern that consisted of regions

of extreme positive bias in the means, no bias, and some negative bias, depending on the size

and magnitude of the path coefficients of the auxiliary variable A2. In conditions in which

the explained variance of the auxiliary variable was 0%, standardized bias was around 1.95,

which is similar to the amount of bias that was observed in the first MNAR simulation.

When the strength of the relationship to missingness and outcomes to A2 was increased, the

amount of bias changed, however again dependent on the sign of the coefficient. If the sign

of the coefficient was positive (and therefore the product of constituent paths was positive)

bias increased to very high levels (standardized bias larger than 6 in the most extreme

conditions). If on the other hand the sign of the path coefficient of the auxiliary variable was

negative, bias decreased, going towards zero, and then increasing again but in the opposite

direction. This pattern is also visualized in the Figure that presents coverage rates. We

observe that the listwise model that excluded A2 had coverage of around 50% in the

condition in which path coefficients were set to 0. Increases in the magnitude of path


coefficients while having a positive sign, deteriorated coverage quickly all the way to 0%.

Increasing path coefficients in the presence of a negative sign, first decreases bias, and

coverage rates approached the unbiased ideal of 95%, at about 25% of explained variance.

After this region where the bias due to the unobserved variable U1 and the omitted auxiliary

variable A2 canceled each other out, the bias from omitting A2 dominated and bias in the

opposite direction was observed and coverage levels dropped again.

The FIML model showed a somewhat stable amount of bias and coverage, however

with the interesting observation that bias increased in more extreme regions of explained

variance. This is visualized in Figure 16, in which the line for bias under FIML slopes

slightly upward at both ends. In Figure 17 we see this pattern even clearer, as coverage rates

drop from 50% at the center of the graph to around 30% at the extreme regions. This

phenomenon of residual bias-amplification has been described previously in the context of

instrumental variable models (?, ?, ?, ?) and it is clearly visible here in the context of

missing data problems as well. What has been shown in the context of instrumental

variables is that any bias of a relationship between two variables (in our case Y and RY ) is

amplified as soon as variables are introduced that explain some of the variance in the

explanatory variable of the two-variable relationship. ? (?) showed that bias amplification is

equal to a factor of 11−R2 , where R2 is the explained variance of Y in our case. In our

example, the inclusion of A2 explains variance in Y and therefore any bias that is due to U1

gets amplified monotonically, as the explained variance in Y due to A2 increases. Note that

in simulation 2.1 this phenomenon was also observed, but so attenuated as to be virtually

unnoticeable. This is due to the fact that the explained variance in Y due to the inclusion of

the collider auxiliary variable A1 is much smaller, because the explained variance of A1 in Y

is itself only based on the induced relationship between A1 and Y through L1.

Finally, we also observed biases in standard deviations of Y that were meaningfully

larger than in previous simulations. These results are not central to our work, but are

described in more detail in the Appendix.


Intermediate results summary of simulation study 2

Simulation study 2 provided evidence that auxiliary variables in the presence of MNAR

can reduce or increase bias, depending on the sign of the constituent paths of the auxiliary

variable. Given certain constellations of relationships and signs of coefficients, it is beneficial

to exclude auxiliary variables to reduce bias, while in others it is highly beneficial. In

particular, if an MNAR situation exists that is believed to induce a positive relationship

between missingness and variables with missing data (i.e., participants with high values on a

variable are also more likely to be missing on this variable), it is beneficial to include an

auxiliary collider variable if the product of the constituent paths is also positive (as this

induces negative and therefore offsetting bias). However, this is only true, as long as the

induced bias is not so strong that it becomes larger than the original bias that is due to

unobserved variables. Bias is increased with the inclusion of a collider auxiliary variable if

the sign of the product of constituent paths is opposite to the sign of the induced

relationship due to unobserved variables. In these cases, it is better to exclude this auxiliary

collider variable.

A similar pattern emerged in the presence of MNAR and an auxiliary variable that has

direct effects on missingness and outcomes. If the existing MNAR bias due to unobserved

variables is believed to be positive, it is beneficial to include an auxiliary variable, if the

product of the constituent paths is also positive. Exclusion of such a variable will always

compound existing bias. If the sign of the product of the auxiliary variable is in the opposite

direction than the bias due to unobserved variables it can be beneficial to exclude this

variable, namely whenever one can assume that the biases cancel each other out. However, it

can also happen that bias due to omission is so strong in the opposite direction that it

becomes larger than the bias due to the unobserved variables.

The results of this last set of simulations adds an important piece of information,

namely that bias can be increased in the presence of MNAR, even if an auxiliary variable is

added that is directly related to missingness and outcomes. The special collider structure


that we discussed as bias-inducing in our first set of simulations is not even necessary to

induce biases due to inclusion of auxiliary variables. As soon as MNAR is present, the

bias-reducing or increasing properties of auxiliary variables are dependent on the sign of the

constituent paths of auxiliary variables, which makes it exceedingly hard for an applied

researcher to exactly know whether any given variable may help or hurt in the mitigation of

missing data bias. As we explain in more detail in the discussion section, matters become

even more complicated when we allow correlations among observed and unobserved variables.

Finally, the results of the second study should not be misinterpreted that a positive

sign of the product of constituent paths is inherently better than a negative sign, or vice

versa. The reason why in some conditions the positive sign reduced bias was only due to the

fact that the paths labeled γ were set to a positive value - we could easily rerun all

simulations and change the sign of γ, and observe reversal of the sign of standardized biases.

Again, the results should under no circumstances be misinterpreted that a listwise deletion

approach is inherently superior to FIML or MI. In fact, in many applied circumstances, an

applied researcher might have good reasons to believe that the auxiliary variables at hand

exhibit direct effects on missingness and variables with missing data, that MAR holds, and

that therefore variables should be included. However, as we have shown, there are sets of

plausible situations in which it is indeed better to not include an auxiliary variable, contrary

to common suggestions.

Discussion

The overarching picture that emerged from our study is that the effects of auxiliary

variables cannot be easily described in a single statement, and even less so in a simple and

universally applicable rule or recommendation of inclusion or exclusion of auxiliary variables.

We have demonstrated through several examples that auxiliary variables can increase biases

both in the presence of MAR or MNAR, in some conditions substantially so. Specifically,

when MAR is believed to hold, auxiliary variables that have a collider structure increase bias.


Under MNAR, any variable can theoretically increase or decrease bias, depending on

strength and magnitude of both observed and unobserved variables that are related to

missingness and variables with missing data. What however does this imply for an applied

researcher who is faced with a missing data problem?

Recommendations

The orthodox recommendation for the selection of auxiliary variables in MI or FIML is

to either take all available covariates or select them based on their observed correlation with

the missingness or the outcome variables. Our study has shown that neither approach

guarantees that only variables are selected that are bias-reducing. Further, neither approach

guarantees that following the recommendation ultimately leads to the best possible estimate

of parameters in the presence of missing data. Using all available variables as auxiliary

variables may include bias-inducing variables, and relying on correlational evidence is not

sufficient to distinguish between bias-inducing and bias-reducing variables.

In theory one could always identify the best possible set of auxiliary variables, by

examining a graphical model and - in the case of MAR - select those variables that

d-separate the missingness indicator and the variables with missing data, thus fulfilling the

conditional independence that needs to hold. In MNAR situations one could quantify the

amount of induced covariance due to observed and unobserved variables using path tracing

rules and knowledge about sign and magnitude of path coefficients and then select those

variables that minimize bias. However, it is highly unlikely that applied researchers have

good qualitative knowledge about relationships among auxiliary variables, let alone

quantitative knowledge about the magnitude of such relations. This would suggest the rather

pessimistic perspective that bias reduction due to missing data is impossible in practice. We

argue that it is indeed non-trivial to select auxiliary variables, but hope that some of our

results can aid in the process. Two results are of special usefulness in this regard: the fact

that bias-induction can be assumed under certain conditions and that magnitude of bias due


to omission tends on average to be smaller than bias due to inclusion.

We argue that if researchers happen to have very specific knowledge about their

auxiliary variables, that this knowledge should be used. For example, if a researcher has good

reasons to believe that MAR holds and assumes that an auxiliary variable is only related to

missingness and analysis variables due to spurious relations, and is itself not related to any

other bias-inducing variables, then it would be best to exclude this variable. On the other

hand, if the researcher believes that direct effects are more plausible, then inclusion of this

auxiliary variable is the best choice of action. These decisions presuppose that researchers in

fact think carefully about their auxiliary variables, which might not always be easy due to

lack of theoretical knowledge. However, it might still be preferable over a weakly argued plea

to MAR and blindly putting in auxiliary variables in one’s imputation model. In general it

would be very desirable if stronger arguments for the plausibility of MAR would be brought

forth, may that in the form of written arguments about relations among auxiliary variables,

or through graphical models. Tacitly assuming that MAR holds (possibly with claims of

large explained variance) should never be a defensible strategy.

The second result that may prove useful for applied researchers is that (especially in

the case of MAR) the bias due to omission of a useful auxiliary variable and the bias due to

inclusion of a collider auxiliary variable is not symmetrical. As our studies suggest, the

former seems to outweigh the latter. This means, that if an applied researcher who observes

a correlation between an auxiliary variable and missingness is unsure whether this variable

may be bias-inducing or reducing, it might be more often beneficial to include it. This may

mean that one ends up at an inclusive strategy in which all potential auxiliary variables are

used. An important difference though is that one arrives at this solution through careful

consideration of auxiliary variables and thus presumably can provide a stronger theoretical

argument in favor of MAR.


Objections and limitations

Several objections might be raised to the graphical models we presented in general and

the existence of bias-inducing variables in particular. First, one might question whether the

unobserved variables that we posited in our simulation studies could in any situation be

potentially observable and in fact could be used as auxiliary variables in an inclusive fashion.

If the unobserved variables L1 and L2 in our examples were in fact observed, then it is true

that bias due to the collider auxiliary variable would vanish. Unfortunately, it is not trivial

to always rule out whether those unobserved variables are truly captured, or whether

additional such unobserved variables may exist.

Second, one might question whether two unobserved variables would be uncorrelated as

in our example. One might argue that it is more realistic that there are other variables

(potentially unobserved as well) that induce correlations between L1 and L2. As we can see

using the d-separation criterion, making the unobserved variables related to each other,

would not change the fact that conditioning on A2 would induce a covariance between Y and

LY .

Third, it might be argued that it seems implausible that a collider variable like A2 does

not have any direct effect on Y or RY . If one supposes that A2 has a direct effect on either

Y or RY , then conditioning on A2 would close a bias-inducing path, but at the same time

open another one. This would be very similar to situations that we describe in simulation

studies 2.1 and 2.2 in which it is impossible to tell whether bias is reduced or induced

without specific knowledge of magnitude of path coefficients.

Fourth, one could argue that the examples and the whole concept of collider bias is too

artificial and simply does not occur in real datasets. The question whether such variables like

A2 in Figure 5 can exist in real data settings has already been widely discussed. ? (?) argues

that such data situations are rare or virtually impossible, whereas other authors (e.g., ?, ?, ?,

?, ?) seem to suggest that such structures can in fact emerge. We believe that while it may

be rare to find such a simple structure as we have displayed, it does not seem completely


implausible to find unobserved variables that happen to have an effect on a potential

auxiliary variable and also on missingness or analysis variables, respectively. It seems in fact

especially plausible if the variables that are being considered are psychological constructs,

that are often caused by many other variables that may or may have not been observed.

Furthermore, as we have shown in our second set of simulations, it is not even necessary to

conjure the concept of a collider to observe bias-inducing properties of auxiliary variables.

Besides these objections the study has other limitations. First, we only examined

biases in means and standard deviations using a MAR-linear pattern. Clearly, the

simulations could be extended to regression coefficients, or various other parameters, under

more complicated missingness patterns. Moreover, we could have simulated data with

correlated auxiliary variables, a larger number of auxiliary variables, and more variables with

missing data. Especially the correlations among auxiliary variables could have potentially

made the bias-inducing and reducing properties even more complicated. We acknowledge

that all these aspects could have been investigated, and hopefully we will have a chance to

do so in the future. For this particular study we purposefully kept the complexity of models

and missingness to a minimum to show that under very simple models, bias-induction due to

auxiliary variables can occur.

Future directions

The limitation section above points in the direction of future research. First, it will be

interesting to consider more variables with missing data. We argue that the underlying

mechanisms of bias-reduction and bias-induction would be similar if more than one variable

is considered, however we concede that it might be more difficult to graphically display

models in which many missing data indicators need to be considered and would complicate

auxiliary variable selection as some variables may be bias-reducing for some variables, and

bias-inducing for others.

Second, it would be fruitful to examine models in which various auxiliary variables are


represented that are correlated with each other. This adds immense complexity as the

inclusion or exclusion of any given variable has far-reaching consequences for the potential to

induce or reduce bias of other auxiliary variables. As an example, it might be beneficial to

include a collider auxiliary variable, even though it is known to be bias-inducing, just for the

reason that it is correlated with a variable that is also bias-inducing, but unobserved. Given

complex patterns of relationships it can prove potentially very challenging even in a

graphical model to disentangle the effects at work.

Third, it would be interesting to directly compare the performance of the inclusive

approach, the data-driven approach, and an approach that relies on theoretical assumptions

of structural relationships, in their ability to reduce bias. In our studies we focused on small

examples and bias behavior, but have not examined differential behavior of different

approaches in a comprehensive fashion.

Fourth, we have only examined a subset of possible relations of auxiliary variables and

missingness. It would be possible to extend our results to other scenarios, e.g., an auxiliary

variable having an impact on missingness but being spuriously related to the variable with

missing values due to an unobserved variable, or auxiliary variables having both spurious

and direct effects to missingness. In summary, we believe that there is still a lot to be

learned about the selection of auxiliary variables in missing data.


Table 1

Results of illustrative example of bias-inducing auxiliary variable.

M (SD) Bias reduction

compared to listwise

Complete data .03 (1.00)

Listwise .19 (.98)

FIML with A1 .06 (.96) 68%

FIML with A2 .30 (.98) −58%

FIML with both .14 (.98) 26%


Figure 1 . A simple MCAR model

X

RY

Y

εX εY

εR


Figure 2 . A simple MAR model without auxiliary variables (a) and with auxiliary variables

(b).

X

RY

Y

εX εY

εR

(a)

RY

X Y

A

εX εY

εR

εA

(b)


Figure 3 . A simple MNAR model with direct path between missingness and variable with

missing data (a) and unobserved variable related to both Y and RY (b).

X

RY

Y

εX εY

εR

(a)

X

RY

Y

εX εY

εR

L1

εL1

(b)


Figure 4 . A model with several auxiliary variables. Not all of the auxiliary variables are

needed for an unbiased estimate.

A1

A2

A3

RY

X Y

εX εY

εR

εA1

εA2

εA3


Figure 5 . Simple structure of two auxiliary variables and a single variable Y exhibiting

missing data.

A1

L1

L2

A2

RY

Y

εY

εR

εA1

εA2

εL1

εL2


Figure 6 . Data generating model for Simulation 1.2.

L1α?

L2α

A1

α

α

RY

Y

εY

εR

εA1

εL1

εL2


Figure 7 . Partial results of simulation study 1.1. Standardized bias in the estimate of the

mean across all conditions for both listwise and FIML. Arrows at the bottom of the graph

display the sign of the path labeled with a ?.

-4

-2

0

2

4

Positive signNegative sign

45 35 25 15 5 0 5 15 25 35 45

Percentage of explained variance

Stan

dardized

bias

ModellistwiseFIML


Figure 8 . Partial results of simulation study 1.1. Coverage in the estimate of the mean

across all conditions for both listwise and FIML. Arrows at the bottom of the graph display

the sign of the path labeled with a ?.

0.0

0.2

0.4

0.6

0.8

1.0


45 35 25 15 5 0 5 15 25 35 45


Coverage

ModellistwiseFIML



A2

β?

βRY

Y

εY

εR

εA2





-4

-2

0

2

4


45 35 25 15 5 0 5 15 25 35 45


Stan

dardized

bias

ModellistwiseFIML





0.0

0.2

0.4

0.6

0.8

1.0


45 35 25 15 5 0 5 15 25 35 45


Coverage

ModellistwiseFIML



L1α?

L2α

A1

α

α

U1

γ

γRY

Y

εY

εR

εA1

εL1

εL2εU1





-4

-2

0

2

4


45 35 25 15 5 0 5 15 25 35 45


Stan

dardized

bias

ModellistwiseFIML





0.0

0.2

0.4

0.6

0.8

1.0


45 35 25 15 5 0 5 15 25 35 45


Coverage

ModellistwiseFIML



A2

β?

β

U1

γ

γRY

Y

εY

εR

εA2 εU1





-4

-2

0

2

4

6


45 35 25 15 5 0 5 15 25 35 45


Stan

dardized

bias

ModellistwiseFIML





0.0

0.2

0.4

0.6

0.8

1.0


45 35 25 15 5 0 5 15 25 35 45


Coverage

ModellistwiseFIML


Appendix A

Generation of missing values and explained variance in RY

Note that the missigness indicator is a binary outcome variable, and should ideally be

modeled using a logistic or probit model. We modeled the relationship between predictor

variables and missingness by modeling a latent, continuous variable that expresses the

likelihood of being missing, given values on variables that predict missingness. This allowed

us to use the same path coefficients and model the same amount of explained variance. This

latent variable is not displayed in our graphs to make the visualization of the underlying

missingness mechanism clearer. Paths going into the latent, continuous variable had the

same magnitude and explanatory power as paths from variables going into the variable with

missing data, hence they are also displayed with the same letter α in our graphs. To

generate missing data, we created a binary indicator based on the latent missingness

propensity, by performing a cut at the 30th percentile of the underlying continuous variable.

We fully acknowledge that this dichotomization results in amounts of explained variance that

are nominally lower than the ones that were specified in regards to the latent continuous

variable. We examined this attenuation and found in line with previous research (?, ?, ?)

that the attenuation factor is constant, as long as the dichotomization always occurs at the

same percentile. We also reran this simulation and modeled the binary missingness indicator

directly, choosing logistic regression coefficients that map on to the exact same values on the

McKelvey-Zavoina Pseudo-R2. Results from these studies showed a very similar pattern,

with biases across all conditions being slightly higher, due to the absence of any attenuation.

The only reason why we did not employ the approach of directly modeling the binary

response was that it becomes exceedingly hard to get the exact desired Pseudo-R2 in models

with several, potentially correlated predictors. This lesser known point about logistic

regression is explained in more detail by ? (?).


Appendix B

Appendix tables

Table B1

Results of simulation study 1.1. Table presents results, broken up by estimation strategy

(listwise, FIML), parameter estimate (mean or variance), type of performance measure

(standardized bias, standard error, RMSE, coverage), and across columns, the sign and

magnitude of the relationships on the R2 metric.

Sign of coefficient negative positive

Unique explained variance in each path α

45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%

Std. bias 0.03 -0.08 -0.05 0.03 0.02 0.04 0.03 -0.03 0.00 -0.01 0.02

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.05 0.05 0.05 0.05 0.05 0.05 0.06 0.05 0.05 0.05 0.05

listwise Coverage 0.96 0.95 0.95 0.96 0.95 0.95 0.94 0.95 0.95 0.95 0.96

Std. bias -0.02 -0.07 -0.03 -0.03 -0.02 -0.04 -0.13 -0.02 -0.04 0.02 -0.1

σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08

RMSE 0.08 0.07 0.07 0.08 0.07 0.08 0.08 0.07 0.08 0.08 0.07

Coverage 0.95 0.96 0.95 0.94 0.95 0.93 0.94 0.96 0.95 0.95 0.95

Std. bias 2.21 1.16 0.54 0.24 0.04 0.03 0.01 -0.24 -0.61 -1.24 -2.17

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.12 0.08 0.06 0.05 0.05 0.05 0.06 0.06 0.06 0.08 0.12

FIML with A1 Coverage 0.43 0.8 0.91 0.95 0.95 0.95 0.95 0.94 0.92 0.79 0.44

Std. bias 0.31 0.05 0.00 -0.02 -0.02 -0.04 -0.13 -0.02 -0.01 0.13 0.24

σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08

RMSE 0.08 0.07 0.07 0.08 0.07 0.08 0.08 0.07 0.08 0.08 0.08

Coverage 0.95 0.96 0.95 0.94 0.95 0.93 0.94 0.95 0.95 0.96 0.95


Table B2







45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%

Std. bias -4.46 -3.18 -2.24 -1.41 -0.50 0.09 0.51 1.36 2.36 3.24 3.90

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.23 0.18 0.14 0.09 0.06 0.05 0.06 0.09 0.13 0.18 0.21


Std. bias -1.52 -0.9 -0.46 -0.22 -0.04 -0.06 -0.03 -0.17 -0.47 -0.96 -1.27

σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08

RMSE 0.12 0.10 0.08 0.08 0.08 0.08 0.07 0.07 0.08 0.10 0.11

Coverage 0.65 0.82 0.92 0.94 0.94 0.94 0.96 0.94 0.89 0.79 0.72

Std. bias 0.01 0.01 0.02 -0.03 -0.03 0.09 0.04 0.00 -0.01 0.01 -0.01

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05


Std. bias 0.00 -0.03 0.00 -0.06 -0.02 -0.06 -0.02 -0.01 -0.04 -0.05 0.01

σy Std. Error 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.08 0.08 0.08 0.08

RMSE 0.08 0.08 0.07 0.08 0.08 0.08 0.07 0.07 0.08 0.08 0.08

Coverage 0.96 0.94 0.96 0.95 0.94 0.94 0.96 0.95 0.94 0.95 0.94


Table B3







45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%

Std. bias 1.89 1.84 1.99 1.92 1.90 1.83 1.77 1.89 1.88 1.81 1.90

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.11 0.11 0.11 0.11 0.12 0.11 0.11 0.11 0.11 0.11 0.11


Std. bias -0.33 -0.24 -0.37 -0.31 -0.29 -0.38 -0.3 -0.27 -0.32 -0.28 -0.31

σy Std. Error 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07

RMSE 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08

Coverage 0.91 0.93 0.92 0.92 0.92 0.93 0.93 0.92 0.92 0.92 0.92

Std. bias 4.38 3.23 2.72 2.17 1.93 1.83 1.75 1.70 1.34 0.71 0.01

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.23 0.18 0.15 0.12 0.12 0.11 0.11 0.10 0.09 0.07 0.05


Std. bias 0.08 -0.09 -0.33 -0.3 -0.29 -0.38 -0.3 -0.27 -0.30 -0.18 -0.02

σy Std. Error 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.08

RMSE 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08

Coverage 0.95 0.94 0.92 0.92 0.91 0.93 0.93 0.93 0.92 0.93 0.95


Table B4







45% 35% 25% 15% 5% 0% 5% 15% 25% 35% 45%

Std. bias -2.29 -1.46 -0.48 0.47 1.49 1.95 2.42 3.23 4.13 5.33 6.36

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.14 0.09 0.06 0.06 0.09 0.11 0.14 0.18 0.23 0.28 0.33


Std. bias -0.47 -0.16 -0.05 -0.02 -0.17 -0.26 -0.49 -0.94 -1.49 -2.34 -3.68

σy Std. Error 0.07 0.07 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.06 0.06

RMSE 0.08 0.07 0.08 0.08 0.08 0.08 0.08 0.1 0.12 0.17 0.23

Coverage 0.90 0.94 0.94 0.95 0.94 0.93 0.90 0.81 0.64 0.33 0.08

Std. bias 2.44 2.35 2.21 2.07 2.02 1.94 2.00 2.03 2.08 2.38 2.53

µy Std. Error 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05

RMSE 0.14 0.13 0.12 0.12 0.12 0.11 0.12 0.12 0.12 0.14 0.14


Std. bias 1.14 0.86 0.43 0.16 -0.15 -0.27 -0.47 -0.79 -1.07 -1.49 -2.02

σy Std. Error 0.09 0.08 0.08 0.08 0.07 0.07 0.07 0.07 0.07 0.07 0.07

RMSE 0.13 0.11 0.09 0.08 0.08 0.08 0.08 0.09 0.11 0.13 0.16

Coverage 0.84 0.91 0.93 0.95 0.94 0.93 0.90 0.85 0.76 0.62 0.44


Appendix C

Bias in standard deviations in simulation study 2.2.

We present an additional graph to highlight the pattern of biases in standard deviations.

Figure C1 . Partial results of simulation study 2.2. Standardized bias in the estimate of the

standard deviation across all conditions for both listwise and FIML. Arrows at the bottom of

the graph display the sign of the path labeled with a ?.

-4

-2

0

2

4


45 35 25 15 5 0 5 15 25 35 45


Stan

dardized

bias ModellistwiseFIML

Figure C1 shows that the listwise model had a moderate amount of bias in situations in

which only U1 biased the effects, but with increased influence of the omitted variable A2 bias

increased, however only in situations in which the sign of the path coefficient was positive.

When the effect of A2 was negative, bias due to U1 in the standard deviation was attenuated

up to a point, and then became slightly larger again when influence of A2 was very large and

negative in sign. In all conditions, the listwise model the estimate of the standard deviation


was negatively biased, i.e., too small. The FIML model showed a different behavior than the

listwise model. At small values of explained variance the two models yielded similar amounts

of biases. With positive signs of the relationship between A2 and Y , and residual

confounding through U1, the standard deviation was consistently underestimated. The

underestimation increased monotonically with the strength of the relationship of A2. In

conditions with a positive sign of A2, the FIML model generally outperformed the listwise

model. In conditions with a negative sign, we observed that bias increased montonically,

therefore eventually overestimating the variability in the data. This bias became stronger

than the negative bias that was observed in the listwise model.

Documents

FelixThoemmes NormanRose UniversityofTuebingen fileSELECTIONOFAUXILIARYVARIABLES 2 Abstract Thetreatmentofmissingdatainthesocialscienceshaschangedtremendouslyduringthe lastdecade