Upload
duonghanh
View
215
Download
1
Embed Size (px)
Citation preview
Confounding, mediation, and some general considerations in regression modeling
Multivariable regression and its variations are currently the most frequently used type of statistical
technique in behavioral medicine research. A typical example of a multivariable1 model in our field
might be a regression model that attempts to evaluate a set of psychosocial predictors of disease,
such as heart failure. The measure indicating disease could be operationalized in a number of
measurement forms: as a continuous variable, such as left ventricular ejection fraction; as two or
more categories, such as the ordinal values of a symptom severity score; or even as the time elapsed
between some defined occasion and the diagnosis of heart failure. The term multivariable indicates
that the model contains a single response (dependent) variable, in this case, the marker of heart
failure, and at least two predictor variables in the model. The ostensible aim is to understand the
‘independent’ association of each predictor variable with the response variable. In the most
commonly used type of model, the regression coefficient or parameter estimate for a given predictor
represents the association between that predictor and the response, adjusting for all other predictors
in the model. We also might have more than one response variable, perhaps two separate indicators
of heart failure, such as LVEF and a symptom severity score. We might simply conduct a separate
regression analysis for each outcome, but also might elect to use a model that contains the two
response variables, referred to as a multivariate model. Regardless of which model we choose, a
number of important decisions must be made in developing the model. Foremost is the selection of
the form of probability model that best suits the response variable(s) under study. Next, and often
the most difficult part of the process, is to decide which predictors should be included in the model.
For the vast majority of work we do in behavioral medicine, an important part of the variable
selection process involves our presumed causal model. If we conduct a regression model to 1 The terms “multivariable” and “multivariate” are often confused. “Multivariable” indicates that the model contains a single response (dependent) variable, in this case, the marker of heart failure, and at least two predictor variables in the model. “Multivariate,” in contrast, indicates the presence of at least two response variables.
1
examine the association between, say, tobacco use and heart failure, we are more often than not
proposing that tobacco use is a cause of heart failure. Upon proposing this model, we must
immediately set about identifying potential confounders, that is, other variables that may threaten
our causal conclusion. We also may be interested in variables that carry information about
mechanisms that occur in between the act of smoking and the outcome of heart failure. Finally, we
also may be concerned, or even believe a priori, that the association between a given predictor and
the response may differ depending on the level of another variable. For example, tobacco use may
be related to heart failure only for persons with a certain genotype. Of course, there are many
additional considerations in conducting a multivariable regression analysis, including testing
assumptions, proper scaling or standardization of the predictors, perhaps centering, rescaling or
orthogonalizing predictors, to name a few. In the present chapter, our focus will be relatively
narrow. After a few preliminaries, we will discuss 1) considerations in selecting predictor variables
for a model; 2) modern approaches to mediation; 3) testing for moderation, and finally 4) the role of
sample size in estimating regression models.
Preliminaries: What is a Model?
What is a model and why use one? The statistical models we use in behavioral medicine typically
take the general form of one of more ‘predictor’ variables and one outcome, or response variable,
such as y = bx1 + bx2 + bx3 + … where y is the response variable, the x’s are the predictor variables,
and the b’s are regression weights. In the vast majority of modern modeling algorithms, the
predictor variables can be of any form, including continuous, categorical, and ordinal (and as we
will note again later, there is no normality requirement for variables on the predictor side of an
equation). A few words about nomenclature are appropriate here. Techniques such as Analysis of
Variance (ANOVA) and Analysis of Covariance (ANCOVA), and multivariable (often referred to
2
as “multiple”) regression have been almost entirely displaced by more general models (e.g., the
general linear model, and the generalized linear model). This transition has created a somewhat
confusing amalgam of terminology from these older techniques. Variables on the x-side of the
equation are referred to interchangeably as independent variables, predictors, covariates,
covariables, or, for variables measured as categories, factors. Variables on the y-side are referred
to as the response, outcome, or dependent variable.
Models per se tend to be preferred over traditional tests (e.g., t-tests, chi-square tests) nowadays for
several reasons. First, models provide not only the same information as more conventional testing
approaches, that is, whether the effect of interest is “statistically significant,” but also yield
information about the size of the effect of interest, along with information about the uncertainty of
the effect estimate, usually in the form of a confidence interval. For example, in a clinical trial
comparing a new blood pressure-lowering drug to a standard drug could legitimately be evaluated
using a simple t-test that compares the treatment groups on mean blood pressure at the end of the
trial. However, the t-test would provide no information on how big the difference was. A key
advantage of multivariable models is that we can include so-called adjustment variables in addition
to the primary variable or variable of interest. These adjustment variables can serve a variety of
purposes in a multivariable model, and these purposes are at the heart of the remainder of this
chapter.
In modern practice most of the earlier techniques, such as t-tests, chi-square tests, ANOVA,
ANCOVA, multiple regression, etc., have been subsumed under the a few more general algorithms.
The generalized linear model (1), for example, can contain one or more variables of virtually any
measurement form on the predictor side, and the probability distribution of the dependent variable
can take a variety of forms beyond normal. These include the binomial, negative binomial, and
3
gamma distributions. Hence, multiple regression, Llogistic regression, P oisson regression, and
many other conventional models models are often still estimated using dedicated logistic regression
routines, but also can be accomplished estimated using the generalized linear model. For time-to-
event data, the Cox regression model (2) is probably the most commonly used approach today,
although parametric techniques also appear with relative frequency. In addition to general linear
model, structural equation models (SEM) (3) have now been extended sufficiently such that they
also can perform virtually all of the above functions. SEM also has the advantage of allowing so-
called indirect relations to be estimated and tested, which we will discuss further in the section on
mediation below.
Cause. Except for the relatively rare case where a regression model is used for completely blind
empirical prediction, researchers typically use regression models to help understand something
substantive about the phenomena under study. Regardless of whether we care to admit it or not,
researchers are largely interested in using regression models to understand cause. Why would we
measure and model, for example, risk factors as predictors of cardiac disease if we were not
interested in those risk factors as causes? If understanding causation underlies our models, most
would agree that a useful model will include as many of the casually relevant variables in the
system as possible. What makes a variable “relevant?” This question has been of great interest and
debate from many decades in the statistics literature, and is often cast in terms of the problem of
“variable selection.” We argue that relevance depends on the causal model underlying the analysis.
Confounding. In the context of causal hypotheses, confounders represent a highly relevant type of
variable. For a variety of reasons, we know that should never be fooled into believing that an
association between two variables is sufficient evidence for causation. It may be the case, for
example, that the putative cause is confounded with another variable. The term confounding
4
derives from the Latin confundere, to pour together, or to mix (4). At root, confounding is the
mixing of the role of two predictor variables. Imagine you are at the bottom of a steep ravine
looking up at a train trestle. Suddenly a very small boy goes running across the train trestle,
followed shortly by a much larger boy, who is shouting at the small boy. You conclude that the
large boy is chasing the small boy, that is, causing the small boy to run. However, shortly after the
boys cross the trestle, a train comes barreling across the trestle behind them. In fact, the larger boy
was not chasing the small boy at all; the train was causing both of them to run across the trestle
quickly. The causal role of the large boy and the train were mixed up, or confounded. The
presence of the large boy was really just a red herring; he just happened to be running from the
train, too. In conducting research we study one or a just a few variables that are of particular
interest in order to understand something about the causal relation with between that variable and
some outcome variable.
A simple example in a research context is presented in a didactic paper Rubin (5). Rubin presents
several large epidemiological studies that all seem to show that smoking tobacco in a pipe or
cigarette is associated with a higher rate of cancer deaths than smoking tobacco in cigarette form. It
is often helpful to draw a diagram of our causal hypothesis. This result is, of course, contrary to
our understanding of the relative dangers of the types of tobacco delivery. So, we need to ask
whether tobacco type (pipe/cigar vs. cigarette) might be confounded with some other variable.
More formally, we consider the general criteria for confounding, which are as follows: 1) the
confounding variable is presumed to be must becausally related to the predictor under study; 2) the
confounding variable must beis presumed to be causally related to the outcome; 3) the confounding
variable cannot be in the causal chain betweenis either common cause or a proxy for a common
cause of the predictor and the outcome. In our tobacco example, tobacco type is the predictor of
interest and cancer death is the outcome. What variable might be associated with the tobacco type
5
and cancer death but is not in the causal chain between the two? One obvious candidate is age.
Older people are more likely than younger people to smoke cigars or pipes and are also more likely
to die of cancer. Although chronological age is clearly causally related to cancer death, age cannot
be caused by the type of tobacco we smoke. In Rubin’s examples, in each of the samples it was
clear that pipe/cigar smokers were much older on average than cigarette smokers, and that the death
rate also was higher among older individuals. When age was properly accounted for in the analysis,
the death rate among pipe/cigar smokers was no longer higher than among cigarette smokers—in
fact, it became lower. Thus the ‘effect’ of tobacco type was confounded with age.
Causal graphs. Causal models can be easier to comprehend if presented in graphic form. Often the
graphs are used as an informal heuristic tool and sometimes they are employed in a more formal
way as (causal) Directed Acyclic Graphs (DAGs) or to represent a Structural Equation Model
(SEM). Providing an introduction to graph theory is beyond the scope of this text, but a non-
technical introduction to causal DAGs can be found in Glymour and Greenland (6) and an overview
of causal analysis in the context of mediation is provided by VanderWeele and Vansteeland (7).
We will say much more about DAGs in the section on mediation below, but for now, we’ll
introduce just one basic element of DAG notation. Causal DAGs use single-headed arrows to
represent the hypothesized causal direction between variables. Double-headed arrows, contrast,
represent associations with no specified causal direction. In the first DAG below, tobacco type and
cancer mortality are associated, but with no causal direction. In the second DAG, tobacco type is
posited as the cause of cancer mortality.
6
The arrow in each of the figures, are, at this point black boxes, representing a potential host of
processes, not all of which are necessarily causal. Put differently, a raw or zero-order correlation,
or an unadjusted regression coefficient between two variables can be a function of a variety of
different processes, some possibly causal, but many others that have nothing to do with cause. In
this case, the association between tobacco type and cancer was actually generated by the presence
of a “third variable,” age, which was a common cause of both tobacco type (in that age captures the
cultural cohort) and cancer mortality. This confounding is depicted below. When we use
regression models to study the effect of one or a few putative causes of an outcome, we strive to
identify and include other variables in the model that might confound the relations under study. A
critical step in planning a study of virtually any design is considering carefully what variables might
confound the relations under study, and then being sure to measure those variables. This is
particularly important when the design is observational where there is no randomization to control
for confounding. By including confounding variables in the analysis of observational data, we may
be at least a bit closer to being able to understand cause. Considering potential confounders is also
important in randomized experiments. Except in extremely large studies, perfect baseline balance is
rarely achieved across randomized arms. When there is baseline imbalance in a randomized
experiment, the treatment effect under study may be confounded with the variable that is not
balanced. Unless the arms are substantially unbalanced, including potential confounding variables
as adjustment variables in a model will effectively reduce the threat of confounding when
interpreting the treatment effect.
7
Including variables to increase precision. Variables other than confounders may be relevant to the
regression model. We also want our model to include predictors that are associated with the
outcome, even if they are not associated with other predictors. In a linear model, such as multiple
regression, including additional predictors in the model (within the limits of sample size, which we
will discuss below), the precision of the regression weightparameter estimates2 is improved and
power of the tests of the regression weights is improved. Intuitively, power is improved because
additional predictors explain variance in the response, and therefore reduce the magnitude of the
error term by which the individual regression estimates are evaluated. For nonlinear models, such
as logistic regression and Cox survival models, the picture is a bit more complicated. Adding
additional variables will increase the standard errors for the parameter estimates, resulting in less
power. However, the estimates will also always be larger. Simulation studies have shown that the
benefit of the increased magnitude of the estimates outweighs the problem of larger standard errors
(8). Thus, when the sample size is large enough, in the most frequently used models in behavioral
medicine, including additional predictors is generally desirable.
Mediation
In addition to addressing confounding and increasing precision, we also might include additional
predictors in a model to study the possibility of mediation. Since the early paper on mediation by
Baron and Kenny (9), analyses of mediation has become increasingly prevalent in the literature. Its
importance has grown so much, in fact, that we have elected to devote a substantial section of this
chapter to it. The notion of mediation is used to describe a scenario where a variable affects
another variable though one or more intermediary variables. In the following sections, we will
2 When we use the term “parameter estimates,” we are referring to the weights or coefficients generated by the regression algorithm for each predictor variable.
8
review some conceptual issues involved in mediation and discuss methods that can be used to
statistically model mediation.
Total, direct and indirect effects. We begin with a little orientation to the nomenclature of modern
mediation analysis. Recall our graphic representation of a proposed causal association between two
variables:
In this graph the arrow pointing from X to Y indicates that the variable X affects the variable Y. We
will refer to this as the total effect of X on Y. The total effect of X on Y depicted in Figure 1 may
come about through any number of intermediary variables, but these can be left out when the
objective is to describe the total effect. If there are intermediary variables between X and Y, as we
noted earlier, the arrow from x to y in the above graph is a black box: We know the input (X) and
the output (Y), but not the mechanisms responsible for creating the association.
In this graph, the variable X affects the variable Y and the variable M. Also, we can see that the
variable M affects the variable Y. As a consequence we can distinguish between two different kinds
of effects of X on Y: A direct effect (X → Y) and an indirect effect through the variable M (X → M
→ Y). The second graph suggests that there is both a direct and an indirect effect of X on Y. In
other words, Figure 2 suggests that the variable M mediates some of the total effect of X on Y, but
it also suggests that there is an effect of X on Y that does not involve M. It is important to note that
the direct effect may in fact involve intermediary variable, just not the intermediary variable M, so
9
the direct effect might more appropriately be termed the non-M mediated effect as the direct effect
can be thought of as the sum of all pathways from X to Y that does not involve the mediator M.
Establishing the relative importance of the direct and indirect effect is often a primary concern in
mediation analysis. Figure 2 also illustrates the difference between confounding and mediation: M
is a mediator between X and Y because it lies on the pathway from X to Y. X is a confounder of the
association between M and Y because X affects M and Y.
Why mediation? Before elaborating further on the technique of mediation, it may prove fruitful to
examine the motivation for looking at mediation in the first place: Why is mediation important to
begin with? A recent paper by Hafeman & Schwartz listed three reasons: To support the evidence of
the main effect hypothesis, to examine the importance of path-specific mechanisms, and to provide
targets for intervention (10).
In 2005, a paper reported that women with a high level of perceived stress had a decreased risk of
breast cancer (11). This finding was quite surprising to many as high levels of stress had previously
been shown to have detrimental effects on various health outcomes, so could it be that the findings
were due to bias and confounding rather than a causal effect of perceived stress on the risk of breast
cancer? In the discussion the authors argue that the effect of stress was due to the fact that stress
hormones suppress estrogen secretion, which lowers the risk of developing breast cancer. This
pathway acts as a mediator between perceived stress and breast cancer. No information on estrogen
levels where available in this study, but an analysis of the mediating role of estrogen would have
improved the argument for a causal role of perceived stress in the development of breast cancer
because it would have served to open the black box of how the exposure and outcome were
connected. In fact, another research group had previously used this strategy to show that the
association between BMI and breast cancer was mediated by serum estrogen levels (12).
10
Another use for mediation is to examine path-specific hypotheses. An association between low
parental socioeconomic position and low offspring birth weight has been observed in many
different populations and across different measures of socioeconomic position. A study by
Mortensen et al. examined the role of two possible mediators of the relationship between maternal
educational attainment and offspring birth weight in a cohort of women followed throughout
pregnancy (13). The two mediators were prepregnant Body Mass Index (BMI) and smoking in the
third trimester. Smoking in pregnancy and high BMI is more prevalent among mothers with short
education, but these two factors have different effects on birth weight: A high BMI increases birth
weight, while smoking decreases it. This means that these two pathways have opposite
contributions to the total effect: if all mothers had the BMI of the highest educated mothers, the
educational differences would be larger because the higher prevalence of obesity among women
with short educations increase their children’s birth weights. If all mothers smoked like the highest
educated mothers, mothers with a shorter education would in fact give birth to the heaviest babies
because of the high prevalence of overweight and obesity among this group. The total effect of
education (short education is associated with a lower birth weight) reflects that the birth weight
reducing influence of the smoking-pathway is stronger that the birth weight increasing BMI-
pathway. The example of Mortensen et al. shows that the examination of different pathways can
increase our understanding of the total effects. For example, it suggests that the educational gradient
in birth weight that has been observed in numerous studies might reverse once smoking among
pregnant women is eliminated. It also underscores that mediation might be worth looking at, even in
the absence of a total effect. This is because a lack of association between the exposure and the
outcome might occur when different pathways that pull the total effect in opposite directions
balance each other out. This is sometimes referred to as a suppressor effect. In this case an analysis
11
of the relevant mediators would help the investigator retrieve the pathway-specific effects of the
exposure on the outcome.
A third use of mediation is to improve and evaluate interventions. Mediation is in a certain sense an
integrated part of the setup in all randomized controlled trials: The effect of randomization to
treatment on the outcome is mediated by the treatment received.
The intention-to-treat analysis is a measure of the effect of randomization to intervention, regardless
of the intervention actually received. In mediation terms, this corresponds to the total effect of
randomization. The motivation for the intention-to-treat analysis is that the results, because of the
random assignment to intervention or control, are unconfounded by factors that affect the
intervention received and the outcome, e.g. compliance to assigned treatment. However, the effect
of the intervention on the outcome is often the quantity of substantive interest, not the effect of
randomization to intervention. If this is indeed the case the intention to treat analysis can be
supplemented with analyses of mediation (14).
A similar use of mediation can be found in studies that uses naturally occurring experiments rather
than experiments under the investigator’s control. Mendelian randomization is a strategy for causal
inference that uses genetic variants as proxies for potentially modifiable factors, obesity for
example (15). In mendelian randomization the effect of the gene on the outcome is mediated by the
modifiable factors. There are special statistical methods (instrumental variable methods) that can be
12
used to recover the effect of the modifiable factors in a way that potentially avoids many of the
biases in observational studies.
Another use of the concept of mediation in intervention studies is that of surrogate endpoint in
randomized controlled trials, where the aim typically is to examine if an intervention has an effect
on one or more clinical disease endpoints such as cancer or cardiovascular disease. In order to
detect effects, clinical endpoints trials often require that a large number of participants are followed
for at long time. Because of this surrogate endpoints are often used (16). Surrogate endpoints are
biomarkers for disease progression and are as such mediators between the intervention and the
clinical endpoints. For example, CD4 cell count can be used as a surrogate endpoint in HIV
treatment trials and serum cholesterol levels as a surrogate of coronary heart disease.
Because mediation allows the investigator to peak into the black box it can also provide insight into
why interventions might work or fail and thus guide future interventions. The paper by Mortensen
et al. suggests that inventions that target smoking will likely reduce the educational gradient in birth
weight, particularly if the intervention is successful among mothers with a short education. Such
analyses of randomized trials might also provide clues as to what the ‘active ingredient’ in a given
intervention might be. Analyses of mediation are, however, not a free lunch: they come at the cost
of a number of added assumptions.
Causal knowledge as a prerequisite for mediation. The attentive reader will have noticed that we
used the term ‘affect’ to describe the relationship between variables. This is because, as was the
case for confounding, the notion of mediation makes little sense unless we have a causal model in
mind. In the case of mediation, the variables involved must be known or at least proposed to be
causally related in a way that is at least partly known to the investigator. For example, Boyle et al.
reported that the association between hostility and mortality was partly mediated by a pattern of
13
episodic excessive alcohol use (binge drinking) among hostile men (17). If high hostility is the
cause of binge drinking use (i.e. hostility → binge drinking), then the investigators’ conclusion is
correct. Let us assume that (unknown to the investigators) binge drinking over time increase
hostility. If binge drinking is the cause of hostility (i.e. binge drinking → hostility) then alcohol use
is not a mediator between hostility and mortality, but rather a common cause of these two variables.
If this was the case, binge drinking would act as a confounder of the association between hostility
and mortality, not as a mediator.
In order for analyses of mediation to make sense, assumptions about the nature of the relationships
between variables are needed. This may at first seem like a rather strong requirement because it
appears to force the investigator to make conclusions in advance about the relationships that are
under investigation. However, causational direction of relationship cannot be extracted from data
alone (18). Investigators will usually get around this by relying to existing knowledge. In the
example of Boyle et al. the prospective design will ensure that the outcome (mortality) occurs after
the exposures are recorded. But the relationship between hostility and binge drinking is cross
sectional so there is nothing in the design of the study to help the investigator decide about the
direction of the relationship. Most studies carefully consider whether the exposure in fact causes the
outcome. It is probably fair to say that in general less caution is exercised when it comes to making
assumptions about the causal relationship between exposure and mediator. Never the less the
analysis is conducted and the findings will most often be interpreted as if the mediator is caused by
the exposure. To this end, graphs are a helpful tool because they encode the investigator’s
assumptions about the possible causal relationships between variables.
Bearing this in mind, it may be fruitful to think of mediation in terms of (hypothetical)
interventions: If we could somehow intervene and change the subjects’ hostility levels in a certain
14
way, would we expect their alcohol use to decline? Would the association between hostility and
mortality change if the investigators had forced everyone to not to drink alcohol or forced everyone
to binge drink once a week? Thinking of mediation in terms of possible interventions has the added
advantage of providing a non-technical interpretation of the outcome of the analysis (given that the
analysis is conducted accordingly). Starting off with a vague question (“does alcohol mediate the
association between hostility and mortality”) may make it difficult to interpret the results. Just as
important, it will also serve to make the in many cases highly hypothetical nature of the mediation
analysis apparent (19, 20).
How to analyze mediation. There are numerous ways to statistically model mediation ("for a
review, see” (21)}. In a much cited 1986 paper, Baron and Kenny stated that the objective of such
an analysis was to “test for mediation” (9). This led them to device a method that was based on a
significance test. However, it can be argued that the question of interest is not to determine if a
given mediator is a statistically significant mediator, but rather to quantify how important the
mediator is. This follows the general arguments against relying only on test of statistical
significance in medical research (22, 23). In the applied literature, one of two somewhat different
modeling approaches to mediation is often used. The one approach is to use a Structural Equation
Model (SEM) and the other is to run a series of regressions to obtain and compare the total and
direct (non-mediated) effect of the exposure on the outcome. This latter approach, which is a
simplified version of the method of Baron and Kenny, involves controlling for the mediator to
estimate the direct effect (24). In some cases these two approaches will yield similar results, in
other cases the results will be different.
The SEM approach has the advantage that the statistical model corresponds to the graphs typically
used to conceptualize mediation, so that every arrow in the graph is estimated as a parameter from
15
one single model. SEMs are primarily used in the social sciences, whereas in the health sciences
SEMs appear to be the less popular choice. This is perhaps because SEMs are somewhat limited in
the sense that they are an extension of linear regression, which is not always well suited for the
kinds of data encountered in medicine. However, modern SEM theory (and modern SEM software)
is relatively flexible with regards to finding models that fit most problems that involve mediation. A
perhaps more important reason of the lack of popularity of SEMs for mediation analyses in the
medical sciences is that most investigators and scientific journals in the health sciences will be
familiar with multiple regression, but may not have experience with SEMs. In the following we will
concentrate on the mediator adjustment approach. For an example of an applied paper that uses both
approaches, see Batty et al. (25)
The mediator adjustment approach involves estimating the total effect and direct effect in two
separate regressions. To estimate the total effect, we need to take account of confounders of the
exposure outcome association. In the simple situation where there is no confounding, the total effect
is simply the outcome regressed on the exposure. The direct effect is typically estimated as the
association between exposure and outcome when conditioning on the mediator. Once we condition
on the mediator, we get the controlled direct effect the association between exposure and outcome.
This is called a controlled effect because it corresponds to evaluating the association between
exposure and outcome in a population where the mediator is forced by intervention at a certain
level. In order for this to make sense some conditions need to be met. We will discuss these
conditions using the example of Boyle et al.
Adjustment for path-specific confounding. In addition to adjustment for confounders of the
association between exposure and outcome, all confounders of the association between the mediator
and the outcome have to be controlled for. Consider again the example of Boyle et al.
16
In this graph, Early life socioeconomic position (SEP) confounds the relationship between hostility
and mortality because it affects both. It also confounds the association between binge drinking and
mortality. The graphs thus suggests that we should adjust for early life SEP when estimating the
total effect and when estimating the direct effect. Suppose that unemployment affects binge
drinking and mortality (loss of a job → increased binge drinking, mortality), but that this variable is
not affected by hostility (the dotted arrow does not exist). Then unemployment is not a confounder
of the total effect, but acts as a confounder of the association between binge drinking and mortality.
In this situation the investigator needs to adjust for unemployment even though unemployment does
not confound the total effect. If we fail to adjust for unemployment when estimating the direct
effect of hostility on mortality the results will generally be biased (26). This is because we need to
condition on binge drinking to estimate the non-binge drinking mediated effect of hostility on
mortality: Among those who binge drink, unemployment will be more frequent and mortality will
be increased as a consequence. Suppose that highly hostile men tend to fight with colleagues and
management and that they consequently are more likely to become unemployed (indicated by the
dotted arrow). In this case unemployment confounds the association between binge drinking and
mortality. This suggests that we should condition on it when estimating the direct effect. However,
if we control for unemployment we eliminate the contribution of the hostility → unemployment →
mortality pathway to the indirect effect. The problem arises because the dotted arrow contributes
both to the direct effect (non-binge drinking mediated) and the indirect (binge drinking mediated)
17
effect of hostility on mortality. This problem can be solved by resorting to a SEM or by applying
special methods (27).
Measurement error. The mediator has to be measured without error. While mismeasurement is
generally something that should be avoided, studies that aim to examine mediation should pay
particular attention to measurement error. This is because even random error in the measurement of
the mediator will bias both the direct effect and the indirect effect, but in different directions. The
actual direction and strength of this bias depends on the pattern of mismeasurement. For example,
suppose that instead of measuring binge drinking in the study by Boyle et al. the investigators
tossed a coin for each participant to determine whether he was a binge drinker. In this case the
direct effect would most likely be overestimated to the point that it would equal the total effect.
SEM software usually has build-in features for handling measurement error, whereas some work is
needed to take account of this in multiple regression (28) "for solutions, see e.g. "}.
No interaction between exposure and mediator. The direct effect of the exposure on the outcome
must not depend on at which particular value of the mediator variable it is assessed. In statistical
terms this can be viewed as an assumption of no statistical interaction between the exposure and the
mediator. For example, if the effect of hostility on mortality is stronger among those who binge
drink than among those who do not, we can estimate two different controlled direct effects: one for
binge drinkers and one of non-binge drinkers. Unless there is a strong argument for at what value of
the mediator the association between exposure and outcome should be evaluated, the controlled
direct effect does not make sense in the presence of statistical interactions. This also relates to the
difference between mediators and moderators. As discussed above, a mediator is a variable that lies
in a causal pathway (e.g. hostility → binge drinking → mortality). Moderation has to do with how
two (or more) variables alone and in combination affects a third variable. In statistical terms, this is
18
an interaction. It is important to note that statistical interaction depends on the choice of scale: Two
variables that do not interact on multiplicative scale (e.g. in a logistic regression) will interact on an
additive scale (linear regression) and vice versa. Because interaction depends on the choice of effect
measure, statistical interaction is often denoted effect measure modification in epidemiology (29).
The concepts of mediation and moderation are fundamentally different and not mutually exclusive,
so that a given variable can act as a mediator or as a moderator or as both. A discussion of
interaction and moderation is given in Hernan & Robins. (19)
Decomposition of total effects and indirect effects. We have now examined how total effects and
the controlled direct effects can be estimated given, but what about the indirect (mediated) effect?
In an SEM context, the total effect can readily be decomposed into a direct and indirect effect, but
this is more difficult when using the mediator adjustment approach. Intuitively it seems reasonable
to assume that the total is the sum of the parts, so that the indirect effect can the calculated by
subtracting the direct effect from the total effect. This is the case in some situations, but in many
situations it is not. If linear regression is used to estimate the total and direct effect, this strategy
works well although the standard error of the indirect effect is not directly estimated. But often
various kinds of non-linear regression models are used. Studies that use logistic regression will
often report the percent reduction in the Odds Ratio after adjustment for the mediator(s).
Unfortunately this strategy will generally not work (24, 30, 31). The problem is that this approach
assumes that the change in Odds Ratio from one logistic regression to another has a very specific
interpretation. This trick works in linear models because a mixture of two linear regressions is a
linear regression, but this is not generally the case in logistic regression for example. It is worth
noting that total and direct effects can be estimated from non-linear regression, but the indirect
effect cannot consistently be calculated by contrasting the total and direct effects.
19
There are also other (non-technical) reasons for resorting to a linear model. For example, it was
long believed that traditional risk factors did not explain the social gradient in cardiovascular
disease. This finding was predominantly supported by studies that used the mediator adjustment
approach in multiplicative, non-linear models. This was something of a paradox in so far that
research on the traditional risk factors suggested that these explained 90% of the cases. A landmark
paper by Lynch et al. from 2003 showed that this apparent paradox was explained by the choice of
relative measures of association. After adjustment for traditional cardiovascular risk factors (the
mediators) the relative differences decreased by about 25%. The absolute risk differences, however,
were reduced by about 75% (32). This highlights an inherent problem in calculating a relative
change in a relative measure.
Mediation and interaction. As noted above the problem of assessing mediation in the presence of
statistical interactions is exacerbated by the fact that statistical interactions are dependent on the
choice of scale. This means that the choices of statistical model and measure of association will in
part determine whether mediation is a tractable problem, which is both impractical and conceptually
unsatisfying. One solution to this problem is to take a close look at the relationship between the
exposure and the mediator. Recall that the controlled direct effect is estimated by fixing the
mediator at some value, for example eradicating all binge drinking. For many real life problems it is
difficult to imagine scenarios where forcing the mediator to attain a particular value is possible. But
is this is quite different from what we would expect the exposure to do to the mediator. If we could
somehow manipulate the exposure by intervention a reasonable expectation would be that the
distribution of the mediator would shift from the distribution it had under no exposure to the
distribution it has when the exposure is present. Consider the pathway perceived stress → estrogen
→ breast cancer from Nielsen et al. If we somehow intervened and eliminated all perceived stress,
we would expect the subjects’ levels of serum estrogen to increase. This would result in a shift in
20
the distribution of estrogen. So, instead of fixing the mediator at certain level, we can then calculate
the direct effect when the mediator has a certain distribution. If we wanted to estimate the direct
(non-estrogen mediated) effect of perceived stress on breast cancer, we could use this information.
For example, we could evaluate the association between perceived stress under the distribution that
estrogen has under no exposure to perceived stress instead of evaluating it in an analysis where we
fixed everyone’s estrogen levels to attain the exact same value. This leads to the estimation of a
natural direct effect. This is done using simple standardization techniques such as those used to
calculate standardized rates. It is important to note that using natural direct effects will yield results
that are identical to controlled direct effect unless there is statistical interaction between exposure
and outcome, but even in this case the concept does add value: Not only does the concept of natural
effects provide a definition of direct effects in the presence of interaction, they also lead to a
definition of indirect effects. A natural indirect effect can be defined as the change in outcome
when the exposure is fixed and the distribution of the mediator is changed. The reader is referred
elsewhere for a comprehensive review of natural effects (7, 27, 33).
Interactions and moderation
The assumption of homogeneity. An additionalAn assumption that all regression model make is
that the effect of a given variable is homogeneous across all levels of all other variables,
irrespective of whether those variables are measured and included in the model or not. For
example, if we estimate a model in which depression is a predictor of cardiac disease, we implicitly
make the assumption that the association between depression and disease is the same (within
sampling error) for men and women, the old and the young, across ethnicities, genotypes, etc. Even
if these other variables were measured and included in the model as adjustment covariables, such a
model would not yield any information about this possible heterogeneity. There are several ways
21
we might tackle this question. One intuitively appealing method would be to divide the sample into
subgroups and evaluate the regression coefficient within each of the groups. For example, we
might divide our sample based on gender, and estimate the relation between depression and cardiac
disease separately for me and women. Subgroup tests, however, are highly controversial and
generally discouraged by statisticians for a number of reasons (34). Among these objections to
subgroup testing, the two most important are the inflated error rate, and the differential power of the
tests, and the increased imprecision of the parameter estimates due to the smaller sample sizes.
Conducting many tests of any kind inflates the Type I error rate. In the case of subgroup tests, of
course, all the parameters in a given model are re-estimated within each subgroup, creating a whole
host of new opportunities for capitalizing on the idiosyncrasies of sample, with the added
disadvantage of conducting those tests on fewer data points! Correction for multiple testing in these
cases can be of some help, but unless the study was designed specifically for the subgroup test, the
power can and usually will be quite different for different subgroups. Hence, some subgroup tests
will have more power than others, making it virtually impossible to manage the error rate
coherently. If subgroup tests are of interest, the sampling plan must take them into account before
the study is carried out to ensure adequate and consistent power across them. The inferences from
pre-planned subgroup analyses are, of course, more robust than those which arose from post hoc
analyses. If the design did not take these tests into account, subgroup analyses should either not be
conducted at all, or should be interpreted as highly preliminary.
Finally, if we are interested in studying heterogeneity of associations, the preferred approach is to
test the corresponding interaction term rather to examine subgroups separately (35, 36). (There are
also Bayesian methods, which may overcome some of the problems with conventional subgroup
analyses [see (37, 38)]). For example, if one is interested in whether a treatment is more effective
in one ethnic group than another, the proper test is a treatment group by ethnicity interaction term.
22
In a multivariable model setting, when more than one interaction term is of interest, the error rate
can be minimized by entering all the interaction terms of interest in the model as a block
simultaneously and testing the change in model fit associated with the block (39, 40). If the test of
the entire block is not significant, then the individual interaction terms are interpreted as
inconclusive or noise. I add a reminder here that in most statistical models nowadays all lower
order component terms must be included in the model with a higher order terms such as interaction.
For example, if we are testing a treatment group by ethnicity interaction, we also must include the
treatment group and ethnicity main effects—otherwise the interaction term is not really
interpretable as an interaction in the conventional sense of the concept.
Preserve measurement information wherever possible. On a final note, when testing interactions,
one might be tempted to create dichotomies or groups out of continuously measured variables.
Researchers also make artificial categories for other reasons, such as ease of interpretation,
evaluating nonlinearity, to parallel clinical cutpoints, or even in the belief that the grouping
somehow improves measurement precision. Indeed, creating groups out of continuous variables has
a long history in psychology, medicine, and epidemiology. What many modern researchers fail to
realize, however, is that this tradition arose strictly out of necessity. In the early days of modern
statistical practice, it was apparently well understood that the practice of grouping was less than
ideal, but there was little choice given the lack of computational power. With the availability of
ample computational power, modern authorities in methodology have repeatedly discouraged
researchers from adopting this practice (41-44). Compared to the categorized version of a variable,
using the continuous form yields substantially greater statistical power (43), is less likely to produce
spurious significance (45), and, from a measurement perspective, is a more reliable instantiation of
the variable under study (44). The much preferred alternative to categorizing is to model the
continuous variable as measured. If nonlinearity is a concern, techniques such as splines (40) or
23
fractional polynomials (46) will allow for a nonlinear association without discarding information or
making arbitrary cutpoints. Despite the overwhelming evidence of the inadequacy of the
categorization approach, a quick glance at many scientific journals suggests that the force of
tradition is apparently quite strong. We once again appeal to readers to avoid this fundamental error
in data analysis.
Some more additional considerations on regression models
Sample size in multivariable models. We now turn to a last few concepts that bear directly on the
above material in terms of producing replicable models. Earlier, we alluded to the idea that
although it is a good idea to include potential confounders and additional predictors of the response
in a model, the number we can include in a model and still obtain reproducible results is determined
by the sample size we have to work with. Before the advent of simulation studies, statisticians
often offered rules of thumb based on their experience. One well-known rule of thumb for linear
regression models is that there should be at least 10, preferably 15, cases for every degree of
freedom used in estimating the equation. Typically, each predictor uses one degree of freedom.
For example, is we want to study 10 predictors with no interactions or curvilinear terms, we should
have at least 100 observations in our sample. Perhaps it is not a surprise, but modern simulation
studies have tended to support this rule of thumb, demonstrating empirically that following this
guideline will result in a regression model that is more likely to replicate in new samples. There are
also rules of thumb that have been empirically tested for logistic regression models and also
survival models such as Cox regression. The rules of thumb for logistic and time to event models
are similar to that for linear regression, about 10-15 observations per predictor. However, there is
an important difference in how the number of observations is counted in the logistic and time to
event models. In these models, the number of observations is based on something called the
24
effective sample size. The effective sample size for a time to event regression model is simply the
number of events. So, if there are 1000 participants in a study, and only 10 of them sustain the
event being study, the effective sample size is 10. For logistic regression models, in which the
outcome is a binary variable, the effective sample size is the count of events or nonevents,
whichever is the smaller number of the two. For example, if there are 200 individuals in the
sample, and 20 had an event, the effective sample size is 20, not 200, and at best 2 variables can be
studied with reasonable confidence. If there were 180 events rather than 20, the effective sample
size would still be 20. In more technical parlance, the number of cases in a logistic regression
model with a binary response is min(q, n-q), where min represents “the minimum of the following
quantities”, q is the number of events, and n is the total sample size. Finally, for ordinal logistic
regression models, that is, models with more than two ordered category as the response, the
effective sample size is given by
n− 1n2∑
i=1
k
ni3
where n is the sample size and k is the number of response categories(40).
What are the consequences of studying more variables than the guidelines suggest? Perhaps the
most serious consequence of trying to squeeze too many variables in a model is overfitting.
Overfitting is a condition in which the idiosyncrasies of the sample lead to an overly optimistic
overall fit of the model. Intuitively, we might say that there is simply not enough information (in
terms of observations) to distinguish noise from true signal. The fewer observations per degree of
freedom in a model, the more likely the model will be overfit. Overfitting is discussed in greater
detail in Babyak (47) and Steyerberg (8). Figure 1 displays the results of a series of simulations
carried out by Babyak (47). The plot shows the distribution of model r-square values for various
25
levels of predictors/observations for a model with 10 predictors whose values are merely randomly
generated, i.e. are pure noise. Because the predictor values are randomly generated, the 'true' model
should have an r-square value of zero, with any non-zero r-square arising simply due to random
sampling fluctuation. The plot demonstrates that when there are relatively many observations per
predictor, the vast majority of r-square values are zero or very close to zero. However, as the
predictor/observation ratio becomes smaller, the typical r-square values become larger and more
varied, with some even reflecting a fairly large amount of variance explained. In addition to
generating overly optimistic model fit, having too few observations per predictor also results in bias
in the estimates for the individual parameters. Peduzzi et al. (48) showed in a series of simulations
that an inadequate predictors/observations ratio also leads to serious bias in the estimates of the
regression coefficients in logistic regression and time-to-event models. Some have argued that in
the case of models in which we are interested in a single predictor and are merely concerned about
ruling out confounding, fewer variables per predictor may be required. Vittinghoff et al. (49) has
argued that in this circumstance, perhaps as few as 5 events/case per predictor may be sufficient, but
the authors also show that under some circumstances even more than 15 per predictor may not be
enough. Perhaps the most prudent advice is that more is always better when it comes to sample size
and that when there are relatively fewer cases than the guidelines suggest, interpreting such results
with great caution.
Reducing the degrees of freedom in a model. If you are confronted with a situation in which you
wish to study more variables than the sample size allows, what are the alternatives? A popular
approach in the past has been to use automated 'stepwise' methods. There are actually a variety of
these techniques, but they are typically characterized by sequentially entering and removing
variables based on the correlations and partial correlations between the predictors and response
variable until some arbitrary criterion is met. For example, in forward stepwise selection, the
26
algorithm scans the correlations between the predictors and response variable and selects the
predictor with the largest correlation with the response. In the next step, the correlations between
the remaining candidate predictors and the response are partialled for the effect of the first variable
that was chosen, and the algorithm selects the largest of these partialled correlations. The process
continues until some predetermined measure of fit is achieved. Unfortunately, these algorithms
have been subsequently shown to be significantly flawed in terms of inference. They do generate
models that will fit the sample data well, but when used in the way that most of us have used them,
they are almost certain to not produce a replicable model. That is, when we compare the fit of the
model and the parameter estimates from the stepwise model to a model based on a new sample, not
much will be the same. Intuitively, the overly optimistic fit can be understood as a function of the
fact that we have tested many variables, and that by chance alone (i.e. random sampling
fluctuation), we are bound to find at least a few, and sometimes even many, predictor variables that
display a non-trivial association with the response variable. On the other side of the same coin, the
automated algorithm will also miss potentially important variables, again due to sampling error,
yielding a model with parameter estimates that may not be appropriately adjusted, i.e., a
misspecified model. Moreover,Further problems arise with automated algorithms when there are
correlations among the candidate predictor variables. In these instances, the choice to select one or
the other by the algorithm can be quite arbitrary. Not surprisingly, in recent years, the use of
automated stepwise methods has been almost uniformly discouraged by statisticians. Several
journals, in fact, will not accept papers that are based on conventional stepwise analyses (50, 51).
A commonly used alternative to stepwise selection is univariate prescreening of variables. In this
approach, the researcher evaluates the univariate relation between each predictor and the response
variable and selects those which are statistically significant for entry in a final regression model.
Unfortunately, this technique suffers from essentially the same, and at times worse shortcomings,
27
though perhaps not quite as dire, asthan those seen in the automated stepwise algorithms. The fit is
again biased toward being too good, because we are selecting predictors whose parameters of the
largest magnitude without accounting for the possibility that the magnitude of the predictor is also
influenced by random sampling error. Steyerberg (8) calls selection based on p-values “testimation
bias” As a more general principle, using the sample data to determine what to include in a model
will produce fit that may be too good and parameters that are too large. A further difficulty with
univariate prescreening is that variables behave differently in univariate setting compared to a
multivariate model. It is entirely possible, for example, for a potential predictor to look quite
uninteresting in a univariate setting and then come to life when partialled for other variables.
Arguably the best alternative to automated techniques and prescreening is to specify the model in its
entirety before even collecting the data. A prespecified model is preferable for a number of
reasons. First and foremost, it requires a thoughtful consideration of the phenomenon under study
before collecting the data. Second, it is transparent. There is no doubt as to whether other variables
were considered but just not reported. Finally, the p-values for the fit of the model and for the
parameters will be 'honest.' In other words, once predictors are tested either during pretesting or
some other selection process and discarded, the tests of the model with the remaining variables, as
well as the test of model fit will be too optimistic (for a simulation study demonstrating this
principle, see (52)).
Sometimes, of course, it is not possible or even desirable to have a single prespecified model. We
simply may not know quite enough about the entire system of variables we are studying, or perhaps
collecting some of the data is expensive and we want to cull as many of the non-important variables
out of the equation. There are a variety of approaches that will either allow us to include more
variables that the rules of thumb suggest, or that will remove extraneous variables with the correct
28
adjustment. The simplest technique for reducing degrees of freedom is to combine predictors in
some rational way. Combining is useful when there are variables that are acting solely as nuisance
or adjustment variables for which we are not particularly interested in their individual regression
coefficients, but still want the information they provide to be included in the model. We can simply
create a composite score from two or more variables, by summing their ranks or converting the
variables to standardized scores and summing them. Alternatively, we can use a clustering
technique such as principle components or common factor analysis to develop a composite that
captures the information in the variables. The resulting composite that we create is then used
instead of the individual variables in the model. More details on these approaches are available in
Harrell (40).
More sophisticated methods for automated model selection have been developed recently and are
now becoming more widely available in popular software packages. The techniques include the
lasso and least angle regression approaches developed by Tibshirani (53), Bayesian model
averaging (54), and the use of penalization (55) or random effects (56). The details of these
techniques are far beyond the scope of this chapter, but they do show some promise in terms of
allowing an algorithm to make reasonable selections of variables while accounting for uncertainty.
Because these approaches properly correct for capitalizing on the idiosyncrasies of the sample,
however, many researcher may be quite displeased with the failure to find ‘significant’ results.
Nevertheless, these approaches generate far more realistic appraisals of the extent to which our
results will replicate in a new sample.
Summary. This paper has reviewed some of the issues involved in the estimation of regression
models in terms of variable selection and underlying causal models. Specifically, regression
models that attempt to illuminate causal understanding are most useful when we try to account for
29
potential confounders, include additional variables that enhance precision, and test for mediators.
For mediation, SEMs are currently the best choice for the applied researcher because they are
linear, provide consistent decomposition of the total effect into direct and indirect contributions and
allow the investigator to take measurement error into account. If interactions among two or more
variables are suspected, care must be taken to design the study in such a way that these potential
interactions can be adequately studies. When testing mediation, if there are strong interactions
between the exposure and the outcome, methods beyond simple SEMs are needed. Finally, in order
to increase the likelihood that our models will replicate, and hence be generalizable, attention
should be paid to the number of parameters we seek to estimate in the context of sample size.
30
Figure Caption
Results of simulation automated stepwise regression with 15 candidate predictor variables. In the
true model, predictors were randomly generated and therefore unrelated to the response variable,
meaning that the true r-square was zero. The ratio of predictors to sample size was then
manipulated by altering the sample size. The frequency of falsely high r-squares increases as the
sample size to predictors ratio decreases.
31
References
1. McCullagh P, Nelder J. Generalized Linear Models. London: Chapman and Hall; 1989.2. Cox DR, Oakes D. Analysis of survival data. London: Chapman & Hall; 1984.3. Muthen LK, Muthen B. Mplus User's Guide. 3rd ed. Los Angeles, CA: Muthen and Muthen; 2004.4. Glare PGW. Oxford Latin dictionary. Oxford University Press; 1982.5. Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med
1997;127:757-63.6. Glymour MM, Greenland S, Rothman KJ, Lash TL. Causal diagrams. Modern Epidemiology, vol. 3rd.
Philadelphia: Lippincott Williams & Wilkins; 2008, p. 183-212.7. VanderWeele TJ, Vansteelandt S. Conceptual issues concerning mediation, interventions and
composition. Stastistics and Its Interface 2009;2:457-68.8. Steyerberg EW. Clinical Prediction Models. New York: Springer; 2009.9. Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research:
Conceptual, strategic, and statistical considerations. J Pers Soc Psychol 1986;51:1173-82.10. Hafeman DM, Schwartz S. Opening the Black Box: a motivation for the assessment of mediation.
IntJEpidemiol 2009;38:838-45.11. Nielsen NR, Zhang ZF, Kristensen TS, Netterstrom B, Schnohr P, Gronbaek M. Self reported stress
and risk of breast cancer: prospective cohort study. BMJ 2005;331:548.12. Key TJ, Appleby PN, Reeves GK, Roddam A, Dorgan JF, Longcope C, Stanczyk FZ, Stephenson HE, Jr.,
Falk RT, Miller R, Schatzkin A, Allen DS, Fentiman IS, Wang DY, Dowsett M, Thomas HV, Hankinson SE, Toniolo P, Akhmedkhanov A, Koenig K, Shore RE, Zeleniuch-Jacquotte A, Berrino F, Muti P, Micheli A, Krogh V, Sieri S, Pala V, Venturelli E, Secreto G, Barrett-Connor E, Laughlin GA, Kabuto M, Akiba S, Stevens RG, Neriishi K, Land CE, Cauley JA, Kuller LH, Cummings SR, Helzlsouer KJ, Alberg AJ, Bush TL, Comstock GW, Gordon GB, Miller SR. Body mass index, serum sex hormones, and breast cancer risk in postmenopausal women. JNatlCancer Inst 2003;95:1218-26.
13. Mortensen LH, Diderichsen F, Smith GD, Andersen AM. The social gradient in birthweight at term: quantification of the mediating role of maternal smoking and body mass index. HumReprod 2009;24:2629-35.
14. Kraemer HC, Wilson GT, Fairburn CG, Agras WS. Mediators and moderators of treatment effects in randomized clinical trials. Arch Gen Psychiatry 2002;59:877-83.
15. Smith GD, Ebrahim S. Mendelian randomization: prospects, potentials, and limitations. Int J Epidemiol 2004;33:30-42.
16. Cohn JN. Introduction to surrogate markers. Circulation 2004;109:IV20-IV1.17. Boyle SH, Mortensen L, Gronbaek M, Barefoot JC. Hostility, drinking pattern and mortality.
Addiction 2008;103:54-9.18. Hernan MA, Hernandez-Diaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for
confounding evaluation: an application to birth defects epidemiology. AmJEpidemiol 2002;155:176-84.
19. Hernan MA, Robins JM. Causal Inference. Chapman Hall/CRC.20. Dawid AP. Causal Inference Without Counterfactuals. J Am Stat Assoc 2000;95:407-24.21. Mackinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test
mediation and other intervening variable effects. PsycholMethods 2002;7:83-104.22. Sterne JA, Davey SG. Sifting the evidence-what's wrong with significance tests? BMJ 2001;322:226-
31.23. Rothman KJ. Significance questing. AnnInternMed 1986;105:445-7.
32
24. Kaufman JS, MacLehose RF, Kaufman S. A further critique of the analytic strategy of adjusting for covariates to identify biologic mediation. Epidemiol PerspectInnov 2004;1:4.
25. Batty GD, Gale CR, Mortensen LH, Langenberg C, Shipley MJ, Deary IJ. Pre-morbid intelligence, the metabolic syndrome and mortality: the Vietnam Experience Study. Diabetologia 2008;51:436-43.
26. Cole SR, Hernan MA. Fallibility in estimating direct effects. Int J Epidemiol 2002;31:163-5.27. Petersen ML, Sinisi SE, van der Laan MJ. Estimation of direct causal effects. Epidemiology
2006;17:276-84.28. Gustafson P. Measurement error and misclassification in statistics and epidemiology: impacts and
Bayesian adjustments. Boca Raton, FL: CRC Press; 2003.29. Rothman KJ. Measuring Interactions. Epidemiology An Introduction, vol. 1st. New York: Oxford
University Press; 2002, p. 168-80.30. Ditlevsen S, Christensen U, Lynch J, Damsgaard MT, Keiding N. The mediation proportion: a
structural equation approach for estimating the proportion of exposure effect on outcome explained by an intermediate variable. Epidemiology 2005;16:114-20.
31. Mackinnon DP, Lockwood CM, Brown CH, Wang W, Hoffman JM. The intermediate endpoint effect in logistic and probit regression. Clin Trials 2007;4:499-513.
32. Lynch J, Davey SG, Harper S, Bainbridge K. Explaining the social gradient in coronary heart disease: comparing relative and absolute risk approaches. J Epidemiol Community Health 2006;60:436-41.
33. Pearl J. Direct and indirect effects: Technical report R-273. Proceedings of the American Statistical Association. Minneapolis, MN; 2005, p. 1572-81.
34. Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet 2000;355:1064-9.
35. Altman DG, Bland JM. Statistics Notes: Interaction revisited: the difference between two estimates. BMJ 2003;326:219.
36. Altman DG, Matthews JNS. Statistics Notes: Interaction 1: heterogeneity of effects. BMJ 1996;313:486.
37. Dixon DO, Simon R. Bayesian subset analysis in a colorectal cancer clinical trial. Stat Med 1992;11:13-22.
38. Simon R. Bayesian subset analysis: application to studying treatment-by-gender interactions. Stat Med 2002;21:2909-16.
39. Cohen J, West SG, Aiken L, Cohen P. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. London: Taylor and Francis; 2002.
40. Harrell FE. Regression Modeling Strategies: With applications to linear modeling, logistic regression, and survival analysis. New York: Springer; 2001.
41. MacCallum RC, Zhang S, Preacher K, Rucker D. On the Practice of Dichotomization of Quantitative Variables. Psychological Methods 2002;7:19-40.
42. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.
43. Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.44. Harrell FE. Problems Caused by Categorizing Continuous Variables. 2008.45. Maxwell SE, Delaney HD. Bivariate Median Splits and Spurious Statistical Significance. Psychol Bull
1993;113:20.46. Royston P, Altman DG. Regression Using Fractional Polynomials of Continuous Covariates:
Parsimonious Parametric Modelling. Journal of the Royal Statistical Society Series C (Applied Statistics) 1994;43:429-67.
47. Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 2004;66:411-21.
48. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373-9.
33
49. Vittinghoff E, McCulloch CE. Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression. Am J Epidemiol 2007;165:710-8.
50. Freedland KE, Babyak MA, McMahon RJ, Jennings JR, Golden RN, Sheps DS. Statistical Guidelines for Psychosomatic Medicine. Psychosom Med 2005;67:167.
51. Thompson B. Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Ed Psychol Meas 1995;55:525-34.
52. Budtz-Jørgensen E, Keiding N, Grandjean P, Weihe P. Confounder Selection in Environmental Epidemiology: Assessment of Health Effects of Prenatal Mercury Exposure. Ann Epidemiol 2007;17:27-35.
53. Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997;16:385-95.54. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statistical
Science 1999;14:382-417.55. Moons KGM, Donders ART, Steyerberg EW, Harrell FE. Penalized maximum likelihood estimation to
directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol 2004;57:1262-70.
56. Greenland S. When should epidemiologic regressions use random coefficients? Biometrics 2000;56:915-21.
34