Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Hierarchical Model Building, Fitting, and Checking: A Behi nd-the-Scenes
Look at a Bayesian Analysis of Arsenic Exposure Pathways
Peter F. Craigmile1, Catherine A. Calder, Hongfei Li, Rajib Paul, and Noel Cressie
Department of Statistics, The Ohio State University, Columbus, OH, USA
Department of Statistics Preprint No. 806
The Ohio State University
3 December 2007
Abstract In this article, we present a behind-the-scenes look at a Bayesian hierarchical analysis of path-
ways of exposure to arsenic, a toxic heavy metal. Our analysis combines individual-level personal exposure
measurements (biomarker and environmental media) with water, soil, and air observations from the ambient
environment. We include details of our model-building exercise that involved a combination of exploratory
data analysis and substantive knowledge in exposure science. Then we present our strategies for model fit-
ting, which involved piecing together components of the hierarchical model in a systematic fashion to assess
issues including parameter identifiability, Bayesian learning, and model fit. We also discuss practical issues
of data management and algorithm debugging. We hope that ourpresentation of these behind-the-scenes
details will be of use to other researchers who build large Bayesian hierarchical models.
Keywords: Arizona, Bayesian learning, data management, environmental health, Markov chain Monte Carlo
(MCMC) algorithm, model validation, National Human Exposure Assessment Survey (NHEXAS)
1contact email: [email protected]
1
1 Introduction
Bayesian hierarchical modeling has been increasingly recognized as a powerful approach for analyzing com-
plex phenomena. This framework has moved from simply the Bayesian analogue of classical multilevel
modeling for nested data, to being recognized as an essential tool for performing modern statistical science.
Bayesian hierarchical models are now commonly used both within and outside the statistics literature and
are widely lauded for their capacity to synthesize data fromdifferent sources, to accommodate complicated
dependence structures, to handle irregular features of data such as missingness and censoring, and to in-
corporate scientifically based process information. However, despite the inherent elegance of the approach,
effective derivation of inferences from these models is by no means a trivial task. The aim of this paper is to
provide a behind-the-scenes look into a Bayesian hierarchical analysis. We show the reader how we carried
out our model building, fitting, and checking, which were complicated by issues related to the quality and
quantity of available data. Strategies will be presented for compartmentalizing model components and inputs
in order to allow efficient assessment of the impact of modeling assumptions on inferences. While the reader
may not agree with all of our modeling decisions, and there certainly are aspects of our analysis that warrant
further study, our goal is to start a dialog on practical hierarchical-modeling strategies that allow defensible
statistical learning from complex Bayesian analyses.
This article is organized as follows. In Section2, we introduce the environmental-exposure application
that underlies our discussion of Bayesian hierarchical modeling. Model building through a combination
of exploratory data analysis and prior knowledge in exposure science is the focus on Section3. Then, in
Section4, we discuss our model-fitting strategies, including data-management and algorithm (Markov chain
Monte Carlo, MCMC)-checking (debugging) procedures. We take the reader through the fitting of various
components of our large hierarchical model in order to examine issues such as parameter identifiability,
Bayesian learning, and model fit. We conclude with a general discussion of Bayesian hierarchical modeling
2
in Section5.
2 Environmental exposure application
2.1 Overview
Our motivating application is a Bayesian hierarchical analysis of human exposure pathways. Gen-
erally, a pathways analysis refers to the study of the relationship between levels of toxicants in en-
vironmental media (e.g., air, water, dust, food) and levelsof personal exposure. In order to pro-
vide data to study pathways of exposure to a variety of toxic substances, the National Human Ex-
posure Assessment Survey (NHEXAS) pilot study was carried out in the 1990s by the Office of Re-
search and Development (ORD) of the U.S. Environmental Protection Agency (USEPA), along with
the U.S. Centers for Disease Control and Prevention (CDC) and the U.S. Food and Drug Admin-
istration (FDA) (NERL and National Center for Environmental Assessment 2000). This pilot study was
conducted in three areas: Arizona, EPA Region 5 (six midwestern states), and Maryland. In this
paper, we limit our pathways analysis to the Arizona (AZ) NHEXAS data consisting of environ-
mental media measurements and biomarker measurements (i.e., indicators of personal exposure) for
178 individuals in 15 counties of AZ (Robertson, Lebowitz, O’Rourke, Gordon, and Moschandreas 1999;
O’Rourke, Van de Water, Jin, Rogan, Weiss, Gordon, Moschandreas, and Lebowitz 1999). AZ was chosen
because of the relatively easy access to water, soil, and airobservations from the ambient environment. Us-
ing these data, we focus on modeling pathways of exposure to arsenic (As), a toxic heavy metal. Acute
exposure to As is associated with irritation of the gastrointestinal and respiratory tracts, while chronic expo-
sure has been shown to be related to melanosis, hyperpigmentation, depigmentation, hyperkeratosis, and skin
cancer, and may also affect the nervous, cardiovascular, and haematopoietic systems (WHO 1981). Further
details on the NHEXAS data used in our analysis can be found Section 2.2.
3
Hierarchical modeling is natural in pathways analyses due to the direct and indirect relationships between
environmental media and personal exposure. For example, the concentration of a toxicant in water may be
related to personal exposure directly through ingestion (drinking) and dermal (bathing) pathways. How-
ever, part of the ingestion pathway may also be indirect if water is used for cooking purposes. Rec-
ognizing the complexity of these relationships,Clayton, Pellizzari, and Quackenboss (2002)used a struc-
tural equations modeling approach to explore pathways of exposure to As and lead (Pb) using NHEXAS
Region 5 data. Also using the As NHEXAS Region 5 data,McMillan, Morara, and Young (2006)and
Cressie, Buxton, Calder, Craigmile, Dong, McMillan, Morara, Santner, Wang, Young, and Zhang (2007)de-
veloped Bayesian hierarchical pathways models. UnlikeClayton et al. (2002)’s approach, these models are
able to accommodate measurement error in the environmentalmedia and biomarker measurements, as well
as missing and censored (below a specified minimum detectionlevel (MDL)) observations, in a coherent
manner. We rely heavily on these models in the development ofour Bayesian hierarchical pathways model
described in Section3.
Recognizing that one of the primary aims of NHEXAS was to identify geographical variation in expo-
sure to hazardous chemicals (Pellizzari et al. 1995), Cressie et al. (2007)proposed supplementing the spa-
tially sparse NHEXAS Region 5 data with information on background levels of As in global environmen-
tal media. Specifically, background concentrations of As insoil, which additionally included both top-
soil and stream sediment concentration measurements, wereincluded in the pathways analysis. This ap-
proach to synthesizing topsoil and stream sediment geochemical measurements was subsequently refined by
Calder, Craigmile, and Zhang (2007). In addition to incorporating background levels of As in soil across
AZ, in this paper we also incorporate background levels of Asin water and air into our pathways analysis.
Rather than using this supplemental, orglobal environmental information to build informative priors forthe
county-level mean for eachlocal environmental medium (as was done byCressie et al., 2007), we use mix-
ture models that account for the uncertainty due to the spatial misalignment of the supplemental data and the
4
NHEXAS individuals.
In the remainder of this section, we provide overviews of thevarious data sources that are used in our
Bayesian hierarchical pathways model.
2.2 NHEXAS data
The NHEXAS AZ data were collected using a three-stage, population-based sampling design. In Stage I,
1200 households were approached to complete descriptive questionnaires. Of the 954 that responded, a
subset of 505 completed baseline questionnaires and had soil and dust analyses collected (Stage II). Our
analysis is based on the final Stage III sample, which consisted of a subset of 179 individuals for whom
data were collected over a seven-day period. Table2 lists the number of NHEXAS individuals per county in
AZ, along with the population of each county (in thousands) based on the 1990 Census. As expected, there
are more sampled individuals in counties with larger populations. Further details on the the NHEXAS AZ
sampling design can be found inRobertson et al. (1999).
The NHEXAS AZ data were obtained from the Human Exposure Database System (HEDS) at the EPA2.
We extracted from the database the As measurements in the environmental media, as well as in the Urine
biomarker, for 178 individuals (one individual could not bematched to a household). All As levels in our
analysis are concentrations in units ofµg/l; the natural log concentration measurements and MDL values
are modeled in our analysis. The variables of interest are summarized in Table1. For each variable, we
include the extent of missingness and left-censoring due tovalues being below the MDL. From this table, it
is apparent that most media considered by NHEXAS, except Water, have missing values and values below
the MDL. Personal Air and Soil have the greatest number of missing values, while Indoor Air, Outdoor Air,
and Urine have the most observations below the MDL. The implications of the amount of missingness and
2http://oaspub.epa.gov/heds/study dir frame?st id=23159
5
censoring for each medium will be discussed in Section4.
The data documentation from NHEXAS AZ3 lists the relative standard deviation (RSD), associated with
each medium. The RSD is the absolute value of the coefficient of variation and is often expressed as a
percentage. FollowingSantner, Craigmile, Calder, and Paul (2007, Appendix), we converted these values
into measurement-error precisions (provided in the rightmost column of Table1) for use in our Bayesian
hierarchical pathways model.
2.3 Global water data
Background levels of the As concentration in water were obtained from the Water Quality Division of the
Arizona Department of Environmental Quality. This division manages millions of records of water-quality
data collected from surface and ground-water monitoring programs statewide, as well as from drinking-
water systems. We were directly provided with both As concentration measurements taken from treated
water samples in public-water systems (PWS) and with AZ community water-use information.
The treated water samples were collected at water-treatment facilities from 1993 to 2006. In our analysis, we
only consider the subset of these observations for which ground or surface water is the primary source, the
water-treatment facility is a community public water system (PWS) that serves 15 or more service connec-
tions used by residents year-round or serves 50 or more residents year-round, and the water system was active
at the time when the sample was collected. This subset of the data consisted of 10,688 As concentration mea-
surements (in units ofmg/l) from 1,161 PWSs. Of these measurements, 3,732 are below themeasurement
detection limits (MDLs). We model the PWS As concentrationson the natural log scale.
Due to security concerns, we were not provided with the locations of the PWSs. Instead, we were only given
a community-water-use dataset that lists the number of individuals in each of the 15 AZ counties served by
3http://oaspub.epa.gov/eims/xmlreport.display?deid=22956&z chk=3399
6
each PWS. We note that each PWS may serve more than one county and each county can be served by more
than one PWS. The smallest number of people served by a PWS is 51 in Coconino Navajo County, the largest
number of people served by a single PWS is 1,200,000 in Maricopa county, and the median number of people
served by a PWS is 300.
2.4 Global soil data
Supplemental information about the levels of As in soils across AZ consists of point-referenced topsoil
measurements from the U.S. Geological Survey’s (USGS’s) USSoils database4 and stream sediment mea-
surements from the USGS’s National Geochemical Survey (NGS) database5. The topsoil data consists of
the concentration of As in the surface layer (the A horizon corresponding to a depth of around 20cm) from
samples of soil, sand, silt, and alluvial deposits collected in undisturbed regions. Each of the topsoil measure-
ments are associated with latitude/longitude coordinates(see left-hand panel of Figure1). While the stream
sediment data are also point referenced, we followCalder et al. (2007)and associate them with watersheds,
or hydrologically similar regions (Seaber, Kapinos, and Knapp 1987). Each of these hydrologic units, or wa-
tersheds, has a unique eight-digit code, termed the HUC8 code. Our analysis includes all stream sediment
data that were taken in the 84 watersheds that contain at least part of the state of AZ (right-hand panel of
Figure1). The units of both the topsoil and stream sediment data areµg/m3, or parts per million (ppm), and
both types of data are modeled on the natural log scale.
4http://tin.er.usgs.gov/ussoils/5http://tin.er.usgs.gov/geochem/
7
2.5 Global air data
Data providing information about the background levels of As in the ambient air across AZ were obtained
from the Interagency Monitoring of Protected Visual Environments (IMPROVE) Network6. This network
was established in 1985 and was designed to provide long-term air-quality records for U.S. national parks and
wilderness areas. We extracted all available As concentration readings (in units ofµg/m3 local conditions7
from the 14 AZ IMPROVE monitoring stations where As levels were monitored during the period March
1988 to August 2003. Each of these monitors is associated with latitude/longitude coordinates (see Figure2)
and, despite IMPROVE’s emphasis on non-urban areas, we notethat several monitoring stations are located
in or near major cities (e.g., Phoenix and Tucson). Readingsare collected on an every 3-4 day schedule,
and certain observations are flagged as being below a minimumdetection limit (MDL), the value of which is
provided. The data are modeled on the natural log scale.
3 Exploratory data analysis and model building
Bayesian hierarchical models have been around for over thirty years. An important starting point was
Lindley and Smith (1972), in which a Bayesian model was defined via a set of hierarchical linear regres-
sion equations. More recently, a standard structure is defined based on the data model, process model, and
parameter model decomposition (e.g.,Berliner, 2003). In our analysis of As pathways in AZ, we consider a
data model and a process model that are each made up of many parts. Rather than fitting the complete model
consisting of all the parts right from the outset, we incorporate the different parts piece-by-piece, validating
and assessing the model fit as we go along. We shall demonstrate that, in terms of the hierarchy defined
6http://vista.cira.colostate.edu/improve/7Temperature and pressure at the time of the reading as opposed to standard conditions of 25 degrees C and 760 mm Hg that are
sometimes used.
8
above, model building should be a combination of exploratory data analysis (EDA) and scientific knowledge.
3.1 The local environment to biomarker (LEB) model
In our study of pathways of exposure to As, the backbone of ourstatistical analysis is a Bayesian hierarchi-
cal model linking local environmental As levels to individual-specific levels of As in Urine, the biomarker
of As exposure. This local environment to biomarker (LEB) model is introduced and discussed in de-
tail in McMillan et al. (2006)and Cressie et al. (2007), so we do not include a lengthy development of it
here. In this subsection, we provide a brief description of the LEB model, for the purpose of completeness.
However in Section3.2, we focus on the global to local environment (GLE) model extensions inspired by
Cressie et al. (2007).
LEB data model The LEB model is driven by the individual-specific local environment and biomarker
measurements from NHEXAS; we generically refer to these measurements as the observations from theNM
different media. Not all of theN I individuals who participated in NHEXAS have an observationassociated
with each medium; certain observations were below a minimumdetection limit (MDL) and others are simply
unobserved (i.e., are “missing”). For each mediumj = 1, . . . ,NM , let Y Mj = {Y M
ji : i = 1, . . . ,N I}
denote the set of log As observations on that medium for each of theN I individuals. LetZMj = {ZM
ji : i =
1, . . . , N I} denote the observation status, so that for individuali, ZMji = 0 if the jth observed value,Y M
ji ,
is below the MDL value ofMji, ZMji = 1 if the value is above the MDL, andZM
ji = 2 if the observation
is missing. LetMMj = {MM
ji : i = 1, . . . ,N I} be the set of MDLs for mediumj. Defining XMj =
{XMji : i = 1, . . . , N I} to be the collection of the latent As concentrations in medium j, and lettingY M =
{Y M1 , . . . ,Y M
NM }, XM = {XX1 , . . . ,XM
NM}, ZM = {ZM1 , . . . ,ZM
NM}, MM = {MM1 , . . . ,MM
NM}, and
ωM = {ωMj : j = 1, . . . , NM}, we assume that the elements of our complete set of data are conditionally
9
independent given{XMij } and media-specific precision parameters,{ωM
j }. That is, we assume
[
Y M |XM ,ZM ,MM ,ωM]
=
NM∏
j=1
[
Y Mj |XM
j ,ZMj ,MM
j ,ωMj
]
=d
NM∏
j=1
INCM(XMj , 1/ωM
j ;ZMj ,MM
j ).
where the INCM (independent normal with possible censoringand missingness) distribution is defined in
SectionA of the Appendices.
LEB Process Model We then link the elements ofXM by choosing subsets of theNM media,{SMj }, that
are defined by an acyclic directed graph (e.g.,Lauritzen, 1996). These subsets define the joint distribution of
XM as follows:
[
XM |βM ,µM , τM]
=
NI∏
i=1
NM∏
j=1
[
XMij |{XM
ik }k∈SMj
, µMj , τM
j
]
(1)
=d
NI∏
i=1
NM∏
j=1
N
µM
j +∑
k∈SMj
βMjk XM
ik , 1/τMj
. (2)
In our case, the acyclic directed graph was defined by considering partitionsMk of the set ofNM media and
defining selector sets{SMj } that satisfy
SMj = ∅ , if j ∈ M1;
SMj ⊆
ℓ−1⋃
k=1
Mk , if j ∈ Mℓ, whenℓ > 1 .
In the model (2), we assume for each media that the baseline mean is constantacross the state (aregional
mean model). The choice of pathways{SMj : j = 1, . . . ,NM} that we use come from the baseline model
of Clayton et al. (2002), which was used to model As concentrations in EPA Region 5. A graphical view of
this structure is shown in Figure4.
To assess whether or not Clayton et al.’s structure is reasonable for the log As concentrations observed in
the NHEXAS AZ study, we looked at a number of graphical summaries of the data. In thisexploratory data
10
analysis (EDA), we replaced any values below the MDL, by halfthe MDL value and ignored any missing
values. One summary that gives some credence to Clayton et al.’s structure is shown in the dependence table
in Figure3. The numbers in the table are the sample correlations between pairs of log As concentrations in
each medium. Marginally, ignoring any multiplicity considerations, we conducted a test to see whether or
not the population correlation coefficient for each pair of media is zero – dark gray shading denotes a P-value
less than 0.05, light gray shading denotes a P-value between0.05 and 0.1, and no shading denotes a P-value
greater than 0.1. Rectangles around the correlation valuesindicate those covariates that appear in Clayton et
al.’s model for each response. For example, Beverage, Food,Surface Dust, Indoor Air, and Personal Air all
appear in Clayton et al.’s regression model for Urine. Thereis good agreement with Clayton et al.’s model
and the pattern of significant pairwise correlations between the response variable and each covariate, except
for media where we have a large number of missing values or values below the MDL (e.g., Soil, Personal
Air, Indoor Air, and Outdoor Air).
LEB model prior distributions We assume independence between the prior parameters and take the
media-specific mean terms{µMj } and slope parameters that link the different media in the LEBmodel,
{βMjk }, to have normal prior distributions. The media-specific process precisions{τM
j } are given gamma
priors. As stated previously, the measurement-error precisions,{ωMj }, are fixed using the method described
in Santner et al. (2007, Appendix).
3.2 The global to local environment (GLE) models
We now consider alternative versions of the model for{XMij } (Equations1 and2) for mediumj that draw on
supplemental data sources and knowledge about the background spatial processes driving the variation in As
levels in that medium. In Section4, we consider the impacts on inferences for As exposure pathways when
these global to local environment (GLE) models are used.
11
3.2.1 Global water model
We start with an exploratory data analysis of the global water data provided to us. The left-hand panels of
Figure5 show the histograms of log As concentrations of NHEXAS and PWS water measurements, both
given on thelog µg/l scale. For the PWS data, the observations below the MDL are denoted by check marks
along the x-axis of the histograms. Both of the distributions are slightly right-skewed. They are remarkably
similar, except for the fact that the PWS concentrations have a slightly longer right tail. This gives some
credence to the linking model that we specify below.
Global water data model Not all of the NW public water systems (PWS) in AZ have an observation
available, while some water systems have multiple observations. As is the case with the NHEXAS data,
certain observations were below an MDL and others are missing. For each PWSj = 1, . . . ,NW , let Y Wj =
{Y Wjk : k = 1, . . . , NW
j } denote the set ofNWj observed log As concentrations. Accordingly, we let
ZWj = {ZW
jk : k = 1, . . . , NWj } denote the set of observation-status variables, withZjk = 0 if the kth
measurement at PWSi is below the MDL value ofMWjk , Zjk = 1 if the value is above the MDL, orZjk = 2
if the observation is missing. LetMWj = {MW
jk : k = 1, . . . ,NWk } be the set of MDLs for PWSj. Taking
XWj = {XW
jk : k = 1, . . . , NW} to be the collection of the latent log As concentrations in PWS j, and
letting Y W = {Y W1 , . . . ,Y W
NW }, XW = {XX1 , . . . ,XW
NW }, ZW = {ZW1 , . . . ,ZW
NW }, andMW =
{MW1 , . . . ,MW
NW }, we assume that
[
Y W |XW ,ZW ,MW , ωW]
=
NW∏
j=1
[
Y Wj |XW
j ,ZWj ,MW
j , ωW]
=d
NW∏
j=1
INCM(XWj , 1/ωW ;ZW
j ,MWj ),
whereωW is the measurement-error precision.
12
Global water process model We take the global water process defined at each PWS to be conditionally
independent given PWS-specific unknown mean parametersµW = {µWj : j = 1, . . . ,NW } and an unknown
precision parameterτW . That is, we assume that
[
XW |µW , τW]
=
NW∏
j=1
NWj
∏
k=1
[
XWjk |µ
Wj , τW
]
=d
NW∏
j=1
NWj
∏
k=1
N(
µWj , 1/τW
)
.
Global water to LEB model We link the global water process to the Water process in the LEB model.
We definem(W ) to be the index corresponding to the Water medium. LetXMm(W ) be the collection ofN I
individual log As concentrations in Water andXMi,m(W ) be individuali’s latent Water log As level. Then we
assume that
[
XMm(W ), ζ
W |µW , τMm(W )
]
=
NI∏
j=1
[
XMi,m(W ), ζ
Wi |µW , τM
m(W )
]
. (3)
That is, we model the joint distribution of Water’s log As concentration and an unknown collection of
parametersζW that indicate whether a PWS serves individuals (i.e.,ζWij = 1 implies that individuali
is served by PWSj and ζWij = 0 implies that individuali is not served by water systemj). Letting
ζWi = (ζW
i1 , . . . , ζWiNW )′, we decompose the factors in the right-hand side of (3) by assuming that
[
XMi,m(W ), ζ
Wi |µW , τM
m(W )
]
=[
XMi,m(W )|ζ
Wi ,µW , τM
m(W )
]
[ζWi ]
=
NW∏
j=1
([
XMi,m(W )|µ
Wj , τM
m(W )
])ζWij
× Mult(
1, (λWc(i),1, . . . , λ
Wc(i),NW )
)
,
where Mult(n,p) denotes the multinomial distribution with parametersn andp. For eachi = 1, . . . ,N I , we
takeλWc(i) =
(
λWc(i),1, . . . , λ
Wc(i),NW
)
, whereλWc(i),j is equal to the proportion of the population of countyc(i)
served by PWSj. Finally, we assume that
[
XMi,m(W )|µ
Wj , τM
m(W )
]
=d N(
µWj , 1/τM
m(W )
)
.
13
Global water prior distributions Across the PWSs we assume that
[
µW |αW , CW]
=NW∏
j=1
[
µWj |αW , CW
]
=d
NW∏
j=1
N(αW , 1/CW ),
whereαW has a normal prior distribution andCW has an gamma prior distribution. The priors forωW and
τW are also gamma distributions. We assume all global water prior distributions are mutually independent.
3.2.2 Global soil model
Looking towards modeling these concentrations jointly with the NHEXAS soil data, Figure6 shows a sum-
mary of the NHEXAS, topsoil, and stream sediment log As concentrations. The histograms in the figure
show that the three sets of concentration measurements havesimilar distributions, each having a slight right-
skewness (the median NHEXAS log concentration value is 1.89log µg/m3, versus the median topsoil log
concentration value of 1.63log µg/m3 and the median log stream sediment concentration of 1.83log µg/m3).
Given the aridness of AZ, it is not surprising that the distributions are so similar. The top right-hand figure
shows the number ofobserved (i.e., not missing) NHEXAS soil measurements for each county of AZ, high-
lighting that the information content is sparse in many counties. Compared with the locations of the topsoil
measurements (left-hand panel of Figure1), we can see that the topsoil measurements are mostly uniformly
spread around the region. On the other hand, the map of the HUC8 regions with the stream sediment mea-
surements denoted by crosses (right-hand panel of Figure1) shows that the stream sediment measurements
appear in clumps across the study region.
The data and process models shown below for the log As concentrations in topsoil and stream sediment are
taken fromCalder et al. (2007). We complete our model by linking these media to the NHEXAS soil-related
media processes in the LEB model using a mixture model (not part of Calder et al., 2007). This mixture
model captures the uncertainty in locations of the NHEXAS individuals across AZ and the uncertainty due to
our not having precise information for the NHEXAS individuals (NHEXAS only gives county of residence).
14
Our approach accommodates the spatially mismatched sampling schemes of NHEXAS, USSoils, and NGS.
Global soil data model The global soil data model draws on both As concentration measurements in
topsoil and stream sediment. We letY T (si) denote the log As concentration observation at locationsi and
defineY T = {Y T (si) : i = 1, . . . , NT } to be the collection of theNT topsoil observations. We assume that
[
Y T |XT (·), ωT]
=
NT∏
i=1
[
Y T (si)|XT (si), ω
T]
=d
NT∏
i=1
N(
XT (si), 1/ωT)
,
whereXT (·) is a continuously indexed spatial process defined on the domain D ⊂ R2 representing AZ, and
ωT is the measurement-error precision.
For the stream sediment observations, we letY Hjk be thekth log As concentration measurement taken in the
jth HUC8 region (watershed) and denote the collection of stream sediment observationsY H = {Y Hjk : k =
1, . . . , NHj , j = 1, . . . , NH}, whereNH
j is the number of observations in thejth HUC8 region andNH is
the number of HUC8s. We assume that
[
Y H |XH , ωH]
=
NH∏
j=1
NHk
∏
k=1
[
Y Hjk |X
Hj , ωH
]
=d
NH∏
j=1
NHk
∏
k=1
N(
XHj , 1/ωH
)
,
whereXH is a latent process defined on theNH watersheds (a random vector of lengthNH) andωH is the
measurement-error precision.
Global soil process model We assume that, conditional on the stream sediment processXH , the topsoil
process{XT (s)} is independent across locationss ∈ D. At each location, we assume that the mean of the
latent log topsoil process is linearly related to the streamsediment process, via
[
XT (s)|βT0 , βT
1 , τT]
=d N(
βT0 + βT
1 XHh(s), 1/τ
T)
. (4)
Hereh(s) denotes the HUC8 region that contains locations. The spatial dependence in the soil model is
introduced through a Gaussian prior on the latent stream sediment process, defined over watersheds. We use
15
a conditional autoregressive (CAR) model:
[
XH |µH , τH , γH]
=d NNH
(
µH1,
[
τH(I − γHA)]−1
)
,
where Nl denotes al-dimensional normal distribution,A is a spatial neighborhood matrix,γH is an unknown
spatial-dependence parameter, andτH is the process precision parameter.
Global soil to LEB model LettingXMm(S) = {XM
i,m(S) : i = 1, . . . ,N I} be the collection ofN I individual
latent log As concentrations in soil, we assume that
[
XMm(S), ζ
T |XT (·), τMm(S)
]
=NI∏
j=1
[
XMi,m(S), ζ
Ti |X
T (·), τMm(S)
]
. (5)
The difference between this linking model and that for wateris that the unknown collection of parameters,
ζT , captures the uncertainty in the spatial location of NHEXASindividuals. From (4), we see that the spatial
distribution of XT (s) only depends on the spatial location throughh(s), the HUC8 region containings.
Thus, rather than assigning NHEXAS individuals to watersheds, we define an unknown parameterζTij such
that ζTij = 1 if NHEXAS individual i is located in HUC8 regionj, and zero otherwise. Then, for each
i = 1, . . . , N I , we defineλTc(i) =
(
λTc(i),1, . . . , λ
Tc(i),NH
)
, whereλTc(i),j is equal to the proportion of the area
of the countyc(i) that contains HUC8 regionj (which could be zero if a HUC8 region is not contained in
countyc(i)). Assuming NHEXAS individuals are uniformly distributed across a county, we assume that
[
XMi,m(S), ζ
Ti |X
T (·), τMm(S)
]
=NH∏
j=1
([
XMi,m(S)|X
T (well-defined location in HUC8 regionj), τMm(S)
])ζTij
×
Mult(
1, (λTc(i),1, . . . , λ
Tc(i),NH )
)
=d
NH∏
j=1
(
N(
XT (well-defined location in HUC8 regionj), 1/τMm(S)
))ζTij
×
Mult(
1, (λTc(i),1, . . . , λ
Tc(i),NH )
)
.
16
Global soil prior distributions We assume independent gamma priors on the topsoil precision, ωT , and
the stream sediment precision,ωH . In the analysis (see Section4 for more details), we fixωT . The priors
for the intercept and slope in the model relating topsoil andstream sediments,βT0 andβT
1 , are independent
normal distributions. The stream sediment mean,µH , is also given a normal prior, and we assume gamma
priors for the process precisions,τT andτH . We assume a uniform distribution on the spatial-dependence
parameter,γH , in the stream sediment model. The bounds on the uniform distribution are determined by the
reciprocal of the smallest eigenvalue (lower bound) and thereciprocal of the largest eigenvalue (upper bound)
of the spatial proximity matrixA, ensuring thatI − γHA is positive-definite and hence that the distribution
of XH is proper (e.g.,Banerjee, Carlin, and Gelfand, 2004, p.80).
3.2.3 Global air model
Figure 7 show time series plots of the observed IMPROVE air monitor readings of the ambient log As
concentrations across AZ, at the fourteen locations displayed in Figure2. The vertical gray lines indicate the
time period during which the NHEXAS study was conducted for the county containing the monitor. As can
be seen, there is a clear lack of information about the As level during the NHEXAS time period for 10 out of
the 14 monitors. Due to the censoring below the MDL and the trending that is obvious in some of the series
(e.g., sites #1, #5, and #8), we are hesitant to fit any statistical model to these data.
Given these data-coverage problems for air, we decided onlyto provide an informative prior for the intercept
in the equation for the Outdoor Air log As concentrations process defined by (2). Let j(A) be the index
corresponding to Outdoor Air in the NHEXAS model (2). We assume that
µj(A) ∼ N(αA, 1/CA),
where in this caseαA andCA are fixed. From the information contained in Figure2, we setαA = −6 on
the log ppm scale, andCA = 1/4, so that the standard deviation of the intercept parameter is 2 on the log
17
ppm scale. This prior has a larger standard deviation than issuggested by the data, to capture the change
of support of moving from the IMPROVE monitors to the monitors used outside the home in the NHEXAS
study.
4 Model fitting and checking
In this section, we discuss the procedure we went through to fit the Bayesian hierarchical model. Rather
than just fitting one large model that links the global environmental soil, water, and air models to the LEB
model, we put these pieces together in a stepwise fashion, adding in each model component one-at-a-time.
Throughout this stepwise procedure, we validated whether or not our models fitted the data well by comparing
the posterior distributions, with and without each model component, to the observed data.
Before we consider our stepwise model fitting, we discuss some of the decisions made in implementing and
fitting the Bayesian hierarchical model using the numerous datasets involved in the analysis. These details
are often unreported in many Bayesian analyses but are worthwhile setting out for readers unfamiliar with
applied Bayesian statistics.
4.1 Data management
In our study, datasets were stored in a hierarchy of directories by data type. As a security feature, access
to theoriginal datasets were limited. Instead, modelers had access to thederived data. The scripts used to
create these derived dataset are stored along with the derived datasets. Data management and processing was
carried out using ArcGIS (for spatial data management), SAS, and R (R Development Core Team 2007). The
maptoolsR library 8 was also used for further spatial-data management. In termsof data format, we found
that comma delimited (CSV) text files, although large in size, tended to be the easiest to use for exploratory
8http://cran.r-project.org/src/contrib/Descriptions/maptools.html
18
data analysis and modeling, especially when using many different software packages.
4.2 Implementing the Markov chain Monte Carlo algorithm
In this analysis, the posterior distributions for the modelparameters are not available in closed form. Instead,
we sample from the posterior density using a Gibbs sampling algorithm that sequentially samples from
the full conditional distributions of the model parameters(Gelfand and Smith 1990). Since some of these
full conditional distributions are also not available in closed form, we employ independence or symmetric-
random-walk Metropolis steps (Tierney 1994). Although there are other versions of the Metropolis-Hasting
algorithm that may have better convergence properties, these methods are easy to implement and straightfor-
ward to tune to achieve appropriate mixing. In practice, thechoice of more complicated candidate distribu-
tions and algorithms always requires careful implementation and tuning.
To implement the MCMC algorithm in practice, one needs to check:
1. that the full conditional distributions are analytically correct, as is the computer code to sample from
them;
2. whether the algorithm converges and mixes well (i.e., whether the full parameter space is being ex-
plored);
3. the sensitivity of the inferences to the choice of prior distributions;
4. the sensitivity of the inferences to aspects of the data.
All of these issues are important in fitting our Bayesian hierarchical pathways model, due to the large num-
ber of pathways (processes) and different data sources. Each dataset can have issues with, for example,
missingness and censoring (MDLs). The data have different (initially incoherent) spatial scales. For some
19
data types, there can be a lack of spatial coverage or resolution. It is still very easy to make mistakes in
implementation of the model-fitting algorithm in computer code, even using dedicated software such as Win-
BUGS/OpenBUGS (Spiegelhalter, Thomas, Best, and Lunn 2004).
Our strategy starts far away from the computer. A model document was created before the MCMC algorithm
was implemented in computer code. The document defines all the notation and the full conditional distribu-
tions, and multiple people (i.e., more than one!) checked the document for errors. For an abbreviated version
of this document that contains all full conditional distributions, see the Supplemental Material document. At
this point, we decided what parameters to augment in the MCMCalgorithm (e.g., adding parameters to make
parts of the model conditionally conjugate, or choosing to sample missing data). We tried to keep multi-
ple versions of the documents and computer code, so that the user could go back to a previous version if a
problem was found with the current version. (We experimented with the software packageSubversion 9,
which can be used to keep track of multiple versions of files, but we did not use it much by the end of the
project. It was easier to make copies of files ourselves.)
Before we implemented the actual MCMC algorithm, we first wrote utility computer code that was needed
(e.g., a C function dynamically called from R for sampling truncated normals). To reduce the chance of
wrongly implemented utility functions corrupting the MCMCalgorithm, this utility code was checked in-
dependently and extensively before being used in practice.Finally, before implementing and running the
MCMC algorithm, the routines to read in the data were coded and checked.
We coded up our MCMC algorithm in the R statistical software language; the interpreted language R does
not run the algorithm as fast as compiled programming languages such as C, C++, or Fortran, but it was fast
enough, and it allowed us the flexibility to use existing functions and complex data types.
In coding up the steps of the MCMC algorithm, we successivelyadded parameters into the model. Param-
9http://subversion.tigris.org/
20
eters that were unsampled in the current stage of the MCMC algorithm, were fixed temporarily, and were
not updated during the simulation. We checked that the output was correct with simulated data, before im-
plementing the MCMC algorithm on actual data. These same rules applied when we made changes to the
model. We compared the output from the previous and current versions of the computer code. Whenever
possible, we also implemented the MCMC algorithm using OpenBUGS, via the R libraryBRugs 10. This
provided another check for the R implementation.
There were a number of strategies that helped us in implementing the MCMC algorithm. Depending on the
robustness required for sampling, we used a Cholesky decomposition or eigenvalue decomposition to sample
the normally distributed slope parameters of our regression models. (The eigenvalue decomposition is more
robust, but much slower.) We never try to invert matrices; instead we solve linear equations, because inner
products are more efficient than matrix calculations. For censored data, we sampled any values that were
below the MDL using a truncated normal distribution. This means that we can omit a Metropolis-Hastings
step to update the parameters relating the data,Y , and the process,X. Since the standard inverse-CDF
method is very inefficient for sampling truncated normal random variables, especially for MDLs far in the
lower tail of the normal distribution, we used a rejection-sampling algorithm proposed byGeweke (1991).
This algorithm uses either normal rejection sampling or exponential rejection sampling based on the value of
the MDL. We also sampled missing data within our MCMC algorithm. Although this is not necessary, this
strategy simplified the computer code and cut down on errors in implementation.
While MCMC computer code needs to be efficient to work quicklyin practice, it also needs to be straight-
forward to read and debug, especially if one needs to amend the code in the future. At odds with having
easy-to-read (commented) code, we found that making the computer code overly general can lead to slow
execution times. Compromises needed to be made throughout the implementation.
10http://cran.r-project.org/src/contrib/Descriptions/BRugs.html
21
4.3 Stepwise fitting of the model
4.3.1 Full LEB model
We began by fitting the full Local Environment to Biomarker (LEB) model containing all of the pathways
shown in Figure4. As is traditionally done in any MCMC algorithm, we ran multiple chains from multiple
starting locations. After a burn-in of 5,0000 iterations, we ran each chain for a further 100,000 iterations.
Although not entirely necessary, we reduced the autocorrelation in the chain by thinning and storing every
10th iteration. Hence, our inferences are based on 10,000 samples. Trace plots and autocorrelation plots
were used to diagnose convergence of the chains. In this case, the mean parameters{µMj }, slope parameters
{βMj,k}, and precision parameters,{τM
j }, corresponding to Personal Air, Indoor Air, and Outdoor Airnever
converged; running the chains longer did not remedy this situation. The essential problem is that there is
little information (i.e, few observed values above the MDL)contained in the NHEXAS air measurements,
especially in Personal Air (Table1). This lack of data is most likely due to the burdensome nature of personal-
air exposure monitors (Samet and Jaakkola 1999). To see if the problem was mediated by removing Personal
Air, we re-ran the model without it, now linking the Indoor Air and Outdoor Air processes directly to the
Urine biomarker (essentially, we removed the Personal Air node in Figure4). This helped a little, but still the
lack of observations on Indoor Air and Outdoor Air caused convergence problems. We only achieved what
we deemed to be convergence by removing any links in the LEB model that involved Indoor Air, Outdoor
Air, and Personal Air. Henceforth, the model with just the air-related links removed from the LEB model
will be referred to as thefull LEB model.
Before we summarize the resulting full LEB model, we give a description of how we diagnosed its fit.
We started by producing standardized residual plots for thelog As concentration process for each of the
media in the NHEXAS study,XMj , j = 1, . . . ,NM . That is, for each each mediumj, and each individual
22
i = 1, . . . , N I , we defined
rMji {p} =
XMji {p} −
(
µMj {p} +
∑
k∈SMj
βMjk {p}X
Mik {p}
)
√
τMj {p}
,
where θ{p} denotes the pth sample of the parameterθ from the joint posterior distribu-
tion (Gelman, Carlin, Stern, and Rubin 2004, p.170). We found that taking a random realization (or a few
realizations) of the posterior parameters,{XMji }, {µM
j }, {βMjk }, and{τM
j }, and then examining plots of
the standardized residuals, was sufficient for diagnosing model fit. As an example, Figure8 demonstrates
residual plots for the Urine pathway. The histogram and normal quantile-quantile plots of the standardized
residuals (top panels) demonstrate that the normal approximation is reasonable. The plots of the residuals by
observation number in Figure8 (second row, left-hand panel) show two observations outside the limits of -3
and 3 for the standardized residuals. If indeed the residuals were normal and independent across individuals,
we would expect roughly 0.5 of the 178 residuals to be outsidethese limits. The two observations are de-
noted by the gray circles in Figure8, corresponding to case numbers 75 and 157 occurring in Maricopa and
Santa Cruz counties. We can see that the plots of the residuals versus the process variables in the pathway
for Urine (Sill Dust, Soil, Food, and Beverage) are scattered evenly about zero, although there are a few
extreme points (including cases 75 and 157) that are worth further investigation. A plot of the standardized
residuals versus the county in Figure8 (bottom trow, right-hand panel) shows that while the residuals are not
perfectly centered around zero for each county, the assumption of a single intercept for the Urine pathway is
reasonable here. Standardized residuals based on a single random draw from the posterior distribution were
useful in this case, however there is certainly some question to how applicable this idea is to other models
and datasets.
Figure9 displays a summary of the parameters{βMjk } in the LEB model, after removing the “air” pathways.
Consider the Urine, Food, and Beverage media and their (linear) dependence on other media. We display
the posterior mean (black circle) and a 95% credible interval (solid black line) for the slope parameters
23
linking the particular media to other media. For example, Urine is related linearly to Sill Dust, Soil, Food,
and Beverage. (We define and compare the gray circles and dashed lines in Section4.3.4.) Except for the
link between Beverage and Sill Dust, the strengths of all thelinear relationships appear to be non-zero, as
evidenced by the fact that the 95% credible intervals do not contain zero. Log Urine As concentrations are
positively associated with Sill Dust, Food, and Beverage, and negatively associated with Soil. There is weak
positive dependence between Food and Sill Dust and between Food and Water. In the Beverage pathway, an
increase in the log As concentration in Water is associated with an increase in the log As concentration in
Beverage.
The black symbols and lines in Figure10 show posterior summaries (the circles denote the posteriormeans,
the lines are 95% credible intervals) for the process variances,{1/τMj }. We can see that Urine and Sill Dust
process variances are the largest, while the Soil process variance is the smallest.
4.3.2 LEB model + global water
Having established a satisfactory LEB model, we then linkedthe global water model to it (i.e., the models
of Sections3.1 and 3.2.1). After diagnosing model fit (in a similar fashion to the discussion above), we
compared the posterior distributions of the parameters in:
1. the full LEB model;
2. the full LEB model + global water; and
3. an LEB model in which we only include the NHEXAS water log As concentrations + global water.
By making comparisons of the three models we can understand the effect of adding the global water model
component to the LEB model component, and vice versa. As an example of the comparisons we made, the
posterior summaries of the parameters involved in the global water model as defined in Section3.2.1are
24
shown in Table3. With respect to these parameters, there is no difference inthe summaries when the rest of
the LEB model is linked to the global water model. (We will leave comparisons of the LEB slope parameters
and process variances until Section4.3.4.) For the moment, we note that there was some Bayesian learning
from the mixture model that links the Water medium in the LEB model with the global water process. This
can be observed by comparing the posterior distribution of the parameterζWij that links individuali to PWS
j, with the proportion of the countyc(i) served by PWSj, λWc(i),j . Figure11 shows the comparison for four
randomly selected individuals. There is evidence that the model puts substantial weight on certain PWSs
for each individual. Even though the NHEXAS Water log As concentrations process,XMm(W ) learns from
the PWS process, the posterior distribution ofXMm(W ) did not change upon the addition of the global water
model.
4.3.3 LEB model + global soil
Similar to what we did with water, we compared posterior distributions of the parameters in:
1. the full LEB model;
2. the full LEB model + global soil; and
3. an LEB model in which we only include the NHEXAS soil log As concentrations + global soil.
As an illustration, Table4 compares the posterior summaries of the global soil parameters in 2. and 3. above.
There is little difference between the parameters in the global stream model (1/ωH , µH , 1/τH , γH). The
process precision for the topsoil process,1/τT , decreases as we add more pathways above Soil in the LEB
component of the model (2. versus 3.), while the credible interval slope parameter,βT1 , relating the global
topsoil process to the global stream process is wider for thefull LEB + global soil model. Our analysis agrees
with our exploratory data analysis, where we found similar log As concentrations in the NHEXAS, USSoils
25
(topsoil), and NGS (stream sediment) studies. As with water, there was substantial Bayesian learning from
the mixture model that assigns watersheds to individuals (not shown here). With the addition of the global
soil model, we found that the posterior distributions of theindividual NHEXAS log As Soil concentration,
XMi,m(S) (i = 1, . . . , N I ), did not change for those individuals with observed valuesequal or above the MDL.
For those individuals with values below the MDL or that were missing, the distribution of the NHEXAS log
As in Soil shifted lower with the addition of the global soil model.
4.3.4 Full LEB model + global water + global soil
Finally, we fitted the full LEB model including the global water and global soil models. After checking the
model fit (similar to the manner described above), we are in a position to summarize the resulting model. The
gray circles and dashed lines display, respectively, the posterior means and 95% posterior credible intervals
for the posterior parameters{βMjk } (Figure9) and process variances{1/τM
j } (Figure10) in this model. We
compare these parameters of the full LEB model + global water+ global soil, with the parameters of the full
LEB model (i.e., without global models, corresponding to the black circles and lines in the figures). In terms
of the posterior distributions of{βMjk }, adding the global modelsdoes not greatly change the relationships we
observe between the environmental media processes in the LEB model. The one exception is in the posterior
credible interval for Soil in the Urine pathway. For the fullLEB model + global water + global soil, the
interval moves slightly to contain less negative values. This casts doubt on the negative association between
the arsenic levels in Urine and Soil, which the science wouldsuggest should be positive. In terms of the
process variances,{1/τMj }, including global models greatly reduces the process variances in the Soil and
Water pathways in the LEB model, but it does not affect the process variances for the other pathways. In the
interest of space, we do not include summaries of the posterior parameters in the global water and global soil
models – they are almost identical to the summaries given in the right-hand panel of Tables3 and4.
26
4.3.5 Further sensitivity analyses
Given that the data sources vary in quality and scope, we re-ran the full LEB model to investigate the sen-
sitivity to subsets of the data (because of the time it takes to carry out each MCMC, we did not include the
global water and global soil models with the full LEB model).As an example, we left out all individuals in
each county in turn, and refitted the full LEB model. Our interest was in the influence of any one county’s
NHEXAS data upon our results. Figure12 shows summaries of the posterior distributions of the parameters
{βMjk } in the Urine pathway for 14 runs (the number of counties we have data for) of the MCMC algorithm,
removing the data for each county in turn. This figure indicates, with perhaps the exception of Pima county,
that the posterior distributions are robust to the data, which provides us some confidence in the results we
have obtained.
4.4 Summary of scientific results
We have demonstrated that the structure proposed byClayton et al. (2002)to explain human As exposure
pathways in NHEXAS Region 5 is applicable to the analysis of As exposure pathways in NHEXAS AZ,
once the air pathways are removed. This removal of air pathways was necessitated by the lack of information
in the corresponding data (few observations were above the MDL) and associated model-fitting-algorithm
convergence problems. Except for the link between Soil and Urine, all associations between variables were
positive. The incorporation of GLE models for global water and global soil did little to affect these asso-
ciations except to cast a small doubt on the negative association between Urine and Soil, and to sharpen
estimates of the posterior distributions of the Water and Soil log As processes. Therefore, the part of our
analysis that includes GLE models, does not significantly refine our understanding of pathways in the full
LEB model of individuals’ exposure to As. However the incorporation of GLE models, could be of use in
assessing spatial variation in exposure across AZ, since the global water and soil data have better spatial
27
coverage than the NHEXAS AZ data.
5 Discussion
It is our contention that large Bayesian hierarchical models need to be implemented carefully. Following
standard ideas of software development, the Bayesian modelshould be coded up and tested in bite-size
pieces. Since it can be hard to get the chains in the MCMC algorithm to mix, the choice of priors and
parameters updates are critical. Datasets need to be managed, in a documented fashion. Sensitivity of the
results to the data is often overlooked but is very importantto check. For example, in the exposure pathways
model that relates the As levels measured in Urine to local and global environmental measurements, we found
that no county’s data heavily influences the model fit. Fast, reliable MCMC computation with an extendable,
modifiable code is crucial for success.
We close with a final note about prior distributions. Throughout, we have probably relied too heavily on the
use of normal and gamma priors. Certainly, the use of normal priors is standard for modeling log concentra-
tions. However, the use of gamma priors for precisions is more open to debate (e.g.,Gelman, 2006). As we
stepped through our models, we checked that our results werenot sensitive to the choice of the parameters in
the gamma prior distributions. In a few smaller versions of the model, we replaced the gamma prior distribu-
tions by other commonly used, more informative prior distributions, such as uniform distributions. (The full
conditional distributions are then not available in closedform, and require Metropolis-Hastings steps.) With
these replacements, the results were unaffected.
28
Acknowledgments
This research was supported by the American Chemistry Council’s (ACC) Long-Range Research Initiative
under an agreement between the ACC and the Ohio State University Research Foundation. The agreement
was a result of a submission to the EPA’s FY2003 STAR Grant program, in response to an RFA titled,
“Environmental Statistics Research: Novel Analyses of Human Exposure Related Data,” which was jointly
funded by the EPA’s National Center for Environmental Research (NCER) and the ACC. The authors would
like to thank Wayne Hood, the Supervisory Hydrologist in Data Management and Analysis of Water Quality
Division of the Arizona Department of Environmental Quality for providing the public-water-system datasets.
Appendix
A The independent normal model with possible censoring and missingness
(INCM)
Let N(µ, σ2) denote a normal distribution with meanµ and varianceσ2, andCN(µ, σ2; c) denote a left-
censored normal distribution with meanµ and varianceσ2 > 0 that is censored at levelc. For each obser-
vationYi from the data{Y1, . . . , Yn}, we let the observation-status variableZi be0 if the observed value is
below the minimum detection limit (MDL) value ofMi, or 1 if the value is equal to or above the MDL, or2
if the observation is missing. Now letY = (Y1, . . . , Yn)T , Z = (Z1, . . . , Zn)T , andM = (M1, . . . ,Mn)T .
Then the random vectorY has anindependent normal distribution with possible censoring and missingness
(INCM distribution), or Y ∼ INCM(µ, σ2;Z;M ), if
[
Y |X, σ2,Z,M]
=∏
{i : Zi=0 or 1}
[
Yi|µi, σ2]
=
∏
{i : Zi=0}
CN(
µi, σ2;Mi
)
∏
{i : Zi=1}
N(
µi, σ2)
,
whereE(Y ) = (µ1, . . . , µn)T ≡ µ is the mean vector andvar(Yi) = σ2 > 0 (i = 1, . . . , n) is the variance.
29
References
Banerjee, S., B. Carlin, and A. Gelfand (2004).Hierarchical Modeling and Analysis for Spatial Data.
Boca Raton, FL: Chapman & Hall/CRC.
Berliner, L. M. (2003). Physical-statistical modeling in geophysics.Journal of Geophysical Research
(Atmospheres) 108 (D24), 8776, doi: 10.1029/2002JD002865.
Calder, C. A., P. F. Craigmile, and J. Zhang (2007). Regionalspatial modeling of topsoil geochemistry.
Technical Report 796, Department of Statistics, The Ohio State University. Submitted for review.
Clayton, C. A., E. D. Pellizzari, and J. J. Quackenboss (2002). National Human Exposure Assessment
Survey: Analysis of exposure pathways and routes for arsenic and lead in EPA Region 5.Journal of
Exposure Analysis and Environmental Epidemiology 12, 29–43.
Cressie, N., B. E. Buxton, C. A. Calder, P. F. Craigmile, C. Dong, N. J. McMillan, M. Morara, T. J. Santner,
K. Wang, G. Young, and J. Zhang (2007). From sources to biomarkers: A hierarchical Bayesian
approach for human exposure modeling.Journal of Statistical Planning and Inference 137, 3361–
3379.
Gelfand, A. E. and A. F. M. Smith (1990). Sampling-based approaches to calculating marginal densities.
Journal of the American Statistical Association 85, 398–409.
Gelman, A. (2006). Comment on “A comparison of Bayesian and likehood-based methods for fitting
mutilevel models” by W. J. Browne and D. Draper.Bayesian Analysis 1, 515–534.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004).Bayesian Data Analysis, 2nd edn. Boca
Raton, FL: Chapman & Hall/CRC.
Geweke, J. (1991). Efficient simulation from the multivariate normal and Student-t distributions subject
to linear constraints. In E. M. Keramidas (Ed.),Computing Science and Statistics: Proceedings of the
Twenty-Third Symposium on the Interface, Fairfax, VA: Interface Foundation of North America, pp.
30
571–568.
Lauritzen, S. L. (1996).Graphical Models. Oxford, UK: Oxford University Press.
Lindley, D. V. and A. F. M. Smith (1972). Bayes estimates for the linear model (with discussion).Journal
of the Royal Statistical Society, Series B 34, 1–41.
McMillan, N. J., M. Morara, and G. S. Young (2006). Hierarchical Bayesian modeling of human exposure
pathways and routes. Alexandria, VA: American StatisticalAssociation, pp. 2492–2503.
NERL and National Center for Environmental Assessment (2000). Strategic Plan For The Analysis Of The
National Human Exposure Assessment Survey (NHEXAS) Pilot Study Data. ORD, U.S. EPA: NERL
and National Center for Environmental Assessment.
O’Rourke, M. K., P. K. Van de Water, S. Jin, S. P. Rogan, A. D. Weiss, S. Gordon, D. Moschandreas,
and M. Lebowitz (1999). Evaluations of primary metals from NHEXAS Arizona: Distributions and
preliminary exposures.Journal of Exposure Analysis and Environmental Epidemiology 9, 435–445.
Pellizzari, E., P. Lioy, J. Quackenboss, R. Whitmore, A. Clayton, N. Freeman, J. Waldman, K. Thomas,
C. Rodes, and T. Wilcosky (1995). Population-based exposure measurements in EPA Region 5: A
Phase I field study in support of the National Human Exposure Assessment Survey.Journal of Expo-
sure Analysis and Environmental Epidemiology 5, 327–358.
R Development Core Team (2007).R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Robertson, G., M. Lebowitz, M. O’Rourke, S. Gordon, and D. Moschandreas (1999). The National Human
Exposure Assessment Survey (NHEXAS) study in Arizona – introduction and preliminary results.
Journal of Exposure Analysis and Environmental Epidemiology 9, 427–434.
Samet, J. and J. Jaakkola (1999). The epidemiologic approach to investigating outdoor air pollution. In
S. Holgate, J. Samet, H. Koren, and R. Maynard (Eds.),Air Pollution and Health. London, UK: Aca-
31
demic Press.
Santner, T. J., P. F. Craigmile, C. A. Calder, and R. Paul (2007). Effect and pathways modifiers in a
Bayesian pathways analysis of the National Human Exposure Assessment Survey for arsenic in EPA
Region 5. Technical Report 801, Department of Statistics, The Ohio State University. Submitted for
review.
Seaber, P., F. Kapinos, and G. Knapp (1987).Hydrologic Unit Maps. Denver, CO: U.S. Geological Survey.
Spiegelhalter, D. J., A. Thomas, N. G. Best, and D. Lunn (2004, January).WinBUGS Version 1.4 User
Manual. Cambridge, UK: MRC Biostatistics Unit. Available athttp://www.mrc-bsu.cam.ac.uk/bugs.
Tierney, L. (1994). Markov chains for exploring posterior distributions.Annals of Statistics 22, 1701–
1728.
WHO (1981).Arsenic. Geneva: World Health Organization. Environmental HealthCriteria 18.
32
Media Description #Missing #<MDL ωj
Urine Morning void urine samples were collected on the third
and seventh days of the study.
12 55 393.99
Personal Air Air samples were collected for a subset of the partici-
pants using monitors worn by the participants for a pe-
riod of 144 hours.
168 4 893.99
Indoor Air Integrated six-day air samples were collected inthe pri-
mary living area of the participants’ homes.
51 90 893.99
Outdoor Air Integrated six-day samples were collected at a location
outside the participants’ homes.
58 82 893.99
Soil Soil samples were collected by a sweeping method from
an area with high traffic at the primary entrance to the
home and by a ring collection method from exposed soil
in the primary outdoor activity areas.
110 0 893.99
Surface Dust Dust samples were collected from a windowsill in the
participant’s primary living area.
20 12 893.99
Food NHEXAS participants were required to prepare or obt- 3 0 893.99
Beverage ain duplicate samples of the food and beverages they
consumed for four consecutive days. Composites were
separated intobeverage intake andfood intake.
10 11 893.99
Water Tap-water samples were collected from participants’
homes after flushing the pipes for a period of 3 minutes.
0 0 3594.00
Table 1: A summary of the environmental media and biomarker measurements from the NHEXAS AZ study.
The right-most columns tabulate for each medium,j the number of individuals (out of 178) that have missing
values or values below the MDL, and list the measurement-error precision,ωj .
33
County Apache Cochise Coconino Gila Graham Greenlea La Paz Maricopa
#Individuals 4 8 8 13 4 4 0 49
Population 77 138 136 62 40 9 26 4022
County Mohave Navajo Pima Pinal Santa Cruz Yavapai Yuma
#Individuals 5 4 44 10 6 11 8
Population 217 117 1021 243 47 227 213
Table 2: The number of NHEXAS individuals in each county of AZ, along with the population (in thousands)
according to the 1990 Census.
Parameters water only-LEB + global water Full LEB + global water
1/τW 0.475 ( 0.459 , 0.492 ) 0.478 ( 0.461 , 0.495 )
αW -4.798 ( -8.446 , 0.675 ) -4.780 ( -8.394 , 0.636 )
1/CW 7.522 ( 1.562 , 41.777 ) 7.537 ( 1.518 , 41.953 )
Table 3: Posterior means and 95% credible intervals for parameters in the global water model.
Parameters soil only-LEB + global soil Full LEB + global soil
βT0 0.075 ( -1.610 , 1.543 ) 0.176 ( -1.521 , 1.768 )
βT1 1.067 ( 0.234 , 2.057 ) 1.021 ( 0.090 , 2.003 )
1/τT 0.692 ( 0.574 , 0.818 ) 0.487 ( 0.334 , 0.678 )
1/ωH 0.740 ( 0.686 , 0.801 ) 0.741 ( 0.686 , 0.803 )
µH 1.678 ( 1.560 , 1.796 ) 1.675 ( 1.559 , 1.792 )
1/τH 0.296 ( 0.196 , 0.419 ) 0.294 ( 0.191 , 0.417 )
γH 0.088 ( 0.048 , 0.104 ) 0.089 ( 0.036 , 0.104 )
Table 4: Posterior means and 95% credible intervals for parameters in the global soil model.
34
−115 −113 −111 −109
3132
3334
3536
37Topsoil locations
Longitude
Latit
ude
−115 −113 −111 −109
3132
3334
3536
37
Stream sediment locations
LongitudeLa
titud
e
Figure 1: The left-hand panel shows the locations of the observed topsoil As concentrations superimposed
on the counties of AZ. The right-hand panel shows the locations of the observed stream sediment As concen-
trations over the 84 hydrologic units that are contained partially within the state of AZ.
−115 −113 −111 −109
3132
3334
3536
37
Arizona IMPROVE Monitors
Longitude
Latit
ude
*
*
Phoenix
Tucson
Figure 2: A map of the counties of AZ showing the locations of the 14 IMPROVE air monitors.
35
−0.3 −0.1 0.2 0.2 −0.3 −0.6 −0.1 0
−0.2 −0.1 0.3 0.3 0.1 0 0.1 −0.1
−0.1 0.2 0.3 −0.2 0 −0.1 0 0
−0.2 0.2 0.3 −0.2 0.1 −0.1 0.2 0.3
0.2 −0.3 0.1 0 0.1 0.1 0.1 0.2
0.3 −0.6 0 −0.1 −0.1 0.1 0.1 0.1
0.2 −0.1 0.1 0 0.2 0.1 0.1 0.5
0.2 0 −0.1 0 0.3 0.2 0.1 0.5
Urin
e
Per
sona
l Air
Indo
or A
ir
Out
door
Air
Soi
l
Sur
face
Dus
t
Foo
d
Bev
erag
e
Wat
er
Personal Air
Indoor Air
Outdoor Air
Soil
Surface Dust
Food
Beverage
Water
response
cova
riate
Figure 3: The numbers in the table represent the pairwise correlations between the log As concentrations
observed in each medium. To calculate each correlation, we removed the missing values and used half the
MDL when censored. The dark gray shading indicates that a pairwise correlation test has a marginal P-
value less than 0.05; the light gray indicates a marginal P-value between 0.05 and 0.1. A rectangle around
a correlation value indicates for the response (columns), which covariates (rows) are used in Clayton et al.’s
model.
36
OUTDOOR
Outdoor Air
Soil
INDOOR
Water
Indoor Air
Surface Dust
EXPOSURE
Personal Air
Food
Beverage
BIOMARKER
Urine
Figure 4: Graphical representation of Clayton et al.’s (2002) structure linking the various environmental
media and the biomarker. Arrows connecting the different nodes in the model indicate the choice of pathways,
SMj , j = 1, . . . , NM .
37
NHEXAS Water log As concentrations
Den
sity
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
0.4
0.5
−115 −113 −111 −10931
3233
3435
3637
NHEXAS Water
Longitude
Latit
ude 4
8
8
13
4 04
49
54
44
10
6
11
8
PWS log As concentrations
Den
sity
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
0.4
0.5
Figure 5: On the left-hand side, histograms of the observed NHEXAS and PWS water log As concentrations.
For the histogram of PWS measurements, the check marks denote the minimum detection limits (MDLs).
The map on the top right-hand side displays the number of NHEXAS Water observations per county of AZ.
38
NHEXAS Soil log As concentrations
Den
sity
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
−115 −113 −111 −10931
3233
3435
3637
NHEXAS Soil
Longitude
Latit
ude 2
0
0
6
4 04
0
00
42
10
0
0
0
Topsoil log As concentrations
Den
sity
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
Stream sediment log As concs.
Den
sity
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
Figure 6: The top panels show histograms and a map of the number of observed NHEXAS Soil log As
concentrations. The lower panels display histograms of theUSGS topsoil and NGS stream sediment log As
concentrations, respectively
39
−10
−6
log
As
conc
.site #1 (COCHISE)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #2 (COCONINO)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #3 (YAVAPAI)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #4 (YAVAPAI)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #5 (COCONINO)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #6 (APACHE)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #7 (PIMA)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #8 (APACHE)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #9 (MARICOPA)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #10 (PINAL)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #11 (PIMA)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #12 (PIMA)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #13 (GILA)
Jan 1988 Jan 1994 Jan 2000
−10
−6
log
As
conc
.
site #14 (GILA)
Jan 1988 Jan 1994 Jan 2000
Figure 7: Time series plots of the log As concentrations for all the observed values at the 14 IMPROVE
air-monitor locations. The vertical gray lines delimit thestudy periods of NHEXAS in the county where the
monitor is located. 40
standardized residuals for Urine
Den
sity
−4 −2 0 2 4
0.00
0.15
0.30
−2 −1 0 1 2
−4
−2
02
4
standard normal quantiles
quan
tiles
of r
esid
uals
0 50 100 150
−4
−2
02
4
Observation number
stan
dard
ized
res
idua
l
0 1 2 3 4 5−
4−
20
24
SillDust
stan
dard
ized
res
idua
l
1 2 3 4 5
−4
−2
02
4
Soil
stan
dard
ized
res
idua
l
0 2 4 6 8
−4
−2
02
4
Food
stan
dard
ized
res
idua
l
−3 −2 −1 0 1 2 3
−4
−2
02
4
Beverage
stan
dard
ized
res
idua
l
−4
−2
02
4
stan
dard
ized
res
idua
l
AP
AC
HE
CO
CH
ISE
CO
CO
NIN
O
GIL
A
GR
AH
AM
GR
EE
NLE
A
MA
RIC
OP
A
MO
HA
VE
NA
VA
JO
PIM
A
PIN
AL
SA
NT
A C
RU
Z
YA
VA
PA
I
YU
MA
Figure 8: Standardized residuals plots for the Urine pathway in the full LEB model (does not include any
pathways for air).
41
SillDust
Soil
Food
Beverage
SillDust
Water
SillDust
Water
Urine
Food
Beverage
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Figure 9: The circles (black: full LEB model; gray: full LEB model + global water + global soil) denote the
posterior means of the parameters{βMjk } in the LEB model pathways leading from the right-hand side media
(Urine, Food, and Beverage) to the left-hand side media. Thehorizontal lines around each circle denote the
corresponding 95% credible intervals.
42
Urine
Food
Beverage
Soil
SillDust
Water
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
Figure 10: The circles compare the posterior means of the process variance parameters,1/τMj (black circles:
full LEB model model; gray circles: full LEB model + global water + global soil). The horizontal lines
around each circle denote the corresponding 95% credible intervals.
43
250 300 350
0.0
0.2
0.4
individual # 44
PWS identifier
prob
abili
ty
250 300 350
0.0
0.2
0.4
individual # 73
PWS identifier
prob
abili
ty
430 440 450 460 470
0.00
0.10
0.20
individual # 96
PWS identifier
prob
abili
ty
500 600 700 800 900
0.0
0.2
0.4
0.6
0.8
individual # 112
PWS identifier
prob
abili
ty
Figure 11: For four randomly selected individuals, a comparison of the prior probability that the individual
is served by each PWS (gray plus sign), with the posterior probability (black circles).
44
−1.
0−
0.5
0.0
0.5
1.0
Urin
e an
d S
illD
ust
AP
AC
HE
CO
CH
ISE
CO
CO
NIN
O
GIL
A
GR
AH
AM
GR
EE
NLE
A
MA
RIC
OP
A
MO
HA
VE
NA
VA
JO
PIM
A
PIN
AL
SA
NT
A C
RU
Z
YA
VA
PA
I
YU
MA
−1.
0−
0.5
0.0
0.5
1.0
Urin
e an
d S
oil
AP
AC
HE
CO
CH
ISE
CO
CO
NIN
O
GIL
A
GR
AH
AM
GR
EE
NLE
A
MA
RIC
OP
A
MO
HA
VE
NA
VA
JO
PIM
A
PIN
AL
SA
NT
A C
RU
Z
YA
VA
PA
I
YU
MA
−1.
0−
0.5
0.0
0.5
1.0
Urin
e an
d F
ood
AP
AC
HE
CO
CH
ISE
CO
CO
NIN
O
GIL
A
GR
AH
AM
GR
EE
NLE
A
MA
RIC
OP
A
MO
HA
VE
NA
VA
JO
PIM
A
PIN
AL
SA
NT
A C
RU
Z
YA
VA
PA
I
YU
MA
−1.
0−
0.5
0.0
0.5
1.0
Urin
e an
d B
ever
age
AP
AC
HE
CO
CH
ISE
CO
CO
NIN
O
GIL
A
GR
AH
AM
GR
EE
NLE
A
MA
RIC
OP
A
MO
HA
VE
NA
VA
JO
PIM
A
PIN
AL
SA
NT
A C
RU
Z
YA
VA
PA
I
YU
MA
Figure 12: The circles denote the posterior means of parameters{βMjk } in the Urine pathway, removing the
data from each county in turn. The vertical lines around eachcircle denote the associated 95% credible
intervals.
45
Supplemental Material for “Hierarchical Model Building, Fitting,
and Checking: A Behind-the-Scenes Look at a Bayesian Analysis of
Arsenic Exposure Pathways”
Full Conditional Distributions
This document provides the full conditional distributions for the models introduced in the paper.
Letting θ = (µ,β, τ ,ω), the posterior distribution is
[θ,X|Y ] ∝ [Y |X,θ] [X |θ] [θ]
= [Y |X,ω] [X|β, τ ] [β] [ω] [τ ],
assuming independence of the four components within θ. The distribution is not available in closed
form and so we use a Markov chain Monte Carlo algorithm (hybrid Gibbs sampler with Metropolis
steps within Gibbs) to sample from it.
Before providing the full conditional distributions, we first note that for a generic parameter ψ, the
quantities aψ and bψ denote the hyperprior parameters for gamma prior distributions and mψ and
V ψ are the hyperprior parameters for normal prior distributions. These hyperparameter parameters
were not introduced in the manuscript for the sake of brevity.
1 The local environment to biomarker (LEB) model
1. For each j and i, we draw augmented data, YM∗ji , as follows:
YM∗ji
= YMji , ZMji = 0;
∼ Truncated N(XMji , 1/ω
Mj ) on (−∞,MM
ji ], ZMji = 1;
∼ N(XMji , 1/ω
Mj ), ZMji = 2.
(1)
2. For the process variable XM , we sample each XMj , j = 1 . . . ,NM , conditional on the other
process variables {XMk : k 6= j} and the other parameters in the model. We have that
[
XMj |{XM
k : k 6= j},µM ,βM , τM ,Y M∗,ωM]
∝[
Y ∗j |X
Mj ,ω
M] [
XMj |{XM
k : k 6= j},βMj , τMj
]
∏
l∈χj
[
XMl |{XM
k : k 6= l},βMl , τMl
]
, (2)
1
where χj denotes the columns of the process variable XM that depend on XMj in the speci-
fication of the conditional mean ηMji (i.e., those l that belong to SMj ). This conditional mean
ηMji is defined as:
ηMji = µMj +
NMj∑
k=1
βMjkXMsjki
, (3)
where sMj = (sMj1 , . . . , sMjNM
j
) is a vector of length NMj (the cardinality of SMj ) containing the
columns of the process variable matrix XM that affect the mean of the process variable XMj .
The expression (2) above is equal to
NI∏
i=1
[
YM∗ji |XM
ji , ωMj
] [
XMji |{X
Mki : k 6= j},βMj , τ
Mj
]
∏
l∈χj
[
XMli |{X
Mki : k 6= l},βMl , τ
Ml
]
,
from which we can see that each XMji , i = 1, . . . ,N I , can be sampled independently. Now
[
YM∗ji |XM
ji , ωMj
]
∝ exp
(
ωMj2
(Y ∗Mji −XM
ji )2
)
,
[
XMji |{X
Mki : k 6= j},βMj , τ
Mj
]
∝ exp
(
τMj2
(XMji − ηMji )2
)
,
and for each l ∈ χj,
[
XMli |{X
Mki : k 6= l},βMl
]
∝ exp
(
τMl2
(XMli − δli(j) − ǫli(j))
2
)
. (4)
In (4),
δji(n) ≡ µMj +∑
{k:sMjk6=n}
βMjkXMsMjki,
is the conditional mean of XMji minus the regression terms that depend on XM
ni , and
ǫji(n) ≡∑
{k:sMjk
=n}
βMjkXsMjki, (5)
is the regression terms that depend on XMni . Always, δji(n) − ǫji(n) = ηMji .
Rearranging the terms of XMji , the distribution of XM
ji conditional on all other parameters is
N(q/p, 1/p), where
p = ωMj + τMj +∑
l∈χj
τMl
∑
{k:sMjk
=n}
βMjk
2
,
2
and
q = ωMj YM∗ji + τMj ηMji +
∑
l∈χj
τMl(
XMli − δli(j)
)
∑
{k:sMjk
=n}
βMjk
.
3. For the regression parameters βM , we sample each βMj conditional on the other parameters
in the model. Letting
Aj =
1
{
XMsMjk
1
}
1
{
XMsMjk
2
}
......
1
{
XMsMjkNI
}
,
the full conditional distribution of βMj , is
Nrj
(
R−1j qj ,R
−1j
)
,
where Rj = τMj ATj Aj + Σ
−1j and qj = τMj AT
j XMj + Σ
−1j mj.
4. For each j, the conditional distribution of the process precision τMj , given the other parameters,
is the gamma distribution,
Ga
aτM
j +N I/2, bτM
j +NI∑
i=1
(XMji − ηMji )2/2
,
where ηMji is defined in (3) above.
2 The global to local environment (GLE) model
In this section, we provide the full conditional distributions for the parameters in the GLE models.
Linking global environmental-media variables to local environmental-media variables will change the
full conditional distributions for some of the latent processes and the associated regression coefficients
that were given in the previous section. Specifically, now the full conditional distributions of XMm(W )
depends on XW , and XMm(S) depends on XT and XH . The full conditional distributions for other
media in the LEB model remain unchanged.
3
2.1 Global water model
1. With the addition of the global water model, we need to change some of the full conditional
distributions in the LEB model for water. For the Water medium we replace equation (1)
with:
YM∗i,m(W )
= YMi,m(W ), if ZM
i,m(W ) = 0;
∼ Truncated N(XMi,m(W ), 1/ω
Mm(W )) on (−∞,MM
i,m(W )], if ZMi,m(W ) = 1;
∼ N(XMi,m(W ), 1/ω
Mm(W )), if ZM
i,m(W ) = 2.
To sample from the conditional distribution of the water process, {XMi,m(W ) : i = 1, · · · ,N I},
we let µMi,m(W ) = I(ζWij = 1)µWj . Then XM
i,m(W ) is N(qi/pi, 1/pi), where
pi = ωMm(W ) + τMm(W ) and qi = ωWY ∗i,m(W ) + τMm(W )µ
Mi,m(W ),
for each individual i.
With the addition of the global water model, the full conditional distribution of the NHEXAS
Water model precision, τMm(W ), is now
Ga
N I
2+ a
τMm(W ) ,
1
2
NI∑
i=1
(
XMi,m(W ) − µMi,m(W )
)2+ b
τMm(W )
.
2. The full conditional distributions of the remaining parameters in the global water model de-
pends on the augmented data Y W∗jk , defined as follows:
Y W∗jk
= Y Wjk , if ZWjk = 0;
∼ Truncated N(XWj , 1/ωW ) on (−∞,MW
jk ], if ZWjk ,= 1;
∼ N(XWj , 1/ωW ), if ZWjk = 2.
The full conditional distribution of the PWS measurement-error precision, ωW , is then
Ga
∑NW
j=1 NWj
2+ aω
W
,1
2
NW∑
j=1
NWj∑
k=1
(Y W∗jk −XW
jk )2 + bωW
.
3. For the global water model, the full conditional distribution of XWjk for the kth observation
(k = 1, · · · , NWj ) of jth PWS (j = 1, · · · ,NW ) is N(qjk/pjk, 1/pjk), where
pjk = ωW + τW and qjk = ωWY W∗jk + τWµWj .
4
Next, define XWj =
{
XMi,m(W ) : ζij = 1
}
and lj to be the number of elements in XWj . Then,
for each PWS j = 1, ..., NW , the full conditional distribution of the PWS-specific mean µWj is
N(qj/pj, 1/pj), with
pj = NWj τW +CW + ljτ
Mm(W )
and
qj = τWNW
j∑
k=1
XWjk + CWαW + τMm(W )1lk
′XWj .
The full conditional distribution of the process precision, τW , is
Ga
∑NW
j=1 NWj
2+ aτ
W
,1
2
NW∑
j=1
NWj∑
k=1
(
XWjk − µWj
)2+ bτ
W
.
4. For the parameters in the global water-LEB linking model, recall that ζWij = 1 implies that indi-
vidual i is served by PWS j and ζWij = 0 implies that individual i is not served by water-system
j. For each individual i = 1, . . . , N I , the full conditional distribution of ζWi =(
ζWi1 , · · · , ζWiNW
)
is Multinomial(1, (λ∗c(i),1, · · · , λ∗c(i),NW )) where the probability that individual i in country
c(i) is served by PWS j, is
λ∗c(i),j =λc(i),j n
(
XMi,m(W );µ
Wj , τ
Mm(W )
)
∑NW
r=1 λc(i),r n(
XMi,m(W );µ
Wr , τ
Mm(W )
) .
Here, n(
XMi,m(W );µ
Wj , τ
Mm(W )
)
is a normal pdf (with mean µWj and variance 1/τMm(W )) evaluated
at the value of the water process, XMi,m(W ), for each individual i.
5. In the parameters in the global water prior distributions, αW conditional on all the other
parameters is N(q/p, p), where
p = NWCW + V αW
and q =
CWNW∑
j=1
µWj + V αW
mαW
.
The full conditional distribution of CW is
Ga
NW
2+ aC
W
,1
2
NW∑
j=1
(
µWj −mαW)2
+ bCW
.
5
2.2 Global soil model
1. With the addition of the global topsoil/stream-sediment model, we need to change some of the
conditional distributions in the LEB model for soil. For the soil medium, we replace equation
(1) with:
YM∗i,m(S)
= YMi,m(S), if ZM
i,m(S) = 0;
∼ Truncated N(XMi,m(S), 1/ω
Mm(S)) on (−∞,MM
i,m(S)], if ZMi,m(S) = 1;
∼ N(XMi,m(S), 1/ω
Mm(S)), if ZM
i,m(S) = 2.
2. For parameters appearing in the global topsoil/stream-sediment model, for the topsoil obser-
vation at the sth location:
Y T∗(s)
= Y T (s), if ZT (s) = 0;
∼ N(XT (s), 1/ωT ), if ZT (s) = 2
(there is no censored topsoil data).
The full conditional distribution of the topsoil measurement-error precision, ωT , is
Ga
NT
2+ aω
T
,1
2
NT∑
i=1
(Y T∗(si) −XT (si))2 + bω
T
.
3. For parameters appearing in the stream-sediment data model, for the kth observation in jth
HUC8:
Y H∗jk
= Y Hjk , if ZHjk = 0;
∼ Truncated N(XHj , 1/ω
H ) on (−∞,MHjk ], if ZHjk = 1;
∼ N(XHj , 1/ω
H ), if ZHjk = 2.
The full conditional distribution of the stream-sediment measurement-error precision, ωH , is
Ga
NH∑
j=1
NHj
2+ aω
H
,1
2
NH∑
j=1
NHj∑
k=1
(Y Hjk −XH
j )2 + bωH
.
4. The full conditional distribution of the topsoil process XT (s) is N(qs/ps, 1/ps), where
ps = ωT + τT and qs = τT (βT0 + βT1 XHh(s)) + ωTY T∗(s),
for s ∈ D.
6
The full conditional distribution of the topsoil model precision, τT , is
Ga
(
|D|
2+ aτ
T
,1
2
∑
s∈D
(XT (s) − βT0 − βT1 XHh(s))
2 + bτT
)
.
The parameters that appear in the mean function of the topsoil model are βT = (βT0 , βT1 )′.
Letting U = [1|D|,GDXH ] and W = [1NI , ζTXH ] we sample βT from N2(P−1Q,P−1), where
P = τTU ′U + τMm(S)W′W + V β
T
and
Q = τTU ′XT + τMm(S)W′XM
m(S) + V βT
MβT
.
5. The full conditional distribution of the mean of the stream-sediment model, µH , is N(q/p, 1/p),
where
p = τH1′NH (INH − γHA)1NH + V µH
,
and
q = τH1′NH (INH − γHA)XH + V µH
mµH
.
6. Recall that for the unknown parameter ζTij , ζTij = 1 if NHEXAS individual i is located in HUC8
region j, and 0 otherwise. These ζTij ’s are updated during our MCMC algorithm using the
following scheme:
Sample ζTi =(
ζTi1, · · · , ζTiNH
)
∼ Multinomial(
1, (λ∗c(i),1, · · · , λ∗c(i),NH )′
)
, for i = 1, · · · ,N I ,
where
λ∗c(i),j =λc(i),j n(XM
i,m(S);βT0 + βT1 X
Hj , τ
Mm(S))
∑NH
r=1 λc(i),r n(XMi,m(S);β
T0 + βT1 X
Hr , τ
Mm(S))
,
and n(XMi,m(S);β
T0 + βT1 X
Hj , τ
Mm(S)) is a normal pdf (with mean βT0 + βT1 X
Hj and variance
1/τMm(S)) evaluated at XM
i,m(S).
Now define the |D|×NH matrix GD such that the (i,j)th entry GD(i, j) is 1 if topsoil location
si falls in jth HUC8, and 0 otherwise. Sample XH from N(P−1Q,P−1), where
P = τH(INH − γHA) + τTG′DGD + τMm(S)ζ
T ′
ζT ,
and
Q = τH(INH − γHA)]µH1NH + τTG′D(XT − βT0 1|D|) + τMm(S)ζ
T ′
(XMm(S) − βT0 1NI ).
7
7. The full conditional distribution of the stream-sediment precision, τH , is
Ga
(
NH
2+ aτ
H
,1
2(XH − µH1NH )′(INH − γHA)(XH − µH1NH ) + bτ
H
)
.
The full conditional distribution of the mean of the stream-sediment model, µH , is N(q/p, 1/p),
where
p = τH1′NH (INH − γHA)1NH + V µH
,
and
q = τH1′NH (INH − γHA)XH + V µH
mµH
.
The full conditional distribution of the spatial-dependence parameter in the stream-sediment
model, γH , is proportional to
|INH − γHA|1/2 exp{
−0.5τH(XH − µH1NH )′(INH − γHA)(XH − µH1NH )}
×
1
|λmax − λmin|I(
γH ∈ (λmin, λmax))
,
where λmin, λmax are the smallest and largest eigenvalues of A.
8. With the addition of the global-topsoil/stream-sediment model, the full conditional distribu-
tion of the LEB Soil log As process XMm(S), is modified. Now, this distribution isNNI (P−1Q,P−1),
where
P = ωMm(S)INI + τMm(S)INI ,
and
Q = ωMm(S)YM∗m(S) + τMm(S)(β
T0 1NI + βT1 ζT
′
XH).
In addition, the full conditional distribution of the local soil process precision, τMm(S), is now
Ga
(
N I
2+ a
τMm(W ) ,
1
2(XM
m(S) − WβT )′(XMm(S) − WβT ) + b
τMm(S)
)
.
8