Hierarchical Model Building, Fitting, and Checking: A ... · Hierarchical Model Building, Fitting, and Checking: A Behind-the-Scenes Look at a Bayesian Analysis of Arsenic Exposure

Hierarchical Model Building, Fitting, and Checking: A Behi nd-the-Scenes

Look at a Bayesian Analysis of Arsenic Exposure Pathways

Peter F. Craigmile1, Catherine A. Calder, Hongfei Li, Rajib Paul, and Noel Cressie

Department of Statistics, The Ohio State University, Columbus, OH, USA

Department of Statistics Preprint No. 806

The Ohio State University

3 December 2007

Abstract In this article, we present a behind-the-scenes look at a Bayesian hierarchical analysis of path-

ways of exposure to arsenic, a toxic heavy metal. Our analysis combines individual-level personal exposure

measurements (biomarker and environmental media) with water, soil, and air observations from the ambient

environment. We include details of our model-building exercise that involved a combination of exploratory

data analysis and substantive knowledge in exposure science. Then we present our strategies for model fit-

ting, which involved piecing together components of the hierarchical model in a systematic fashion to assess

issues including parameter identifiability, Bayesian learning, and model fit. We also discuss practical issues

of data management and algorithm debugging. We hope that ourpresentation of these behind-the-scenes

details will be of use to other researchers who build large Bayesian hierarchical models.

Keywords: Arizona, Bayesian learning, data management, environmental health, Markov chain Monte Carlo

(MCMC) algorithm, model validation, National Human Exposure Assessment Survey (NHEXAS)

1contact email: [email protected]

1

1 Introduction

Bayesian hierarchical modeling has been increasingly recognized as a powerful approach for analyzing com-

plex phenomena. This framework has moved from simply the Bayesian analogue of classical multilevel

modeling for nested data, to being recognized as an essential tool for performing modern statistical science.

Bayesian hierarchical models are now commonly used both within and outside the statistics literature and

are widely lauded for their capacity to synthesize data fromdifferent sources, to accommodate complicated

dependence structures, to handle irregular features of data such as missingness and censoring, and to in-

corporate scientifically based process information. However, despite the inherent elegance of the approach,

effective derivation of inferences from these models is by no means a trivial task. The aim of this paper is to

provide a behind-the-scenes look into a Bayesian hierarchical analysis. We show the reader how we carried

out our model building, fitting, and checking, which were complicated by issues related to the quality and

quantity of available data. Strategies will be presented for compartmentalizing model components and inputs

in order to allow efficient assessment of the impact of modeling assumptions on inferences. While the reader

may not agree with all of our modeling decisions, and there certainly are aspects of our analysis that warrant

further study, our goal is to start a dialog on practical hierarchical-modeling strategies that allow defensible

statistical learning from complex Bayesian analyses.

This article is organized as follows. In Section2, we introduce the environmental-exposure application

that underlies our discussion of Bayesian hierarchical modeling. Model building through a combination

of exploratory data analysis and prior knowledge in exposure science is the focus on Section3. Then, in

Section4, we discuss our model-fitting strategies, including data-management and algorithm (Markov chain

Monte Carlo, MCMC)-checking (debugging) procedures. We take the reader through the fitting of various

components of our large hierarchical model in order to examine issues such as parameter identifiability,

Bayesian learning, and model fit. We conclude with a general discussion of Bayesian hierarchical modeling

2

in Section5.

2 Environmental exposure application

2.1 Overview

Our motivating application is a Bayesian hierarchical analysis of human exposure pathways. Gen-

erally, a pathways analysis refers to the study of the relationship between levels of toxicants in en-

vironmental media (e.g., air, water, dust, food) and levelsof personal exposure. In order to pro-

vide data to study pathways of exposure to a variety of toxic substances, the National Human Ex-

posure Assessment Survey (NHEXAS) pilot study was carried out in the 1990s by the Office of Re-

search and Development (ORD) of the U.S. Environmental Protection Agency (USEPA), along with

the U.S. Centers for Disease Control and Prevention (CDC) and the U.S. Food and Drug Admin-

istration (FDA) (NERL and National Center for Environmental Assessment 2000). This pilot study was

conducted in three areas: Arizona, EPA Region 5 (six midwestern states), and Maryland. In this

paper, we limit our pathways analysis to the Arizona (AZ) NHEXAS data consisting of environ-

mental media measurements and biomarker measurements (i.e., indicators of personal exposure) for

178 individuals in 15 counties of AZ (Robertson, Lebowitz, O’Rourke, Gordon, and Moschandreas 1999;

O’Rourke, Van de Water, Jin, Rogan, Weiss, Gordon, Moschandreas, and Lebowitz 1999). AZ was chosen

because of the relatively easy access to water, soil, and airobservations from the ambient environment. Us-

ing these data, we focus on modeling pathways of exposure to arsenic (As), a toxic heavy metal. Acute

exposure to As is associated with irritation of the gastrointestinal and respiratory tracts, while chronic expo-

sure has been shown to be related to melanosis, hyperpigmentation, depigmentation, hyperkeratosis, and skin

cancer, and may also affect the nervous, cardiovascular, and haematopoietic systems (WHO 1981). Further

details on the NHEXAS data used in our analysis can be found Section 2.2.

3

Hierarchical modeling is natural in pathways analyses due to the direct and indirect relationships between

environmental media and personal exposure. For example, the concentration of a toxicant in water may be

related to personal exposure directly through ingestion (drinking) and dermal (bathing) pathways. How-

ever, part of the ingestion pathway may also be indirect if water is used for cooking purposes. Rec-

ognizing the complexity of these relationships,Clayton, Pellizzari, and Quackenboss (2002)used a struc-

tural equations modeling approach to explore pathways of exposure to As and lead (Pb) using NHEXAS

Region 5 data. Also using the As NHEXAS Region 5 data,McMillan, Morara, and Young (2006)and

Cressie, Buxton, Calder, Craigmile, Dong, McMillan, Morara, Santner, Wang, Young, and Zhang (2007)de-

veloped Bayesian hierarchical pathways models. UnlikeClayton et al. (2002)’s approach, these models are

able to accommodate measurement error in the environmentalmedia and biomarker measurements, as well

as missing and censored (below a specified minimum detectionlevel (MDL)) observations, in a coherent

manner. We rely heavily on these models in the development ofour Bayesian hierarchical pathways model

described in Section3.

Recognizing that one of the primary aims of NHEXAS was to identify geographical variation in expo-

sure to hazardous chemicals (Pellizzari et al. 1995), Cressie et al. (2007)proposed supplementing the spa-

tially sparse NHEXAS Region 5 data with information on background levels of As in global environmen-

tal media. Specifically, background concentrations of As insoil, which additionally included both top-

soil and stream sediment concentration measurements, wereincluded in the pathways analysis. This ap-

proach to synthesizing topsoil and stream sediment geochemical measurements was subsequently refined by

Calder, Craigmile, and Zhang (2007). In addition to incorporating background levels of As in soil across

AZ, in this paper we also incorporate background levels of Asin water and air into our pathways analysis.

Rather than using this supplemental, orglobal environmental information to build informative priors forthe

county-level mean for eachlocal environmental medium (as was done byCressie et al., 2007), we use mix-

ture models that account for the uncertainty due to the spatial misalignment of the supplemental data and the

4

NHEXAS individuals.

In the remainder of this section, we provide overviews of thevarious data sources that are used in our

Bayesian hierarchical pathways model.

2.2 NHEXAS data

The NHEXAS AZ data were collected using a three-stage, population-based sampling design. In Stage I,

1200 households were approached to complete descriptive questionnaires. Of the 954 that responded, a

subset of 505 completed baseline questionnaires and had soil and dust analyses collected (Stage II). Our

analysis is based on the final Stage III sample, which consisted of a subset of 179 individuals for whom

data were collected over a seven-day period. Table2 lists the number of NHEXAS individuals per county in

AZ, along with the population of each county (in thousands) based on the 1990 Census. As expected, there

are more sampled individuals in counties with larger populations. Further details on the the NHEXAS AZ

sampling design can be found inRobertson et al. (1999).

The NHEXAS AZ data were obtained from the Human Exposure Database System (HEDS) at the EPA2.

We extracted from the database the As measurements in the environmental media, as well as in the Urine

biomarker, for 178 individuals (one individual could not bematched to a household). All As levels in our

analysis are concentrations in units ofµg/l; the natural log concentration measurements and MDL values

are modeled in our analysis. The variables of interest are summarized in Table1. For each variable, we

include the extent of missingness and left-censoring due tovalues being below the MDL. From this table, it

is apparent that most media considered by NHEXAS, except Water, have missing values and values below

the MDL. Personal Air and Soil have the greatest number of missing values, while Indoor Air, Outdoor Air,

and Urine have the most observations below the MDL. The implications of the amount of missingness and

2http://oaspub.epa.gov/heds/study dir frame?st id=23159

5

http://oaspub.epa.gov/heds/study_dir_frame?st_id=23159

censoring for each medium will be discussed in Section4.

The data documentation from NHEXAS AZ3 lists the relative standard deviation (RSD), associated with

each medium. The RSD is the absolute value of the coefficient of variation and is often expressed as a

percentage. FollowingSantner, Craigmile, Calder, and Paul (2007, Appendix), we converted these values

into measurement-error precisions (provided in the rightmost column of Table1) for use in our Bayesian

hierarchical pathways model.

2.3 Global water data

Background levels of the As concentration in water were obtained from the Water Quality Division of the

Arizona Department of Environmental Quality. This division manages millions of records of water-quality

data collected from surface and ground-water monitoring programs statewide, as well as from drinking-

water systems. We were directly provided with both As concentration measurements taken from treated

water samples in public-water systems (PWS) and with AZ community water-use information.

The treated water samples were collected at water-treatment facilities from 1993 to 2006. In our analysis, we

only consider the subset of these observations for which ground or surface water is the primary source, the

water-treatment facility is a community public water system (PWS) that serves 15 or more service connec-

tions used by residents year-round or serves 50 or more residents year-round, and the water system was active

at the time when the sample was collected. This subset of the data consisted of 10,688 As concentration mea-

surements (in units ofmg/l) from 1,161 PWSs. Of these measurements, 3,732 are below themeasurement

detection limits (MDLs). We model the PWS As concentrationson the natural log scale.

Due to security concerns, we were not provided with the locations of the PWSs. Instead, we were only given

a community-water-use dataset that lists the number of individuals in each of the 15 AZ counties served by

3http://oaspub.epa.gov/eims/xmlreport.display?deid=22956&z chk=3399

6

http://oaspub.epa.gov/eims/xmlreport.display?deid=22956&z_chk=3399

each PWS. We note that each PWS may serve more than one county and each county can be served by more

than one PWS. The smallest number of people served by a PWS is 51 in Coconino Navajo County, the largest

number of people served by a single PWS is 1,200,000 in Maricopa county, and the median number of people

served by a PWS is 300.

2.4 Global soil data

Supplemental information about the levels of As in soils across AZ consists of point-referenced topsoil

measurements from the U.S. Geological Survey’s (USGS’s) USSoils database4 and stream sediment mea-

surements from the USGS’s National Geochemical Survey (NGS) database5. The topsoil data consists of

the concentration of As in the surface layer (the A horizon corresponding to a depth of around 20cm) from

samples of soil, sand, silt, and alluvial deposits collected in undisturbed regions. Each of the topsoil measure-

ments are associated with latitude/longitude coordinates(see left-hand panel of Figure1). While the stream

sediment data are also point referenced, we followCalder et al. (2007)and associate them with watersheds,

or hydrologically similar regions (Seaber, Kapinos, and Knapp 1987). Each of these hydrologic units, or wa-

tersheds, has a unique eight-digit code, termed the HUC8 code. Our analysis includes all stream sediment

data that were taken in the 84 watersheds that contain at least part of the state of AZ (right-hand panel of

Figure1). The units of both the topsoil and stream sediment data areµg/m3, or parts per million (ppm), and

both types of data are modeled on the natural log scale.

4http://tin.er.usgs.gov/ussoils/5http://tin.er.usgs.gov/geochem/

7

http://tin.er.usgs.gov/ussoils/

http://tin.er.usgs.gov/geochem/

2.5 Global air data

Data providing information about the background levels of As in the ambient air across AZ were obtained

from the Interagency Monitoring of Protected Visual Environments (IMPROVE) Network6. This network

was established in 1985 and was designed to provide long-term air-quality records for U.S. national parks and

wilderness areas. We extracted all available As concentration readings (in units ofµg/m3 local conditions7

from the 14 AZ IMPROVE monitoring stations where As levels were monitored during the period March

1988 to August 2003. Each of these monitors is associated with latitude/longitude coordinates (see Figure2)

and, despite IMPROVE’s emphasis on non-urban areas, we notethat several monitoring stations are located

in or near major cities (e.g., Phoenix and Tucson). Readingsare collected on an every 3-4 day schedule,

and certain observations are flagged as being below a minimumdetection limit (MDL), the value of which is

provided. The data are modeled on the natural log scale.

3 Exploratory data analysis and model building

Bayesian hierarchical models have been around for over thirty years. An important starting point was

Lindley and Smith (1972), in which a Bayesian model was defined via a set of hierarchical linear regres-

sion equations. More recently, a standard structure is defined based on the data model, process model, and

parameter model decomposition (e.g.,Berliner, 2003). In our analysis of As pathways in AZ, we consider a

data model and a process model that are each made up of many parts. Rather than fitting the complete model

consisting of all the parts right from the outset, we incorporate the different parts piece-by-piece, validating

and assessing the model fit as we go along. We shall demonstrate that, in terms of the hierarchy defined

6http://vista.cira.colostate.edu/improve/7Temperature and pressure at the time of the reading as opposed to standard conditions of 25 degrees C and 760 mm Hg that are

sometimes used.

8

http://vista.cira.colostate.edu/improve/

above, model building should be a combination of exploratory data analysis (EDA) and scientific knowledge.

3.1 The local environment to biomarker (LEB) model

In our study of pathways of exposure to As, the backbone of ourstatistical analysis is a Bayesian hierarchi-

cal model linking local environmental As levels to individual-specific levels of As in Urine, the biomarker

of As exposure. This local environment to biomarker (LEB) model is introduced and discussed in de-

tail in McMillan et al. (2006)and Cressie et al. (2007), so we do not include a lengthy development of it

here. In this subsection, we provide a brief description of the LEB model, for the purpose of completeness.

However in Section3.2, we focus on the global to local environment (GLE) model extensions inspired by

Cressie et al. (2007).

LEB data model The LEB model is driven by the individual-specific local environment and biomarker

measurements from NHEXAS; we generically refer to these measurements as the observations from theNM

different media. Not all of theN I individuals who participated in NHEXAS have an observationassociated

with each medium; certain observations were below a minimumdetection limit (MDL) and others are simply

unobserved (i.e., are “missing”). For each mediumj = 1, . . . ,NM , let Y Mj = {Y M

ji : i = 1, . . . ,N I}

denote the set of log As observations on that medium for each of theN I individuals. LetZMj = {ZM

ji : i =

1, . . . , N I} denote the observation status, so that for individuali, ZMji = 0 if the jth observed value,Y M

ji ,

is below the MDL value ofMji, ZMji = 1 if the value is above the MDL, andZM

ji = 2 if the observation

is missing. LetMMj = {MM

ji : i = 1, . . . ,N I} be the set of MDLs for mediumj. Defining XMj =

{XMji : i = 1, . . . , N I} to be the collection of the latent As concentrations in medium j, and lettingY M =

{Y M1 , . . . ,Y M

NM }, XM = {XX1 , . . . ,XM

NM}, ZM = {ZM1 , . . . ,ZM

NM}, MM = {MM1 , . . . ,MM

NM}, and

ωM = {ωMj : j = 1, . . . , NM}, we assume that the elements of our complete set of data are conditionally

9

independent given{XMij } and media-specific precision parameters,{ωM

j }. That is, we assume

[

Y M |XM ,ZM ,MM ,ωM]

=

NM∏

j=1

[

Y Mj |XM

j ,ZMj ,MM

j ,ωMj

]

=d

NM∏

j=1

INCM(XMj , 1/ωM

j ;ZMj ,MM

j ).

where the INCM (independent normal with possible censoringand missingness) distribution is defined in

SectionA of the Appendices.

LEB Process Model We then link the elements ofXM by choosing subsets of theNM media,{SMj }, that

are defined by an acyclic directed graph (e.g.,Lauritzen, 1996). These subsets define the joint distribution of

XM as follows:

[

XM |βM ,µM , τM]

=

NI∏

i=1

NM∏

j=1

[

XMij |{XM

ik }k∈SMj

, µMj , τM

j

]

(1)

=d

NI∏

i=1

NM∏

j=1

N

µM

j +∑

k∈SMj

βMjk XM

ik , 1/τMj

. (2)

In our case, the acyclic directed graph was defined by considering partitionsMk of the set ofNM media and

defining selector sets{SMj } that satisfy

SMj = ∅ , if j ∈ M1;

SMj ⊆

ℓ−1⋃

k=1

Mk , if j ∈ Mℓ, whenℓ > 1 .

In the model (2), we assume for each media that the baseline mean is constantacross the state (aregional

mean model). The choice of pathways{SMj : j = 1, . . . ,NM} that we use come from the baseline model

of Clayton et al. (2002), which was used to model As concentrations in EPA Region 5. A graphical view of

this structure is shown in Figure4.

To assess whether or not Clayton et al.’s structure is reasonable for the log As concentrations observed in

the NHEXAS AZ study, we looked at a number of graphical summaries of the data. In thisexploratory data

10

analysis (EDA), we replaced any values below the MDL, by halfthe MDL value and ignored any missing

values. One summary that gives some credence to Clayton et al.’s structure is shown in the dependence table

in Figure3. The numbers in the table are the sample correlations between pairs of log As concentrations in

each medium. Marginally, ignoring any multiplicity considerations, we conducted a test to see whether or

not the population correlation coefficient for each pair of media is zero – dark gray shading denotes a P-value

less than 0.05, light gray shading denotes a P-value between0.05 and 0.1, and no shading denotes a P-value

greater than 0.1. Rectangles around the correlation valuesindicate those covariates that appear in Clayton et

al.’s model for each response. For example, Beverage, Food,Surface Dust, Indoor Air, and Personal Air all

appear in Clayton et al.’s regression model for Urine. Thereis good agreement with Clayton et al.’s model

and the pattern of significant pairwise correlations between the response variable and each covariate, except

for media where we have a large number of missing values or values below the MDL (e.g., Soil, Personal

Air, Indoor Air, and Outdoor Air).

LEB model prior distributions We assume independence between the prior parameters and take the

media-specific mean terms{µMj } and slope parameters that link the different media in the LEBmodel,

{βMjk }, to have normal prior distributions. The media-specific process precisions{τM

j } are given gamma

priors. As stated previously, the measurement-error precisions,{ωMj }, are fixed using the method described

in Santner et al. (2007, Appendix).

3.2 The global to local environment (GLE) models

We now consider alternative versions of the model for{XMij } (Equations1 and2) for mediumj that draw on

supplemental data sources and knowledge about the background spatial processes driving the variation in As

levels in that medium. In Section4, we consider the impacts on inferences for As exposure pathways when

these global to local environment (GLE) models are used.

11

3.2.1 Global water model

We start with an exploratory data analysis of the global water data provided to us. The left-hand panels of

Figure5 show the histograms of log As concentrations of NHEXAS and PWS water measurements, both

given on thelog µg/l scale. For the PWS data, the observations below the MDL are denoted by check marks

along the x-axis of the histograms. Both of the distributions are slightly right-skewed. They are remarkably

similar, except for the fact that the PWS concentrations have a slightly longer right tail. This gives some

credence to the linking model that we specify below.

Global water data model Not all of the NW public water systems (PWS) in AZ have an observation

available, while some water systems have multiple observations. As is the case with the NHEXAS data,

certain observations were below an MDL and others are missing. For each PWSj = 1, . . . ,NW , let Y Wj =

{Y Wjk : k = 1, . . . , NW

j } denote the set ofNWj observed log As concentrations. Accordingly, we let

ZWj = {ZW

jk : k = 1, . . . , NWj } denote the set of observation-status variables, withZjk = 0 if the kth

measurement at PWSi is below the MDL value ofMWjk , Zjk = 1 if the value is above the MDL, orZjk = 2

if the observation is missing. LetMWj = {MW

jk : k = 1, . . . ,NWk } be the set of MDLs for PWSj. Taking

XWj = {XW

jk : k = 1, . . . , NW} to be the collection of the latent log As concentrations in PWS j, and

letting Y W = {Y W1 , . . . ,Y W

NW }, XW = {XX1 , . . . ,XW

NW }, ZW = {ZW1 , . . . ,ZW

NW }, andMW =

{MW1 , . . . ,MW

NW }, we assume that

[

Y W |XW ,ZW ,MW , ωW]

=

NW∏

j=1

[

Y Wj |XW

j ,ZWj ,MW

j , ωW]

=d

NW∏

j=1

INCM(XWj , 1/ωW ;ZW

j ,MWj ),

whereωW is the measurement-error precision.

12

Global water process model We take the global water process defined at each PWS to be conditionally

independent given PWS-specific unknown mean parametersµW = {µWj : j = 1, . . . ,NW } and an unknown

precision parameterτW . That is, we assume that

[

XW |µW , τW]

=

NW∏

j=1

NWj

∏

k=1

[

XWjk |µ

Wj , τW

]

=d

NW∏

j=1

NWj

∏

k=1

N(

µWj , 1/τW

)

.

Global water to LEB model We link the global water process to the Water process in the LEB model.

We definem(W ) to be the index corresponding to the Water medium. LetXMm(W ) be the collection ofN I

individual log As concentrations in Water andXMi,m(W ) be individuali’s latent Water log As level. Then we

assume that

[

XMm(W ), ζ

W |µW , τMm(W )

]

=

NI∏

j=1

[

XMi,m(W ), ζ

Wi |µW , τM

m(W )

]

. (3)

That is, we model the joint distribution of Water’s log As concentration and an unknown collection of

parametersζW that indicate whether a PWS serves individuals (i.e.,ζWij = 1 implies that individuali

is served by PWSj and ζWij = 0 implies that individuali is not served by water systemj). Letting

ζWi = (ζW

i1 , . . . , ζWiNW )′, we decompose the factors in the right-hand side of (3) by assuming that

[

XMi,m(W ), ζ

Wi |µW , τM

m(W )

]

=[

XMi,m(W )|ζ

Wi ,µW , τM

m(W )

]

[ζWi ]

=

NW∏

j=1

([

XMi,m(W )|µ

Wj , τM

m(W )

])ζWij

× Mult(

1, (λWc(i),1, . . . , λ

Wc(i),NW )

)

,

where Mult(n,p) denotes the multinomial distribution with parametersn andp. For eachi = 1, . . . ,N I , we

takeλWc(i) =

(

λWc(i),1, . . . , λ

Wc(i),NW

)

, whereλWc(i),j is equal to the proportion of the population of countyc(i)

served by PWSj. Finally, we assume that

[

XMi,m(W )|µ

Wj , τM

m(W )

]

=d N(

µWj , 1/τM

m(W )

)

.

13

Global water prior distributions Across the PWSs we assume that

[

µW |αW , CW]

=NW∏

j=1

[

µWj |αW , CW

]

=d

NW∏

j=1

N(αW , 1/CW ),

whereαW has a normal prior distribution andCW has an gamma prior distribution. The priors forωW and

τW are also gamma distributions. We assume all global water prior distributions are mutually independent.

3.2.2 Global soil model

Looking towards modeling these concentrations jointly with the NHEXAS soil data, Figure6 shows a sum-

mary of the NHEXAS, topsoil, and stream sediment log As concentrations. The histograms in the figure

show that the three sets of concentration measurements havesimilar distributions, each having a slight right-

skewness (the median NHEXAS log concentration value is 1.89log µg/m3, versus the median topsoil log

concentration value of 1.63log µg/m3 and the median log stream sediment concentration of 1.83log µg/m3).

Given the aridness of AZ, it is not surprising that the distributions are so similar. The top right-hand figure

shows the number ofobserved (i.e., not missing) NHEXAS soil measurements for each county of AZ, high-

lighting that the information content is sparse in many counties. Compared with the locations of the topsoil

measurements (left-hand panel of Figure1), we can see that the topsoil measurements are mostly uniformly

spread around the region. On the other hand, the map of the HUC8 regions with the stream sediment mea-

surements denoted by crosses (right-hand panel of Figure1) shows that the stream sediment measurements

appear in clumps across the study region.

The data and process models shown below for the log As concentrations in topsoil and stream sediment are

taken fromCalder et al. (2007). We complete our model by linking these media to the NHEXAS soil-related

media processes in the LEB model using a mixture model (not part of Calder et al., 2007). This mixture

model captures the uncertainty in locations of the NHEXAS individuals across AZ and the uncertainty due to

our not having precise information for the NHEXAS individuals (NHEXAS only gives county of residence).

14

Our approach accommodates the spatially mismatched sampling schemes of NHEXAS, USSoils, and NGS.

Global soil data model The global soil data model draws on both As concentration measurements in

topsoil and stream sediment. We letY T (si) denote the log As concentration observation at locationsi and

defineY T = {Y T (si) : i = 1, . . . , NT } to be the collection of theNT topsoil observations. We assume that

[

Y T |XT (·), ωT]

=

NT∏

i=1

[

Y T (si)|XT (si), ω

T]

=d

NT∏

i=1

N(

XT (si), 1/ωT)

,

whereXT (·) is a continuously indexed spatial process defined on the domain D ⊂ R2 representing AZ, and

ωT is the measurement-error precision.

For the stream sediment observations, we letY Hjk be thekth log As concentration measurement taken in the

jth HUC8 region (watershed) and denote the collection of stream sediment observationsY H = {Y Hjk : k =

1, . . . , NHj , j = 1, . . . , NH}, whereNH

j is the number of observations in thejth HUC8 region andNH is

the number of HUC8s. We assume that

[

Y H |XH , ωH]

=

NH∏

j=1

NHk

∏

k=1

[

Y Hjk |X

Hj , ωH

]

=d

NH∏

j=1

NHk

∏

k=1

N(

XHj , 1/ωH

)

,

whereXH is a latent process defined on theNH watersheds (a random vector of lengthNH) andωH is the

measurement-error precision.

Global soil process model We assume that, conditional on the stream sediment processXH , the topsoil

process{XT (s)} is independent across locationss ∈ D. At each location, we assume that the mean of the

latent log topsoil process is linearly related to the streamsediment process, via

[

XT (s)|βT0 , βT

1 , τT]

=d N(

βT0 + βT

1 XHh(s), 1/τ

T)

. (4)

Hereh(s) denotes the HUC8 region that contains locations. The spatial dependence in the soil model is

introduced through a Gaussian prior on the latent stream sediment process, defined over watersheds. We use

15

a conditional autoregressive (CAR) model:

[

XH |µH , τH , γH]

=d NNH

(

µH1,

[

τH(I − γHA)]−1

)

,

where Nl denotes al-dimensional normal distribution,A is a spatial neighborhood matrix,γH is an unknown

spatial-dependence parameter, andτH is the process precision parameter.

Global soil to LEB model LettingXMm(S) = {XM

i,m(S) : i = 1, . . . ,N I} be the collection ofN I individual

latent log As concentrations in soil, we assume that

[

XMm(S), ζ

T |XT (·), τMm(S)

]

=NI∏

j=1

[

XMi,m(S), ζ

Ti |X

T (·), τMm(S)

]

. (5)

The difference between this linking model and that for wateris that the unknown collection of parameters,

ζT , captures the uncertainty in the spatial location of NHEXASindividuals. From (4), we see that the spatial

distribution of XT (s) only depends on the spatial location throughh(s), the HUC8 region containings.

Thus, rather than assigning NHEXAS individuals to watersheds, we define an unknown parameterζTij such

that ζTij = 1 if NHEXAS individual i is located in HUC8 regionj, and zero otherwise. Then, for each

i = 1, . . . , N I , we defineλTc(i) =

(

λTc(i),1, . . . , λ

Tc(i),NH

)

, whereλTc(i),j is equal to the proportion of the area

of the countyc(i) that contains HUC8 regionj (which could be zero if a HUC8 region is not contained in

countyc(i)). Assuming NHEXAS individuals are uniformly distributed across a county, we assume that

[

XMi,m(S), ζ

Ti |X

T (·), τMm(S)

]

=NH∏

j=1

([

XMi,m(S)|X

T (well-defined location in HUC8 regionj), τMm(S)

])ζTij

×

Mult(

1, (λTc(i),1, . . . , λ

Tc(i),NH )

)

=d

NH∏

j=1

(

N(

XT (well-defined location in HUC8 regionj), 1/τMm(S)

))ζTij

×

Mult(

1, (λTc(i),1, . . . , λ

Tc(i),NH )

)

.

16

Global soil prior distributions We assume independent gamma priors on the topsoil precision, ωT , and

the stream sediment precision,ωH . In the analysis (see Section4 for more details), we fixωT . The priors

for the intercept and slope in the model relating topsoil andstream sediments,βT0 andβT

1 , are independent

normal distributions. The stream sediment mean,µH , is also given a normal prior, and we assume gamma

priors for the process precisions,τT andτH . We assume a uniform distribution on the spatial-dependence

parameter,γH , in the stream sediment model. The bounds on the uniform distribution are determined by the

reciprocal of the smallest eigenvalue (lower bound) and thereciprocal of the largest eigenvalue (upper bound)

of the spatial proximity matrixA, ensuring thatI − γHA is positive-definite and hence that the distribution

of XH is proper (e.g.,Banerjee, Carlin, and Gelfand, 2004, p.80).

3.2.3 Global air model

Figure 7 show time series plots of the observed IMPROVE air monitor readings of the ambient log As

concentrations across AZ, at the fourteen locations displayed in Figure2. The vertical gray lines indicate the

time period during which the NHEXAS study was conducted for the county containing the monitor. As can

be seen, there is a clear lack of information about the As level during the NHEXAS time period for 10 out of

the 14 monitors. Due to the censoring below the MDL and the trending that is obvious in some of the series

(e.g., sites #1, #5, and #8), we are hesitant to fit any statistical model to these data.

Given these data-coverage problems for air, we decided onlyto provide an informative prior for the intercept

in the equation for the Outdoor Air log As concentrations process defined by (2). Let j(A) be the index

corresponding to Outdoor Air in the NHEXAS model (2). We assume that

µj(A) ∼ N(αA, 1/CA),

where in this caseαA andCA are fixed. From the information contained in Figure2, we setαA = −6 on

the log ppm scale, andCA = 1/4, so that the standard deviation of the intercept parameter is 2 on the log

17

ppm scale. This prior has a larger standard deviation than issuggested by the data, to capture the change

of support of moving from the IMPROVE monitors to the monitors used outside the home in the NHEXAS

study.

4 Model fitting and checking

In this section, we discuss the procedure we went through to fit the Bayesian hierarchical model. Rather

than just fitting one large model that links the global environmental soil, water, and air models to the LEB

model, we put these pieces together in a stepwise fashion, adding in each model component one-at-a-time.

Throughout this stepwise procedure, we validated whether or not our models fitted the data well by comparing

the posterior distributions, with and without each model component, to the observed data.

Before we consider our stepwise model fitting, we discuss some of the decisions made in implementing and

fitting the Bayesian hierarchical model using the numerous datasets involved in the analysis. These details

are often unreported in many Bayesian analyses but are worthwhile setting out for readers unfamiliar with

applied Bayesian statistics.

4.1 Data management

In our study, datasets were stored in a hierarchy of directories by data type. As a security feature, access

to theoriginal datasets were limited. Instead, modelers had access to thederived data. The scripts used to

create these derived dataset are stored along with the derived datasets. Data management and processing was

carried out using ArcGIS (for spatial data management), SAS, and R (R Development Core Team 2007). The

maptoolsR library 8 was also used for further spatial-data management. In termsof data format, we found

that comma delimited (CSV) text files, although large in size, tended to be the easiest to use for exploratory

8http://cran.r-project.org/src/contrib/Descriptions/maptools.html

18

http://cran.r-project.org/src/contrib/Descriptions/maptools.html

data analysis and modeling, especially when using many different software packages.

4.2 Implementing the Markov chain Monte Carlo algorithm

In this analysis, the posterior distributions for the modelparameters are not available in closed form. Instead,

we sample from the posterior density using a Gibbs sampling algorithm that sequentially samples from

the full conditional distributions of the model parameters(Gelfand and Smith 1990). Since some of these

full conditional distributions are also not available in closed form, we employ independence or symmetric-

random-walk Metropolis steps (Tierney 1994). Although there are other versions of the Metropolis-Hasting

algorithm that may have better convergence properties, these methods are easy to implement and straightfor-

ward to tune to achieve appropriate mixing. In practice, thechoice of more complicated candidate distribu-

tions and algorithms always requires careful implementation and tuning.

To implement the MCMC algorithm in practice, one needs to check:

1. that the full conditional distributions are analytically correct, as is the computer code to sample from

them;

2. whether the algorithm converges and mixes well (i.e., whether the full parameter space is being ex-

plored);

3. the sensitivity of the inferences to the choice of prior distributions;

4. the sensitivity of the inferences to aspects of the data.

All of these issues are important in fitting our Bayesian hierarchical pathways model, due to the large num-

ber of pathways (processes) and different data sources. Each dataset can have issues with, for example,

missingness and censoring (MDLs). The data have different (initially incoherent) spatial scales. For some

19

data types, there can be a lack of spatial coverage or resolution. It is still very easy to make mistakes in

implementation of the model-fitting algorithm in computer code, even using dedicated software such as Win-

BUGS/OpenBUGS (Spiegelhalter, Thomas, Best, and Lunn 2004).

Our strategy starts far away from the computer. A model document was created before the MCMC algorithm

was implemented in computer code. The document defines all the notation and the full conditional distribu-

tions, and multiple people (i.e., more than one!) checked the document for errors. For an abbreviated version

of this document that contains all full conditional distributions, see the Supplemental Material document. At

this point, we decided what parameters to augment in the MCMCalgorithm (e.g., adding parameters to make

parts of the model conditionally conjugate, or choosing to sample missing data). We tried to keep multi-

ple versions of the documents and computer code, so that the user could go back to a previous version if a

problem was found with the current version. (We experimented with the software packageSubversion 9,

which can be used to keep track of multiple versions of files, but we did not use it much by the end of the

project. It was easier to make copies of files ourselves.)

Before we implemented the actual MCMC algorithm, we first wrote utility computer code that was needed

(e.g., a C function dynamically called from R for sampling truncated normals). To reduce the chance of

wrongly implemented utility functions corrupting the MCMCalgorithm, this utility code was checked in-

dependently and extensively before being used in practice.Finally, before implementing and running the

MCMC algorithm, the routines to read in the data were coded and checked.

We coded up our MCMC algorithm in the R statistical software language; the interpreted language R does

not run the algorithm as fast as compiled programming languages such as C, C++, or Fortran, but it was fast

enough, and it allowed us the flexibility to use existing functions and complex data types.

In coding up the steps of the MCMC algorithm, we successivelyadded parameters into the model. Param-

9http://subversion.tigris.org/

20

http://subversion.tigris.org/

eters that were unsampled in the current stage of the MCMC algorithm, were fixed temporarily, and were

not updated during the simulation. We checked that the output was correct with simulated data, before im-

plementing the MCMC algorithm on actual data. These same rules applied when we made changes to the

model. We compared the output from the previous and current versions of the computer code. Whenever

possible, we also implemented the MCMC algorithm using OpenBUGS, via the R libraryBRugs 10. This

provided another check for the R implementation.

There were a number of strategies that helped us in implementing the MCMC algorithm. Depending on the

robustness required for sampling, we used a Cholesky decomposition or eigenvalue decomposition to sample

the normally distributed slope parameters of our regression models. (The eigenvalue decomposition is more

robust, but much slower.) We never try to invert matrices; instead we solve linear equations, because inner

products are more efficient than matrix calculations. For censored data, we sampled any values that were

below the MDL using a truncated normal distribution. This means that we can omit a Metropolis-Hastings

step to update the parameters relating the data,Y , and the process,X. Since the standard inverse-CDF

method is very inefficient for sampling truncated normal random variables, especially for MDLs far in the

lower tail of the normal distribution, we used a rejection-sampling algorithm proposed byGeweke (1991).

This algorithm uses either normal rejection sampling or exponential rejection sampling based on the value of

the MDL. We also sampled missing data within our MCMC algorithm. Although this is not necessary, this

strategy simplified the computer code and cut down on errors in implementation.

While MCMC computer code needs to be efficient to work quicklyin practice, it also needs to be straight-

forward to read and debug, especially if one needs to amend the code in the future. At odds with having

easy-to-read (commented) code, we found that making the computer code overly general can lead to slow

execution times. Compromises needed to be made throughout the implementation.

10http://cran.r-project.org/src/contrib/Descriptions/BRugs.html

21

http://cran.r-project.org/src/contrib/Descriptions/BRugs.html

4.3 Stepwise fitting of the model

4.3.1 Full LEB model

We began by fitting the full Local Environment to Biomarker (LEB) model containing all of the pathways

shown in Figure4. As is traditionally done in any MCMC algorithm, we ran multiple chains from multiple

starting locations. After a burn-in of 5,0000 iterations, we ran each chain for a further 100,000 iterations.

Although not entirely necessary, we reduced the autocorrelation in the chain by thinning and storing every

10th iteration. Hence, our inferences are based on 10,000 samples. Trace plots and autocorrelation plots

were used to diagnose convergence of the chains. In this case, the mean parameters{µMj }, slope parameters

{βMj,k}, and precision parameters,{τM

j }, corresponding to Personal Air, Indoor Air, and Outdoor Airnever

converged; running the chains longer did not remedy this situation. The essential problem is that there is

little information (i.e, few observed values above the MDL)contained in the NHEXAS air measurements,

especially in Personal Air (Table1). This lack of data is most likely due to the burdensome nature of personal-

air exposure monitors (Samet and Jaakkola 1999). To see if the problem was mediated by removing Personal

Air, we re-ran the model without it, now linking the Indoor Air and Outdoor Air processes directly to the

Urine biomarker (essentially, we removed the Personal Air node in Figure4). This helped a little, but still the

lack of observations on Indoor Air and Outdoor Air caused convergence problems. We only achieved what

we deemed to be convergence by removing any links in the LEB model that involved Indoor Air, Outdoor

Air, and Personal Air. Henceforth, the model with just the air-related links removed from the LEB model

will be referred to as thefull LEB model.

Before we summarize the resulting full LEB model, we give a description of how we diagnosed its fit.

We started by producing standardized residual plots for thelog As concentration process for each of the

media in the NHEXAS study,XMj , j = 1, . . . ,NM . That is, for each each mediumj, and each individual

22

i = 1, . . . , N I , we defined

rMji {p} =

XMji {p} −

(

µMj {p} +

∑

k∈SMj

βMjk {p}X

Mik {p}

)

√

τMj {p}

,

where θ{p} denotes the pth sample of the parameterθ from the joint posterior distribu-

tion (Gelman, Carlin, Stern, and Rubin 2004, p.170). We found that taking a random realization (or a few

realizations) of the posterior parameters,{XMji }, {µM

j }, {βMjk }, and{τM

j }, and then examining plots of

the standardized residuals, was sufficient for diagnosing model fit. As an example, Figure8 demonstrates

residual plots for the Urine pathway. The histogram and normal quantile-quantile plots of the standardized

residuals (top panels) demonstrate that the normal approximation is reasonable. The plots of the residuals by

observation number in Figure8 (second row, left-hand panel) show two observations outside the limits of -3

and 3 for the standardized residuals. If indeed the residuals were normal and independent across individuals,

we would expect roughly 0.5 of the 178 residuals to be outsidethese limits. The two observations are de-

noted by the gray circles in Figure8, corresponding to case numbers 75 and 157 occurring in Maricopa and

Santa Cruz counties. We can see that the plots of the residuals versus the process variables in the pathway

for Urine (Sill Dust, Soil, Food, and Beverage) are scattered evenly about zero, although there are a few

extreme points (including cases 75 and 157) that are worth further investigation. A plot of the standardized

residuals versus the county in Figure8 (bottom trow, right-hand panel) shows that while the residuals are not

perfectly centered around zero for each county, the assumption of a single intercept for the Urine pathway is

reasonable here. Standardized residuals based on a single random draw from the posterior distribution were

useful in this case, however there is certainly some question to how applicable this idea is to other models

and datasets.

Figure9 displays a summary of the parameters{βMjk } in the LEB model, after removing the “air” pathways.

Consider the Urine, Food, and Beverage media and their (linear) dependence on other media. We display

the posterior mean (black circle) and a 95% credible interval (solid black line) for the slope parameters

23

linking the particular media to other media. For example, Urine is related linearly to Sill Dust, Soil, Food,

and Beverage. (We define and compare the gray circles and dashed lines in Section4.3.4.) Except for the

link between Beverage and Sill Dust, the strengths of all thelinear relationships appear to be non-zero, as

evidenced by the fact that the 95% credible intervals do not contain zero. Log Urine As concentrations are

positively associated with Sill Dust, Food, and Beverage, and negatively associated with Soil. There is weak

positive dependence between Food and Sill Dust and between Food and Water. In the Beverage pathway, an

increase in the log As concentration in Water is associated with an increase in the log As concentration in

Beverage.

The black symbols and lines in Figure10 show posterior summaries (the circles denote the posteriormeans,

the lines are 95% credible intervals) for the process variances,{1/τMj }. We can see that Urine and Sill Dust

process variances are the largest, while the Soil process variance is the smallest.

4.3.2 LEB model + global water

Having established a satisfactory LEB model, we then linkedthe global water model to it (i.e., the models

of Sections3.1 and 3.2.1). After diagnosing model fit (in a similar fashion to the discussion above), we

compared the posterior distributions of the parameters in:

1. the full LEB model;

2. the full LEB model + global water; and

3. an LEB model in which we only include the NHEXAS water log As concentrations + global water.

By making comparisons of the three models we can understand the effect of adding the global water model

component to the LEB model component, and vice versa. As an example of the comparisons we made, the

posterior summaries of the parameters involved in the global water model as defined in Section3.2.1are

24

shown in Table3. With respect to these parameters, there is no difference inthe summaries when the rest of

the LEB model is linked to the global water model. (We will leave comparisons of the LEB slope parameters

and process variances until Section4.3.4.) For the moment, we note that there was some Bayesian learning

from the mixture model that links the Water medium in the LEB model with the global water process. This

can be observed by comparing the posterior distribution of the parameterζWij that links individuali to PWS

j, with the proportion of the countyc(i) served by PWSj, λWc(i),j . Figure11 shows the comparison for four

randomly selected individuals. There is evidence that the model puts substantial weight on certain PWSs

for each individual. Even though the NHEXAS Water log As concentrations process,XMm(W ) learns from

the PWS process, the posterior distribution ofXMm(W ) did not change upon the addition of the global water

model.

4.3.3 LEB model + global soil

Similar to what we did with water, we compared posterior distributions of the parameters in:

1. the full LEB model;

2. the full LEB model + global soil; and

3. an LEB model in which we only include the NHEXAS soil log As concentrations + global soil.

As an illustration, Table4 compares the posterior summaries of the global soil parameters in 2. and 3. above.

There is little difference between the parameters in the global stream model (1/ωH , µH , 1/τH , γH). The

process precision for the topsoil process,1/τT , decreases as we add more pathways above Soil in the LEB

component of the model (2. versus 3.), while the credible interval slope parameter,βT1 , relating the global

topsoil process to the global stream process is wider for thefull LEB + global soil model. Our analysis agrees

with our exploratory data analysis, where we found similar log As concentrations in the NHEXAS, USSoils

25

(topsoil), and NGS (stream sediment) studies. As with water, there was substantial Bayesian learning from

the mixture model that assigns watersheds to individuals (not shown here). With the addition of the global

soil model, we found that the posterior distributions of theindividual NHEXAS log As Soil concentration,

XMi,m(S) (i = 1, . . . , N I ), did not change for those individuals with observed valuesequal or above the MDL.

For those individuals with values below the MDL or that were missing, the distribution of the NHEXAS log

As in Soil shifted lower with the addition of the global soil model.

4.3.4 Full LEB model + global water + global soil

Finally, we fitted the full LEB model including the global water and global soil models. After checking the

model fit (similar to the manner described above), we are in a position to summarize the resulting model. The

gray circles and dashed lines display, respectively, the posterior means and 95% posterior credible intervals

for the posterior parameters{βMjk } (Figure9) and process variances{1/τM

j } (Figure10) in this model. We

compare these parameters of the full LEB model + global water+ global soil, with the parameters of the full

LEB model (i.e., without global models, corresponding to the black circles and lines in the figures). In terms

of the posterior distributions of{βMjk }, adding the global modelsdoes not greatly change the relationships we

observe between the environmental media processes in the LEB model. The one exception is in the posterior

credible interval for Soil in the Urine pathway. For the fullLEB model + global water + global soil, the

interval moves slightly to contain less negative values. This casts doubt on the negative association between

the arsenic levels in Urine and Soil, which the science wouldsuggest should be positive. In terms of the

process variances,{1/τMj }, including global models greatly reduces the process variances in the Soil and

Water pathways in the LEB model, but it does not affect the process variances for the other pathways. In the

interest of space, we do not include summaries of the posterior parameters in the global water and global soil

models – they are almost identical to the summaries given in the right-hand panel of Tables3 and4.

26

4.3.5 Further sensitivity analyses

Given that the data sources vary in quality and scope, we re-ran the full LEB model to investigate the sen-

sitivity to subsets of the data (because of the time it takes to carry out each MCMC, we did not include the

global water and global soil models with the full LEB model).As an example, we left out all individuals in

each county in turn, and refitted the full LEB model. Our interest was in the influence of any one county’s

NHEXAS data upon our results. Figure12 shows summaries of the posterior distributions of the parameters

{βMjk } in the Urine pathway for 14 runs (the number of counties we have data for) of the MCMC algorithm,

removing the data for each county in turn. This figure indicates, with perhaps the exception of Pima county,

that the posterior distributions are robust to the data, which provides us some confidence in the results we

have obtained.

4.4 Summary of scientific results

We have demonstrated that the structure proposed byClayton et al. (2002)to explain human As exposure

pathways in NHEXAS Region 5 is applicable to the analysis of As exposure pathways in NHEXAS AZ,

once the air pathways are removed. This removal of air pathways was necessitated by the lack of information

in the corresponding data (few observations were above the MDL) and associated model-fitting-algorithm

convergence problems. Except for the link between Soil and Urine, all associations between variables were

positive. The incorporation of GLE models for global water and global soil did little to affect these asso-

ciations except to cast a small doubt on the negative association between Urine and Soil, and to sharpen

estimates of the posterior distributions of the Water and Soil log As processes. Therefore, the part of our

analysis that includes GLE models, does not significantly refine our understanding of pathways in the full

LEB model of individuals’ exposure to As. However the incorporation of GLE models, could be of use in

assessing spatial variation in exposure across AZ, since the global water and soil data have better spatial

27

coverage than the NHEXAS AZ data.

5 Discussion

It is our contention that large Bayesian hierarchical models need to be implemented carefully. Following

standard ideas of software development, the Bayesian modelshould be coded up and tested in bite-size

pieces. Since it can be hard to get the chains in the MCMC algorithm to mix, the choice of priors and

parameters updates are critical. Datasets need to be managed, in a documented fashion. Sensitivity of the

results to the data is often overlooked but is very importantto check. For example, in the exposure pathways

model that relates the As levels measured in Urine to local and global environmental measurements, we found

that no county’s data heavily influences the model fit. Fast, reliable MCMC computation with an extendable,

modifiable code is crucial for success.

We close with a final note about prior distributions. Throughout, we have probably relied too heavily on the

use of normal and gamma priors. Certainly, the use of normal priors is standard for modeling log concentra-

tions. However, the use of gamma priors for precisions is more open to debate (e.g.,Gelman, 2006). As we

stepped through our models, we checked that our results werenot sensitive to the choice of the parameters in

the gamma prior distributions. In a few smaller versions of the model, we replaced the gamma prior distribu-

tions by other commonly used, more informative prior distributions, such as uniform distributions. (The full

conditional distributions are then not available in closedform, and require Metropolis-Hastings steps.) With

these replacements, the results were unaffected.

28

Acknowledgments

This research was supported by the American Chemistry Council’s (ACC) Long-Range Research Initiative

under an agreement between the ACC and the Ohio State University Research Foundation. The agreement

was a result of a submission to the EPA’s FY2003 STAR Grant program, in response to an RFA titled,

“Environmental Statistics Research: Novel Analyses of Human Exposure Related Data,” which was jointly

funded by the EPA’s National Center for Environmental Research (NCER) and the ACC. The authors would

like to thank Wayne Hood, the Supervisory Hydrologist in Data Management and Analysis of Water Quality

Division of the Arizona Department of Environmental Quality for providing the public-water-system datasets.

Appendix

A The independent normal model with possible censoring and missingness

(INCM)

Let N(µ, σ2) denote a normal distribution with meanµ and varianceσ2, andCN(µ, σ2; c) denote a left-

censored normal distribution with meanµ and varianceσ2 > 0 that is censored at levelc. For each obser-

vationYi from the data{Y1, . . . , Yn}, we let the observation-status variableZi be0 if the observed value is

below the minimum detection limit (MDL) value ofMi, or 1 if the value is equal to or above the MDL, or2

if the observation is missing. Now letY = (Y1, . . . , Yn)T , Z = (Z1, . . . , Zn)T , andM = (M1, . . . ,Mn)T .

Then the random vectorY has anindependent normal distribution with possible censoring and missingness

(INCM distribution), or Y ∼ INCM(µ, σ2;Z;M ), if

[

Y |X, σ2,Z,M]

=∏

{i : Zi=0 or 1}

[

Yi|µi, σ2]

=

∏

{i : Zi=0}

CN(

µi, σ2;Mi

)

∏

{i : Zi=1}

N(

µi, σ2)

,

whereE(Y ) = (µ1, . . . , µn)T ≡ µ is the mean vector andvar(Yi) = σ2 > 0 (i = 1, . . . , n) is the variance.

29

References

Banerjee, S., B. Carlin, and A. Gelfand (2004).Hierarchical Modeling and Analysis for Spatial Data.

Boca Raton, FL: Chapman & Hall/CRC.

Berliner, L. M. (2003). Physical-statistical modeling in geophysics.Journal of Geophysical Research

(Atmospheres) 108 (D24), 8776, doi: 10.1029/2002JD002865.

Calder, C. A., P. F. Craigmile, and J. Zhang (2007). Regionalspatial modeling of topsoil geochemistry.

Technical Report 796, Department of Statistics, The Ohio State University. Submitted for review.

Clayton, C. A., E. D. Pellizzari, and J. J. Quackenboss (2002). National Human Exposure Assessment

Survey: Analysis of exposure pathways and routes for arsenic and lead in EPA Region 5.Journal of

Exposure Analysis and Environmental Epidemiology 12, 29–43.

Cressie, N., B. E. Buxton, C. A. Calder, P. F. Craigmile, C. Dong, N. J. McMillan, M. Morara, T. J. Santner,

K. Wang, G. Young, and J. Zhang (2007). From sources to biomarkers: A hierarchical Bayesian

approach for human exposure modeling.Journal of Statistical Planning and Inference 137, 3361–

3379.

Gelfand, A. E. and A. F. M. Smith (1990). Sampling-based approaches to calculating marginal densities.

Journal of the American Statistical Association 85, 398–409.

Gelman, A. (2006). Comment on “A comparison of Bayesian and likehood-based methods for fitting

mutilevel models” by W. J. Browne and D. Draper.Bayesian Analysis 1, 515–534.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004).Bayesian Data Analysis, 2nd edn. Boca

Raton, FL: Chapman & Hall/CRC.

Geweke, J. (1991). Efficient simulation from the multivariate normal and Student-t distributions subject

to linear constraints. In E. M. Keramidas (Ed.),Computing Science and Statistics: Proceedings of the

Twenty-Third Symposium on the Interface, Fairfax, VA: Interface Foundation of North America, pp.

30

571–568.

Lauritzen, S. L. (1996).Graphical Models. Oxford, UK: Oxford University Press.

Lindley, D. V. and A. F. M. Smith (1972). Bayes estimates for the linear model (with discussion).Journal

of the Royal Statistical Society, Series B 34, 1–41.

McMillan, N. J., M. Morara, and G. S. Young (2006). Hierarchical Bayesian modeling of human exposure

pathways and routes. Alexandria, VA: American StatisticalAssociation, pp. 2492–2503.

NERL and National Center for Environmental Assessment (2000). Strategic Plan For The Analysis Of The

National Human Exposure Assessment Survey (NHEXAS) Pilot Study Data. ORD, U.S. EPA: NERL

and National Center for Environmental Assessment.

O’Rourke, M. K., P. K. Van de Water, S. Jin, S. P. Rogan, A. D. Weiss, S. Gordon, D. Moschandreas,

and M. Lebowitz (1999). Evaluations of primary metals from NHEXAS Arizona: Distributions and

preliminary exposures.Journal of Exposure Analysis and Environmental Epidemiology 9, 435–445.

Pellizzari, E., P. Lioy, J. Quackenboss, R. Whitmore, A. Clayton, N. Freeman, J. Waldman, K. Thomas,

C. Rodes, and T. Wilcosky (1995). Population-based exposure measurements in EPA Region 5: A

Phase I field study in support of the National Human Exposure Assessment Survey.Journal of Expo-

sure Analysis and Environmental Epidemiology 5, 327–358.

R Development Core Team (2007).R: A Language and Environment for Statistical Computing. Vienna,

Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.

Robertson, G., M. Lebowitz, M. O’Rourke, S. Gordon, and D. Moschandreas (1999). The National Human

Exposure Assessment Survey (NHEXAS) study in Arizona – introduction and preliminary results.

Journal of Exposure Analysis and Environmental Epidemiology 9, 427–434.

Samet, J. and J. Jaakkola (1999). The epidemiologic approach to investigating outdoor air pollution. In

S. Holgate, J. Samet, H. Koren, and R. Maynard (Eds.),Air Pollution and Health. London, UK: Aca-

31

demic Press.

Santner, T. J., P. F. Craigmile, C. A. Calder, and R. Paul (2007). Effect and pathways modifiers in a

Bayesian pathways analysis of the National Human Exposure Assessment Survey for arsenic in EPA

Region 5. Technical Report 801, Department of Statistics, The Ohio State University. Submitted for

review.

Seaber, P., F. Kapinos, and G. Knapp (1987).Hydrologic Unit Maps. Denver, CO: U.S. Geological Survey.

Spiegelhalter, D. J., A. Thomas, N. G. Best, and D. Lunn (2004, January).WinBUGS Version 1.4 User

Manual. Cambridge, UK: MRC Biostatistics Unit. Available athttp://www.mrc-bsu.cam.ac.uk/bugs.

Tierney, L. (1994). Markov chains for exploring posterior distributions.Annals of Statistics 22, 1701–

1728.

WHO (1981).Arsenic. Geneva: World Health Organization. Environmental HealthCriteria 18.

32

Media Description #Missing #<MDL ωj

Urine Morning void urine samples were collected on the third

and seventh days of the study.

12 55 393.99

Personal Air Air samples were collected for a subset of the partici-

pants using monitors worn by the participants for a pe-

riod of 144 hours.

168 4 893.99

Indoor Air Integrated six-day air samples were collected inthe pri-

mary living area of the participants’ homes.

51 90 893.99

Outdoor Air Integrated six-day samples were collected at a location

outside the participants’ homes.

58 82 893.99

Soil Soil samples were collected by a sweeping method from

an area with high traffic at the primary entrance to the

home and by a ring collection method from exposed soil

in the primary outdoor activity areas.

110 0 893.99

Surface Dust Dust samples were collected from a windowsill in the

participant’s primary living area.

20 12 893.99

Food NHEXAS participants were required to prepare or obt- 3 0 893.99

Beverage ain duplicate samples of the food and beverages they

consumed for four consecutive days. Composites were

separated intobeverage intake andfood intake.

10 11 893.99

Water Tap-water samples were collected from participants’

homes after flushing the pipes for a period of 3 minutes.

0 0 3594.00

Table 1: A summary of the environmental media and biomarker measurements from the NHEXAS AZ study.

The right-most columns tabulate for each medium,j the number of individuals (out of 178) that have missing

values or values below the MDL, and list the measurement-error precision,ωj .

33

County Apache Cochise Coconino Gila Graham Greenlea La Paz Maricopa

#Individuals 4 8 8 13 4 4 0 49

Population 77 138 136 62 40 9 26 4022

County Mohave Navajo Pima Pinal Santa Cruz Yavapai Yuma

#Individuals 5 4 44 10 6 11 8

Population 217 117 1021 243 47 227 213

Table 2: The number of NHEXAS individuals in each county of AZ, along with the population (in thousands)

according to the 1990 Census.

Parameters water only-LEB + global water Full LEB + global water

1/τW 0.475 ( 0.459 , 0.492 ) 0.478 ( 0.461 , 0.495 )

αW -4.798 ( -8.446 , 0.675 ) -4.780 ( -8.394 , 0.636 )

1/CW 7.522 ( 1.562 , 41.777 ) 7.537 ( 1.518 , 41.953 )

Table 3: Posterior means and 95% credible intervals for parameters in the global water model.

Parameters soil only-LEB + global soil Full LEB + global soil

βT0 0.075 ( -1.610 , 1.543 ) 0.176 ( -1.521 , 1.768 )

βT1 1.067 ( 0.234 , 2.057 ) 1.021 ( 0.090 , 2.003 )

1/τT 0.692 ( 0.574 , 0.818 ) 0.487 ( 0.334 , 0.678 )

1/ωH 0.740 ( 0.686 , 0.801 ) 0.741 ( 0.686 , 0.803 )

µH 1.678 ( 1.560 , 1.796 ) 1.675 ( 1.559 , 1.792 )

1/τH 0.296 ( 0.196 , 0.419 ) 0.294 ( 0.191 , 0.417 )

γH 0.088 ( 0.048 , 0.104 ) 0.089 ( 0.036 , 0.104 )

Table 4: Posterior means and 95% credible intervals for parameters in the global soil model.

34

−115 −113 −111 −109

3132

3334

3536

37Topsoil locations

Longitude

Latit

ude

−115 −113 −111 −109

3132

3334

3536

37

Stream sediment locations

LongitudeLa

titud

e

Figure 1: The left-hand panel shows the locations of the observed topsoil As concentrations superimposed

on the counties of AZ. The right-hand panel shows the locations of the observed stream sediment As concen-

trations over the 84 hydrologic units that are contained partially within the state of AZ.

−115 −113 −111 −109

3132

3334

3536

37

Arizona IMPROVE Monitors

Longitude

Latit

ude

*

*

Phoenix

Tucson

Figure 2: A map of the counties of AZ showing the locations of the 14 IMPROVE air monitors.

35

−0.3 −0.1 0.2 0.2 −0.3 −0.6 −0.1 0

−0.2 −0.1 0.3 0.3 0.1 0 0.1 −0.1

−0.1 0.2 0.3 −0.2 0 −0.1 0 0

−0.2 0.2 0.3 −0.2 0.1 −0.1 0.2 0.3

0.2 −0.3 0.1 0 0.1 0.1 0.1 0.2

0.3 −0.6 0 −0.1 −0.1 0.1 0.1 0.1

0.2 −0.1 0.1 0 0.2 0.1 0.1 0.5

0.2 0 −0.1 0 0.3 0.2 0.1 0.5

Urin

e

Per

sona

l Air

Indo

or A

ir

Out

door

Air

Soi

l

Sur

face

Dus

t

Foo

d

Bev

erag

e

Wat

er

Personal Air

Indoor Air

Outdoor Air

Soil

Surface Dust

Food

Beverage

Water

response

cova

riate

Figure 3: The numbers in the table represent the pairwise correlations between the log As concentrations

observed in each medium. To calculate each correlation, we removed the missing values and used half the

MDL when censored. The dark gray shading indicates that a pairwise correlation test has a marginal P-

value less than 0.05; the light gray indicates a marginal P-value between 0.05 and 0.1. A rectangle around

a correlation value indicates for the response (columns), which covariates (rows) are used in Clayton et al.’s

model.

36

OUTDOOR

Outdoor Air

Soil

INDOOR

Water

Indoor Air

Surface Dust

EXPOSURE

Personal Air

Food

Beverage

BIOMARKER

Urine

Figure 4: Graphical representation of Clayton et al.’s (2002) structure linking the various environmental

media and the biomarker. Arrows connecting the different nodes in the model indicate the choice of pathways,

SMj , j = 1, . . . , NM .

37

NHEXAS Water log As concentrations

Den

sity

−2 0 2 4 6 8 10 12

0.0

0.1

0.2

0.3

0.4

0.5

−115 −113 −111 −10931

3233

3435

3637

NHEXAS Water

Longitude

Latit

ude 4

8

8

13

4 04

49

54

44

10

6

11

8

PWS log As concentrations

Den

sity

−2 0 2 4 6 8 10 12

0.0

0.1

0.2

0.3

0.4

0.5

Figure 5: On the left-hand side, histograms of the observed NHEXAS and PWS water log As concentrations.

For the histogram of PWS measurements, the check marks denote the minimum detection limits (MDLs).

The map on the top right-hand side displays the number of NHEXAS Water observations per county of AZ.

38

NHEXAS Soil log As concentrations

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

−115 −113 −111 −10931

3233

3435

3637

NHEXAS Soil

Longitude

Latit

ude 2

0

0

6

4 04

0

00

42

10

0

0

0

Topsoil log As concentrations

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Stream sediment log As concs.

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Figure 6: The top panels show histograms and a map of the number of observed NHEXAS Soil log As

concentrations. The lower panels display histograms of theUSGS topsoil and NGS stream sediment log As

concentrations, respectively

39

−10

−6

log

As

conc

.site #1 (COCHISE)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #2 (COCONINO)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #3 (YAVAPAI)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #4 (YAVAPAI)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #5 (COCONINO)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #6 (APACHE)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #7 (PIMA)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #8 (APACHE)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #9 (MARICOPA)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #10 (PINAL)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #11 (PIMA)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #12 (PIMA)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #13 (GILA)

Jan 1988 Jan 1994 Jan 2000

−10

−6

log

As

conc

.

site #14 (GILA)

Jan 1988 Jan 1994 Jan 2000

Figure 7: Time series plots of the log As concentrations for all the observed values at the 14 IMPROVE

air-monitor locations. The vertical gray lines delimit thestudy periods of NHEXAS in the county where the

monitor is located. 40

standardized residuals for Urine

Den

sity

−4 −2 0 2 4

0.00

0.15

0.30

−2 −1 0 1 2

−4

−2

02

4

standard normal quantiles

quan

tiles

of r

esid

uals

0 50 100 150

−4

−2

02

4

Observation number

stan

dard

ized

res

idua

l

0 1 2 3 4 5−

4−

20

24

SillDust

stan

dard

ized

res

idua

l

1 2 3 4 5

−4

−2

02

4

Soil

stan

dard

ized

res

idua

l

0 2 4 6 8

−4

−2

02

4

Food

stan

dard

ized

res

idua

l

−3 −2 −1 0 1 2 3

−4

−2

02

4

Beverage

stan

dard

ized

res

idua

l

−4

−2

02

4

stan

dard

ized

res

idua

l

AP

AC

HE

CO

CH

ISE

CO

CO

NIN

O

GIL

A

GR

AH

AM

GR

EE

NLE

A

MA

RIC

OP

A

MO

HA

VE

NA

VA

JO

PIM

A

PIN

AL

SA

NT

A C

RU

Z

YA

VA

PA

I

YU

MA

Figure 8: Standardized residuals plots for the Urine pathway in the full LEB model (does not include any

pathways for air).

41

SillDust

Soil

Food

Beverage

SillDust

Water

SillDust

Water

Urine

Food

Beverage

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

Figure 9: The circles (black: full LEB model; gray: full LEB model + global water + global soil) denote the

posterior means of the parameters{βMjk } in the LEB model pathways leading from the right-hand side media

(Urine, Food, and Beverage) to the left-hand side media. Thehorizontal lines around each circle denote the

corresponding 95% credible intervals.

42

Urine

Food

Beverage

Soil

SillDust

Water

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

Figure 10: The circles compare the posterior means of the process variance parameters,1/τMj (black circles:

full LEB model model; gray circles: full LEB model + global water + global soil). The horizontal lines

around each circle denote the corresponding 95% credible intervals.

43

250 300 350

0.0

0.2

0.4

individual # 44

PWS identifier

prob

abili

ty

250 300 350

0.0

0.2

0.4

individual # 73

PWS identifier

prob

abili

ty

430 440 450 460 470

0.00

0.10

0.20

individual # 96

PWS identifier

prob

abili

ty

500 600 700 800 900

0.0

0.2

0.4

0.6

0.8

individual # 112

PWS identifier

prob

abili

ty

Figure 11: For four randomly selected individuals, a comparison of the prior probability that the individual

is served by each PWS (gray plus sign), with the posterior probability (black circles).

44

−1.

0−

0.5

0.0

0.5

1.0

Urin

e an

d S

illD

ust

AP

AC

HE

CO

CH

ISE

CO

CO

NIN

O

GIL

A

GR

AH

AM

GR

EE

NLE

A

MA

RIC

OP

A

MO

HA

VE

NA

VA

JO

PIM

A

PIN

AL

SA

NT

A C

RU

Z

YA

VA

PA

I

YU

MA

−1.

0−

0.5

0.0

0.5

1.0

Urin

e an

d S

oil

AP

AC

HE

CO

CH

ISE

CO

CO

NIN

O

GIL

A

GR

AH

AM

GR

EE

NLE

A

MA

RIC

OP

A

MO

HA

VE

NA

VA

JO

PIM

A

PIN

AL

SA

NT

A C

RU

Z

YA

VA

PA

I

YU

MA

−1.

0−

0.5

0.0

0.5

1.0

Urin

e an

d F

ood

AP

AC

HE

CO

CH

ISE

CO

CO

NIN

O

GIL

A

GR

AH

AM

GR

EE

NLE

A

MA

RIC

OP

A

MO

HA

VE

NA

VA

JO

PIM

A

PIN

AL

SA

NT

A C

RU

Z

YA

VA

PA

I

YU

MA

−1.

0−

0.5

0.0

0.5

1.0

Urin

e an

d B

ever

age

AP

AC

HE

CO

CH

ISE

CO

CO

NIN

O

GIL

A

GR

AH

AM

GR

EE

NLE

A

MA

RIC

OP

A

MO

HA

VE

NA

VA

JO

PIM

A

PIN

AL

SA

NT

A C

RU

Z

YA

VA

PA

I

YU

MA

Figure 12: The circles denote the posterior means of parameters{βMjk } in the Urine pathway, removing the

data from each county in turn. The vertical lines around eachcircle denote the associated 95% credible

intervals.

45

Supplemental Material for “Hierarchical Model Building, Fitting,

and Checking: A Behind-the-Scenes Look at a Bayesian Analysis of

Arsenic Exposure Pathways”

Full Conditional Distributions

This document provides the full conditional distributions for the models introduced in the paper.

Letting θ = (µ,β, τ ,ω), the posterior distribution is

[θ,X|Y ] ∝ [Y |X,θ] [X |θ] [θ]

= [Y |X,ω] [X|β, τ ] [β] [ω] [τ ],

assuming independence of the four components within θ. The distribution is not available in closed

form and so we use a Markov chain Monte Carlo algorithm (hybrid Gibbs sampler with Metropolis

steps within Gibbs) to sample from it.

Before providing the full conditional distributions, we first note that for a generic parameter ψ, the

quantities aψ and bψ denote the hyperprior parameters for gamma prior distributions and mψ and

V ψ are the hyperprior parameters for normal prior distributions. These hyperparameter parameters

were not introduced in the manuscript for the sake of brevity.

1 The local environment to biomarker (LEB) model

1. For each j and i, we draw augmented data, YM∗ji , as follows:

YM∗ji

= YMji , ZMji = 0;

∼ Truncated N(XMji , 1/ω

Mj ) on (−∞,MM

ji ], ZMji = 1;

∼ N(XMji , 1/ω

Mj ), ZMji = 2.

(1)

2. For the process variable XM , we sample each XMj , j = 1 . . . ,NM , conditional on the other

process variables {XMk : k 6= j} and the other parameters in the model. We have that

[

XMj |{XM

k : k 6= j},µM ,βM , τM ,Y M∗,ωM]

∝[

Y ∗j |X

Mj ,ω

M] [

XMj |{XM

k : k 6= j},βMj , τMj

]

∏

l∈χj

[

XMl |{XM

k : k 6= l},βMl , τMl

]

, (2)

1

where χj denotes the columns of the process variable XM that depend on XMj in the speci-

fication of the conditional mean ηMji (i.e., those l that belong to SMj ). This conditional mean

ηMji is defined as:

ηMji = µMj +

NMj∑

k=1

βMjkXMsjki

, (3)

where sMj = (sMj1 , . . . , sMjNM

j

) is a vector of length NMj (the cardinality of SMj ) containing the

columns of the process variable matrix XM that affect the mean of the process variable XMj .

The expression (2) above is equal to

NI∏

i=1

[

YM∗ji |XM

ji , ωMj

] [

XMji |{X

Mki : k 6= j},βMj , τ

Mj

]

∏

l∈χj

[

XMli |{X

Mki : k 6= l},βMl , τ

Ml

]

,

from which we can see that each XMji , i = 1, . . . ,N I , can be sampled independently. Now

[

YM∗ji |XM

ji , ωMj

]

∝ exp

(

ωMj2

(Y ∗Mji −XM

ji )2

)

,

[

XMji |{X

Mki : k 6= j},βMj , τ

Mj

]

∝ exp

(

τMj2

(XMji − ηMji )2

)

,

and for each l ∈ χj,

[

XMli |{X

Mki : k 6= l},βMl

]

∝ exp

(

τMl2

(XMli − δli(j) − ǫli(j))

2

)

. (4)

In (4),

δji(n) ≡ µMj +∑

{k:sMjk6=n}

βMjkXMsMjki,

is the conditional mean of XMji minus the regression terms that depend on XM

ni , and

ǫji(n) ≡∑

{k:sMjk

=n}

βMjkXsMjki, (5)

is the regression terms that depend on XMni . Always, δji(n) − ǫji(n) = ηMji .

Rearranging the terms of XMji , the distribution of XM

ji conditional on all other parameters is

N(q/p, 1/p), where

p = ωMj + τMj +∑

l∈χj

τMl

∑

{k:sMjk

=n}

βMjk

2

,

2

and

q = ωMj YM∗ji + τMj ηMji +

∑

l∈χj

τMl(

XMli − δli(j)

)

∑

{k:sMjk

=n}

βMjk

.

3. For the regression parameters βM , we sample each βMj conditional on the other parameters

in the model. Letting

Aj =

1

{

XMsMjk

1

}

1

{

XMsMjk

2

}

......

1

{

XMsMjkNI

}

,

the full conditional distribution of βMj , is

Nrj

(

R−1j qj ,R

−1j

)

,

where Rj = τMj ATj Aj + Σ

−1j and qj = τMj AT

j XMj + Σ

−1j mj.

4. For each j, the conditional distribution of the process precision τMj , given the other parameters,

is the gamma distribution,

Ga

aτM

j +N I/2, bτM

j +NI∑

i=1

(XMji − ηMji )2/2

,

where ηMji is defined in (3) above.

2 The global to local environment (GLE) model

In this section, we provide the full conditional distributions for the parameters in the GLE models.

Linking global environmental-media variables to local environmental-media variables will change the

full conditional distributions for some of the latent processes and the associated regression coefficients

that were given in the previous section. Specifically, now the full conditional distributions of XMm(W )

depends on XW , and XMm(S) depends on XT and XH . The full conditional distributions for other

media in the LEB model remain unchanged.

3

2.1 Global water model

1. With the addition of the global water model, we need to change some of the full conditional

distributions in the LEB model for water. For the Water medium we replace equation (1)

with:

YM∗i,m(W )

= YMi,m(W ), if ZM

i,m(W ) = 0;

∼ Truncated N(XMi,m(W ), 1/ω

Mm(W )) on (−∞,MM

i,m(W )], if ZMi,m(W ) = 1;

∼ N(XMi,m(W ), 1/ω

Mm(W )), if ZM

i,m(W ) = 2.

To sample from the conditional distribution of the water process, {XMi,m(W ) : i = 1, · · · ,N I},

we let µMi,m(W ) = I(ζWij = 1)µWj . Then XM

i,m(W ) is N(qi/pi, 1/pi), where

pi = ωMm(W ) + τMm(W ) and qi = ωWY ∗i,m(W ) + τMm(W )µ

Mi,m(W ),

for each individual i.

With the addition of the global water model, the full conditional distribution of the NHEXAS

Water model precision, τMm(W ), is now

Ga

N I

2+ a

τMm(W ) ,

1

2

NI∑

i=1

(

XMi,m(W ) − µMi,m(W )

)2+ b

τMm(W )

.

2. The full conditional distributions of the remaining parameters in the global water model de-

pends on the augmented data Y W∗jk , defined as follows:

Y W∗jk

= Y Wjk , if ZWjk = 0;

∼ Truncated N(XWj , 1/ωW ) on (−∞,MW

jk ], if ZWjk ,= 1;

∼ N(XWj , 1/ωW ), if ZWjk = 2.

The full conditional distribution of the PWS measurement-error precision, ωW , is then

Ga

∑NW

j=1 NWj

2+ aω

W

,1

2

NW∑

j=1

NWj∑

k=1

(Y W∗jk −XW

jk )2 + bωW

.

3. For the global water model, the full conditional distribution of XWjk for the kth observation

(k = 1, · · · , NWj ) of jth PWS (j = 1, · · · ,NW ) is N(qjk/pjk, 1/pjk), where

pjk = ωW + τW and qjk = ωWY W∗jk + τWµWj .

4

Next, define XWj =

{

XMi,m(W ) : ζij = 1

}

and lj to be the number of elements in XWj . Then,

for each PWS j = 1, ..., NW , the full conditional distribution of the PWS-specific mean µWj is

N(qj/pj, 1/pj), with

pj = NWj τW +CW + ljτ

Mm(W )

and

qj = τWNW

j∑

k=1

XWjk + CWαW + τMm(W )1lk

′XWj .

The full conditional distribution of the process precision, τW , is

Ga

∑NW

j=1 NWj

2+ aτ

W

,1

2

NW∑

j=1

NWj∑

k=1

(

XWjk − µWj

)2+ bτ

W

.

4. For the parameters in the global water-LEB linking model, recall that ζWij = 1 implies that indi-

vidual i is served by PWS j and ζWij = 0 implies that individual i is not served by water-system

j. For each individual i = 1, . . . , N I , the full conditional distribution of ζWi =(

ζWi1 , · · · , ζWiNW

)

is Multinomial(1, (λ∗c(i),1, · · · , λ∗c(i),NW )) where the probability that individual i in country

c(i) is served by PWS j, is

λ∗c(i),j =λc(i),j n

(

XMi,m(W );µ

Wj , τ

Mm(W )

)

∑NW

r=1 λc(i),r n(

XMi,m(W );µ

Wr , τ

Mm(W )

) .

Here, n(

XMi,m(W );µ

Wj , τ

Mm(W )

)

is a normal pdf (with mean µWj and variance 1/τMm(W )) evaluated

at the value of the water process, XMi,m(W ), for each individual i.

5. In the parameters in the global water prior distributions, αW conditional on all the other

parameters is N(q/p, p), where

p = NWCW + V αW

and q =

CWNW∑

j=1

µWj + V αW

mαW

.

The full conditional distribution of CW is

Ga

NW

2+ aC

W

,1

2

NW∑

j=1

(

µWj −mαW)2

+ bCW

.

5

2.2 Global soil model

1. With the addition of the global topsoil/stream-sediment model, we need to change some of the

conditional distributions in the LEB model for soil. For the soil medium, we replace equation

(1) with:

YM∗i,m(S)

= YMi,m(S), if ZM

i,m(S) = 0;

∼ Truncated N(XMi,m(S), 1/ω

Mm(S)) on (−∞,MM

i,m(S)], if ZMi,m(S) = 1;

∼ N(XMi,m(S), 1/ω

Mm(S)), if ZM

i,m(S) = 2.

2. For parameters appearing in the global topsoil/stream-sediment model, for the topsoil obser-

vation at the sth location:

Y T∗(s)

= Y T (s), if ZT (s) = 0;

∼ N(XT (s), 1/ωT ), if ZT (s) = 2

(there is no censored topsoil data).

The full conditional distribution of the topsoil measurement-error precision, ωT , is

Ga

NT

2+ aω

T

,1

2

NT∑

i=1

(Y T∗(si) −XT (si))2 + bω

T

.

3. For parameters appearing in the stream-sediment data model, for the kth observation in jth

HUC8:

Y H∗jk

= Y Hjk , if ZHjk = 0;

∼ Truncated N(XHj , 1/ω

H ) on (−∞,MHjk ], if ZHjk = 1;

∼ N(XHj , 1/ω

H ), if ZHjk = 2.

The full conditional distribution of the stream-sediment measurement-error precision, ωH , is

Ga

NH∑

j=1

NHj

2+ aω

H

,1

2

NH∑

j=1

NHj∑

k=1

(Y Hjk −XH

j )2 + bωH

.

4. The full conditional distribution of the topsoil process XT (s) is N(qs/ps, 1/ps), where

ps = ωT + τT and qs = τT (βT0 + βT1 XHh(s)) + ωTY T∗(s),

for s ∈ D.

6

The full conditional distribution of the topsoil model precision, τT , is

Ga

(

|D|

2+ aτ

T

,1

2

∑

s∈D

(XT (s) − βT0 − βT1 XHh(s))

2 + bτT

)

.

The parameters that appear in the mean function of the topsoil model are βT = (βT0 , βT1 )′.

Letting U = [1|D|,GDXH ] and W = [1NI , ζTXH ] we sample βT from N2(P−1Q,P−1), where

P = τTU ′U + τMm(S)W′W + V β

T

and

Q = τTU ′XT + τMm(S)W′XM

m(S) + V βT

MβT

.

5. The full conditional distribution of the mean of the stream-sediment model, µH , is N(q/p, 1/p),

where

p = τH1′NH (INH − γHA)1NH + V µH

,

and

q = τH1′NH (INH − γHA)XH + V µH

mµH

.

6. Recall that for the unknown parameter ζTij , ζTij = 1 if NHEXAS individual i is located in HUC8

region j, and 0 otherwise. These ζTij ’s are updated during our MCMC algorithm using the

following scheme:

Sample ζTi =(

ζTi1, · · · , ζTiNH

)

∼ Multinomial(

1, (λ∗c(i),1, · · · , λ∗c(i),NH )′

)

, for i = 1, · · · ,N I ,

where

λ∗c(i),j =λc(i),j n(XM

i,m(S);βT0 + βT1 X

Hj , τ

Mm(S))

∑NH

r=1 λc(i),r n(XMi,m(S);β

T0 + βT1 X

Hr , τ

Mm(S))

,

and n(XMi,m(S);β

T0 + βT1 X

Hj , τ

Mm(S)) is a normal pdf (with mean βT0 + βT1 X

Hj and variance

1/τMm(S)) evaluated at XM

i,m(S).

Now define the |D|×NH matrix GD such that the (i,j)th entry GD(i, j) is 1 if topsoil location

si falls in jth HUC8, and 0 otherwise. Sample XH from N(P−1Q,P−1), where

P = τH(INH − γHA) + τTG′DGD + τMm(S)ζ

T ′

ζT ,

and

Q = τH(INH − γHA)]µH1NH + τTG′D(XT − βT0 1|D|) + τMm(S)ζ

T ′

(XMm(S) − βT0 1NI ).

7

7. The full conditional distribution of the stream-sediment precision, τH , is

Ga

(

NH

2+ aτ

H

,1

2(XH − µH1NH )′(INH − γHA)(XH − µH1NH ) + bτ

H

)

.

The full conditional distribution of the mean of the stream-sediment model, µH , is N(q/p, 1/p),

where

p = τH1′NH (INH − γHA)1NH + V µH

,

and

q = τH1′NH (INH − γHA)XH + V µH

mµH

.

The full conditional distribution of the spatial-dependence parameter in the stream-sediment

model, γH , is proportional to

|INH − γHA|1/2 exp{

−0.5τH(XH − µH1NH )′(INH − γHA)(XH − µH1NH )}

×

1

|λmax − λmin|I(

γH ∈ (λmin, λmax))

,

where λmin, λmax are the smallest and largest eigenvalues of A.

8. With the addition of the global-topsoil/stream-sediment model, the full conditional distribu-

tion of the LEB Soil log As process XMm(S), is modified. Now, this distribution isNNI (P−1Q,P−1),

where

P = ωMm(S)INI + τMm(S)INI ,

and

Q = ωMm(S)YM∗m(S) + τMm(S)(β

T0 1NI + βT1 ζT

′

XH).

In addition, the full conditional distribution of the local soil process precision, τMm(S), is now

Ga

(

N I

2+ a

τMm(W ) ,

1

2(XM

m(S) − WβT )′(XMm(S) − WβT ) + b

τMm(S)

)

.

8

Documents

Hierarchical Model Building, Fitting, and Checking: A ... · Hierarchical Model Building, Fitting, and Checking: A Behind-the-Scenes Look at a Bayesian Analysis of Arsenic Exposure