Upload
dinhtuyen
View
213
Download
0
Embed Size (px)
Citation preview
MODELS AND METHODS FOR SPATIAL DATA:
APPLICATIONS IN EPIDEMIOLOGICAL,
ENVIRONMENTAL AND ECOLOGICAL STUDIES
by
Cindy Xin Feng
M.Sc. (Statistics), Simon Fraser University, 2006
B.Sc. (Applied Mathematics), Beijing University of Technology, 2003
a Thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in the Department of
Statistics and Actuarial Science
c⃝ Cindy Xin Feng 2011
SIMON FRASER UNIVERSITY
Summer 2011
All rights reserved. However, in accordance with the Copyright Act of
Canada, this work may be reproduced without authorization under the
conditions for Fair Dealing. Therefore, limited reproduction of this
work for the purposes of private study, research, criticism, review and
news reporting is likely to be in accordance with the law, particularly
if cited appropriately.
APPROVAL
Name: Cindy Xin Feng
Degree: Doctor of Philosophy
Title of Thesis: Models and Methods for Spatial Data: Applications in
Epidemiological, Environmental and Ecological Stud-
ies
Examining Committee: Dr. Rick Routledge
Chair
Dr. Charmaine Dean, Senior Supervisor
Dr. Jiguo Cao, Supervisor
Dr. Yi Lu, Supervisor
Dr. Paramjit Gill, Internal External Examiner
Dr. Patrick Brown, External Examiner,
University of Toronto
Date Approved:
ii
Abstract
This thesis develops new methodologies for applied problems using smoothing tech-
niques for spatial or spatial temporal data. We investigate Bayesian ranking methods
for identifying high risk areas in disease mapping, assessing these particularly with
regard their performance in isolating emerging unusual and extreme risks in small
areas. We build on information obtained through mapping multivariate outcomes by
developing models which investigate if the multivariate spatial outcomes share the
same underlying spatial structure. We develop a general framework for joint model-
ing of multivariate spatial outcomes for count and zero-inflated count data using a
common spatial factor model.
We also study spatial exposure measures, motivated by an analysis of Comandra
blister rust infection on lodgepole pine trees from British Columbia. We contrast
nearest distance with other, more general, exposure measures and consider the impact
of mis-specification of exposure measures in a semiparametric generalized additive
modeling framework including a spatial residual term modeled as thin plate regression
spline. An appealing feature of the new spatial exposure measures considered is that
they can be easily adapted to other problems, such as investigation of the association
of asthma incidence to traffic exposures. A common theme in the thesis is the use of
functional data analysis, and we specifically adapt such methods for assessing spatial
and temporal variation of Cadmium concentration in Pacific oysters from British
iii
Columbia.
The methodologies developed in these projects widen the toolbox for spatial anal-
ysis in applications in epidemiology, and in environmental and ecological studies.
iv
Acknowledgments
I am deeply indebted to my senior supervisor Dr. Charmaine Dean for her guidance
and support in countless ways. Without her enlightening instruction, great kindness
and patience, I could not have completed my thesis. Her support and encouragement
were very helpful to me through some very difficult times in my life. I also want
to extend my gratitude to my examining committee members, Dr. Rick Routledge,
Dr. Jiguo Cao, Dr. Yi Lu, Dr. Paramjit Gill and Dr. Patrick Brown for all their
careful reviewing and insightful comments. Their detailed reviews and constructive
comments greatly improved the thesis.
Many thanks to the faculty and staff of the Department of Statistics and Actuarial
Science of Simon Fraser University for providing me a wonderful environment for
graduate studies. In particular, I would also like to thank Dr. Derek Bingham, Dr.
Boxin Tang, Dr. Richard Lockhart, Dr. Tim Swartz, Dr. Leilei Zeng, Dr. Joan Hu
and Mr. Ian Bercovitz for their support and Sadika, Kelly and Charlene for your
help always. Thank you also to the fellow graduate students for being company and
growing together with me during my graduate studies.
Finally, and most importantly, I would like to thank my family, I would not be
able to go this far without their care and encouragement.
v
Contents
Approval ii
Abstract iii
Acknowledgments v
Contents vi
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Disease Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Conditional Autoregressive Priors . . . . . . . . . . . . . . . . 3
1.3 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Bayesian Ranking Methods for the Detection of Isolated Hotspots
vi
in Disease Mapping 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Bayesian Disease-Mapping Model . . . . . . . . . . . . . . . . . . . . 13
2.3 Ranking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Squared error loss function for the isolation measures . . . . . 16
2.3.2 Squared error loss function for the ranks of the isolation measures 16
2.3.3 Weighted rank squared error loss function . . . . . . . . . . . 17
2.3.4 Misclassification rates of regions in the top 100 % group . . . 18
2.4 Comparison of Rank Estimators of Isolation . . . . . . . . . . . . . . 19
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Joint Analysis of Multivariate Spatial Count and Zero-Heavy Count
Outcomes 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Models for Joint Count Outcomes . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Common Spatial Factor Model for Counts . . . . . . . . . . . 42
3.2.2 Common Spatial Factor Model for Zero Heavy Counts . . . . 43
3.2.3 Model Assessment and Comparison . . . . . . . . . . . . . . . 46
3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Ontario Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Comandra Blister Rust Tree Infection . . . . . . . . . . . . . . 53
3.4 Power of the Test for Common Spatial Structure . . . . . . . . . . . . 59
3.5 Precision Gains Through Joint Outcome Modeling . . . . . . . . . . . 64
3.6 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 66
4 Impact of Misspecifying Spatial Exposures 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vii
4.2 Comandra Blister Rust Study . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Flexible Smooth Models . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Comparison of Exposure Measures for CBR Infection . . . . . . . . . 83
4.5 Assessing the Effect of Misspecification of Spatial Exposure Measures 87
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 Exploring Spatial and Temporal Variations of Cadmium Concentra-
tions in Pacific Oysters from British Columbia 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.1 The Motivating Datasets . . . . . . . . . . . . . . . . . . . . . 98
5.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Monotone Spline Smoothing . . . . . . . . . . . . . . . . . . . 104
5.2.3 Functional Principal Component Analysis . . . . . . . . . . . 105
5.2.4 Semi-Parametric Additive Model . . . . . . . . . . . . . . . . 107
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.2 Spatial Variability . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.3 The Semi-Parametric Additive Model . . . . . . . . . . . . . . 116
5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Future Work 122
6.1 Spatial-temporal Modeling for Multivariate Spatial Outcomes . . . . 122
6.2 Spatial Modeling for Infectious Disease . . . . . . . . . . . . . . . . . 124
6.3 Curve Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography 128
viii
A Appendix for Chapter 3 138
B Appendix for Chapter 4 140
C Appendix for Chapter 5 147
ix
Chapter 1
Introduction
1.1 Overview
In recent years, there has been considerable interest in the development and applica-
tion of spatial models and methods for the analysis of spatially correlated data, which
are often geographically referenced, temporally correlated or highly multivariate. For
example, a motivating dataset considers the analysis of lung cancer for males and
females by local health unit in Ontario. The key idea throughout the approaches
considered is to take advantage of the correlation structure among observations to
perform estimation, prediction, hypothesis testing and other statistical procedures.
We begin with a review of some important concepts which form the building blocks
of the methods and models developed in later chapters. This is followed by an outline
of the material presented in each of the chapters of the thesis.
1
CHAPTER 1. INTRODUCTION 2
1.2 Disease Mapping
Mapping of disease incidence mortality rates is of primary importance in many epi-
demiological studies. The use of crude rates to estimate rare disease risks in small
areas such as health units, census areas or administrative zones, is problematic since it
does not account for the high variability of population sizes over the different regions,
nor the spatial patterns of the regions under study. Because of this, interpretation of
the spatial distribution of disease based on crude estimates is often misleading. Al-
ternatively, Bayesian inference is widely used to produce stabilized risk maps through
borrowing information from neighborhoods across the map. Early developments of
disease mapping methodology included the use of empirical Bayes (EB) techniques
(Manton et al., 1989; Marshall, 1991; Dean and MacNab, 2001; Breslow and Clay-
ton, 1993) to estimate parameters, and a plug-in approximation of these for posterior
inference, which yielded unbiased estimates of the relative risks. However, the vari-
ance of these estimates were underestimated, since the EB approach does not account
for the uncertainty arising from estimating hyperparameters. In recent years, (fully)
Bayesian (FB) approaches have gained prominence. Inference is based on Markov
chain Monte Carlo (MCMC) algorithms (Besag et al., 1991; Bernardinelli and Mon-
tomoli, 1991; MacNab et al., 2004; Congdon, 2006). Interval estimation of relative
risks based on posterior distributions account for the uncertainty associated with the
estimates through the hyperprior specifications. Bayesian methods for disease map-
ping is often termed hierarchical spatial modeling. The first level of the hierarchy
depicts the distribution of the data; the second level introduces the spatial depen-
dence through random effects which account for heterogeneity in the risks; at the
lowest level is specified the distribution of the hyperparameters.
CHAPTER 1. INTRODUCTION 3
1.2.1 Conditional Autoregressive Priors
One of the most popular choices for the distribution of the random effects in hi-
erarchical spatial modeling is the intrinsic conditional autoregressive (CAR) model
(Besag et al., 1991). Let W = (wij) denote the so-called spatial proximity matrix,
i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , n for n regions, where wii = 0 and wij = 1 if the ith
and the jth areas are neighbours (denoted j ∼ i), and 0 otherwise. The conditional
expectation and variance are
E(bi∣bj ∕=i) =1
wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2b
wi+, (1.1)
where b−i represents and b = (b1, ⋅ ⋅ ⋅ , bn) has joint distribution
b ∼MVN(0,Σ), Σ = �2b (D −W )−1 , (1.2)
where D = diag(w1+, ⋅ ⋅ ⋅ , wn+), wi+ =∑
j wij. The forms (1.1) and (1.2) define the
intrinsic CAR (Besag et al., 1991) uniquely. With this model, local smoothing can be
achieved, as E(bi∣bj ∕=i) is the local risk average over the neighborhood of region i and
Var(bi∣bj ∕=i) is scaled by the inverse of the number of neighbors, so that the greater
the number of neighbors the smaller the variance. However, the intrinsic CAR prior
is improper, since the matrix (D−W ) is singular. This impropriety can be remedied
by enforcing constraints such as∑n
i=1 bi = 0, which can be implemented numerically
at each iteration of an MCMC algorithm used for model fitting. Alternatively, the
so-called proper CAR model may be used; this model incorporates an additional
parameter �, so that the full conditionals are
E(bi∣bj ∕=i) =�
wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2b
wi+, � ∈ (0, 1) (1.3)
leading to the unique joint distribution
b ∼MVN(0,Σ), Σ = �2b (D − �W )−1 , (1.4)
CHAPTER 1. INTRODUCTION 4
so that the covariance matrix (D − �W ) is non-singular.
Alternatively, Leroux et al. (1999) proposed a CAR model defining the full con-
ditionals as
E(bi∣bj ∕=i) =�
1− �+ �wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2b
1− �+ �wi+, � ∈ (0, 1) (1.5)
leading to the unique joint distribution
b ∼MVN(0,Σ), Σ = �2b {�(D −W ) + (1− �)I}−1 , (1.6)
where � is a weighting parameter which weights the contributions from the spatially
correlated effect, modeled as intrinsic CAR, and the independent random noise term,
an independent normal distribution.
For point referenced data, geostatistical models (Cressie, 1993) are often used,
which directly specify the covariance matrix based on the distance between the spa-
tial sites. For example, the correlation between two spatial sites may decay expo-
nentially with distance; whereas, CAR models are specified based on the adjacency
structure among the spatial units, and can be used for either point-referenced data
or lattice data. In addition, inference for geostatistical models usually requires in-
version of covariance matrixes at each MCMC iteration; CAR models are therefore
computationally more efficient than geostatistical models.
1.3 Thin-Plate Splines
Thin-plate splines (Duchon, 1977) offer a very elegant approach for estimating a
smooth function of multiple predictor variables. The following provides a concise
introduction to thin plate splines. For a more detailed description, see (Duchon,
1977; Meinguet, 1979; Green and Silverman, 1994; Wood, 2004, 2006).
CHAPTER 1. INTRODUCTION 5
Suppose the response yi, i = 1, ⋅ ⋅ ⋅ , n, is modeled as a smooth function of covari-
ates xi such that
yi = f(xi) + �i, i = 1, ⋅ ⋅ ⋅ , n, (1.7)
where f is an unknown function on a fixed domain D ⊂ Rd, �i is a random error term,
and xi ⊂ D are fixed values for covariates.
Thin-plate spline smoothing estimates f by finding the function f which minimizes
the penalized sum of squares
1
n
n∑i=1
wi {yi − f(xi)}2 + �Jm(f) , (1.8)
where wi, i = 1, 2, ⋅ ⋅ ⋅ , n, are some fixed constants; Jm(f) is penalty function measur-
ing the non-smoothness or so-called ‘wiggliness’ of f , and � is the smoothing param-
eter, which controls the tradeoff between f fitting the data precisely and smoothness
of f . The penalty term is defined as
Jm(f) =
∫⋅ ⋅ ⋅∫Rd
∑�1+⋅⋅⋅+�d=m
m!
�1! ⋅ ⋅ ⋅ �d!
( ∂mf
∂x�11 ⋅ ⋅ ⋅ ∂x�dd
)2dx1 ⋅ ⋅ ⋅ dxd . (1.9)
The sum in the integral is taken over all the integers � = (�1, ⋅ ⋅ ⋅ , �d)T such that
�1 + ⋅ ⋅ ⋅ �d = m, where d denotes the number of covariates, so d = 2 for spatial
longitude and latitude coordinate data, and the order m of differentiation in the
penalty can be any integer satisfying 2m > d. Matheron (1973) and Duchon (1977)
showed that the function minimizing (1.8) has the form
f(x) =k∑j=1
�j�j(x) +n∑i=1
i i(x) , (1.10)
where (�1, ⋅ ⋅ ⋅ , �k) are linearly independent polynomials spanning the space of all
d-dimensioned polynomials of degree less than m, and �j, j = 1, ⋅ ⋅ ⋅ , k and i, i =
1, ⋅ ⋅ ⋅ , n are coefficients to be estimated. For example, when d = 2, m = 2, k = 3
CHAPTER 1. INTRODUCTION 6
and x = (x1, x2), we have �1(x) = 1, �2(x) = x1 and �3(x) = x2. For d = 2, m = 3,
k = 6, we have �1(x) = 1, �2(x) = x1, �3(x) = x2, �4(x) = x1x2, �5(x) = x21,
�6(x) = x22. The functions ( 1, ⋅ ⋅ ⋅ , n) are a set of n radial basis functions, defined
as
i(r) =
⎧⎨⎩ amd∥r∥2m−dlog∥r∥, d even
bmd∥r∥2m−d, d odd
where amd and bmd are constants.
For modeling spatial effects, thin-plate regression splines can be viewed as a Gaus-
sian process with generalized covariance (Cressie, 1993), characterized in terms of
distance �. The form of the covariance in two dimensions is C(�) ∝ �2m−2log(�),
where m is the order of the spline (commonly two). Paciorek (2007) provided a nice
comparison of a variety of approaches for modeling spatial surface. Wood (2000, 2003,
2004) proposed the use of iterative weighted fitting of reduced rank thin-plate splines
for computational efficiency.
1.4 Outline of Thesis
This thesis develops models and methods for the analysis of spatial or spatial-temporal
data arising from epidemiology, environmental and ecological studies. Specific prob-
lems will be considered including identification of high risk isolated areas in Chapter
2; misspecification of spatial exposure measures in Chapter 4; joint modeling of multi-
variate spatially correlated outcomes using common spatial factor models in Chapter
3; and investigation of functional data analysis approaches for modeling spatially and
temporally correlated data in Chapter 5. Each of Chapters 2, 3, 4 and 5 constitute
papers submitted. As a result, some introductory material may be repeated through
these chapters as well as the descriptions of motivating data sets.
CHAPTER 1. INTRODUCTION 7
1.4.1 Chapter 2
In disease mapping studies, often there is interest in identifying high risk areas in
order to investigate causes of mortality for surveillance purposes, or perhaps for effi-
cient allocation of health funding. Here, we focus on identification of locally isolated
high risk regions termed ‘local hotspots’ or ‘emerging hotspots’, defined as regions
with elevated risks, with respect to their neighbors. Identification of ‘local hotspots’ or
‘emerging hotspots’ before they become extreme is crucial for disease surveillance. We
develop methods of ranking the difference between area risks or ranks and correspond-
ing values for neighbours, based on (1) the standardized mortality ratio (SMR), (2)
minimizing mean squared errors of estimation for relative risks (3) minimizing mean
squared errors of estimation for ranks of risks, (4) minimizing a weighted squared
error loss function for ranks and (5) maximizing the sensitivity in the upper and
lower 100 % relative risks at prespecified . We evaluate our methods through sim-
ulation investigation in a scenario which reflects the Scottish lip cancer data used in
several mapping studies. Our simulation results show that ranking the difference be-
tween posterior ranks of emerging hotspots and corresponding values for neighbours,
based on minimizing mean squared errors of estimation for ranks, is superior to other
methods for identifying emerging hotspots.
1.4.2 Chapter 3
This chapter discusses joint outcome modeling of multivariate spatial data, where
outcomes include count as well as zero-inflated count data. The framework utilized for
the joint spatial count outcome analysis reflects that which is now commonly employed
for the joint analysis of longitudinal and survival data, termed shared frailty models,
in which the outcomes are linked through a shared latent spatial random risk term.
We discuss these types of joint mapping models and consider the benefits achieved
CHAPTER 1. INTRODUCTION 8
through such joint modeling in the disease mapping context. We also consider the
power of tests for common spatial structure and develop recommendations on the
sort of power achievable in some contexts, as well as overall recommendations on the
utility of joint mapping. We illustrate the approaches in an analysis of lung cancer
mortality as well as an ecological study of Comandra blister rust infection of lodgepole
pine trees.
1.4.3 Chapter 4
In environmental and epidemiological studies, the nearest distance between the sus-
ceptible subject and the exposure source is a commonly used exposure measure, prin-
cipally because this measure is easy to collect. However, the density of the exposure in
the neighborhood of the subject may play an important role in the response to expo-
sure. Misspecification of exposure measures may result in inaccurate determinations
of the link between exposure and the response of interest. Such considerations are
motivated by the study of the disease dynamics of Comandra blister rust (Cronartium
comandrae) on lodgepole. This disease spreads to pine trees through alternate host
plants near the trees. We aim at understanding the relationship between the alternate
host plant presence and the disease, as well as effects relating to genetic variation in
the trees. We contrast the use of nearest distance to the alternate host plant, with
host plant densities at different orders of neighborhood, as exposure measures, in the
framework of a flexible semiparametric generalized additive model, while adjusting for
a spatially smooth surface. We demonstrate that if exposure is inaccurately modeled,
bias in estimating genetic effects may manifest themselves. Our study also provides
information on the added benefit of collecting more detailed information on exposure
beyond the simple nearest distance measure.
CHAPTER 1. INTRODUCTION 9
1.4.4 Chapter 5
Oysters from the Pacific Northwest coast of British Columbia, Canada, contain high
levels of cadmium, in some cases exceeding some international food safety guidelines.
A primary goal of this chapter is the investigation of the spatial and temporal variation
in cadmium concentrations for oysters sampled from coastal British Columbia. Such
information is important so that recommendations can be made as to where and when
oysters can be cultured such that accumulation of cadmium within these oysters
is minimized. Some modern statistical methods are applied to achieve this goal,
including monotone spline smoothing, functional principal component analysis and
semi-parametric additive modelling. Oyster growth rates are estimated as the first
derivatives of the monotone smoothing growth curves. Some important patterns in
cadmium accumulation by oysters are observed. For example, most inland regions
tend to have a higher level of cadmium concentration than most coastal regions, so
more caution needs to be taken for shellfish aquaculture practices occurring in the
inland regions. The semi-parametric additive modelling shows that oyster cadmium
concentration decreases with oyster length, and oysters sampled at 7m have higher
average cadmium concentration than those sampled at 1m.
1.4.5 Chapter 6
The thesis closes with a discussion of future research topics.
Chapter 2
Bayesian Ranking Methods for the
Detection of Isolated Hotspots in
Disease Mapping
2.1 Introduction
In disease mapping, early capture of emerging hotspots, that is, regions with ele-
vated risks which are surrounded by areas with much lower risks, before they become
extreme, is crucial in decision-making related to health surveillance. Such decision-
making processes may refer to optimal allocation of resources for health prevention, or
to decisions reflecting mobility of a society or other environmental controls. A typical
approach for detection of disease hotspots through a hypothesis testing framework
utilizes the scan statistic (see Kulldorff and Nagarwalla, 1995; Kulldorff et al., 1998),
which aims at detecting the location and size of hotspots without any preconceived
assumptions about these values. Our focus here is quite different as we seek to es-
timate and rank various local elevations in risk across a map. Model based spatial
10
CHAPTER 2. ISOLATED HOTSPOT DETECTION 11
methods are used here to estimate such ranks. For rare diseases, the observed dis-
ease count may exhibit extra Poisson variation. Hence, the standardized mortality
ratios (SMRs), a basic investigative tool for epidemiologists, may be highly variable.
Subsequently, in maps of SMRs, the most variable values, arising typically from low
population areas, tend to be highlighted, masking the true underlying pattern of dis-
ease risk. To address the issue of such overdispersion, the field of disease mapping
has flourished in the last decade with a variety of estimation methods and spatial
models for latent levels of the model hierarchy. In particular, there have been many
developments related to Bayesian hierarchical models, which allow the risk in an area
to borrow strength from neighboring areas where the disease risks are similar. These
models have indeed become standard tools for mapping rates (see Besag et al., 1991;
Clayton and Bernardinelli, 1992; Clayton et al., 1993; Lawson et al., 2000; MacNab
et al., 2004; Best et al., 2005, for example) in order to identify global hotspots and
trends in the risk surface across the map.
Identification of local or emerging hotspots have received less attention. It is
unclear whether and what sorts of smoothing techniques offer advantages for iden-
tifying isolated hotspots, over basic estimates such as raw rates. Here, we maintain
the focus on Bayesian hierarchical conditional autoregressive (CAR) models, devel-
oped by Besag et al. (1991); Clayton and Bernardinelli (1992); Clayton et al. (1993).
This model and its extensions have become commonplace in epidemiological studies
and have been shown to be flexible and robust (Lawson et al., 2000). Best et al.
(2005) demonstrates the merits of the CAR model when compared to other contem-
porary models including a multivariate normal geostatistical model with exponential
covariance, a spatial mixture model, a partition model and a gamma moving average
model. While CAR models were not designed to detect isolated hotspots or clusters
of isolated hotspots, they have nevertheless been used broadly for identifying extreme
risks.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 12
The most natural measure of isolation is the difference between the risk or rank of a
potential hotspot and the corresponding quantity for its neighbors. Ranking methods
play a valuable role in drawing attention to elevated regions. This chapter considers
methods for ranking isolation measures with the goal of using these to identify local
or emerging hostpots. We note that Laird and Louis (1989) showed that ranking of
empirical Bayes estimators can be more accurate than that of conventional maximum
likelihood estimators. Shen and Louis (1998) investigated ranking procedures using
squared error loss functions operating on the difference between the estimated and
true ranks. We note also that in many applications, interest focuses principally on
identifying the locations with relatively high (e.g. in the upper 10 %) or low risks.
With such an emphasis, Lin et al. (2006) discussed various loss functions for Bayesian
optimal ranking, as well as decision rules for identifying the regions with the top
100 % risk values. Wright et al. (2003) developed a weighted rank squared error loss
function targeted at the most likely high-risk locations. We contrast these methods
for identifying the highest and lowest isolation measures across a map and develop
recommendations based on adaptations of these procedures. Though we focus on
disease mapping, we note that methods for ranking isolation measures may be broadly
useful in many other contexts, particularly sociological, for ranking political or racial
isolation, or ecological, for diversity studies.
In Section 2.2, we review the Bayesian hierarchical models commonly used for
analyzing disease incidence and mortality data. Section 2.3 discusses the ranking
methods considered, focusing on identifying regions associated with high risks which
are isolated and building upon Bayesian hierarchical models. Section 2.4 evaluates the
methods using the spatial distribution of lip cancer from Scotland where local hotspots
are artificially generated. Section 2.5 closes with a discussion and recommendations.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 13
2.2 Bayesian Disease-Mapping Model
It is well known that Bayesian hierarchical models for disease mapping provide a trade-
off between bias and variance reduction of estimates, and is particularly helpful in
cases where the disease is rare. The variance reduction is achieved through borrowing
information from the neighboring region to produce a more stable estimate of the risk
surface with estimated risks shrunk toward the overall mean risk, or some function of
this mean. Marshall (1991) reviews empirical Bayes and some early Bayesian methods
for disease mapping; Lawson et al. (2000) compares disease mapping models using
various goodness of fit criteria; Best et al. (2005) provides a comprehensive review of
the recent development in Bayesian disease mapping and compares models through
simulation studies; Richardson et al. (2004) conducts a comprehensive evaluation
designed to highlight the amount of smoothing of risk which occurs and the effects on
identifying global hotspots in a variety of settings. Our aim here is to evaluate various
ranking methods for risk estimators obtained from fitting Bayesian disease mapping
models. We focus on the basic spatial model described by Besag et al. (1991).
Let the area under study be divided into n contiguous regions labeled i = 1, ⋅ ⋅ ⋅ , n,
and let y = (y1, ⋅ ⋅ ⋅ , yn)T be the observed, and E = (E1, ⋅ ⋅ ⋅ , En)T be the expected,
disease counts. Denote by � = (�1, ⋅ ⋅ ⋅ , �n)T , i = 1, ⋅ ⋅ ⋅ , n the underlying random
region-specific disease risks. The response variables, conditional on �i, i = 1, ⋅ ⋅ ⋅ , n,
are assumed independent and Poisson distributed: yi∣�i∼Poisson(�i), �i = �iEi. The
conditional log linear model (Besag et al., 1991) specifies
log(�i) = � + log(Ei) + �i, �i = � + bi + ℎi ,
where � denotes the overall mean risk, while �i is decomposed into a spatially cor-
related random error term bi, and a uncorrelated error ℎi. The spatially correlated
CHAPTER 2. ISOLATED HOTSPOT DETECTION 14
random effects, b = (b1, ⋅ ⋅ ⋅ , bn)T are conveniently interpreted conditionally, as
bi∣bj ∕=i ∼ N
(∑j∼iwijbj∑j∼iwij
,�2b∑
j∼iwij
),
where j ∼ i indicates that region j belongs to the neighbourhood of region i,
i = 1, ⋅ ⋅ ⋅ , n. Neighborhoods define the scope of the conditional influence and may
be constructed in different ways depending on the context of the analysis. In our
application, we define regions which are contiguous in space with the ith region,
sharing a common boundary, as its neighborhood. The weights, wij ≥ 0, wii = 0,
i, j = 1, ⋅ ⋅ ⋅ , n may be based on adjacency indicators for a lattice, or on a distance
measure between region i and j. Where the weights are based on adjacency indica-
tors, the joint distribution of random effects, b, is described as the intrinsic condi-
tional autoregressive model (Besag, 1974; Sun et al., 1999): b ∼ MVN(0, �2bQ−1),
where Q has ith diagonal element equal to the number of neighbors of the ith region
while for i ∕= j, Qij = −1 if i and j are neighbors, and 0 otherwise. The vector of
random risks, �, accommodates extra variation by a white noise error vector, and
h = (ℎ1, ⋅ ⋅ ⋅ , ℎn)T ∼ MVN(0, �2ℎI), where I is an identity matrix of dimension
n. By combining the independent and spatially correlated sources of random errors,
we obtain the convolution conditional autoregressive model for defining the distribu-
tion of the risks �i, as defined by Besag et al. (1991): h+ b ∼ MVN(0,Σ), where
Σ = �2ℎI + �2
bQ−1. The values of �2
ℎ and �2b give a sense of the contributions of
spatial and non-spatial components in explaining the variability in the map of risks.
Bayesian analysis requires the specification of prior distribution for the parameters.
We put diffuse prior on the intercept �. For the variance parameters (�2b , �
2ℎ) of
the random effects (b,h), we let the square root be a noninformative uniform prior
density between 0 and 100 (Gelman, 2006).
In the Bayesian approach to disease mapping, inference on the relative risks is
based on the posterior distribution of the risks given the data. The use of Markov
CHAPTER 2. ISOLATED HOTSPOT DETECTION 15
chain Monte Carlo (MCMC) methods based on Gibbs sampling (Geman and Ge-
man, 1984; Gelfand and Smith, 1990) yields easy implementation in the WinBUGS
software package (Spiegelhalter et al., 2003), allowing for estimation of the posterior
distribution of the relative risks. The R project R2WinBUGS (Sturtz et al., 2005)
may be used to export results for additional analyses using R.
2.3 Ranking Methods
To estimate isolation, we propose to rank the difference between the rank or risk es-
timates of the region under consideration and the corresponding mean value from its
neighbours. We expect this to provide a useful mechanism for identifying areas with
emerging or unusual elevated risk, and hence for prioritizing public health investiga-
tions. Our discussion of ranking approaches are from both (i) traditional perspectives
which use estimates based upon the SMR and (ii) those based on smoothing methods
with a focus of obtaining a general impression of trends over space as well as utilizing
these to provide more precise identification of isolated high risk areas.
Let d = (d1, ⋅ ⋅ ⋅ , dn)T be a vector representing the isolation measure defined as
the true difference in relative risks between the region and the mean value of the risk
for its neighborhood
di = �i −1
Ni
∑j∼i
�j , (2.1)
where Ni denotes the number of neighbours for region i, i = 1, ⋅ ⋅ ⋅ , n. Define the
corresponding rank of di as
rank(di) = Ri =n∑j=1
I {di ≤ dj} , (2.2)
where I {A} is the indicator function for event A. The smallest difference has rank
n and the largest has rank 1. The ranking methods considered are obtained by
CHAPTER 2. ISOLATED HOTSPOT DETECTION 16
minimizing the following loss functions.
2.3.1 Squared error loss function for the isolation measures
It is well known that the posterior mean minimizes the Bayesian risk with respect
to the squared-error loss (SEL) function (Berger, 1985). For example, the posterior
mean, E(�i∣y), is the optimal Bayes estimate obtained by minimizing the posterior
expectation of the sum of squared error loss function L(�, �) =∑n
i=1(�i − �i)2/n
(Carlin and Louis, 1996). In our case, we rank the posterior mean of the isolation
value, E(di∣y), which minimizes the posterior expectation L(d, d) =∑n
i=1(di−di)2/n.
The corresponding estimated ranks are denoted as PM.
2.3.2 Squared error loss function for the ranks of the isola-
tion measures
Laird and Louis (1989), Shen and Louis (1998) and Louis and Shen (1999) showed
that if ranks of parameters are of interest, using a rank estimator directly is more ap-
propriate than using the parameter estimator to obtain ranks. The posterior expected
rank is obtained by minimizing the sum of squared error loss function of the ranks
L(R, R) =∑n
i=1(Ri−Ri)2/n. The estimated ranks, which are non-integer quantities,
are
Ri = E(Ri∣y) =n∑j=1
P (di ≤ dj∣y) , (2.3)
and tend to be shrunk towards the mid-rank (n+ 1)/2. Hence, we rank the posterior
means of (2.3) as described below, and denote the corresponding estimated ranks,
Ri = rank(Ri), as PRANK. Lin et al. (2006) shows that the estimator is also optimal
under weighted squared error loss of ranks, 1/n∑n
i=1wi(Ri − Ri)2 for any values of
wi, i = 1, ⋅ ⋅ ⋅ , n. Calculation of PRANK can be easily implemented in the Bayesian
CHAPTER 2. ISOLATED HOTSPOT DETECTION 17
context. Let �(r) = (�(r)1 , ⋅ ⋅ ⋅ , �(r)n )T be a random draw of � from p(�∣y); rank the
isolation measures d(r)i = �
(r)i − 1/Ni
∑j∼i �
(r)j , i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ ,Ni, and
subsequently rank the average rank of d(r)i over the MCMC iterations, r = 1, ⋅ ⋅ ⋅ , R,
to obtain the optimal rank based on (2.3).
Ranking methods described in Subsections 2.3.1 and 2.3.2 may be reasonable
choices when accurate ranking of all regions is of interest. In contrast, the methods
described in Subsection 2.3.3 and 2.3.4 focus on high risk areas.
2.3.3 Weighted rank squared error loss function
The posterior means are less variable than a typical draw from the posterior distribu-
tion (Louis, 1984). Therefore, high risks tend to be underestimated, while low risks
tend to be overestimated. Wright et al. (2003) introduces weighted rank squared error
loss functions in a hierarchical setting for estimating extrema (hotspot) of parame-
ters. In an exploratory approach, we adapt this method to be aligned with a focus
on identifying local isolated hotspots.
Let (d(1), ⋅ ⋅ ⋅ , d(n)) be the ordered vector of d, d(1) < ⋅ ⋅ ⋅ < d(n), assuming no ties.
To identify the most isolated hotspot, we consider the following loss function:
J(d, d, c
)=
n∑k=1
n∑j=1
cjI{dk = d(j)
}(dk − dk
)2=
n∑k=1
cr(k)
(dk − dk
)2, (2.4)
where r(k) ≡{j : dk = d(j)
}, cr(k) =
∑nj=1 cjI
{dk = d(j)
}and c = (c1, ⋅ ⋅ ⋅ , cn)T is the
vector of weights for d. The optimal Bayes estimator of dk is obtained by minimizing
the conditional expectation of the kth element in (2.4),
E{Jk(d, dk, c∣y
)}=
∫ n∑j=1
cjI{dk = d(j)
}(dk − dk
)2p(d∣y)dd , (2.5)
CHAPTER 2. ISOLATED HOTSPOT DETECTION 18
which yields
dk =
∑nj=1 cj
∫I{dk = d(j)
}dkp(d∣y)dd∑n
j=1 cj∫I{dk = d(j)
}p(d∣y)dd
=
∑nj=1E
(dk∣dk = d(j),y
)cjp(dk = d(j)∣y
)∑n
j=1 cjp(dk = d(j)∣y
) .
(2.6)
The estimate dk is a weighted average of conditional posterior means of dk, with the
weight being cj multiplied by the posterior probability that dk has rank j. The corre-
sponding estimated ranks are denoted as WRSEL. For identifying extreme risks, we
use the suggestion in Wright et al. (2003) to consider a sharply increasing weighting
vector, with ci = exp [{(n+ 1)− i} /s] as the weight for rank i, i = 1, ⋅ ⋅ ⋅ , n. We
let WRSEL(a) denote the estimated ranks when s = 2, so that the weighting func-
tion puts large weight on highly isolated risks and almost 0 weight otherwise, and
WRSEL(b) denote the estimated ranks when s = 10, so that the weight function de-
clines less steeply as risks become less isolated. Figure 2.1 displays the weight vectors
c for WRSEL(a) and WRSEL(b) when n = 56.
2.3.4 Misclassification rates of regions in the top 100 % group
Lin et al. (2006) considered specific loss functions tailored for estimating extreme
ranks. They recommended ranking the posterior probability that a region’s rank is
in the top 100 % of ranks based on the rank-based misclassification loss function:
L0∣1( ,R, R) =1
n
n∑i=1
{FP( ,Ri, Ri) + FN( ,Ri, Ri)
}, (2.7)
where
FP( ,Ri, Ri) = I{Ri > (n+ 1), Ri ≤ (n+ 1)
};
FN( ,Ri, Ri) = I{Ri ≤ (n+ 1), Ri > (n+ 1)
}, (2.8)
CHAPTER 2. ISOLATED HOTSPOT DETECTION 19
0 10 20 30 40 50
0.00.2
0.40.6
0.81.0
rank
c
WRSEL(a)WRSEL(b)
Figure 2.1: Plot of the weight function c for WRSEL(a) and WRSEL(b). In thisplot, the weight functions are scaled to have maximum value of 1.
where FP (false positive) and FN (false negative) indicate the two possible misclassi-
fication rates.
Lin et al. (2006) shows the loss function (2.7) is minimized by ranking the following
posterior probabilities:
P (Ri ≤ (n+ 1)∣y) , (2.9)
as in Lin et al. (2006), based on the posterior distribution of Ri, and minimizes
errors in classifying regions above or below a percentile threshold. The corresponding
estimated ranks are denoted as PPR.
2.4 Comparison of Rank Estimators of Isolation
In an effort to understand how these ranking methods perform and how well they cap-
ture isolated hotspots when these are only modestly elevated, we consider hotspots
CHAPTER 2. ISOLATED HOTSPOT DETECTION 20
from regions with low expected counts, where the elevation in risk ranges from mod-
erate to large. We consider a single isolated hotspot, a small cluster of contiguous
hotspots, and, for comparison, several non-contiguous isolated hotspots.
In the investigations, the background relative risks are spatially correlated while an
independent discrete random effect inflates the risks in the target regions. Specifically,
counts were generated from a multinomial distribution
yi ∼ Multinomial
(n∑i=1
Ei,Ei�i∑ni=1Ei�i
), �i = exp(� + bi + log�i) , (2.10)
where � is the overall mean rate over the map; bi denotes a spatially correlated ran-
dom effect; �i = 1 if the region is not a hotspot, and constant t otherwise, t being
the inflation factor. To accommodate sampling variability, each simulation scenario is
replicated 500 times. Two MCMC chains have been run for a total of 20,000 iterations,
keeping every 10th, after a 10,000 iteration burn-in period. Brooks-Gelman-Rubin
diagnostics (Brooks and Gelman, 1998), as well as graphical checks of chains and
their autocorrelations were performed to assess convergence. The distribution of the
spatially correlated random effects, the expected disease counts and the neighbor-
hood structure mimic the fitted distribution from an initial analysis of the Scottish
lip cancer data (see Breslow and Clayton, 1993, for example). The data comprise
observed and expected counts of lip cancer cases during the period 1975-1980 over 56
Scottish counties. Table 2.1 summarizes observed and expected counts for this data.
The lip cancer data is known for exhibiting severe extra-Poisson variation (Clayton
and Kaldor, 1987). Breslow and Clayton (1993) and others have found that a con-
ditional Poisson model with spatially correlated CAR random effects provides a fair
fit to these data. We use the estimated model parameters from such an analysis to
define the background spatial pattern. Additionally, emerging isolated hotspots are
generated as
CHAPTER 2. ISOLATED HOTSPOT DETECTION 21
∙ Scenario I: A single region is considered as emerging hotspot. Three candi-
dates are considered with expected counts of 1.8, 6 and 14.6, corresponding
approximately to the 10th, 50th and 90th percentiles of the expected counts,
respectively. Note that we choose the hotspot with low expected count (10th
percentile of the expected counts), such that it is surrounded by neighbours
with fairly high expected counts, one of which has expected count 50.7. In
this case, the neighbours may have substantial smoothing effects on the target
region under the CAR model.
∙ Scenario II: A group of three contiguous regions is considered as an isolated clus-
ter. Two cases are considered: (i) areas with low expected counts 3.3, 4.8 and
2.9; (ii) areas with high expected counts of 9.3, 14.6 and 88.7. When contigu-
ous regions are proposed as hotspots, di (2.1) is calculated by excluding target
hotspots from Ni. This mimics a hypothesis testing scenario where a specific
cluster is being tested. Note that the expected counts from the neighbours of
case (i) are fairly low, with mean expected counts about 7.5.
∙ Scenario III: A group of three non-contiguous regions is considered as an isolated
group of regions of higher risk. Two cases are considered (i) areas with low
expected counts of 2.5, 3.3 and 3.6; (ii) areas with high expected counts of 10.1,
50.7 and 8.2. The expected counts of the neighbours for two of the isolated
hotspots in case (i) are fairly low, while the third hotspot has a neighbor with
the highest expected count over the map.
Note that estimated risks for the isolated hotspots which have moderate or high
expected counts are less likely to be influenced by disease counts for their neighbours.
In the simulation studies, the risks of the elevated regions are inflated to be sharply
different from their neighbors and (i) not overly high (rank about 10th place), and
(ii) moderately high (rank about 3rd place) and (iii) high (rank about 1st place). The
CHAPTER 2. ISOLATED HOTSPOT DETECTION 22
Table 2.1: Scottish lip cancer data: summary statistics.
Minimum First quartile Median Third quartile MaximumObserved count (y) 0.00 4.75 8.00 11.00 39.00Expected count (E) 1.10 4.05 6.30 10.12 88.70SMR (y/E) 0 0.49 1.11 2.24 6.43
magnitudes of the inflations for Scenarios I, II and III are reflected in Figures 2.2,
2.3 and 2.4, respectively. The corresponding geographical locations for the isolated
hotspots are shown in Figures 2.5 and 2.6, respectively. We also consider scaling the
expected counts by a factor u = 1, 4 and 8 for all scenarios. The threshold in (2.9)
corresponds to 1/(n + 1) for Scenario I and 3/(n + 1) for Scenarios II and III. The
simulated data are analyzed using model (2.1).
To assess the accuracy of the proposed ranking methods for identification of iso-
lated hotspots, we consider the root mean squared error of Ri, for an isolated hotspot
at site i, given by
RMSE(Ri) =
{1
M
M∑m=1
(R
(m)i −Ri
)2}1/2
, (2.11)
where Ri is the true rank (2.2) and R(m)i is the estimated value based on the mth
simulated dataset, m = 1, ⋅ ⋅ ⋅ ,M . For cases where clusters are considered (Scenario
II and III), we calculate the average RMSE for the hotspots.
We also evaluate the ranking methods based on the correct positive and false
CHAPTER 2. ISOLATED HOTSPOT DETECTION 23
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd ●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
1st
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●●●
0 10 30 501
23
45
true rank
tru
e r
ela
tive
ris
k
●
1st
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
1st
Figure 2.2: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top, middle and bottom rows correspondto Scenario I with the target region having low, moderate and high expected incidencecount, respectively. The isolated hotspot, shown as black dots, are inflated to aboutthe 10th (column 1), 3rd (column 2) and 1st (column 3) places. The symbol +identifies neighboring regions of the isolated hotspot.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 24
●●
●
●
●●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●●●
●
●●● ● ●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●
10th ●●
●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●●●
●
●●● ● ●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●
3rd●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●●
●●●
●●
●●●
●
●●● ● ●
●●●
● ●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●
● 1st
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●
●● ●
●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●
●● ●
●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●●
●
3rd
●●
●
●
●●●
●
●
●●●
●
●●
●●
●
●
●●●
●
●
●●
●●
●
●●
●●
●●
●●
●●
●●●●
●
●● ●
●●
●● ●●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●1st
Figure 2.3: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 25
●●
●
●
●●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●●● ●●
●
●●
●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●●● ●●
●
●●
●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●
●
●
3rd
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●●
●● ●●
●●
●●●●
●●● ●● ●●●
●
●●
●●
0 10 20 30 40 50
12
34
5
true rank
true
rel
ativ
e ris
k
●●
●
1st
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●●
●●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●●
●●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●
●●
3rd
●●
●
●
●●●
●
●
●●●
●
●●
●●
●
●
●●●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●●●●
●
●
●
●● ●●● ● ●●
●●
0 10 20 30 40 50
12
34
5
true rank
true
rel
ativ
e ris
k
●
●
●
1st
Figure 2.4: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three non-contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 26
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
low E
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
mod
erat
e E
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
high
E
Fig
ure
2.5:
The
pan
els
dis
pla
ydi
for
Sce
nar
ioI.
The
singl
eis
olat
edhot
spot
wit
hlo
w,
moder
ate
and
and
hig
hex
pec
ted
count,
are
iden
tified
by
the
red
circ
lein
the
1st,
2nd
and
3rd
pan
els,
resp
ecti
vely
.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 27
under −1.5−1.5 − 00 − 1.5over 1.5
low E
under −1.5−1.5 − 00 − 1.5over 1.5
high E
under −1.5−1.5 − 00 − 1.5over 1.5
low E
under −1.5−1.5 − 00 − 1.5over 1.5
high E
Figure 2.6: The top and bottom panels display di for Scenarios II and III, respectively.The cluster of three contiguous isolated hotspots with low and high expected countsfor simulation Scenario II are identified by the red circles in the left and right toppanels, respectively; the cluster of three non-contiguous hotspots with low and highexpected counts for simulation Scenario III are identified by the red circles in the leftand right bottom panels, respectively.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 28
positive rates
CP = P (Ri < �∣Ri < �) =1
M
M∑m=1
I{R
(m)i < �∣Ri < �
};
FP = P (Ri < �∣Ri > �) =1
M
M∑m=1
I{R
(m)i < �∣Ri > �
}, (2.12)
where � in (2.12) denotes the threshold defining high ranks, � = 2 for Scenario I and
4 for Scenarios II and III.
Table 2.2 displays RMSE, CP and FP for all the ranking methods evaluated here
for Scenario I for the case where the hotspot is associated with a low expected count
surrounded by neighbours with high expected values. It is not surprising that SMR
performs better in this case, as the CAR model pools information from the neighbours
to produce an estimate for the target region; therefore, the risk estimate for this
isolated hotspot tends to be smoothed under the CAR model. In contrast, for the
case of an isolated hotspot with moderate or large expected count, as shown in Tables
2.3 and 2.4, PRANK outperforms SMR. The gains of using PRANK are substantial
when the expected incidence count for the emerging hotspot is large. For example,
in Table 2.4, when the isolated hotspot is in the 10th place, CP is about 71.2% while
FP is about 0.5% for PRANK, yielding a performance which is far superior to the
other ranking methods. In general, WRSEL(a), WRSEL(b) and PPR, perform less
well. The WRSEL function tends to inflate the point estimates of the high risks;
because their weights are low, inaccuracies in point estimates of the other regions
with low isolation measures are relatively unimportant. WRSEL does not provide
precise estimates of all the risks and this may make it unsuitable for ranking purposes
(ranking requires good estimates over the whole map). Our empirical evaluation of
PPR over a sequence of values of the threshold (not shown here) suggests that
the performance of this estimator in terms of RMSE, CP and FP is influenced by ,
especially when the expected disease counts are low for the isolated hotspots. For
CHAPTER 2. ISOLATED HOTSPOT DETECTION 29
all the ranking methods, RMSE decreases, CP increases and FP decreases when the
emerging hotspots are gradually elevated above the whole surface, and when the
expected incidence counts for all the regions are inflated. These findings apply also
to the cases where the isolated hotspots are a cluster of three contiguous regions (see
Tables 2.5 and 2.6) and also where the isolated hotspots are three non-contiguous
regions (see Tables 2.7 and 2.8). It is also interesting to note that, in contrast to
Scenario I, for Scenario II, where three contiguous regions with low expected counts
are inflated as a cluster of hotspots, PRANK is superior to SMR, as the CAR model
has less of a smoothing effect on these isolated hotspots.
2.5 Summary
In this study, we focus on developing and evaluating rank estimators for disease map-
ping for the identification of emerging isolated hotspots. To determine the magnitude
of elevation of the hotspots relative to their neighbours, we developed an isolation
measure, the difference of risks or their rank estimators for the emerging high risk
regions and their neighbours. In summary, we note that though the CAR model
provides a smoothed risk surface, the estimates for PRANK or PM based on this
model perform reasonably well in detecting the emerging isolated hotspots. Simula-
tion studies show that gains of using PRANK may be substantial compared to other
ranking methods considered, especially when the disease is rare and the high risk area
is not yet a global outlier. The research has adopted the widely used CAR model.
Rank estimators based on other models may yield different results on identification of
isolated hotspots. The isolation measure developed here depends on the definition of
the neighborhood structure. The performance of the isolation measure may depend
on the distribution of the number of neighbours; hence the development of methods
which account for the number of neighbours may be useful.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 30
In addition, in comparison to the classical scan statistic, we expect that the rank-
ing methods based on the spatial model may have lower false positive rates for identify-
ing isolated hotspots, since the classical scan statistic is very sensitive to the violation
of the assumption of spatial independence, detecting clusters at the 5% level much
more often than 5% of the time when spatially correlated data are simulated (Loh
and Zhu, 2007). It would be useful to compare the use of the scan statistic to our
ranking methods through simulation studies when no isolated hotspots exist.
CHAPTER 2. ISOLATED HOTSPOT DETECTION 31
Table 2.2: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with LOW expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 8.988 0.262 0.013 4.211 0.542 0.008 1.842 0.706 0.005PM 11.924 0.014 0.018 6.991 0.098 0.016 4.631 0.212 0.014WRSEL(a) 14.479 0.004 0.018 10.173 0.064 0.017 7.689 0.132 0.016WRSEL(b) 15.673 0.006 0.018 10.304 0.074 0.017 7.542 0.160 0.015PPR 21.577 0.008 0.018 13.456 0.088 0.017 9.316 0.162 0.015PRANK 11.643 0.024 0.018 6.555 0.150 0.015 4.103 0.278 0.013
u = 4 SMR 2.109 0.428 0.010 0.417 0.886 0.002 0.253 0.948 0.001PM 4.331 0.070 0.017 1.291 0.646 0.006 0.629 0.834 0.003WRSEL(a) 6.937 0.036 0.018 1.865 0.560 0.008 0.913 0.784 0.004WRSEL(b) 5.753 0.050 0.017 1.449 0.612 0.007 0.700 0.828 0.003PPR 10.509 0.054 0.017 1.785 0.626 0.007 0.739 0.832 0.003PRANK 3.935 0.096 0.016 1.154 0.676 0.006 0.576 0.862 0.003
u = 8 SMR 1.305 0.472 0.010 0.205 0.958 0.001 0.118 0.986 0.000PM 2.510 0.164 0.015 0.397 0.872 0.002 0.161 0.974 0.000WRSEL(a) 3.409 0.132 0.016 0.443 0.834 0.003 0.179 0.968 0.001WRSEL(b) 2.781 0.164 0.015 0.422 0.858 0.003 0.161 0.974 0.000PPR 5.720 0.160 0.015 0.422 0.864 0.002 0.155 0.976 0.000PRANK 2.373 0.188 0.015 0.374 0.890 0.002 0.141 0.980 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 32
Table 2.3: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with MODERATE expected disease counts, whose risk wasinflated to about the 10th, 3rd and 1st place; the expected disease counts for all theregions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 3.159 0.208 0.014 1.152 0.580 0.008 0.605 0.802 0.004PM 3.151 0.200 0.015 1.275 0.542 0.008 0.560 0.832 0.003WRSEL(a) 9.498 0.028 0.018 4.629 0.206 0.014 1.997 0.602 0.007WRSEL(b) 6.426 0.082 0.017 2.577 0.420 0.011 0.980 0.750 0.005PPR 8.818 0.112 0.016 2.705 0.474 0.010 0.995 0.784 0.004PRANK 2.164 0.378 0.011 0.729 0.764 0.004 0.319 0.922 0.001
u = 4 SMR 0.931 0.604 0.007 0.341 0.884 0.002 0.110 0.988 0.000PM 0.963 0.588 0.007 0.241 0.942 0.001 0.089 0.992 0.000WRSEL(a) 1.957 0.326 0.012 0.392 0.864 0.002 0.110 0.988 0.000WRSEL(b) 1.138 0.512 0.009 0.300 0.922 0.001 0.089 0.992 0.000PPR 1.483 0.534 0.008 0.272 0.938 0.001 0.089 0.992 0.000PRANK 0.769 0.690 0.006 0.195 0.962 0.001 0.077 0.994 0.000
u = 8 SMR 0.597 0.710 0.005 0.200 0.960 0.001 0.000 1.000 0.000PM 0.642 0.706 0.005 0.179 0.968 0.001 0.000 1.000 0.000WRSEL(a) 0.872 0.552 0.008 0.205 0.958 0.001 0.000 1.000 0.000WRSEL(b) 0.672 0.678 0.006 0.179 0.968 0.001 0.000 1.000 0.000PPR 0.722 0.700 0.005 0.179 0.968 0.001 0.000 1.000 0.000PRANK 0.546 0.790 0.004 0.161 0.974 0.000 0.000 1.000 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 33
Table 2.4: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with HIGH expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 1.834 0.250 0.014 0.799 0.630 0.007 0.417 0.860 0.003PM 1.321 0.428 0.010 0.494 0.804 0.004 0.249 0.950 0.001WRSEL(a) 7.695 0.030 0.018 2.490 0.358 0.012 0.832 0.782 0.004WRSEL(b) 3.025 0.188 0.015 0.906 0.674 0.006 0.319 0.920 0.001PPR 5.272 0.250 0.014 0.926 0.744 0.005 0.382 0.936 0.001PRANK 0.696 0.712 0.005 0.257 0.940 0.001 0.118 0.986 0.000
u = 4 SMR 0.651 0.684 0.006 0.283 0.920 0.001 0.077 0.994 0.000PM 0.562 0.766 0.004 0.200 0.960 0.001 0.063 0.996 0.000WRSEL(a) 1.049 0.504 0.009 0.268 0.934 0.001 0.077 0.994 0.000WRSEL(b) 0.660 0.716 0.005 0.205 0.958 0.001 0.063 0.996 0.000PPR 0.720 0.734 0.005 0.195 0.962 0.001 0.063 0.996 0.000PRANK 0.415 0.870 0.002 0.167 0.972 0.001 0.045 0.998 0.000
u = 8 SMR 0.537 0.734 0.005 0.118 0.986 0.000 0.000 1.000 0.000PM 0.454 0.816 0.003 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(a) 0.610 0.686 0.006 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(b) 0.486 0.792 0.004 0.110 0.988 0.000 0.000 1.000 0.000PPR 0.475 0.808 0.003 0.100 0.990 0.000 0.000 1.000 0.000PRANK 0.369 0.880 0.002 0.089 0.992 0.000 0.000 1.000 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 34
Table 2.5: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 5.264 0.455 0.031 3.530 0.578 0.024 1.712 0.808 0.011PM 3.904 0.412 0.033 2.762 0.568 0.024 1.611 0.821 0.010WRSEL(a) 9.437 0.111 0.050 7.373 0.276 0.041 3.445 0.619 0.022WRSEL(b) 7.071 0.249 0.043 4.987 0.425 0.033 2.207 0.740 0.015PPR 6.695 0.351 0.042 4.601 0.517 0.032 2.278 0.800 0.016PRANK 3.268 0.549 0.026 2.214 0.689 0.018 1.385 0.879 0.007
u = 4 SMR 2.172 0.656 0.019 1.508 0.823 0.010 1.204 0.977 0.001PM 2.148 0.635 0.021 1.485 0.819 0.010 1.193 0.987 0.001WRSEL(a) 3.767 0.442 0.032 2.081 0.706 0.017 1.228 0.961 0.002WRSEL(b) 2.465 0.583 0.024 1.586 0.793 0.012 1.198 0.983 0.001PPR 2.523 0.623 0.027 1.584 0.813 0.015 1.180 0.990 0.003PRANK 2.011 0.683 0.018 1.422 0.845 0.009 1.185 0.991 0.001
u = 8 SMR 1.701 0.740 0.015 1.279 0.908 0.005 1.179 0.997 0.000PM 1.697 0.745 0.014 1.268 0.914 0.005 1.182 0.999 0.000WRSEL(a) 2.043 0.646 0.020 1.389 0.853 0.008 1.191 0.996 0.000WRSEL(b) 1.755 0.721 0.016 1.301 0.901 0.006 1.183 0.999 0.000PPR 1.781 0.741 0.019 1.287 0.915 0.010 1.179 0.999 0.001PRANK 1.655 0.764 0.013 1.253 0.922 0.004 1.181 0.999 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 35
Table 2.6: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 4.105 0.347 0.037 2.511 0.599 0.023 1.767 0.778 0.013PM 3.514 0.509 0.028 2.121 0.727 0.015 1.504 0.865 0.008WRSEL(a) 14.481 0.014 0.056 9.552 0.119 0.050 4.741 0.432 0.032WRSEL(b) 7.166 0.209 0.045 4.073 0.522 0.027 2.161 0.777 0.013PPR 7.550 0.413 0.046 4.196 0.672 0.032 2.466 0.843 0.019PRANK 2.628 0.713 0.016 1.572 0.835 0.009 1.328 0.934 0.004
u = 4 SMR 2.269 0.646 0.020 1.445 0.854 0.008 1.204 0.972 0.002PM 2.184 0.677 0.018 1.378 0.883 0.007 1.184 0.983 0.001WRSEL(a) 5.851 0.294 0.040 2.518 0.673 0.019 1.262 0.929 0.004WRSEL(b) 2.722 0.601 0.023 1.504 0.849 0.009 1.190 0.978 0.001PPR 4.034 0.648 0.032 1.583 0.867 0.023 1.184 0.984 0.010PRANK 1.966 0.729 0.015 1.304 0.917 0.005 1.179 0.989 0.001
u = 8 SMR 1.965 0.705 0.017 1.291 0.909 0.005 1.163 0.994 0.000PM 1.967 0.718 0.016 1.268 0.923 0.004 1.160 0.997 0.000WRSEL(a) 3.107 0.551 0.025 1.437 0.838 0.009 1.167 0.989 0.001WRSEL(b) 2.091 0.691 0.018 1.290 0.911 0.005 1.161 0.995 0.000PPR 3.275 0.705 0.030 1.271 0.927 0.021 1.167 0.998 0.011PRANK 1.893 0.753 0.014 1.244 0.941 0.003 1.162 0.995 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 36
Table 2.7: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 9.478 0.412 0.033 4.000 0.630 0.021 2.561 0.816 0.010PM 8.487 0.286 0.040 3.993 0.543 0.026 2.544 0.777 0.013WRSEL(a) 9.800 0.081 0.052 6.782 0.301 0.040 3.668 0.613 0.022WRSEL(b) 9.080 0.147 0.048 5.522 0.427 0.032 2.872 0.717 0.016PPR 8.361 0.227 0.047 5.112 0.497 0.032 2.600 0.758 0.017PRANK 8.642 0.391 0.034 3.926 0.631 0.021 2.482 0.823 0.010
u = 4 SMR 3.484 0.577 0.024 1.458 0.851 0.008 1.191 0.971 0.002PM 3.335 0.521 0.027 1.487 0.835 0.009 1.180 0.976 0.001WRSEL(a) 4.550 0.367 0.036 1.871 0.750 0.014 1.208 0.954 0.003WRSEL(b) 3.618 0.476 0.030 1.570 0.818 0.010 1.186 0.972 0.002PPR 4.030 0.505 0.032 1.536 0.833 0.013 1.171 0.979 0.003PRANK 3.337 0.569 0.024 1.437 0.856 0.008 1.167 0.983 0.001
u = 8 SMR 2.495 0.641 0.020 1.308 0.904 0.005 1.177 0.998 0.000PM 2.509 0.600 0.023 1.314 0.905 0.005 1.176 0.998 0.000WRSEL(a) 2.878 0.540 0.026 1.384 0.861 0.008 1.184 0.995 0.000WRSEL(b) 2.591 0.585 0.024 1.326 0.897 0.006 1.178 0.998 0.000PPR 3.277 0.603 0.025 1.302 0.915 0.009 1.174 0.999 0.001PRANK 2.513 0.612 0.022 1.303 0.911 0.005 1.178 0.998 0.000
CHAPTER 2. ISOLATED HOTSPOT DETECTION 37
Table 2.8: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 3.787 0.379 0.035 2.308 0.593 0.023 1.552 0.809 0.011PM 3.438 0.465 0.030 2.056 0.699 0.017 1.396 0.882 0.007WRSEL(a) 13.639 0.015 0.056 8.508 0.124 0.050 4.016 0.481 0.029WRSEL(b) 7.210 0.183 0.046 3.736 0.494 0.029 1.794 0.801 0.011PPR 7.308 0.375 0.047 3.646 0.647 0.032 1.678 0.871 0.018PRANK 2.550 0.665 0.019 1.547 0.819 0.010 1.230 0.941 0.003
u = 4 SMR 2.032 0.649 0.020 1.341 0.885 0.007 1.170 0.973 0.002PM 1.991 0.668 0.019 1.297 0.918 0.005 1.154 0.985 0.001WRSEL(a) 4.936 0.218 0.044 1.827 0.717 0.016 1.194 0.954 0.003WRSEL(b) 2.349 0.569 0.024 1.356 0.885 0.007 1.159 0.981 0.001PPR 2.618 0.647 0.032 1.290 0.922 0.016 1.142 0.990 0.007PRANK 1.852 0.738 0.015 1.252 0.944 0.003 1.150 0.987 0.001
u = 8 SMR 1.785 0.749 0.014 1.246 0.935 0.004 1.152 0.996 0.000PM 1.770 0.768 0.013 1.215 0.955 0.003 1.145 0.997 0.000WRSEL(a) 2.456 0.539 0.026 1.321 0.872 0.007 1.154 0.994 0.000WRSEL(b) 1.865 0.718 0.016 1.224 0.946 0.003 1.148 0.996 0.000PPR 1.930 0.766 0.024 1.216 0.964 0.013 1.144 0.998 0.008PRANK 1.756 0.799 0.011 1.212 0.959 0.002 1.145 0.996 0.000
Chapter 3
Joint Analysis of Multivariate
Spatial Count and Zero-Heavy
Count Outcomes
3.1 Introduction
In public health, environmental and ecological studies, variables measured at the same
spatial locations may be correlated so that the spatial structures of such variables
across the region under consideration are very similar, indicating that they may be
characterized by a common spatial risk surface. Employing such a commonality in
risks may be useful for gaining precision of local area risk estimates, especially for
rare diseases.
Shared component spatial models have been studied in a variety of applied con-
texts. Knorr-Held and Best (2001) proposed a shared-component model which mim-
ics an ecological regression on the unobserved shared component. The two diseases
38
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 39
considered in that application share a common spatial structure and, as well, sup-
port disease-specific spatially uncorrelated random errors. Fitting the model requires
strong prior assumptions of the random spatial and uncorrelated errors, typically be-
cause of challenges arising related to identifiability of the latent spatial fields. Wang
and Wall (2003) proposed a common spatial factor model to study multivariate indi-
cators of cancer risk across counties in Minnesota. To avoid identifiability issues, the
model includes the common spatial structure term but no excess heterogeneity and,
as well, the variance of the shared spatially correlated random effect is considered
as fixed. Hogan and Tchernis (2004) proposed a common factor model for spatial
multivariate count data with constraints imposed on the variance structure of the
conditional autoregressive model they employ. Congdon (2006) set out a modeling
framework for modeling multiple health outcomes over area, age, and time dimensions
that takes account of spatial correlation as well as interactions between dimensions.
Tzala and Best (2006) proposed a Bayesian latent variable model for cancer mor-
tality data, which linked spatial effects. As well, other joint modeling approaches
for multivariate spatial data have been proposed including the multivariate version
of the conditional autoregressive model (MVCAR) (Gelfand and Vounatsou, 2003),
which assumes the spatial structure is the same across the multivariate outcomes.
Such modelling allows for the pooling of information across spatial units as well as
across multiple outcomes within units. In contrast, the common spatial factor model
may stratify the spatial variation into two components: the shared component and
outcome-specific components. Such a modeling approach permits a simple analysis
of which spatial term dominates as well as an identification of the common spatial
structure. Though testing for a common spatial structure is quite relevant in certain
studies, there has been very little discussion of the power of such tests. We consider
this in the context of the analysis of count data an also examine the utility of joint
modeling in terms of gains in the efficiency of estimating relative risks.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 40
In environmental and ecological studies, counts data are often characterized by an
excess of zeros and spatial dependence (Clarke and Green, 1988; Welsh et al., 1996;
Martin et al., 2005). When studying of abundance of species in ecological studies, hav-
ing a large proportion of zero counts may indicate the habitat is unsuitable in certain
areas, for example. In such cases, standard distributions such as Poisson, binomial
and negative-binomial may fail to provide an adequate fit. A class of distributions
for such data is defined as zero-inflated distributions (Lambert, 1992).
For handling zero-inflation, the use of mixture models and conditional models are
two common approaches within the context of ecological and health studies. The
well-known zero-inflated Poisson (ZIP) model (Lambert, 1992) is a mixture of a de-
generate zero mass and a Poisson distribution. On the other hand, Welsh et al. (1996)
formulate a two-component conditional model where the presence/absence of counts
is modeled with a binomial distribution and the abundance at active sites is mod-
eled using a truncated Poisson or truncated negative binomial distribution. These
two models have different interpretations. Structural zeros and random zeros are not
distinguished under the conditional specification, whereas the mixture model permits
an examination of the different sources of error (Kuhnert et al., 2005). For more
discussion of zero-inflated models from a Bayesian perspective see Angers and Biswas
(2003) and Ainsworth (2007).
In many applications, zero-inflated count data are spatially correlated. Rathbun
and Fei (2006) introduced a zero-inflated Poisson model, in which the component
modeling the excess zeros is governed by a hidden spatial probit model; a threshold,
defining large probabilities in the probit layer, governs the proportion of zeros. Agar-
wal et al. (2002) also proposed a zero inflated model for spatial count data using a
mixture model approach and incorporating spatial random errors into either or both
of the model components. With multivariate zero-inflated count data corresponding
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 41
to several related spatial outcomes, there is also the possibility of linking model com-
ponents across the various outcomes using a shared latent spatial structure. This
would be relevant, for example, if the underlying, hidden mechanisms resulting in the
structural zeros, or the abundance of counts, are related across the outcomes.
The methods developed in this chapter for joint outcome analysis of spatial count
and zero-heavy count data focus on the use of shared latent spatial frailty models.
We discuss such joint mapping models and evaluate what benefits may be achieved
through joint modeling. The rest of the chapter is structured as follows. Section
3.2 describes a general modeling framework for common spatial factor models for
count data and zero-inflated count data. Section 3.3 presents two motivating appli-
cations, applying the common spatial factor model to Ontario lung cancer data and
zero-inflated forestry infection data related to a study of Comandra blister rust on
lodgepole pine trees. Section 3.4 examines hypothesis testing of whether two spatial
maps share the same underlying spatial structure for count data. A power study is
performed based on the situational context of the Ontario lung cancer data. Section
3.5 compares joint and separate modeling in terms of accuracy and efficiency of es-
timating relative risks through simulation investigations. Some closing remarks are
provided in Section 3.6.
3.2 Models for Joint Count Outcomes
We present here a general modeling framework for the common spatial factor model
for joint modeling of count data and zero-inflated count data. In disease mapping,
the typical response is a rate (both in health and forest epidemiology), hence the
focus on the analysis of counts herein. However, generalization of the model to other
non-normal data is straightforward.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 42
3.2.1 Common Spatial Factor Model for Counts
Let yij∣�ij ∼ Poisson(�ij) for region i = 1, ⋅ ⋅ ⋅ , n and outcome j = 1, ⋅ ⋅ ⋅ , J , where
yij denotes the response and �ij denotes the expected mean count for outcome j in
region i. The common spatial factor model can be written as:
log(�ij) = �j + log(Eij) + jbi + ℎij , (3.1)
where �j denotes the overall mean rate for the jth outcome and Eij is the expected
number of disease counts in region i for the jth outcome based on some standardized
rates; bi, i = 1, ⋅ ⋅ ⋅ , n is the spatial random effect assumed here to follow a condi-
tional autoregressive distribution (Besag, 1974) to account for the spatially struc-
tured correlation in the outcomes; j is the factor loading for the shared spatial
component on outcome j, with 1 = 1, and ℎij represents uncorrelated error terms.
The spatial component b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ MVN(0,Σb), Σb having generalized
inverse Σ−1b = �−2b (D −W ); W = (Wrs) is often called the neighborhood matrix
and D = diag(W1.,W2., ⋅ ⋅ ⋅ ,Wn.) with Wr. =∑n
s=1Wrs. The neighborhood matrix
defines the spatial structure across the region; a simple adjacency model sets Wrs
to 1 if regions r and s share a boundary, and 0 otherwise. The spatially uncorre-
lated random effect hj ∼ N(0, �2ℎjI), where hj = (ℎ1j, ⋅ ⋅ ⋅ , ℎnj)T , j = 1, ⋅ ⋅ ⋅ , J and
I is the identity matrix. The relative risk for outcome j within region i is then
�ij = exp(�j + jbi + ℎij).
The common spatial factor model may be implemented in a Bayesian framework
using Markov chain Monte Carlo (MCMC) procedures. The joint posterior distribu-
tion is expressed as
p(�, b,h, , �2b , �
2ℎ1, ⋅ ⋅ ⋅ , �2
ℎJ∣Y ) ∝ L(Y ∣�, b,h, )p(b∣�2
b )p(h∣�2ℎ)
p(�)p( )p(�2b )p(�
2ℎ1
) ⋅ ⋅ ⋅ p(�2ℎJ
) , (3.2)
with� = (�1, ⋅ ⋅ ⋅ , �J)T , b = (b1, ⋅ ⋅ ⋅ , bn)T , h = (h1, ⋅ ⋅ ⋅ ,hJ)T where hj = (ℎ1j, ⋅ ⋅ ⋅ , ℎnj)T
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 43
and = ( 2, ⋅ ⋅ ⋅ , J)T . The first term on the right hand side of (3.2) is the conditional
likelihood,
L(Y ∣�, b,h, ) ∝ exp
[−
n∑i=1
J∑j=1
Eijexp(�j + jbi + ℎij)
]
×n∏i=1
J∏j=1
[Eijexp(�j + jbi + ℎij)]yij . (3.3)
The second and third terms on the right hand side of (3.2) are the distributions
of b and h respectively and the remaining terms are the prior distributions on
(�, , �2b , �
2ℎ1, ⋅ ⋅ ⋅ , �2
ℎJ); Flat priors are assigned to �, while = ( 2, ⋅ ⋅ ⋅ , J)T are
assigned N(0, �2 ) priors with �2
= 1000; the priors assigned to the standard devi-
ations of the spatially structured and unstructured random effects (�b, �ℎ1 , ⋅ ⋅ ⋅ , �ℎJ )
are uniformly distributed between 0 and 100, due to the robust properties of this
prior (Gelman, 2006). The Gibbs sampler (Gelfand and Smith, 1990; Carlin and
Louis, 2000) is a natural choice for updating the parameters in this setting, since
it takes advantage of the conditional specification of the joint distribution. Each of
the full conditional distributions required by the Gibbs sampler can be successively
updated.
3.2.2 Common Spatial Factor Model for Zero Heavy Counts
In many applications, a large number of zero counts may be observed for spatial
outcomes. We refer to situations where the number of zeros are unacceptably large
given the spatial structure of means over the map, and pose challenges for analysis
in terms of goodness of fit using a simple spatial count model which is based on a
conditional Poisson distribution. Here, we extend the common spatial factor model
to handle such zero-inflated count data. Conditional on spatial random effects bi and
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 44
di as described below, suppose that the response variable Yij is distributed as
Yij∣zij =
⎧⎨⎩ 0 if zij = 1,
Poisson(�ij) if zij = 0(3.4)
where i indexes region, i = 1, ⋅ ⋅ ⋅ , n and j indexes the multivariate outcomes, for
example, diseases, j = 1, ⋅ ⋅ ⋅ , J . Here, Yij denotes the observed jth outcome at the ith
spatial location. The variable zij is a latent Bernoulli indicator, a trigger for the excess
zeros, with mean function �ij, while Poisson(�ij) denotes a conditionally independent
Poisson random variable with mean �ij, conditional on bi and di. Before specifying bi
and di, we note immediately that the corresponding probability distribution functions
are
P (Yij = yij) =
⎧⎨⎩ �ij + (1− �ij)e−�ij if yij = 0,
(1− �ij)e−�ij�
yijij
yij !if yij > 0
. (3.5)
The parameters �ij and �ij depend on random effects bi, di, so that
log(�ij) = �j + jbi, logit(�ij) = �j + !jdi , (3.6)
where �j and �j denote the overall mean rates for the Poisson and excess zero proba-
bility components for the jth outcome; b = (b1, ⋅ ⋅ ⋅ , bn)T ∼MVN(0, �2b (D −W )−1)
is the spatially correlated random effect for the Poisson count component and d =
(d1, ⋅ ⋅ ⋅ , dn)T ∼MVN(0, �2d(D −W )−1) is the spatially correlated random effect for
the excess zero probability component. The mean and the variance of the conditional
distributions are E(Yij) = (1 − �ij)�ij and var(Yij) = (1 − �ij)�ij(1 + �ij�ij). The
factor loading parameter j reflects the influence of the latent common spatial factor
b on the Poisson component of the jth outcome and correspondingly !j measures the
impact of the shared component d on the component generating excess zeros for the
jth outcome; we set 1 = !1 ≡ 1. Similar shared random effect structures could be
derived for other link functions beyond the log and logit. The details of the likelihood
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 45
and prior below reflect an assumption that b and d are independent; however more
complicated forms may be considered.
The zero inflated common spatial factor model (3.6) may be implemented in a
Bayesian framework using MCMC methods. The joint posterior distribution of the
parameters is:
p(�, �, b,d, ,!, �2b , �
2d∣Y ) ∝ L(Y ∣�, �, b,d, ,!)p(�)p(�)p( )p(!)
p(b∣�2b )p(d∣�2
d)p(�2b )p(�
2d) , (3.7)
where � = (�1, ⋅ ⋅ ⋅ , �J)T , � = (�1, ⋅ ⋅ ⋅ , �J)T , = ( 2, ⋅ ⋅ ⋅ , J)T , ! = (!2, ⋅ ⋅ ⋅ , !J)T .
The first term on the right hand side of (3.7) is the likelihood
L(Y ∣�, �, b,d, ,!) ∝n∏i=1
J∏j=1
[I(yij = 1) {�ij + (1− �ij)e�ij}
+ I(yij = 0)
{(1− �ij)
e−�ij�yijij
yij!
}], (3.8)
where I(A) is the indicator function. The forms of the conditional distributions of the
model parameters are proportional to the joint posterior density as in (3.7); these can
be obtained by retaining quantities involving relevant parameters from the joint pos-
terior. Samples from the posterior distribution in (3.7) so obtained by MCMC allow
computation of summary measures and credible intervals of any arbitrary functional
of the parameters. We assign normal priors on �, �, and ! with a moderately
large variance to avoid computational instability. The parameters (�b, �d) can be
assigned uniform distribution with again only moderately large variance to avoid the
situation of exceedingly large random effects arising in a sample. Diffuse priors may
be employed to determine sensitivity to these choices.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 46
3.2.3 Model Assessment and Comparison
To compare various models we employ the deviance information criterion (DIC), as
DIC=D(�) + pD where D(�) is the posterior mean of the deviance with D(�) =
−2logL(Y ∣�), and � denotes the collection of parameters in the model (Spiegelhalter
et al., 2002). The penalty term pD is the effective number of model parameters,
defined by pD = D(�)−D(�) where � = E[�∣Y ] is the posterior mean of �. Models
with lower DIC scores are preferred as they achieve more optimal combination of fit
and parsimony.
A common and effective tool used as a diagnostic method in Bayesian analyses is
the comparison of the posterior predictive distribution of replicated data under the
model with the observed data (Rubin, 1984; Gelman et al., 1996, 2004). If the model
accurately represents the process that generated the data, then replicated data gen-
erated under the model should look similar to the observed data. Posterior predictive
comparisons are usually implemented by drawing simulated values from the posterior
predictive distribution and comparing these samples to the observed data using test
quantities that characterize important features of the data. Let L(y∣�) denote the
likelihood for the model, where y denotes the data and � denotes all the parame-
ters in the model, including the hyperparameters. The posterior distribution of � is
f(�∣y) ∝ L(y∣�)f(�), where f(�) denotes the prior distribution of the parameters.
The posterior predictive distribution of replicated data yrep is then defined as:
f(yrep∣y) =
∫L(yrep∣�)f(�∣y)d� , (3.9)
which is the likelihood of the future data averaged over the posterior distribution
f(�∣y). The distribution is termed as predictive distribution.
The predictive data yrep reflect the expected observation assuming the model with
observed data y. If the model is adequate, the values of y and yrep should be close.
A evaluation of closeness can be carried out using some summary function of distance
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 47
D(y,�) with assessment of the overall goodness of fit using posterior predictive p-
value (Meng, 1994) given by
P(D(yrep,�) > D(y,�)∣y
).
In our application, we considered
D(y,�) =n∑i=1
J∑j=1
(yij − �ij)2
�ij. (3.10)
We comment that Meng (1994) and Carlin and Louis (2000) argue that posterior
predictive checks provide a measure of discrepancy between the model and the data,
and are not directly informative for model comparison and inference. A posterior
p-value around 0.5 indicates that the distributions of the replicated and actual data
are close, while a value close to zero or one indicates strong differences between them
(Gelman et al., 1996).
3.3 Applications
In this section, we will present two applications which utilize common spatial factor
models: an analysis of Ontario lung cancer for males and females, and an analysis of
zero-inflated forestry data which relate to Comandra blister rust infection of lodgepole
pine trees.
3.3.1 Ontario Lung Cancer
Lung cancer incidence and expected counts in 37 public health units (PHU) over the
period 1995-2002 in Ontario are considered here in a joint analysis of counts for males
(yim observed; Eim expected) and for females (yif observed; Eif expected). Figure 3.1
displays the standardized mortality rates (SMR) for Ontario lung cancer for males and
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 48
Figure 3.1: SMR of Ontario lung cancer for males and females.
females, respectively, with some similarity of spatial structure observed over gender.
The common spatial factor model for joint modeling of lung cancer mortality counts
for males and females is expressed as:⎧⎨⎩ log(�im) = �m + log(Eim) + bi + ℎim
log(�if ) = �f + log(Eif ) + ⋅ bi + ℎif, (3.11)
where the common spatial structure b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ MVN(0, �2b (D −W )−1),
hm = (ℎ1m, ⋅ ⋅ ⋅ , ℎnm)T ∼ MVN(0, �2ℎmI), hf = (ℎ1f , ⋅ ⋅ ⋅ , ℎnf )T ∼ MVN(0, �2
ℎfI)
and is the factor loading for the shared spatial random effect. Two chains with
dispersed starting values for all parameters are run. Each chain was run for an initial
10,000 burn-in iterations followed by an additional 10,000 iterations thinned at 10,
resulting in a total of 2000 iterations to be used for posterior inference. Sensitivity
with respect to prior distributions was assessed by comparisons from repeating the
analysis with other weakly informative prior specifications. This comparison indicated
results to be fairly robust over the forms of prior considered. Table 3.1 presents the
posterior summary statistics for the parameter estimates obtained from employing
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 49
Table 3.1: Posterior summaries for the parameters of M1 in the analysis of Ontariolung cancer incidence.
Parameter Mean 95% CI�m 0.066 (0.039, 0.090)�f 0.086 (0.055, 0.115) 1.145 (0.866, 1.466)�2b 0.059 (0.032, 0.116)�2ℎm
0.0038 (0.0018, 0.0094)�2ℎf
0.0050 (0.0023, 0.0129)
M1. That much of the variability is accounted for by the common factor component
is reflected in the posterior mean (95% credible interval) of the variance parameter
of the shared random effect, 0.059(0.032, 0.116), when contrasted with the random
noise terms estimated as 0.0038(0.0018, 0.0094) for males and 0.0050(0.0023, 0.0129)
for females. The spatial variation dominates the spatially uncorrelated errors. As
well, it is important to note that �2b does not reflect the full variance term as it
is simply the scale factor operating on the variance-covariance matrix of the spatial
effects. The factor loading parameter is estimated as 1.145(0.866, 1.466), indicating
that males and females share a common and possibly identical spatial structure, with
substantially above zero and with = 1 not rejected.
The posterior median estimates along with the 95% credible intervals of the com-
mon latent spatially structured effect and the male- and female-specific unstructured
components are displayed in Figure 3.2. The common spatial component resembles
the SMRs maps for male and female lung cancer data (Figure 3.1) with an urban
versus rural effect apparent, while the residual terms appear to be more or less flat,
confirming the dominance of a strong underlying spatial structure shared between
males and females.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 50
Fig
ure
3.2:
The
pan
els
inth
efirs
tro
wdis
pla
yth
esp
atia
lm
aps
for
the
pos
teri
orm
ean
esti
mat
esfo
rℎm
,b
and
ℎf;
the
pan
els
inth
ese
cond
row
dis
pla
yth
e95
%cr
edib
lein
terv
als
forℎm
,b
andℎf
inth
ean
alysi
sof
the
Onta
rio
lung
cance
rdat
a.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 51
To examine whether any residual spatial structure has been left for each of the sub-
models, we used Moran’s I (Cliff and Ord, 1981; Cressie, 1993) to compute residual
spatial autocorrelation. Let yim and yif denote the predicted disease counts for males
and females. Moran’s I statistic is defined as
Im =e′mWmeme′mem
; If =e′fWfef
e′fef,
where e′m = (e1m, ⋅ ⋅ ⋅ , enm) with eim = (yim − yim)/√
var(yim); e′f = (e1f , ⋅ ⋅ ⋅ , enf )
with eif = (yif− yif )/√
var(yif ) and W f = Wm being the n×n adjacency matrix for
the regions with elements wij. The posterior mean (95% credible interval) of Im and
If are -0.053(-0.305, 0.156) and -0.057(-0.289, 0.153) respectively, suggesting that the
residuals have no significant spatial correlation.
In addition, we also considered several competing models including the separate
counterpart models corresponding to each gender and models with simplified random
effect structures. Table 3.2 lists pD and DIC values for the competing models con-
sidered. Overall, model M1 which includes the shared common spatial random error
as well as the random noise terms for each sub-model is considered most optimal
according to various measures of fit, though the difference between M1 and either of
M2 or M3 is not substantial. The separate counterpart models yield poorer fits com-
pared to the joint models. Examination of the posterior predictive p-value indicates
a substantially poorer fit for M4.
In the spatial analyses above, we decompose the variability associated with multi-
ple outcome domains into a variety of sources. In the separate models, each outcome
domain is decomposed into spatially structured and unstructured variability. With
the spatial common factor model, the variability of all outcome domains is decom-
posed into one common spatially structured error term and, additionally, domain-
specific unstructured variability. It is therefore of interest to evaluate what proportion
of total variability in each outcome is explained by each of the terms in the different
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 52
Table 3.2: pD and DIC for competing models in the analysis of Ontario lung cancerincidence.
Type Model pD DICShared M1 log(�im) = �m + logEim + bi + ℎim 56.5 725.9
log(�if ) = �f + logEif + ⋅ bi + ℎif
M2 log(�im) = �m + logEim + bi 56.5 727.5log(�if ) = �f + logEif + ⋅ bi + ℎif
M3 log(�im) = �m + logEim + bi + ℎim 56.6 728.3log(�if ) = �f + logEif + ⋅ bi
M4 log(�im) = �m + logEim + bi 38.8 764.1log(�if ) = �f + logEif + ⋅ bi
Separate M5 log(�im) = �m + logEim + bim + ℎim 68.3 738.2log(�if ) = �f + logEif + bif + ℎif
M6 log(�im) = �m + logEim + bim 68.8 740.1log(�if ) = �f + logEif + bif + ℎif
M7 log(�im) = �m + logEim + bim + ℎim 69.4 741.6log(�if ) = �f + logEif + bif
M8 log(�im) = �m + logEim + bim 70.1 743.9log(�if ) = �f + logEif + bif
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 53
models. For the separate analyses of the male (j = 1) and female (j = 2) outcomes,
at each iteration of the MCMC simulation we calculate the empirical variances s2bj
and s2ℎj , of the spatially structured and unstructured random effects, respectively.
The proportion of variability explained by the structured random effects for jth out-
come domain is calculated as s2bj/(s2bj
+ s2ℎj). For the common spatial factor model,
the empirical variances for the common and domain specific components, s2b and s2ℎj ,
respectively, are also calculated for each run of the MCMC samples. The fraction of
the variability explained by the common factor for jth outcome domain is calculated
as the ratio s2b/(s2b+s2ℎj). The results show that for the separate analysis, the spatially
structured component accounts for 53% of the variability for males and 42% of the
variability for females. In contrast, for the common spatial factor model, the shared
component accounts for 94% of the variability for males and 88% of the variability
for females; most of the variability was absorbed by the shared spatially structured
factor, with little unstructured residual variation left.
3.3.2 Comandra Blister Rust Tree Infection
In a study of Comandra blister rust infection of lodgepole pine trees in British
Columbia, Canada, we model the joint outcomes of the counts of lesions on trees
arising from infection and the counts of an alternate host plant (bastard toad flax)
surrounding the trees, which promotes infection. The study is part of a larger investi-
gation conducted by the British Columbia Ministry of Forests and Range of infection
on lodgepole pine trees. The lodgepole pine trees are located in each cell of a 124×64
grid, with each grid cell being 1.5m square. There are buffer trees on the plantation
and responses are collected for a random sample of 500 trees on yiL, the number of
lesions on the tree, marking infection areas, and yiH , the number of alternate host
plants within the 1.5m square grid containing the ith tree. The alternate host plant
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 54
20 40 60 80
100
150
200
lesion counts
Easting
Nor
thin
g
0
1
2
3
4
20 40 60 80
100
150
200
counts for host plants
Easting
Nor
thin
g
01
5
8
10
15
20
Figure 3.3: The left panel displays the lesion counts and the right panel displays thecounts of alternate host plants over the experimental field in the Comandra blisterrust analysis.
serves as a host for the fungus causing the infection on the trees. Preliminary data
analysis indicate that lesion counts are positively associated with the host plants,
driving the hypothesis that a shared frailty model might be useful for analyzing the
common spatial pattern of these responses. Figure 3.3 displays the spatial structures
of the lesion counts and counts of alternate host plants for the 500 sampled trees.
One of the most widely used approach for modeling spatial correlation for point
referenced data is kriging, which commonly uses the exponential decay function to
model spatial correlation between points r and s: �rs = exp(−�drs), where � > 0
controls the rate of decay of correlation with distance, larger values indicating rapid
decay and drs being the distance between the two spatial points. For shared frailty
modeling, this method is computationally slow and cumbersome to implement due
to inversion of the n × n covariance matrix at each MCMC iteration. Here, we
use the conditional autoregressive model, as an approximation, for specifying the
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 55
●●
●
●●
●
●●
● ●
●● ●
●●
10 20 30 40 50 60
05
1015
lesion counts
distance
sem
ivar
ianc
e
●
●●
●●
● ● ● ● ●●
●● ●
●
10 20 30 40 50 60
0.0
0.4
0.8
counts for host plants
distance
sem
ivar
ianc
e
Figure 3.4: Empirical semivariogram for counts of lesions and alternate host plants.
shared spatially correlated random effect. To obtain neighborhood definitions for this
model, we note that the empirical semivariograms of counts of lesions and alternate
host plants, displayed in Figure 3.4 suggest spatial correlations in both outcomes
up to about 20 meters. Therefore, spatial correlation in the shared frailty term is
accommodated by setting Wrs = I {d(r, s) ≤ 20m} where d(r, s) denotes the distance
between trees r and s.
Both counts of lesions (65% of zero) and alternate host plants (84% of zero) exhibit
excessive zeros, as displayed in Figure 3.5. In the current context, the extra-Poisson
variation induced by the excess zeros cannot be accommodated by a simple Poisson-
type distribution. Additionally, note that Vuong’s non-nested hypothesis test (Vuong,
1989) for a comparison of the predicted probabilities of Poisson regression and ZIP
regression indicates that the observed frequency of zeros far exceeds that expected
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 56
lesion counts
Freq
uenc
y
0 1 2 3 4 5 6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
%zeros=65.6%
counts of disease host plantsFr
eque
ncy
0 5 10 20 30
0.0
0.2
0.4
0.6
0.8
%zeros=84.8%
0 5 15 25
05
1525
Figure 3.5: Histograms of counts of lesions and alternate host plants. The subplot inthe right panel displays the distribution for counts of disease host plants excludingzeros.
under the Poisson assumption. We utilize a zero-inflated distribution instead.
In our application, our goal is to examine whether spatial similarity of the random
processes exist across the spatial maps of the two responses through a latent random
effect and also to examine whether the zero mass components of the two distributions
are also correlated through a shared underlying latent process that is spatially varying.
The zero inflated common spatial factor model (3.4), (3.5), (3.6) is implemented
with j = L for the response of lesion counts and H for alternate host plants. We
assign weakly informative prior N(0, 10) to �L, �H , �L, �H , and !. Note that we
restrict variance components equal to one to allow identification in this application.
In other applications, careful examination of prior specification that impose enough
constraint to allow for identification, without restricting variance components to be
equal to one may be necessary. Table 3.3 provides summaries of the posterior dis-
tributions of the parameters. The posterior mean (95% credible interval) for is
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 57
Table 3.3: Posterior summaries for the parameters of the zero-inflated common spatialfactor model in the analysis of Comandra blister rust infection.
Parameter Mean 95% CI�L 0.066 (-0.102, 0.223)�H 1.034 (0.456, 1.450)�L -0.095 (-0.459, 0.213)�H 2.025 (1.465, 2.696) 5.972 (4.251, 8.454)! 6.419 (2.906, 10.491)
5.972(4.251, 8.454) and for ! is 6.419(2.906, 10.491). Both of the credible intervals
do not cover zero, so there is evidence in commonality of spatial structure for lesion
counts and counts of host plants, both for the Poisson distributed component, and
the excess zero component. We also considered other competing models including the
counterpart separate models as listed in Table 3.4. The zero inflated common spatial
factor model offers a better fit with lower DIC score. As well, the shared compo-
nent models are superior to their counterpart separate models (see another example
illustrating this in Table 3.4–contrast M2 and M4).
Figure 3.6 displays the posterior medians of the common spatial factor for the
Poisson distributed component, b, and the excess zero component, d. Strong spatial
structures are manifested in both shared spatial components, with higher values for b
in the south-west and south-east quadrants and almost the opposite pattern observed
for d. Our model exhibits the feature that regions with higher probability of observing
excess zeros for both components have lower probability of observing large counts.
More sophisticated models accounting for the correlation of the two components may
be useful.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 58
Tab
le3.
4:pD
and
DIC
for
com
pet
ing
model
sin
the
anal
ysi
sof
Com
andra
blist
erru
stin
fect
ion.
Typ
eM
odel
Poi
sson
dis
trib
uti
onE
xce
ssZ
ero
pD
DIC
Shar
edM
1lo
g(�iL
)=�L
+b i
logi
t(�iL
)=� L
+di
119.
217
84.9
log(�iH
)=�H
+ b⋅bi
logi
t(�iH
)=� H
+ d⋅d
i
M2
log(�iL
)=�L
+b i
logi
t(�iL
)=� L
58.0
1831
.9lo
g(�iH
)=�H
+ b⋅bi
logi
t(�iH
)=� H
Sep
arat
eM
3lo
g(�iL
)=�L
+b iL
logi
t(�iL
)=� L
+diL
32.8
1982
.9lo
g(�iH
)=�H
+b iH
logi
t(�iH
)=� H
+diH
M4
log(�iL
)=�L
+b iL
logi
t(�iL
)=� L
27.7
1989
.3lo
g(�iH
)=�H
+b iH
logi
t(�iH
)=� H
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 59
20 40 60 80
100
150
200
posterior median for b
East
Nor
th
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
20 40 60 80
100
150
200
posterior median for d
East
Nor
th
−0.2
−0.1
0.0
0.1
0.2
Figure 3.6: The left panel displays the posterior median of the shared random effectb for the Poisson count component. The right panel displays the posterior median ofthe shared random effect d for the excess zero probability component.
3.4 Power of the Test for Common Spatial Struc-
ture
Using a shared frailty model as (3.1) to evaluate and test for common spatial structure
across outcomes seems an intuitively appealing and easily interpretable approach.
Additionally, it would be useful to have a sense of whether such a framework provides
a reasonably powerful test under scenarios, as seen for example, in the analysis of the
lung cancer data. More broadly, routinely evaluating the power of testing procedures
employed in an application is an important and helpful consideration. To examine
the strength of the evidence in the data for a test of common spatial structure across
outcomes, we conduct a simulation study which evaluates the power of the test
H0 : = 0 versus H1 : ∕= 0 . (3.12)
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 60
In this case, H0 corresponds to the case that there is no common spatial structure for
the multiple spatial outcomes. We note that the power of a test of H0 : = 1 versus
H1 : ∕= 1 may also be evaluated in a similar manner. Such a hypothesis may be
particularly applicable to the lung cancer analysis of males and females, and to other
similar gender comparisons.
The Bayesian approach to hypothesis testing is based on the credible interval
derived from the posterior distribution of the parameters. The power investigation
utilizes the generation of bootstrap samples based on the posterior estimates from
fitting the model to the data under consideration. The procedure is described in
terms of the lung cancer analysis as follows. Let �m and �f denote the posterior mean
estimates of the intercept terms for males (j = m) and females (j = f), respectively,
from fitting the model M1 (3.11) to the Ontario lung cancer data; we use these as true
values in the study, with and the variance components varying around estimates
obtained from that analysis as described below. Let �2b , �
2ℎm
and �2ℎf
denote the true
values of the variance parameters for b, hm and hf . Calculate the covariance matrix
for b as Σb = �2b(D −W )−1, where (D −W ) is the neighborhood structure based
on Ontario lung cancer mortality data. Set r = 1.
(a) At the rth replicate, generate b(r) = (b(r)1 , ⋅ ⋅ ⋅ , b(r)n )T ∼MVN(0,Σb), h
(r)m =
(ℎ(r)1m, ⋅ ⋅ ⋅ , ℎ
(r)nm)T ∼MVN(0, �2
ℎmI) and h
(r)f = (ℎ
(r)1f , ⋅ ⋅ ⋅ , ℎ
(r)nf )T ∼MVN(0, �2
ℎfI).
(b) Calculate the relative risks for males and females:
�(r)im = exp(�m + b
(r)i + ℎ
(r)im); �
(r)if = exp(�f + ⋅ b(r)i + ℎ
(r)if ), i = 1, ⋅ ⋅ ⋅ , n.
Then, generate counts from
y(r)im ∼ Multinomial
(n∑i=1
Eim,Eim�
(r)im∑n
i=1Eim�(r)im
);
y(r)if ∼ Multinomial
(n∑i=1
Eif ,Eif�
(r)if∑n
i=1Eif�(r)if
).
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 61
(c) Fit the common spatial factor model using y(r)im and y
(r)if , i = 1, ⋅ ⋅ ⋅ , n, to
obtain the posterior estimate (r). Let (r)L and
(r)U denote the 2.5% and 97.5%
quantiles of MCMC distribution of (r).
(d) Set r to r + 1. Repeat (a)-(c) for r = 2, ⋅ ⋅ ⋅ , R replicates, for example
R = 1000.
The power is then calculated as
Power = 1−R∑r=1
I( (r)L < 0 <
(r)U
)/R ,
where I(A) is the indicator function for the event A.
Intuitively, the greater the difference between the true value and zero, the greater
the power should be. As well, the power should increase with the relative dominance of
the common factor b, in explaining heterogeneity of outcomes, relative to the outcome
specific random noise h. Hence, we study the power of the test as and �2b/�
2ℎ vary in
such a manner as to reflect the relative importance of the shared component, where,
for simplicity, we have �2ℎ = �2
ℎm= �2
ℎf. We investigate scenarios with identical
spatial structures as in the Ontario lung cancer incidence study, including the same
estimate of �2ℎ as well as the same expected counts, and noting for completeness
here that power would increase as the expected counts increase. We consider values
of varying between 0 and 2 in increments of 0.1 and let �2b/�
2ℎ take values of
0, 0.5, 1, 1.5, 2, 3, 4, ⋅ ⋅ ⋅ , 8. Figure 3.7 displays the power as a function of for three
values of the ratio �2b/�
2ℎ, while Figure 3.8 displays the power as a function of �2
b/�2ℎ
for three values of . When �2b/�
2ℎ is 10 or higher, the power attains the target value
of one quickly, but the curve is far less steep for �2b/�
2ℎ = 5. Figure 3.8 shows that
the power can be quite low if �2b = �2
ℎ and = 1, for example. For high power in
this spatial framework, we require a clear dominance of �2b when = 1. We note that
this is achieved in the lung cancer analysis. When data are more sparse than for this
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 62
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
γ
pow
er
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●●
● ●● ● ●
●
●
●
●●
● ● ● ● ● ●
σb2/σh
2=20
σb2/σh
2=10
σb2/σh
2=5
Figure 3.7: Power curves for testing spatial homogeneity of the risk maps for males andfemales in the Ontario lung cancer analysis, over varying values of the link parameter ranging between 0 to 2 in increments of 0.1, for three values of the ratio �2
b/�2ℎ.
analysis, i.e. for the cases when the expected counts are E/50 or E/100, the power is
considerably lower. For example, when �2b/�
2ℎ = 10, = 1, and the expected counts
are scaled by 1/50, the power drops to about 0.2 −− contrast this with the value
of about 1 when the expected counts and parameter settings are as seen in the lung
cancer data.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 63
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
σb2/σh
2
pow
er
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●● ● ● ● ●
γ=2γ=1γ=0.5
Figure 3.8: Power curves for testing spatial homogeneity of the risk maps for malesand females in the Ontario lung cancer analysis, over varying values of �2
b/�2ℎ of
0, 0.5, 1, 1.5, 2, 3, ⋅ ⋅ ⋅ , 8 at three values of .
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 64
3.5 Precision Gains Through Joint Outcome Mod-
eling
The main emphasis in this chapter is a discussion of models and methods for the
exploration of common spatial structure across outcomes through the use of common
spatial factor models. As an aside, we also note that such models, if they are appli-
cable, may also provide more precise estimates of small area risks. We evaluate here
the gains in precision of risk estimates with particular emphasis on scenarios where
the power of testing for common spatial structure is high, as in the setting of the lung
cancer analysis.
In particular, we consider the spatial structure of the map as for the Ontario lung
cancer analysis; �2b/�
2ℎ is assigned to be 10, 50 or 100, with �2
ℎ = 0.001, 0.005 and
0.01; = 1; the expected disease counts are scaled by the inverse of �, with � = 1,
50 and 100; all other parameter values are set to the estimates from the analysis of
the lung cancer incidence data. Note that the Appendix A provides details on how
the variance of the response depends on the magnitude of the variance components
as well as on the expected counts. Data are generated for each combination of the
parameter settings, and analyzed using both the common spatial factor model and
separate analyses of each outcome in a parallel structure to the joint outcome analysis
including spatial and uncorrelated random effects.
The performance of the common spatial factor model and the separate models
for the outcomes were assessed in terms of the average relative bias and average
root mean squared error for the relative risk estimators across regions and outcome
groups, along with the average standard deviation and exceedance probabilities for
those areas with true relative risks above one, where averaging is taken over a large
number, R, of simulation runs, R = 1000, for each of the settings of parameters.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 65
We define these evaluation criteria precisely below for estimators �(r)ij denoting the
posterior mean estimate of the true relative risk �ij for local area i and outcome j
from the rth simulated data set using the joint outcome analysis. Similar definitions
would apply for estimators obtained from the separate analyses of the outcomes. We
compute the average absolute relative bias (ABIAS), standard deviation (ASE) and
root mean squared error (ARMSE) as
ABIAS(�) =1
nJ
J∑j=1
n∑i=1
∣∣∣∣∣∑R
r=1 �(r)ij /R− �ij�ij
∣∣∣∣∣,ASE(�) =
1
nJ
J∑j=1
n∑i=1
⎡⎣ R∑r=1
(�(r)ij −
R∑r=1
�(r)ij /R
)2
/R
⎤⎦ 12
,
ARMSE(�) =1
nJ
J∑j=1
n∑i=1
[R∑r=1
(�(r)ij − �ij
)2/R
] 12
,
respectively. Since identification of high risk areas is often of a prime interest in
disease mapping, we compute the average exceedance probability for those areas with
true relative risks greater than unity:
APREX(�) =1
nJ
J∑j=1
n∑i=1
[R∑r=1
p(�(r)ij > 1∣j : �ij > 1
)].
Table 3.5 contrasts the ABIAS, ASE, AMRSE and APREX for the joint and
separate analyses for studies with �2ℎ = 0.001 and = 1. The table shows that
the gains through using the joint model are not overly substantial, even when the
population distribution is sparse. However, as we increase �2ℎ to 0.005 and 0.01
(see Tables 3.6 and 3.7), it is evident that the joint model results in smaller standard
deviations, and therefore, smaller root mean squared errors, particularly as the shared
component becomes more dominant, and the expected counts become sparser. For
example, Table 3.7 shows that when �2b = 1 and the expected counts are scaled by
1/100, the ARMSE for the joint model is 10% smaller than that for the separate
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 66
Table 3.5: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2
ℎ is 0.001.
�2b = 0.01 �2
b = 0.05 �2b = 0.1
Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0520 0.0518 0.0590 0.0581 0.0674 0.0665
ASE 0.0340 0.0363 0.0388 0.0438 0.0397 0.0461ARMSE 0.0675 0.0688 0.0769 0.0802 0.0861 0.0900APREX 0.6071 0.6036 0.7088 0.6976 0.7480 0.7372
� = 50 ABIAS 0.0604 0.0602 0.0938 0.0971 0.1138 0.1233ASE 0.0833 0.0888 0.1029 0.1063 0.1224 0.1260ARMSE 0.1116 0.1158 0.1517 0.1566 0.1819 0.1916APREX 0.4930 0.4925 0.5296 0.5255 0.5540 0.5452
� = 100 ABIAS 0.0621 0.0620 0.1028 0.1041 0.1308 0.1360ASE 0.1012 0.1067 0.1167 0.1204 0.1338 0.1360ARMSE 0.1274 0.1318 0.1691 0.1729 0.2039 0.2102APREX 0.4769 0.4765 0.5005 0.4983 0.5198 0.5151
model. Plots of the ASE by region (not shown here) show that gains in precision are
sharper for those regions with smaller expected counts, for all simulation settings. In
terms of APREX, the joint model has only very slightly higher predictive ability to
detect areas of high risk. The common factor model is potentially beneficial if there
is reasonably strong spatial correlation for the multiple spatial outcomes, and the
disease is relatively rare.
3.6 Summary and Concluding Remarks
In this chapter, we present a general modeling framework for joint modeling of count
data and zero-inflated count data. We examine important aspects of the approach
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 67
Table 3.6: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2
ℎ is 0.005.
�2b = 0.05 �2
b = 0.25 �2b = 0.5
Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0615 0.0615 0.0917 0.0911 0.1327 0.1319
ASE 0.0415 0.0447 0.0424 0.0478 0.0420 0.0479ARMSE 0.0832 0.0851 0.1158 0.1186 0.1683 0.1706APREX 0.7055 0.6972 0.7927 0.7824 0.8159 0.8112
� = 50 ABIAS 0.0943 0.0982 0.1388 0.1594 0.1522 0.1856ASE 0.1067 0.1094 0.1583 0.1716 0.1889 0.2139ARMSE 0.1599 0.1645 0.2399 0.2638 0.2908 0.3274APREX 0.5320 0.5275 0.6037 0.5894 0.6520 0.6285
� = 100 ABIAS 0.1037 0.1057 0.1692 0.1874 0.2013 0.2350ASE 0.1206 0.1236 0.1797 0.1849 0.2187 0.2394ARMSE 0.1778 0.1813 0.2791 0.2976 0.3445 0.3816APREX 0.5010 0.4994 0.5581 0.5474 0.6014 0.5833
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 68
Table 3.7: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2
ℎ is 0.01.
�2b = 0.1 �2
b = 0.5 �2b = 1
Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0733 0.0731 0.1353 0.1348 0.2151 0.2145
ASE 0.0439 0.0469 0.0438 0.0482 0.0428 0.0473ARMSE 0.0976 0.0991 0.1736 0.1754 0.3014 0.3028APREX 0.7672 0.7575 0.7897 0.7826 0.7793 0.7756
� = 50 ABIAS 0.1127 0.1216 0.1549 0.1850 0.1902 0.2225ASE 0.1319 0.1365 0.1899 0.2150 0.2123 0.2517ARMSE 0.1967 0.2054 0.2964 0.3309 0.3987 0.4416APREX 0.5674 0.5596 0.6336 0.6142 0.6684 0.6438
� = 100 ABIAS 0.1300 0.1364 0.1986 0.2301 0.2384 0.2883ASE 0.1431 0.1450 0.2214 0.2419 0.2602 0.2971ARMSE 0.2187 0.2246 0.3489 0.3846 0.4622 0.5128APREX 0.5283 0.5235 0.5901 0.5746 0.6150 0.5923
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 69
including the power of testing the common spatial structure and the gains in effi-
ciency in estimating relative risks. In summary, the power of identifying common
spatial structure increases as the factor loading parameter increases, as well as when
the shared spatially correlated random effect becomes more dominant. As well, the
common spatial factor model offers some improvement in the efficiency of the rela-
tive risk estimator when the common spatial term becomes more dominant as well as
when the disease is relatively rare. Importantly, the shared frailty model can be used
to identify the common spatial structures across outcomes, where such exist.
More sophisticated spatiotemporal models are also possible. For example, let yijt
denote the count of disease for region i, outcome j and time t. and Eijt denote the
expected disease count, i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ , J and t = 1, ⋅ ⋅ ⋅ , T . A common
spatio-temporal factor model may be expressed as
log(�ijt) = �jt + log(Eijt) + jbi + �jgt + �ijt, (3.13)
where �jt is the overall mean rate for jth component at time t; �ijt is the expected
mean count of disease for region i, outcome j and time t; The spatial random effect
b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ N(0,Σb), Σb = �2b(D −W )−1 and a simple AR(1) model for
the temporal random effect may be used: gt∣gt−1 ∼ N(�gt−1, �2g), where �2
g is the
temporal dispersion parameter and � is the temporal autocorrelation, with ∣�∣ = 1
(Waller et al., 1997); the interaction effect of space and time over multiple outcomes
is accommodated through � = (�111, ⋅ ⋅ ⋅ , �nJT )T ∼ N(0, �2�I−1� ), where �2
� measures
the residual dispersion of outcomes over space and time. This model is intuitively
appealing because it allows multiple outcomes to be linked through both the shared
spatially correlated component, as well as the shared temporally correlated compo-
nent. Such a model may also be extended to incorporate zero-inflation. The major
challenge for all these extensions is the determination of computationally feasible and
efficient algorithms for inference.
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 70
We also note that disease mapping techniques are often used to gain precision
over the highly variable standardized mortality ratio estimate of small area risks. In
situations where it is not clear what sort of spatial or neighborhood structure may
be appropriate, it may be that sufficient gains in precision of estimates of small-
area risks may be attained through joint outcome modeling using only independent
error terms, where small-area effects are clearly linked across a variety of outcomes
(e.g. gender, age-groups). In this case, the shared random effect would be assumed
uncorrelated. This may provide a robust and efficient alternative to assumptions of
spatial correlation across a map. Shared frailty models for clustering regions might
also be useful in this regard.
Chapter 4
Impact of Misspecifying Spatial
Exposures
4.1 Introduction
Comandra blister rust (CBR) is a disease of hard pines that is caused by a fungus
growing in the inner bark. The fungus (Cronartium comandrae) has a complex life
cycle. Though it infects hard pines, it needs an alternate host (here termed AHP),
an unrelated plant, to spread from one pine to another. On hard pines, the fungus
causes growth reduction, stem deformity, and mortality. In addition, pines with stem
cankers produce significantly fewer cones and seeds than healthy trees.
Losses to Comandra blister rust on lodgepole pine are large in British Columbia,
Canada; mortality in young stands sometimes exceeds 85%. Understanding the effect
of spatial proximity of susceptible trees to the alternate host plant is likely critical
to understanding disease dynamics in such natural systems where the spatial pattern
of trees and alternate host plants are highly variable. However, few studies which
contrast and evaluate appropriate measures of host plant spatial proximity have been
71
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 72
performed. In forestry, the most common measure recorded and utilized is the dis-
tance between the tree and the nearest AHP (Jacobi et al., 1993). Though nearest
distance is an important measure of proximity, one might also postulate that the den-
sity of the AHP in various neighborhoods of the tree may have an important effect.
This chapter aims to understand the dynamics of the disease, especially the relation-
ship with the host plant through which the disease spreads. In particular we aim to
elucidate the effect of misspecifying measures of proximity in spatial studies of ex-
posure even as we utilize flexible models for investigating relationships. This type of
investigation is possible in our study due to the availability of detailed data on AHP,
providing a unique and rich framework for assessing the effect of misspecification of
measures of proximity.
Here we contrast the use of (i) nearest distance to alternate host plant with (ii)
alternate host plant density at various neighborhoods of the tree, where both mea-
sures are incorporated in a flexible generalized additive modeling framework. The
complex nature of the phenomenon under study as well as the interest in not incor-
porating misspecification through the use of linear models suggests the adoption of
such a semi-parametric generalized additive model. Of primary focus is the effect
of misspecifying the measure by which the alternate host plant affects the risk of
infection with respect to a treatment effect. If the covariate is improperly modeled,
bias in estimating treatment effects may manifest themselves. Through simulation
studies, we investigate the practical implication of the bias in the treatment effect for
scenarios related to those seen in the application. We complement this investigation
with an analysis of Comandra blister rust infection which incorporates both nonlin-
ear effects of the measures of spatial proximity of the host plants as well as spatial
random effects which account for spatial correlation.
The rest of the chapter is organized as follows: in Section 4.2 we discuss the mo-
tivating data in detail. Section 4.3 describes the semi-parametric additive model and
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 73
outlines inference for this model. Section 4.4 contains an analysis of the data includ-
ing model selection for a series of competing models. In Section 4.5, we investigate
model misspecification with regards the use of different measures of spatial proximity
of alternate host plants. Concluding remarks and recommendations are provided in
Section 4.6.
4.2 Comandra Blister Rust Study
The motivating study is part of larger investigation, conducted by the British Columbia
(BC) Ministry of Forests and Range, of the response of lodgepole pine trees to Co-
mandra blister rust. Here, we consider responses of 7055 lodgepole pines trees, which
were planted over a 124×64 grid in 2004 and subsequently examined for infection
in 2006, with each grid being 1.5m × 1.5m. Except for about 2000 buffer trees,
the remaining trees belong to 110 genetically different seedlots with about 50 trees
randomly allocated across the plot from each seedlot. There is evidence of genetic
variation in resistance of lodgepole pine to Comandra blister rust; though not con-
sidered in this chapter in detail, identifying such genetic resistance was a prime focus
of the investigation. Data are recorded at the grid cell level.
Comandra blister rust spreads by airborne spores from trees to alternative host
plants and alternative host plants pass the spores to other trees. It takes about
two years for the infection to appear on a tree. Rust susceptibility is assessed as a
presence/absence trait in 2006, as shown in Figure 4.1. The map reveals some areas
(such as the south-west quadrant), with a high density of infected trees. Figure B.1
in the Appendix B provides summary statistics related to infection counts and the
counts of the alternate host plant over various subsections of the plot. Preliminary
studies show that Comandra rust infections are severe on sites where the alternate
host is abundant.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 74
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
20 40 60 80 100
100
150
200
250
East
Nor
th
Figure 4.1: Locations of the trees; a triangle indicates infection with Comandra blisterrust.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 75
4.3 Flexible Smooth Models
To use as flexible a model as possible to describe the effect of a covariate, we consider a
semiparametric generalized additive model incorporating flexible non-linear functions
of exposure measures as well as the spatial surface. Let yi denote the observed
response indicating the infection status of tree i at spatial easting and northing site
coordinates si = (ei, ni), such that yi = 1 if tree i is infected with CBR and yi = 0
otherwise, i = 1, ⋅ ⋅ ⋅ , N . We have yi ∼ Bernoulli (�i), where �i = p(yi = 1) represents
the probability of infection.
We now consider how disease risk may be modeled as a function of spatial location
in relation to the alternate host plant. We first note that the ideal exposure measure is
the cumulative effect due to the exposure received at each site in the study region. As
well, the effect of exposure at each grid may well differ due to topographical variables
(e.g. elevation, slope, aspect) and meteorology (e.g. wind direction, temperature,
humidity). A very general form for the effect of the covariate over the whole region
can be expressed as ∫s
fi,s {�(s)} ds ,
where �(s) is the value of the exposure at location s, and fi,s is a non-parametric or
parametric function determining the effect of exposure �(s) at site i. The function
fi,s may be identical when ∥si − s∥ = �, so that the functional form governing the
effect of the exposure is the same at concentric circles radiating from any tree i; in
this case fi,s may be expressed as f�. Alternatively, fi,s may be sharply increasing
with �(s) for site s to the right of any tree i, and zero for sites directly to the left;
this might be the case if the outcome depends on wind direction. We approximate
such a general expression here by defining fi,s to be the same over wide locations s,
termed here neighborhoods, and by defining the effect as zero for reasonably distant
locations from site i, as described in detail below.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 76
0 1 2 3 4 5
01
23
45
Ai3
Ai2
Ai1
Figure 4.2: Illustration of the definition of the first (Ai1), second (Ai2) and third (Ai3)order neighborhoods of tree i. Note that neighborhoods are non-overlapping.
Define the first order neighborhood (Ai1) of the ith tree as the grid cell in which
the tree is located, and the ℎth order neighborhood, Aiℎ, as the (2ℎ− 1)× (2ℎ− 1)
grid cells centered at the ith tree, excluding Aiℎ′ , ℎ′ < ℎ, ℎ = 2, ⋅ ⋅ ⋅ , H. Figure 4.2
displays Ai1, Ai2 and Ai3.
The usual measure of proximity of AHP to the susceptible tree is the distance
from the tree to the nearest AHP. Here, this is measured as Li= min{ℎ ∋ ziℎ > 0}
where ziℎ is the total alternate host plant count in Aiℎ. Hence, Li reflects how large
ℎ has to be in order the union of neighborhoods∪ℎ′<ℎAiℎ′ has at least one alternate
host plant.
Few attempts have been made in the forestry and environmental as well as medical
literature to assess measures for characterizing spatial proximity in disease exposure
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 77
studies beyond nearest distance. Here, we consider the natural alternative measure
of the density of the alternative host plant; we also consider that density effects differ
at various neighborhoods. We approximate the density effects in the different order
of neighborhoods by the same form, with density effects being zero beyond the third
order neighborhood. To be more specific, define the density of AHP in the ℎth order
neighborhood of tree i as Di(ℎ) = ziℎ/∣Aiℎ∣, where ∣Aiℎ∣ represents the number of
cells in the ℎth order neighborhood centered at tree i. So the actual density in AHP
counts per m2 is Di(ℎ)/2.25. Figure 4.3 displays smoothed plots of Li, Di(1), Di(2)
and Di(3) over the study area, which indicate that trees in the north-east corner
of the plot are far away from the alternate host plants, which are abundant on the
south-west and south-east quadrants.
Previous studies show evidence of genetic variation in resistance of lodgepole pine
to CBR and we incorporate this variable in our model by a simple linear term on the
logit scale. Let Zi = 1 if the genetic family of the tree is hypothesized to be resistant
and 0, otherwise. The model is then expressed as:
log {�i/(1− �i)} = � + �ZZi + fL(Li) +
∫ ∞ℎ=1
fD(ℎ) {D(ℎ)} dℎ+ fs(ei, ni) ,
where � is the overall mean relative risk and �Z represents the genetic effect; more
generally, �Z may represent a treatment effect of interest in other environmental ap-
plications. fL and fD(ℎ) are univariate cubic penalized splines for nearest distance
and densities. The term fs(ei, ni) is a two dimensional thin-plate regression spline
accounting for the spatial autocorrelation over space. This structure might be deter-
mined by residual factors beyond AHP which are unobserved.
To flexibly model AHP distance and density effects, cubic regression splines as a
polynomial spline approximation of a smooth function is a flexible choice in geospatial
modeling. For each covariate x, the B-spline formulation of f can be expressed as a
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 78
20 40 60 80
10
01
50
20
0L
East
No
rth
5
10
15
20
25
30
35
20 40 60 801
00
15
02
00
D(1)
East
No
rth
015810152030
50
20 40 60 80
10
01
50
20
0
D(2)
East
No
rth
01
5810
15
20
30
20 40 60 80
10
01
50
20
0
D(3)
East
No
rth
01
5
8
10
15
Figure 4.3: Plots of the distributions of the covariates L, D(1), D(2) and D(3) overthe study area. Darker shades of red indicate higher values and lighter shades ofyellow indicate lower values.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 79
linear combination of k + l basis functions:
f(xi) =k+l∑m=1
�m�m(xi) = �T� . (4.1)
This approach assumes that the true non-linear forms by which these terms af-
fect the logit of the probability of infection can be approximated by a polynomial
spline of degree l with k inner knots. The columns of the design matrix �T are
given by the B-spline basis functions evaluated at the observations xi, that is �T =
(�1(xi), ⋅ ⋅ ⋅ , �k+l(xi))T while � = (�1, ⋅ ⋅ ⋅ , �k+l)T represents the corresponding spline
coefficients.
For the spatial effect fs, we employ a thin-plate regression spline (Green and
Silverman, 1994; Wood, 2004), which is a higher dimensional extension of smoothing
splines with the useful property of not requiring a specification of knot locations;
as well, thin-plate splines are reasonably computationally efficient and are isotropic.
Thin plate regression splines can be seen as a Gaussian process with generalized
covariance (Cressie, 1993), characterized in terms of distance �. The form in two
dimensions is C(�) ∝ �2m−2log(�), where m is the order of the spline (commonly
two). Paciorek (2007) provides a nice comparison of a variety of models for spatial
surface. We adopt a thin-plate regression spline primarily for its computationally
efficiency with large datasets. Let e = (e1, ⋅ ⋅ ⋅ , eN)′ and n = (n1, ⋅ ⋅ ⋅ , nN)′. A
thin-plate spline over two dimensional space indexed by s = (e,n) takes the form,
f(s) =∑i
�i�(∥s− si∥) + (a1 + a2e+ a3n) , (4.2)
where �i, i = 1, ⋅ ⋅ ⋅ , n and a1, a2 and a3 are constants, ∥⋅∥ denotes the Euclidean
norm, �(r) = 18�r2log(r2), for r > 0 and 0 otherwise, subject to the identifiability
constraints that∑
i �i =∑
i �iei =∑
i �ini = 0.
To estimate the parameters �T =(�TL,�
TD(1),�
TD(2),�
TD(3), �Z , �1, ⋅ ⋅ ⋅ , �n, a1, a2, a3
),
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 80
we maximize the penalized log-likelihood
lp(�) = l(�)− 1
2
∑j
�jJ[1](fj)−
1
2�sJ
[2](fs) , (4.3)
conditional on smoothing parameters �j, operating on the penalty terms J [1] and
J [2] as discussed below. Here l(�) is the log-likelihood associated with the Bernoulli
response:
l(�) =n∑i=1
[yilog {p(yi = 1∣�)}+ (1− yi)log {1− p(yi = 1∣�)}
]. (4.4)
Given values of �j, lp is maximized to find �. The smoothing parameters �j control
the tradeoff between goodness of fit of the model and model smoothness with the
second and third terms of equation (4.3) penalizing sharp changes of the splines.
For the one-dimensional spline and the two-dimensional thin-plate spline (Green and
Silverman, 1994) of (4.1), the penalty terms in equation (4.3) are, respectively,
J [1](fj) =
∫ℜ
{df 2j (t)
dt
}2
dt j = 1, ⋅ ⋅ ⋅ , p (4.5)
J [2](fs) =
∫ ∫ℜ2
{(∂2fs∂sx2
)2
+ 2
(∂2fs∂sxsy
)2
+
(∂2fs∂sy2
)2}dsxdsy . (4.6)
The penalized likelihood may be maximized by a penalized iteratively reweighted
least squares (P-IRLS) algorithm. Given the smoothing parameter, at the kth P-
IRLS iteration, the following penalized sum of squares would be minimized with
respect to g = X� to find the (k + 1)th estimate g[k+1] = X�[k+1]:
n∑i=1
{w
[k]i (z
[k]i − gi)
}2
+
p∑j=1
�jJ[1](fj) + �sJ
[2](fs) , (4.7)
where z[k] is a vector of pseudodata z[k]i = g
[k]i + g′(�
[k]i )(yi − �
[k]i ), and W[k] is a
diagonal matrix with elements w[k]i = 1/
√V (�
[k]i )g′(�
[k]i )2 and V (�
[k]i ) is proportional
to the variance of Yi according to the current estimate �[k]i . This mimics penalized
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 81
quasi-likelihood in generalized linear models (Breslow and Clayton, 1993), in main
part because of the connections with linear representations as in (4.1), and (4.2) and
yields (at least conceptually) straightforward implementation.
The smoothing parameters �j are estimated by minimizing the generalized cross
validation score (Craven and Wahba, 1979; Wahba, 1985) for each working penalized
linear model of the P-IRLS iteration; the score has the following form:
GCV =n ∥√
W(z−X�) ∥2
[n− trace(A)]2, (4.8)
where A = X(XTWX + S)−1XTW the influence matrix (Hastie and Tibshirani,
1990) while S measures the roughness of the smooth functions. In our context, GCV
is used to estimate the smoothing parameters as well as for variable selection, and
may be directly implemented in the penalized iteratively re-weighted least squares
algorithm (Wood, 2004). In Section 4, we discuss its use for model selection. We
note that model fitting may be conveniently implemented in R (R Development Core
Team, 2010) using the mgcv package (Wood, 2006). In the following subsection, we
discuss methods for model assessment.
Predictive Accuracy: Finding the presence of the alternate host plants within the land-
scape may increase the likelihood of finding disease infected trees and may serve as a
key component in determining the severity of CBR hazard in a given area. Hence, the
models are evaluated on their abilities to reproduce the observed presence/absence
of tree infection using the covariate information at each site. The estimated prob-
ability of disease presence, �i, may be obtained based on the fitted values for any
postulated model. We may then compare model performance using receiver operat-
ing characteristic (ROC) curves, which display the true-positive rate (TPR) versus
the false-positive rate(FPR) for different thresholds of prediction probabilities used
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 82
to create a disease status classification. Here, we define
TPR(c) =
∑ni=1 I
(�i > c∣yi = 1
)∑ni=1 I(yi = 1)
; FPR(c) =
∑ni=1 I
(�i > c∣yi = 0
)∑ni=1 I(yi = 0)
, (4.9)
where 0 < c < 1 and I is an indicator function. Thus, TPR is equal to the number
of trees for which the estimated probability of disease presence is greater than c, for
trees which are truly infected, divided by the total number of trees which are truly
infected. FPR is interpreted similarly for trees which are absent of infection. The
ROC curves plot TPR versus FPR for a grid of selected values of c. The greater
the area under the ROC curve (AUC), the better the method discriminates between
true presence and absence. An area of 1 defines a perfect calibration, where for some
specific c, trees that are truly diseased (� > c) or not diseased (� < c) are so identified.
Generalized Akaike’s Information Score: A natural way to evaluate models is to com-
pare generalized Akaike’s information criterion (AIC) scores (Wood, 2008), which is
expressed as AIC = D(�) + 2trace(A)�, where D(�), the model deviance, is de-
fined as 2�(lsat − l), l is the log-likelihood of the fitted model and lsat the maximum
value for the (saturated) log-likelihood of the model with one parameter per datum.
The dispersion parameter � can be estimated by the Pearson estimator (Wood, 2006).
Generalized Cross Validation Score: The theoretical justification for the use of the
generalized cross validation (GCV) score for model assessment is that the GCV score
is asymptotically a predictive mean square error criterion. This means that for large
n, the value of � that minimizes the GCV score will yield a spline estimate that
minimizes the mean square error between the estimate and the true, unknown model
function. Here, GCV is used as a criterion to assess the predictive accuracy of the
models.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 83
4.4 Comparison of Exposure Measures for CBR
Infection
To evaluate the impact of the use of different exposure measures, we consider the
following candidate models including either or both density and distance or neither
of these terms:
M1 : gi = � + �ZZi + fL(Li) +H∑ℎ=1
fD(ℎ) {Di(ℎ)}+ fs(ei, ni) ,
M2 : gi = � + �ZZi +H∑ℎ=1
fD(ℎ) {Di(ℎ)}+ fs(ei, ni) ,
M3 : gi = � + �ZZi + fL(Li) + fs(ei, ni) ,
M4 : gi = � + �ZZi + fs(ei, ni) .
Figure 4.4 displays the fitted partial spatial effect for the four models with the
fit from M4 having the highest spatial roughness while M1 seems the smoothest.
The south-west and south-east quadrants have higher values of the underlying spa-
tial effect while this is much lower in the north-east and north-west. Figure B.2 in
the Appendix B displays the semivariogram estimates (Ribeiro Jr and Diggle, 2001)
obtained from raw values of infection status and the residuals from the fit of M1,
along with Monte Carlo envelopes obtained by repeated random permutation of the
data values on the spatial locations. Positive spatial dependence is evident in the
raw data, exhibiting a correlation length of approximately 20m, while the use of M1
eliminates such spatial correlation. Figure 4.5 displays the covariate effects, showing
that the log odds of infection decreases as nearest distance increases, whereas the log
odds of infection increases as densities increase. The effects of nearest distance seem
to be insignificant when distance is beyond about 5 (on the order of 3.75m), while
the effects of density in the first order neighborhood achieves a maximum value when
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 84
Table 4.1: AIC, GCV and AUC (area under ROC curve) scores and the estimates ofthe genetic effect for four competing models.
Model AIC GCV AUC �(SE(�))M1 7848.51 0.127 0.703 -1.018(0.141)M2 7859.66 0.128 0.702 -1.013(0.141)M3 7954.91 0.143 0.697 -0.982(0.138)M4 8284.34 0.185 0.674 -0.934(0.135)
density is about 40 AHPs per cell. The effect of AHP densities in the second and
third order neighborhoods also generally increase to a threshold.
Table 4.1 reports the estimates of the genetic effect (�Z) for the four competing
models, all of which are significant. The estimated log odds of infection for (hypothe-
sized) genetically resistant versus non-resistant trees for M1 is -1.018. Across models,
the difference in this effect is negligible for M2, about 0.036 higher for M3, and about
0.084 higher in M4.
To assess which of these models best fits the data, we consider evaluating them
through their predictive accuracy and other measures of goodness of fit. Examination
of all the goodness of fit criteria, shown in Table 4.1, suggests that M1 and M2
are preferable to the other competing models. Hence, alternate host plant densities
appear to provide somewhat more information than the nearest distance as a measure
of host plant proximity. In other words, considering densities as the exposure measure
both enhances model fit and improves prediction accuracy. The worst fit arises from
M4, so some measure of AHP exposure is helpful in predicting disease; here preferably
both distance and density.
In the next section, we assess whether biases in treatment effect may be mani-
fested, and large predictive error induced, by misspecification of the exposure mea-
sures.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 85
20 40 60 80 100
10
01
50
20
02
50
M1
East
No
rth
−2
−1.6
−1.6
−1.6
−1.4
−1.4
−1.4
−1.4
−1.2
−1.2
−1.2
−1.2
−1
−1
−1
−0.8
−0.8
−0.6 −0.6
−0.4
20 40 60 80 1001
00
15
02
00
25
0
M2
East
No
rth
−2 −
1.8
−1.6
−1.6
−1.6
−1.4
−1.4
−1.4
−1.4 −1.2
−1.2
−1.2
−1.2
−1
−1
−1
−0.8
−0.8
−0.8
−0.6
−0.6
−0.4
20 40 60 80 100
10
01
50
20
02
50
M3
East
No
rth
−1.8
−1.
6
−1.6
−1.4
−1.4
−1.4
−1.2
−1.2
−1.2
−1
−1
−1
−1
−0.8
−0.8
−0.6
−0.
6
−0
.6
−0.4
−0.4
−0.2
20 40 60 80 100
10
01
50
20
02
50
M4
East
No
rth
−2
−2
−1.5
−1.5
−1.5 −1
−1
−1
−1
−1
−0.5
−0.5
0
0 0
Figure 4.4: Estimated spatial terms for models M1, M2, M3 and M4 in an analysisof infection over a plantation. Darker shades of red indicate higher risk, while lightershades of yellow indicate lower risk values.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 86
01
02
5
−2−1012
L
partial L effect
04
08
0
−6−4−202
D(1
)
partial D(1) effect
01
02
03
0
−1.00.01.0
D(2
)
partial D(2) effect
01
02
0
0.00.51.01.5
D(3
)
partial D(3) effect
04
08
0
−8−402
D(1
)
partial D(1) effect
01
02
03
0
−1.00.01.0
D(2
)
partial D(2) effect
01
02
0
−0.50.51.5
D(3
)
partial D(3) effect
01
02
5
−4−2012
L
partial L effectF
igure
4.5:
Est
imat
edpar
tial
cova
riat
eeff
ects
wit
h95
%co
nfiden
cein
terv
als
(das
hed
lines
)fr
omfitt
ing
M1
(firs
tro
w),
M2
(sec
ond
row
),M
3(t
hir
dro
w),
resp
ecti
vely
.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 87
4.5 Assessing the Effect of Misspecification of Spa-
tial Exposure Measures
In this simulation experiment, we seek to study the impact of misspecified exposure
measures on estimation of a treatment effect in a similar generalized additive modeling
framework as adopted in the analysis in Section 4.4. We investigate the magnitude
of biases in such treatment estimates under a variety of scenarios.
It is well known that if treatment is independent of the covariate measures, such
as the situation in a randomized experiment where covariates are balanced across
treatment groups, omission of covariates will not induce the bias of the treatment
effect in the setting of linear model as these effects are orthogonal. However, Gail
et al. (1984) has shown that this is not necessarily the case with nonlinear regression.
These authors show that when treatment is binary, the omission of a balanced co-
variate in logistic regression may lead to bias in the estimate of the treatment effect,
with the bias being towards the null hypothesis of there being of no effect. This
result emphasizes the fact that for logistic models, randomization will not guarantee
unbiased estimates of treatment effects when important covariates are omitted. Here,
we investigate the impact of misspecification of the exposure measures with (1) the
treatment group consisting of the trees from resistant seedlots at sites exactly as in
the CBR study, which were planted randomly across the experimental plot; (2) the
treatment group being randomly selected as half of the trees in the plot. We consider
the second case as the spatial distribution of AHP would have evolved over the two
years of the study of initial planting. The randomization of trees to different treat-
ments assures that on average there should be no systematic differences in observed
or unobserved covariates between units assigned to the different treatments.
The simulation study requires the generation of bootstrap samples, based on the
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 88
fitted model under consideration and we also include inflation or deflation of the
partial covariate effects and the site specific structure. The procedure for generating
simulated data is as follows:
(a) Let the partial covariate effects from fit of M1 to the CBR study, be de-
noted fL(Li), fD(ℎ) {Di(ℎ)}, ℎ = 1, ⋅ ⋅ ⋅ , H and fs(ei, ni); let � denote the fitted
intercept. We modify these effects based on forestry scientific considerations
so that they become constant after certain thresholds. For example, our true
partial nearest distance effect, f ∗L(Li) decreases until 5 with f ∗L(Li) = fL(Li)
for Li < 5, f ∗L(Li) = fL(5) for Li ≥ 5 with a smooth transition to the con-
stant value. Similarly the true covariate effect for D(1), f ∗D(1) {Di(1)}, mimics
fD(1) {Di(1)} up to the threshold of 40; threshold values for D(2) and D(3) are
at fD(2)(15) and fD(3)(5), respectively. These true effects from which data are
generated are displayed in Figure 4.6. The true spatial effect f ∗s (ei, ni) is set to
fs(ei, ni), and �∗ = �.
(b) Calculate the true site specific log odds of infection; log {�∗i /(1− �∗i )} = g∗i ;
if M2 is the true model, g∗i = �∗ + �∗ZZi + H∑ℎ=1
f ∗D(ℎ) {Di(ℎ)}+ f ∗s (ei, ni) ,
(4.10)
if M3 is the true model, g∗i = �∗ + �∗ZZi + f ∗L(Li) + f ∗s (ei, ni) , (4.11)
for i = 1, ⋅ ⋅ ⋅ , n. Here represents an inflation factor which modifies the mag-
nitude of the partial covariate effects. Simulation scenarios S1, S2 and S3 corre-
spond to = 1, 1.5, and 2 respectively. Figure 4.6 displays the inflated partial
covariate effects for these scenarios. The true treatment effect �∗Z takes values
−2,−1, 0, 1, 2.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 89
(c) For each combination of the allocation of treatment trees and parameter set-
tings, generate y(b)i from a Bernoulli distribution with mean 1/ {1 + exp(−g∗i )},
i = 1, ⋅ ⋅ ⋅ , n for b = 1, ⋅ ⋅ ⋅ , B replicates, B = 1000.
(d) Fit models M1-M4 using y(b)i as the responses to obtain estimates of the
treatment effect, �(b)z , b = 1, ⋅ ⋅ ⋅ , B.
A natural overall measure to consider when assessing the predictive abilities of the
fitted models at each location is the change in the infection probability caused by
treatment versus no treatment over each of the sites in the plot,
∗i = 1/[1 + exp {g∗i (Zi = 1)}]− 1/[1 + exp {g∗i (Zi = 0)}] , (4.12)
where g∗i (Zi = k), a function of Zi, represents the predicted log odds of infection as in
(4.11) using site covariate values except Zi and with Zi set to k = 0, 1. The estimated
site specific treatment effect is denoted as (b)i = 1/[1 + exp
{g(b)i (Zi = 1)
}]− 1/[1 +
exp{g(b)i (Zi = 0)
}] in the b-th replicate, b = 1, ⋅ ⋅ ⋅ , B. We use the sum of the absolute
bias, standard error and root mean squared error across all the sites to evaluate the
performance of the estimate of the site specific treatment effect under the different
models. These are defined as
SBIAS( ) =N∑i=1
∣∣∣∣∣ 1
B
B∑b=1
(b)i − ∗i
∣∣∣∣∣ ;
SSE( ) =N∑i=1
⎧⎨⎩ 1
B
B∑b=1
(
(b)i −
1
B
B∑b=1
(b)i
)2⎫⎬⎭ ;
SRMSE( ) =N∑i=1
{1
B
B∑b=1
(
(b)i − ∗i
)2}, (4.13)
where = { (b)i }, i = 1, ⋅ ⋅ ⋅ , N and b = 1, ⋅ ⋅ ⋅ , B.
To compare predictive accuracy for different models under different settings, define
� ∗i = 1/[1 + exp {g∗i (Zi = 0)}] as the true infection probability for the ith tree and
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 90
0 5 10 15 20 25 30 35
−0.
40.
00.
40.
8
L
part
ial L
effe
ct
S1S2S3
0 20 40 60 80 1200.
00.
51.
01.
52.
0D(1)
part
ial D
(1)
effe
ct
S1S2S3
0 5 10 15 20 25 30 35
0.0
0.5
1.0
1.5
2.0
D(2)
part
ial D
(2)
effe
ct
S1S2S3
0 5 10 15 20 25
0.0
0.5
1.0
1.5
2.0
D(3)
part
ial D
(3)
effe
ct
S1S2S3
Figure 4.6: True partial covariate effects for simulating data in scenarios S1, S2 andS3. The treatment group consists of trees from resistant seedlots. The vertical greyline indicates the threshold of the covariate after which the effect asymptotes.
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 91
�(b)i = 1/[1 + exp
{g(b)i (Zi = 0)
}] as the estimated infection probability for the ith
tree, i = 1, ⋅ ⋅ ⋅ , N . We compare the models in terms of sum of the absolute bias,
sum of the standard error and sum of the root mean squared error of the infection
probabilities for all trees, defined similarly in (4.13).
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
β z
−3.0
−2.5
−2.0
−1.5
−1.0
S1 S2 S3
−3.0
−2.5
−2.0
−1.5
−1.0
M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
Figure 4.7: Estimated treatment effect from fitting M1, M2, M3 and M4 when thetrue model is M2. Simulation scenarios S1, S2 and S3 correspond to the inflationfactor = 1, 1.5 and 2, respectively. The treatment group consists of trees fromresistant seedlots. The horizontal dashed line represents the true treatment effect.
As an illustration, Figure 4.7 displays estimated treatment effects obtained from
fitting M1, M2, M3 and M4 when the true model is M2 and the true treatment effect
is -2, with the sites for the treatment group being identical here to those trees from
resistant seedlots in the CBR study. The results for all the simulation scenarios are
displayed Figure B.3 in the Appendix B. Similarly, and Figure B.4 in the Appendix B
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 92
displays results when the treatment group is randomly selected as half of the trees. In
general, if the true model is M2, bias in estimating the treatment effect may arise with
M3 and M4 as shown in the left-hand side panels in Figures B.3 and B.4. The bias
of the treatment effect increases with inflation of the covariate effects, and for larger
values of can be substantial. The bias from the use of M3 and M4 also increases as
the magnitude of the treatment effect increases. An important point is that the bias
of the estimated treatment effect under misspecified models tends to be close to zero
when the true treatment effect is zero, but becomes positive when the true treatment
effect is negative, and conversely, negative when the true treatment effect is positive.
Hence when the true treatment effect is not zero, the bias attenuates the magnitude
of the estimated effect to zero. Note that, as may be expected, M1 yields unbiased
estimates in all simulation scenarios. The right-hand panels in Figures B.3 and B.4
in the Appendix B show that the bias arising from misspecified models when the true
model is M3 is almost negligible in all settings even with rather large values of .
Table 4.2 presents SBIAS, SSE and SRMSE for the site specific treatment effects
when the treatment group consists of trees from resistant seedlots as in the CBR
study, while B.1 in the Appendix B presents corresponding results for the case where
treatment trees are randomly chosen in this simulation experiment. In general, SBIAS
in the correctly specified model is considerably smaller than for the misspecified mod-
els and close to that for model M1, while SBIAS from the fit of M4 is substantial in
all simulation scenarios. The SSE for M4 is smaller than that for the other models,
since no smoothers are included in M4; also, it estimates fewer parameters than the
other models. In all simulation scenarios, the correctly specified model has the lowest
SRMSE; however, note that the difference between SRMSE values for M2 and M3
when M3 is true is less than the difference when M2 is true, under the same simulation
scenario. The increase in SRMSE observed under misspecified models, over that for
the true model, can be very substantial when inflation factors, , and the treatment
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 93
effect are large. Note also that when �∗Z = 0, i.e. there is no treatment effect, SRMSE
values for the true and misspecified models are comparable.
Table 4.3 shows SBIAS, SSE and SRMSE for the predicted site specific infection
probability when the treatment group consists of trees from resistant seedlots. The
results obtained when the treatment group is randomly selected as half of the trees in
the plot are presented in Table B.2 in the Appendix B. The findings are quite similar
to those predicting treatment effects over all sites, except that the errors introduced
are more pronounced as errors associated with the incorrect covariate are not removed
through the calculation of differences at sites associated with Z = 1 versus Z = 0.
4.6 Summary
The study contrasted the use of nearest distance to alternate host plants versus al-
ternate host plant density as a spatial covariate with additive effects in a study of
the dynamics of CBR on lodgepole pine. Importantly, we utilized the detailed data
set available to evaluate the bias in estimating treatment effects and predictive errors
induced by misspecification of exposure measures. Our results suggest that bias in
estimating the treatment effect may be quite large under model misspecification. As
the true partial covariate effects increases, the bias becomes substantial. In addition,
the bias of the difference in site specific estimated probabilities of infection with and
without treatment over the whole plot tends to be close to zero when the true treat-
ment effect is zero; when the true treatment effect is not zero, this estimated overall
effect of infection probability tends to be attenuated to zero under model misspecifi-
cation. We recommend further investigation of this phenomenon in a wide range of
studies.
We also note that a model including both distance and density terms is usually
less efficient than the correctly specified model due to the inflated variance induced
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 94
Table 4.2: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific treatment effect ( i) over the whole plot when the treat-ment group consists of the trees from resistant seedlots. The other rows providethe difference between SBIAS, SSE, SRMSE of i based on the misspecified mod-els and the corresponding quantity for the true model. The true treatment effect is�∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.
True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE
�∗Z = −2M2 M1 2.01 4.03 4.65 2.16 4.32 4.92 1.86 4.60 5.02
M2 48.68 167.05 177.14 49.78 170.09 180.76 51.86 174.67 185.88M3 124.48 0.84 82.14 170.88 -5.18 113.87 223.01 -11.63 154.35M4 207.56 -7.69 144.38 268.30 -13.42 193.14 312.45 -21.76 228.28
M3 M1 0.19 10.31 10.01 0.40 10.17 9.92 0.32 10.73 10.41M2 24.19 8.95 20.67 47.11 11.63 35.97 69.38 14.38 52.74M3 51.91 166.26 177.66 50.70 166.84 177.80 50.71 165.08 176.25M4 79.00 -6.35 39.09 136.83 -5.44 83.05 193.00 -3.97 131.36
�∗Z = −1M2 M1 1.56 1.64 2.05 1.54 2.28 2.66 1.47 2.38 2.70
M2 29.28 156.43 160.54 29.60 154.49 158.91 29.89 153.19 157.71M3 59.65 2.85 31.07 98.50 0.55 55.92 139.96 -1.48 86.86M4 85.44 -2.41 44.51 119.69 -5.68 67.36 153.47 -8.78 93.85
M3 M1 0.11 2.96 2.94 0.33 2.83 2.84 0.40 2.97 2.97M2 10.80 1.45 5.70 24.12 1.88 11.42 36.12 2.15 17.66M3 30.63 155.03 159.36 28.96 155.03 159.09 28.89 154.63 158.79M4 39.70 -3.16 13.12 70.72 -3.22 30.49 99.42 -3.09 50.16
�∗Z = 0M2 M1 2.61 0.17 0.28 0.91 -0.03 -0.02 0.33 0.13 0.12
M2 5.37 161.95 161.89 1.80 154.35 154.22 2.03 144.73 144.62M3 7.89 0.99 1.44 14.58 0.76 1.61 25.31 0.80 3.33M4 7.63 0.17 0.60 14.23 0.00 0.82 24.48 -0.41 1.98
M3 M1 -0.95 0.17 0.15 -0.73 -0.07 -0.09 -0.50 0.02 -0.01M2 3.26 -0.21 -0.11 2.38 -0.91 -0.80 -2.09 -1.37 -1.50M3 3.64 162.71 162.59 6.49 162.74 162.72 11.16 156.17 156.42M4 -0.25 -0.74 -0.75 -4.22 -1.05 -1.17 -9.73 -1.21 -1.61
�∗Z = 1M2 M1 1.63 1.83 2.12 2.14 1.99 2.36 1.80 1.78 2.16
M2 20.34 186.24 188.33 23.57 190.76 193.38 24.79 185.25 188.13M3 54.28 0.12 28.55 95.35 0.28 57.71 133.36 0.88 86.85M4 68.04 -6.41 30.15 117.44 -8.95 64.58 163.67 -10.95 100.03
M3 M1 0.18 0.84 0.88 0.12 1.10 1.15 0.14 1.59 1.63M2 7.96 -0.76 0.22 14.32 -1.40 1.11 21.24 -1.10 3.68M3 15.43 188.13 189.24 16.51 186.53 187.84 18.14 183.74 185.34M4 17.48 -2.54 0.62 30.57 -3.95 3.66 44.51 -4.76 9.10
�∗Z = 2M2 M1 1.28 2.52 2.79 1.68 3.24 3.59 1.81 3.39 3.80
M2 24.49 179.58 182.74 27.05 185.80 189.43 29.47 195.42 199.34M3 106.32 -2.11 71.49 169.35 2.25 126.52 223.55 6.37 173.54M4 175.61 -12.84 111.44 277.51 -13.40 196.88 363.53 -13.95 270.52
M3 M1 0.79 5.93 5.99 0.66 6.63 6.63 0.56 8.32 8.22M2 14.01 5.22 8.86 26.36 6.83 15.05 37.13 9.52 22.90M3 22.05 168.28 170.85 23.92 171.38 174.21 25.68 174.73 178.08M4 36.46 -3.95 10.48 65.29 -5.57 25.79 94.64 -7.02 44.46
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 95
Table 4.3: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific infection probability (�i) over the whole plot when thetreatment group consists of the trees from resistant seedlots. The other rows providethe difference between SBIAS, SSE, SRMSE of �i based on the misspecified modelsand the corresponding quantity for the true model. The true treatment effect is�∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.
True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE
�∗Z = −2M2 M1 2.44 5.56 6.35 2.69 6.10 6.89 2.36 6.14 6.74
M2 65.70 191.98 207.01 63.39 186.45 201.00 61.50 179.97 194.18M3 216.97 -3.40 162.96 323.38 0.23 259.32 404.02 3.74 335.96M4 355.24 -12.21 277.21 536.31 -4.36 451.24 689.50 1.35 601.89
M3 M1 0.16 18.52 17.67 0.27 19.28 18.40 0.00 21.24 20.14M2 31.85 18.07 34.85 63.22 24.80 59.84 94.07 31.55 86.74M3 68.53 185.30 202.54 67.04 185.34 201.99 67.52 184.48 201.54M4 107.19 -8.66 60.43 187.95 -6.56 126.61 267.52 -3.34 198.54
�∗Z = −1M2 M1 2.61 4.89 5.77 2.72 6.37 7.15 2.33 6.34 6.90
M2 66.21 194.02 209.07 63.60 189.11 203.56 61.67 182.44 196.52M3 221.91 -2.13 167.77 330.22 0.10 264.65 410.46 3.53 341.12M4 363.74 -10.99 285.12 548.43 -4.43 461.57 702.63 1.20 613.23
M3 M1 0.14 18.95 18.09 0.34 18.96 18.17 -0.04 21.37 20.28M2 33.08 18.17 35.49 65.18 25.00 60.93 96.58 31.96 88.46M3 68.67 188.81 205.79 67.12 188.96 205.34 67.71 187.84 204.69M4 110.36 -8.70 62.09 192.74 -6.61 129.57 273.84 -3.45 203.03
�∗Z = 0M2 M1 2.37 6.30 6.97 2.75 6.23 7.00 2.45 6.58 7.14
M2 67.36 197.06 212.40 64.81 191.32 206.10 62.64 184.28 198.66M3 224.78 -3.21 169.34 332.30 -0.09 266.51 411.51 3.31 342.03M4 369.63 -12.09 289.28 554.14 -4.67 466.49 707.41 0.86 617.30
M3 M1 -0.08 19.36 18.41 0.52 19.56 18.83 -0.05 20.79 19.70M2 33.37 18.91 36.47 67.29 24.88 62.09 99.05 31.82 89.93M3 69.65 191.56 208.70 67.93 191.64 208.15 68.58 190.04 207.08M4 113.15 -8.89 63.94 197.42 -6.96 132.96 279.93 -3.46 208.23
�∗Z = 1M2 M1 2.05 6.40 6.96 2.39 6.75 7.37 2.32 5.80 6.37
M2 67.26 198.28 213.49 64.53 193.36 207.88 63.07 186.22 200.64M3 223.49 -3.17 168.14 330.13 -0.03 264.05 407.82 3.02 338.28M4 367.63 -11.95 287.15 550.13 -4.61 461.78 700.76 0.77 610.36
M3 M1 -0.14 19.04 18.05 0.37 18.42 17.64 -0.01 20.67 19.59M2 34.56 18.54 36.54 67.37 25.20 62.34 99.62 31.75 90.04M3 69.28 192.27 209.29 67.72 192.65 209.06 68.22 191.54 208.35M4 114.45 -8.50 64.84 198.69 -6.65 133.85 281.55 -3.36 209.00
�∗Z = 2M2 M1 2.44 6.30 7.09 2.70 6.37 7.19 2.37 6.13 6.75
M2 66.59 195.99 211.10 64.48 190.60 205.29 62.64 185.31 199.61M3 219.26 -3.05 164.90 323.50 0.56 259.42 400.46 3.26 331.95M4 359.49 -11.97 280.26 537.98 -4.25 451.74 686.48 0.59 596.89
M3 M1 0.27 18.56 17.68 0.11 18.89 17.94 -0.20 20.74 19.54M2 32.67 18.43 35.44 64.74 24.65 60.38 96.31 31.46 87.86M3 69.58 190.11 207.52 68.20 190.38 207.23 68.91 189.27 206.63M4 110.97 -8.63 62.57 193.65 -6.54 130.31 275.00 -3.19 204.03
CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 96
by the inclusion of redundant smooth terms; however this efficiency loss is minimal.
Our findings for the Comandra blister rust infection indicate that the alternate
host plant densities at different orders of neighborhood are important exposure mea-
sures. In our developments, we adopted a discrete approach to account for the effect
of densities at various neighborhoods. More precision may be gained if we model the
densities in a continuous way with density effects decreasing as we move away from
the site.
As well, it would be interesting to compare our model to a traditional infectious
disease model, in which the transition probabilities for infection from the alternate
host plants to the pine trees may be modeled when more information about infec-
tiousness of the alternate host plants is available.
Chapter 5
Exploring Spatial and Temporal
Variations of Cadmium
Concentrations in Pacific Oysters
from British Columbia
5.1 Introduction
Pacific oysters (Crassostrea gigas) are cultivated along the northwest coast of North
America from Washington to Alaska and accumulate levels of cadmium that exceed
some international tolerances. Health guidelines for the European Community set the
tolerance of cadmium concentration at 6.3 �g Cd/g dry weight basis, and Asian ex-
port markets set the tolerance at 13.5 �g Cd/g dry weight basis (Kruzynski, 2004). In
1999, several shipments of oysters cultured in the province of British Columbia (BC),
Canada, were rejected by the Hong Kong Food and Environmental Hygiene Depart-
ment for exceeding the 13.5 �g Cd/g dry weight basis import limit. A subsequent
97
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 98
shellfish survey by the Canadian Food Inspection Agency (CFIA) confirmed these
shipments were not unusual and reported a mean cadmium value of 17.7 �g Cd/g
dry weight basis for BC oysters cultured over the broad geographic area (Schallie,
2001). In 2000, Fisheries and Oceans Canada provided possible sources where cad-
mium might be originating. They concluded that the cadmium in BC oysters is
mainly due to the geology of the area (Kruzynski, 2000), but the source of cadmium
for these oysters is still uncertain. Consequently, there has recently been an enormous
investment on studying the issue.
A primary interest of our analysis is to study how oyster cadmium concentrations
vary over space and time. A second objective is to investigate how cadmium con-
centrations depend on oyster growth over time. We illustrate how spline smoothing
techniques can be employed to address both of these concerns.
5.1.1 The Motivating Datasets
In July 2000, in collaboration with the British Columbia Ministry of Agriculture
Food and Fisheries (BCMAFF), Simon Fraser University initiated a grow-out study
whereby Pacific oysters from the same seed source (Coast Seafoods, Washington State,
USA) of the same age were deployed to existing oyster culture locations along the
western coast of BC. The deployment dates of all oysters are the same within each
site. Representative locations in both the east (mainland) and outer west (oceanic)
were included. Deployment occurred along lines that were approximately 8 m long
with seeded shells inserted at 30 cm intervals from the surface. Oysters were sampled
approximately bimonthly between December 2002 (D2) to February 2004 (F4), from
shallow (about 1 m depth) to deep (about 7 m depth) positions along the long-
line. Oyster shell length (at the maximum length), was recorded in the field at time
of sampling. Then, the sampled oysters were transported on ice to the laboratory,
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 99
where they were killed and frozen whole until cadmium concentration analysis.
In this chapter, we consider thirteen sites identified in Figure 5.1. In the remainder
of the chapter, the location of each site is denoted as the two-letter abbreviation of
its name with the number in brackets indicating which region the site is from. The
five southernmost sites located in the region of Barkley Sound (BS) on the westmost
side of Vancouver Island are (1)Poett Nook (PN), (1)Useless Inlet-3 (BM), (1)Useless
Inlet-4 (JF), (1)Useless Inlet-5 (PC) and (1)Webster Island/Effingham Inlet (WI).
Moving northward, the site (2)Kendrick (KI) is located in the region of Nootka Sound
(NS) located to the north of Barkley Sound. Six sites from the region of Desolation
Sound (DS), located on the west coast of the British Columbia mainland and on
the east side of Vancouver Island include (3)Orchard Bay (OB), (3)Redonda Bay
(RB), (3)Teakerne Arm (TA), (3)Gorge Harbour (GH), (3)Thors Cove (TC) and
(3)Trevenen Bay (TB). The northernmost site considered is (4)Hecate Cove (HC)
located in the region of Quatsino Sound (QS). Note that the region of Nootka Sound
(NS) or the region of Quatsino Sound (QS) include only one site due to sampling
difficulty in the specific geographical location.
5.1.2 Statistical Methods
In this study, we seek to explain the variation in observed cadmium concentrations
and note that since measurements are taken in this study at unequally-spaced time
points, traditional techniques from time series analysis are difficult to implement.
Bendell and Feng (2009) demonstrated the presence of certain temporal variation and
clustering patterns for oyster cadmium concentrations contained in BC oysters via
preliminary visual and simple statistical approaches. The average oyster cadmium
concentrations and oyster length at the same depth at each site are measured at
some discrete time points. Smoothing splines are used to estimate the average oyster
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 100
−129 −128 −127 −126 −125 −124 −123
48.5
49.0
49.5
50.0
50.5
51.0
longitude
latit
ude
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
Hectate Cove(HC)
Kendrick Inlet(KI)
Poett Nook(PN)Useless Inlet(BM/JF/PC)Webster Island(WI)
Orchard Bay(OB)Redonda Bay(RB)
Teakern Arm(TA)Gorge Harbour(GH) Thors Cove(TC)
Trevenen Bay(TB)
Desolation Sound (DS)
Barkley Sound (BS)
Nootka Sound (NS)
Quatsino Sound (QS)
Figure 5.1: The geographical range of samples of cultured Pacific oysters along thewest-coast of British Columbia analyzed for oyster cadmium concentrations.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 101
cadmium concentration as a smooth nonparametric function at each site. The average
oyster length will be non-decreasing over time, so a monotone smoothing technique
(Ramsay, 1988) is used to estimate the average oyster length as a monotone function
of time.
In addition to identifying the temporal trends in cadmium accumulation by oys-
ters, we also sought to assess spatial influences on measured cadmium concentrations.
Specifically, we aim to detect those regions where the oyster cadmium concentration
over the entire deployment time is the highest, and hence provide advice to shellfish
farmers to avoid these areas as possible farming sites. Classical multivariate prin-
cipal components analysis (PCA) has often been adopted to identify these features.
Here we employ functional alternatives, namely functional principal component anal-
ysis (FPCA), which incorporate the smoothing techniques into PCA (Deville, 1974;
Besse and Ramsay, 1986; Hall et al., 2006). Ramsay and Silverman (2005) give a nice
introduction about the methodologies and applications of FPCA.
Oyster cadmium concentration may also depend on the explanatory variables
over time (i.e., oyster length and growth rate). Bendell and Feng (2009) showed
that cadmium concentrations in oysters were linked to a number of factors, such as
region, depth, growth rate and oyster length by using a standard multivariate linear
regression model. All of these factors played important roles in determining final
tissue concentrations and ultimately the amounts of cadmium transferred to higher
tropic levels. However, the linear relationship between the cadmium concentration
and these covariates may not be a valid assumption. We relax this strict linear
relationship constraint by using a semi-parametric additive model, which allows for
flexible dependence structures (Ferraty and Vieu, 2000; Malfait and Ramsay, 2003;
Chiou et al., 2004; Antoniadis and Sapatinas, 2007; Crambes et al., 2009).
Oysters’ growth rate appears to be associated with temperature and food avail-
ability, which in turn, might be linked to the amount of cadmium contained in oysters.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 102
Bendell and Feng (2009) attempted to look at the role of growth in influencing oys-
ter cadmium concentrations by calculating the growth rate as the change in oyster
length divided by the total Julian days (the time interval between time 0 when the
oysters were first deployed to the sampling date) to adjust for different deployment
times across the sites. The main drawback of their method is that they considered
global growth rates by dividing the change in length by the length of the time period
between deployment and sampling. Here, we present an alternative way to consider
local (instantaneous) growth rates, which are calculated as the first derivatives of the
monotone smoothing curves for the oyster lengths, alleviating the limitation of the
previous approach.
The rest of the chapter is organized as follows. In Section 5.2 we provide an
overview of some smoothing techniques used in this analysis, including penalized
smoothing, monotone smoothing, functional principal component analysis and semi-
parametric additive modelling. These methods are then illustrated in Section 5.3
using our motivating oyster data where the capacity of the smoothing approaches to
handle several features of this dataset is demonstrated. Important results on spatial
and temporal variation of cadmium concentrations are discussed. Concluding remarks
are provided in Section 5.4.
5.2 Methodology
This section reviews the methods used in this application. The average measurements
of oysters sampled at the same time points were modeled as a function of time at each
site. The growth rates of oysters at each site are estimated as the first derivatives of
the monotone smoothing spline estimator for oyster lengths. We also use functional
principal component analysis for detecting the spatial variation of the average oyster
cadmium concentrations. The effects of a number of factors on the oyster cadmium
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 103
concentration functions are examined by semi-parametric additive modelling.
5.2.1 Spline Smoothing
Let ysgi represent the measurement of cadmium concentration on the ith oyster sam-
pled at the gth time point from site s, s = 1, . . . , N , g = 1, . . . , G, i = 1, . . . , nsg,
where nsg denotes the number of oysters sampled from site s at time g. Let xs(t)
denote the mean cadmium concentration curve for each site, which is represented as
a linear combination of basis functions
xs(t) =b+d+1∑k=1
csk�k(t) = �(t)Tcs , (5.1)
where cs is the vector of B-spline coefficients csk, k = 1, . . . , K, corresponding to the
kth spline effect at site s, �(t) is the vector of cubic B-spline basis function �k(t), b
is the number of break-points or knots, and d is the degree of the polynomial within
each segment - cubic splines (d = 3) are often used. There are many equivalent
bases for the spline space but the most popular is the so-called B-spline basis due to
its numerical stability and computational efficiency (de Boor, 1978). To implement
spline smoothing, the basis coefficient vector cs needs to be estimated, for example,
using least squares. The fitted curve is determined by the number and location of
the knots. Ramsay (1988) and Zhou and Shen (2001) discuss how to choose the
number and location of the knots. The drawback of the dependence of splines on
suitable knot placement has been discussed in the literature (Hastie et al., 2001;
Durban et al., 2004). Our study uses the penalized fitting strategy to alleviate the
importance of knot locations (Wood, 2000) by putting one knot at each distinct
time point with measurements, and a roughness penalty term is used to control the
smoothness of the fitted function. This eliminates the need to choose knot locations
and makes estimated curves more stable at the cost of some increase in bias. The
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 104
basis coefficient vector cs is estimated by minimizing the penalized sum of squared
error (PENSSE) loss function,
PENSSE(cs) =G∑g=1
nsg∑i=1
[ysgi − xs(tg)]2 + �s
∫ tG
t1
[d2
dt2xs(t)
]2dt , (5.2)
where tg represent the actual time at the gth time point . The second term penalizes
the roughness of the fitted function. The smoothing parameter �s for site s determines
the trade-off between the fit of the data and the smoothness of the fitted function.
Ramsay and Dalzell (1991) suggest �s can often be chosen by inspection of the curve
smoothness or through an automated procedure such as generalized cross-validation
(GCV) (Craven and Wahba, 1979).
Taking the derivative of (5.2) with respect to the parameter vector cs and solving
for cs yields
cs =
{G∑g=1
nsg[�(tg)�(tg)
T]
+ �s
∫ tG
t1
d2
dt2�(t)
d2
dt2�(t)Tdt
}−1 [ G∑g=1
nsg∑i=1
ysgi�(tg)
].
The estimate for the smooth function is then
xs(t) = �(t)T cs . (5.3)
5.2.2 Monotone Spline Smoothing
In principle, the average oyster length should not decrease over time, even when the
noise inherent to any data set may suggest otherwise. To account for this, we employ
a monotone smoothing technique (Ramsay, 1988) to model oyster length over time.
A strictly monotone smooth function has a strictly positive first derivative. Let ls(t)
represent the average oyster length function at site s . The growth rate dls(t)/dt must
be positive, so it is expressed as the exponential of an unconstrained function Ws(t):
dls(t)/dt = exp[Ws(t)]. By integrating both sides of this equation, ls(t) can be written
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 105
as
ls(t) = �s0 +
∫ t
t1
eWs(u)du .
By using this representation, Ws(u) can be flexibly estimated as a linear combination
of basis functions, Ws(u) =∑
k csk�k(u), defined similarly to (5.1). Here, we need to
estimate �s0 and cs1, ⋅ ⋅ ⋅ , csK . We estimate these parameters by minimizing,
PENSSE(�s0, cs1, ⋅ ⋅ ⋅ , csK) =G∑g=1
nsg∑i=1
[ℓsgi − �s0 −
∫ ti
t1
eWs(t)dt
]2+ �s
∫ tG
t1
[d2Ws(t)
dt2
]2dt ,
where ℓsgi represent the length for the ith oyster sampled from site s at the gth time
point . In this situation, we cannot obtain closed forms for the estimates of �s0 and
the spline coefficients cs1, ⋅ ⋅ ⋅ , csK . The Newton-Raphson iteration method is used
to obtain the coefficient estimates. It is easily implemented and converges quickly.
To avoid converging to local minima, one can try different starting values for basis
coefficients.
5.2.3 Functional Principal Component Analysis
Here we outline the statistical methodology of FPCA, which we use in the following
section to examine the oyster data set. FPCA is a multivariate technique that can
partition variability among the measurements into components of decreasing ‘impor-
tance’. In this application, we treat the distribution of mean curves for the cad-
mium concentration, defined in (5.3) as the ‘response’. We subtract the mean curve
x(t) =∑N
s=1 xs(t)/N from each curve and use zs(t) = xs(t) − x(t), to implement
FPCA, as our interest is primarily in characterizing the deviations of the xs(t) from
the average curve. The first functional principal component weight function �1(t) is
estimated by maximizing sum squared functional principal component (FPC) scores
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 106
∑s f
2s1, where
fs1 =
∫ tG
t1
�1(t)zs(t)dt, s = 1, ⋅ ⋅ ⋅ , N ,
subject to
∥�1∥2 =
∫ tG
t1
�21(t)dt = 1 . (5.4)
The second functional principal component weight function �2(t) is estimated by
maximizing sum squared FPC scores, subject to the constraint ∥�2∥2 = 1 and the
additional constraint ∫ tG
t1
�1(t)�2(t)dt = 0 . (5.5)
Other functional principal component weight functions can be estimated in the same
way.
Searching for the mutually orthonormal and normalized weight functions is equiv-
alent to the problem of eigenanalysis of the variance-covariance function or operator,
defined by
v(t, t′) = N−1N∑s=1
zs(t)zs(t′) ,
then any eigenfunction �p(t), p = 1, ⋅ ⋅ ⋅ , P , satisfies the functional eigenequation∫ tG
t1
v(t, t′)�p(t′)dt′ = �p�p(t) ,
for an appropriate eigenvalue �p. The proportion of each eigenfunction �p(t) taking
account of total variation among N curves is calculated as �p/∑P
p=1 �p . In practice,
the first PL eigenfunctions are chosen such that∑PL
p=1 �p/∑P
p=1 �p is greater than some
threshold, because they account for most of the total variation. The first two com-
ponents in the oyster data set accounted for much of the variation, providing enough
information regarding the principal sources of variation between mean concentration
curves.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 107
To control the smoothness of eigenfunctions, Ramsay and Silverman (2005) in-
troduce a smoothed PCA approach. The first eigenfunction �1(t) is estimated by
maximizing the penalized sample variance
PCAPSV(�(t)) =var∫ tGt1�(t)zs(t)dt
∥�1∥2 + �∫ tGt1
[d2�(t)dt2
]2dt ,subject to ∥�1∥2 = 1 . The smoothing parameter � controls the trade-off between
the maximization of the sample variance and the roughness of the first eigenfunc-
tion. Each subsequent eigenfunction, �j(t), j = 2, 3, ⋅ ⋅ ⋅ , is estimated by maximizing
the penalized variance PCAPSV(�(t)) subject to two constraints ∥�j∥2 = 1 and the
modified form of orthogonality∫ tG
t1
�j(t)�k(t)dt+
∫ tG
t1
[d2�j(t)
dt2
][d2�k(t)
dt2
]dt = 0 for k = 1, ⋅ ⋅ ⋅ , j − 1 .
Ramsay and Silverman (2005) explain in detail how to find these eigenfunctions by
solving a single eigenvalue problem in Section 9.4 of their book . Silverman (1996)
shows the theoretical advantages of this approach.
5.2.4 Semi-Parametric Additive Model
Oyster cadmium concentration trends for the sampling sites can be explained by
variables such as the depth, region, oyster length and oyster growth rate. A semi-
parametric additive model is proposed to investigate which regions tend to have
greater cadmium concentrations, and which sizes of oysters tend to have high con-
centrations within any region, after adjusting for depth and growth rate effects.
Let ysgi denote the measurement of cadmium concentration on the ith oyster
sampled at the gth time point from site s, s = 1, . . . , N , g = 1, . . . , G, i = 1, . . . , nsg,
where nsg denotes the number of oysters sampled from site s at time g. We investigate
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 108
the semi-parametric additive model:
M1 : log(ysgi) = � + s0(tg) + s1(lsi(tg)) + s2(rsi(tg))
+�dI(depthsi = 7m) +H∑ℎ=1
�ℎI(regionsi = ℎ) + �si(tg) ,
where � represents the overall mean, tg is the actual time at the gth time point,
the nonparametric smooth function s0(⋅) represents the overall mean trend. The
growth rate rsi(t) is estimated as the first derivative of the monotone smoothing
spline estimator of oyster lengths lsi(t), s1(⋅) and s2(⋅) are nonparametric smooth
functions of observed oyster length and estimated oyster growth rate, respectively.
Here s0(⋅), s1(⋅) and s2(⋅) are not constrained to be of any pre-specified parametric
form. Instead, we model these terms as linear combinations of cubic B-splines: sk(⋅) =∑pkj=1 ckj�kj(⋅), k = 0, 1 and 2, where ckj are coefficients of the smooth. The linear
coefficients, �d and �ℎ, are discrete effects for the depth and region in our study
where the effects of being at the depth 1m and region BS are set as baseline as these
are set to be the reference levels for these factor variables. We use �si(t) to denote
independent errors with mean zero and common variance.
To avoid overfitting, M1 is estimated by penalized maximum likelihood estimation
(Hastie and Tibshirani, 1990). The semi-parametric additive model is estimated by
minimizing the penalized sum of squared error loss function:
PENSSE =∑s,i,g
{log(ysgi)−
[� + s0(tg) + s1 {lsi(tg)}+ s2 {rsi(tg)}
+ �dI(depthsi = 7m) +H∑ℎ=1
�ℎI(regionsi = ℎ)]}2
+ �0
∫ [d2s0(t)
dt2
]2dt+ �1
∫ [d2s1(l)
dl2
]2dl + �2
∫ [d2s2(r)
dr2
]2dr ,
where the smoothing parameters �0, �1, �2, determine the amount of smoothing for
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 109
each of the smooth terms. The smoothing parameters are estimated with a computa-
tionally efficient method by applying GCV in generalized ridge regression problems
(Wood, 2004). The above model is fitted using an R package mgcv (Wood, 2004).
Note that although oyster growth rate is estimated as the first derivative of the
monotone smoothing spline estimator of the oyster length, it has little correlation
with the oyster length as correlation coefficient equals to -0.17 (scatter plot is shown
in Figure C.4 in the Appendix C). It is not rare for the first derivative of a variable
to be independent from that variable itself. For example, although the velocity of a
moving car is the first derivative of the position function, the velocity is independent
from the position of the car.
Bendell and Feng (2009) estimate linear effects of oyster growth rate and oyster
length and effects of depth and region using a standard multiple linear regression
model
M2 : log(ysgi) =� + �llsi(tg) + �rrsi(tg) + �dI(depthsi = 7m)
+H∑ℎ=1
�ℎI(regionsi = ℎ) + �si(tg) ,
where �l and �r are linear coefficients for oyster length and oyster growth rate, re-
spectively.
To compare the two models, we employ the Akaike information criterion (AIC)
(Akaike, 1974), which penalizes the complexity of the model for using a large number
of parameters. The standard multiple linear regression model M2 is less complex
than the semi-parametric additive model M1 and easier to interpret. However, M1
is appealing in terms of flexibility in the trends for the covariate effects and avoids
restricting the trend to a linear form.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 110
5.3 Results
5.3.1 Data Representation
In this subsection, all plots are provided in the Appendix C. Figure C.1 in the
Appendix C displays smoothed functions of average oyster cadmium concentrations
over time by penalized spline smoothing at each of the experimental sites. During
the initial sampling time in winter 2002 and 2003, the oysters appear to exhibit high
cadmium concentrations, which decrease over the summer of 2003, and subsequently
increase towards winter 2003, though the patterns vary from site to site. Also, the
oyster cadmium concentrations tend to be higher for oysters sampled from 7m depth
than those sampled at 1m.
The curves for the average oyster length are estimated using monotone spline
smoothing, as the average oyster length should not decrease over time. As an il-
lustration, Figure 5.2 displays the fitted curves of the lengths and growth rates for
oysters sampled from site WI at 1m and 7m, respectively. The fitted curves for all
the experimental sites are displayed in Figure C.2 in the Appendix C, which shows
that the oysters sampled at 7m tend to be smaller than those sampled at 1m possibly
due to food availability at different depths. Figure C.3 in the Appendix C shows that
the trends and variations of the growth rate are not aligned across the sites. Note
that the estimated growth rate at the first measurement time from site PN at 1m
is beyond 15, and it is much higher than the other estimated growth rates which
are approximately less than 8. This may be caused by the boundary effect of spline
smoothing, which often yields unreliable estimates at the boundaries. Therefore, this
extreme estimate is removed when exploring the effect of growth rate on cadmium
concentration in the semi-parametric additive model M1 and the standard multiple
linear regression model M2.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 111
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
050
100
150
month
leng
th (
mm
)
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
02
46
810
month
grow
th ra
te
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
Figure 5.2: The left panel shows the monotone spline smoothing curves of oysterlengths for oysters sampled from site WI at 1m (solid curves) and 7m (dashed curves),respectively. The circles and triangles are the measured oyster lengths sampled at1m and 7m depth, respectively. The right panel shows the estimated growth ratesfor the oysters sampled from site WI at 1m (solid curves) and 7m (dashed curves),respectively. The growth rates are estimated as the first derivatives of the monotonesmoothing functions of the oyster lengths. The labels above the x-axis represent theseasons, ranging from winter 2002 (W2) to winter 2003 (W3). The labels below thex-axis represent the months, ranging from December 2002 (D2) to February 2004(F4).
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 112
5.3.2 Spatial Variability
To visualize FPCA results, we examine plots of the overall mean function and the
functions obtained by adding and subtracting a suitable multiple of the eigenfunctions
(Silverman, 1995).
The multiple is �√�p, where � represents a correcting factor to adjust the mag-
nitude of the effect of �p(t) with respect to the square root of the eigenvalue �p. The
correcting factor � can be set to be any value subjectively to adjust the magnitude of
the effect of the eigenfunction �p(t) with respect to the square root of the eigenvalue
�p. Here, we choose � = 0.2 subjectively to produce a clear visual impression of the
effect of principal components on the overall mean function. Figure 5.3 displays the
overall mean curve and the effect of adding (+) and subtracting (−) a multiple of
each of the first two weighting functions for 1m and 7m, respectively. The first FPC
displayed in the upper left panel of Figure 5.3 accounts for about 89% of the variation
of the average cadmium concentration for the oysters sampled at 1m among the thir-
teen experimental sites. The effect of the first eigenfunction is approximately to add
or subtract a constant to cadmium concentration throughout the time period. This
indicates that about 89% of variability between sites is accounted for by the average
cadmium concentration differences. The second eigenfunction explains about 10% of
the variation after accounting for the variability of the first eigenfunction, indicating
that about 10% of the variation among sites is the change of cadmium concentration
from winter 2002 and winter 2003. Similar patterns are observed for the variation
of the cadmium concentration for the oysters sampled at 7m, except that the second
eigenfunction represents the change of cadmium concentration after August 2003.
One of the important features of FPCA is the ability to examine the scores of
each curve on each eigenfunction (Ramsay and Dalzell, 1996; Ramsay and Silverman,
2002). The bottom two panels in Figure 5.4 displays the first FPC score against the
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 113
9.5
10.0
10.5
11.0
11.5
PC 1 ( 89.3 % )
month
Cad
miu
m C
once
ntra
tion
(µg
Cd/
g)
D2 F3 A3 J3 J3 S3 N3 J4 F4
++++++++++++++++++++++++++++++++
++++
++++
++++
++++
++
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−
−−−−
−−−−
−−−−
−−−−
10.0
10.5
11.0
11.5
PC 2 ( 9.9 % )
monthC
adm
ium
Con
cent
ratio
n (µ
g C
d/g)
D2 F3 A3 J3 J3 S3 N3 J4 F4
++++++++++++++++++++++++++
++++
++++
++++++++++++++++
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−
−−−−
−−−−
−−−−
−−
1112
1314
1516
PC 1 ( 75.8 % )
month
Cad
miu
m C
once
ntra
tion
(µg
Cd/
g)
D2 F3 A3 J3 J3 S3 N3 J4 F4
++++++++++++++++++++++++++++++++++
++++++
++++
++++
++
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−
−−−−−−
−−−−
1213
1415
16PC 2 ( 15 % )
month
Cad
miu
m C
once
ntra
tion
(µg
Cd/
g)
D2 F3 A3 J3 J3 S3 N3 J4 F4
++++++++++++++++++++++++++++++++
++++
++++
++++
++++
++
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−
−−−−−−
−−
Figure 5.3: The top two panels and the bottom two panels display the mean oystercadmium concentration curve and the effects of adding (+) and subtracting (-) a smallmultiple of each eigenfunctions for oysters sampled at 1m and 7m depth, respectively.The percentages indicate the amount of total spatial variation accounted by functionalprincipal components. The labels of the x-axis represent the months, ranging fromDecember 2002 (D2) to February 2004 (F4).
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 114
second FPC score for 1m and 7m, respectively. There appears to be some regional
groupings, although there is some overlap between Barkley Sound, Nootka Sound and
Quatsino Sound. One referee suggests to use a hierarchical clustering tree for group
classification by considering the n× 2 matrix with row entries taken as the principal
component scores associated with eigenvalues �1 and �2. The top two panels in Figure
5.4 shows the clustering tree for the samples at 1m and 7m, respectively. For samples
at 1m, three groups are obtained by cutting the tree at height 25: group 1=TB,
OB, GH and TC, which score highly on the first PC; group 2= RB, TA, JF, BM
and WI, which score moderately on the first PC; group 3=PC, HC, PN and KI,
which score low on the first PC. For samples at 7m, three groups are obtained by
cutting the tree at height 25: group 1=OB, WI, TA and TC, which score highly on
the first PC; group 2= GH, TB, BM and JF, which score moderately on the first
PC; group 3=PN, KI, PC, RB and HC, which score low on the first PC. Therefore,
cadmium concentrations for the oysters sampled from those inland sites may have
higher cadmium concentration on average than the coastal sites at 1m depth. In
fact, the form of the groups is completely guided by the first principal component
coordinates because this first axis accounts for about 90% of the entire variability.
This axis can then serve as a pollution index because it sorts the observations by mean
cadmium concentration. Similar grouping patterns are also found for the oysters
sampled at 7m depth, except for site WI, exhibiting high cadmium concentration
on average. It is interesting to note that the oysters sampled from WI are more
influenced by oceanic processes rather than direct anthropogenic influences. Possible
sources at this one site could also be related to forestry practices within this region
e.g. forest canopy removal with resulting erosion of soils naturally high in cadmium.
Cadmium contaminated fertilizer applied during reforestation could also contribute
to observed oyster cadmium concentrations at this site.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 115
(1)P
C
(4)H
C
(1)P
N
(2)K
I
(3)T
B
(3)O
B
(3)G
H
(3)T
C
(3)R
B
(3)T
A (1)J
F
(1)B
M
(1)W
I
020
4060
1m
Hei
ght
(1)P
N
(2)K
I
(1)P
C
(3)R
B
(4)H
C
(3)G
H
(3)T
B
(1)B
M
(1)J
F (3)O
B
(1)W
I
(3)T
A
(3)T
C
020
60
7m
Hei
ght
−15 −10 −5 0 5 10 15
−6−2
02
46
PC1(89.3%)
PC
2(9.
9%) ●
●
●
●
●
●
(3)RB(3)TA
(3)GH
(3)TC
(3)TB
(1)PN
(1)BM
(1)JF
(1)PC
(1)WI
(4)HC(2)KI
(3)OB
1m
●
●
●
−10 0 10 20
−10
−50
510
PC1(75.8%)
PC
2(15
%)
●
●
●
●
●
●
(3)RB
(3)TA
(3)GH
(3)TC
(3)TB
(1)PN
(1)BM(1)JF
(1)PC
(1)WI
(4)HC
(2)KI
(3)OB
7m
●
●
●
Figure 5.4: The upper two panels show the hierarchical clustering trees for the n ×2 matrix with row entries taken as the principal component scores associated toeigenvalues �1 and �2 for 1m and 7m depth, respectively. The bottom two panels showthe first two functional principal component scores at 1m and 7m depth, respectively.The location of each site is shown by the two-letter abbreviation of its name withthe number in the bracket indicating which region the site is from. The sites fromBarkley Sound are symbolized as the solid squares; the site from Nootka Sound issymbolized as the solid triangle; the sites from Desolation Sound are symbolized asthe solid circles; and the site from Quatsino Sound is symbolized as the square withtriangle inside. The convex hulls are added to the clusters with the cluster centroidsymbolized as the red circle for each of the clusters, provided the trees are cut atheight 25.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 116
5.3.3 The Semi-Parametric Additive Model
The semi-parametric additive model M1 and the standard multiple linear regression
model M2 are compared in terms of AIC. The AIC is 856.68 for M1 and 936.41 for M2.
A model with a lower AIC score is preferred as it achieves a more optimal combination
of fit and parsimony. As a result, AIC favors M1 over M2. In comparison to M2, M1
used nonparametric functions of length and growth rate to explain the variation left
after accounting for the effects of depth and region.
Figure 5.5 displays the estimated model terms with 95% confidence intervals. The
top left panel shows that the oyster cadmium concentration averaged over thirteen
sites has the lowest value in summer 2003 and relatively higher values in winter 2002
and 2003. A longer series of data would be needed to test if this pattern is consistent
over years. The other two top panels show that the oyster length and growth rate
tend to have nonlinear relationships with oyster cadmium concentration. The aver-
age cadmium concentration decreases with the oyster length, indicating that smaller
oysters have higher cadmium concentration than their longer counterparts. The third
panel shows that the partial growth rate effect appears to decrease with the growth
rate up to about 2.5 mm per month and then becomes statistically nonsignificant.
Note that the confidence interval for the partial effect of growth rate gets wider as
growth rate gets larger, as there are fewer oysters with higher values of growth rate.
Figure 5.5 also displays the comparison for oyster cadmium concentrations be-
tween 1m and 7m and among all the regions after adjusting for the smooth terms
of length and growth rate effects in model M1. On average, oyster cadmium con-
centrations are higher for the oysters sampled at 7m than those from 1m and higher
for those from region DS than those from the other regions with regions BS and QS
having significantly lower oyster cadmium concentrations than the other locations,
confirming the results of functional principal component analysis.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 117
The results in Table 5.1 for model M1 indicate that the functional effects of oyster
length and growth rate on the cadmium concentration are significant and the average
oyster cadmium concentration at a lower depth of 7m is significantly higher than
that at depth of 1m by about 0.23 �g Cd/g. Also, the average oyster cadmium
concentration in oysters from region DS is significantly higher than the other regions
by about 0.47 �g Cd/g. It is worthwhile noting that the model terms that are
common to both models have remarkably similar coefficients, since depth and region
are independent from oyster length and oyster growth rate. To be more precise, the
nonparametric functions of length and growth rate explain the remaining variation of
cadmium concentration conditional on the depth and region variables.
5.4 Concluding Remarks
In this chapter, we investigate the nature of the spatial and temporal distributions
of oyster cadmium concentration within the regions of our study. We illustrate some
statistical methodologies to provide a route for statistical analysis directed at enhanc-
ing biological insight. Those methodologies can readily be applied to a wide variety
of marine ecological data characterized by being irregularly spaced and noisy, while
allowing spatial clustering and potentially interesting and important factors to be
identified.
To handle missing and irregularly spaced temporal measurements, we have in-
vestigated the use of penalized spline smoothing technique to estimate the mean
curve of oyster cadmium concentration. We also adopt the monotone spline smooth-
ing method to impose non-decreasing constraints on oyster length curve estimation.
Oyster growth rate is characterized as the first derivative of the estimated curve for
oyster length. To the best of our knowledge, few attempts have been made so far in
the marine ecological literature to impose shape restrictions on the growth curve. The
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 118
−0.2
−0.1
0.0
0.1
0.2
month
parti
al e
ffect
for m
onth
D2 F3 A3 J3 A3 O3 D3 F4 50 100 150
−0.4
−0.2
0.0
0.2
0.4
length
parti
al e
ffect
for l
engt
h
0 1 2 3 4 5 6 7
−0.1
0.0
0.1
0.2
0.3
growth rate
parti
al e
ffect
for g
row
th ra
te
0.00
0.05
0.10
0.15
0.20
0.25
0.30
depth
parti
al e
ffect
for d
epth
1m 7m
−0.2
0.0
0.2
0.4
region
parti
al e
ffect
for r
egio
n
BS DS NS QS
Figure 5.5: The estimated partial effects of covariates on cadmium concentration in oystersin model M1. The top left panel displays the estimated partial effect over months s0(t);the top middle panel displays the estimated partial effect of oyster length s1(lsi(t)); andthe top right panel displays the estimated partial effect of oyster growth rate s2(rsi(t)).The x-axis tick labels in the top left panel represent the months ranging from December2002 (D2) to February 2004 (F4). The growth rate is calculated as the first derivativeof the monotone smoothing curve for the oyster length. The bottom two panels showthe estimated partial effects for each level of depth and region, respectively. The effectsof being at the depth 1m and region BS are set as baseline due to the default contrastshaving been used. In all the panels, the dashed lines indicate the 95% confidence intervalsfor the partial effects.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 119
Table 5.1: Results for the semi-parametric additive model (M1) and standard linearregression model (M2). The effects of being at the depth 1m and region BS are setas baseline due to the default contrasts having been used. Note “edf” represents theeffective degrees of freedom of the functional parameters.
Semi-parametric Additive Model (M1)Parametric coefficients:
Estimate Std.Error p-value(Intercept) 2.01 0.03 < 0.001depth 7m 0.23 0.03 < 0.001region DS 0.48 0.03 < 0.001region NS 0.24 0.06 < 0.001region QS -0.16 0.06 0.01Approximate significance of smooth terms:
edf p-values0(t) 3.59 < 0.001s1(length) 1.56 < 0.001s2(growth rate) 3.36 0.008
Standard Linear Regression Model (M2)Estimate Std.Error p-value
(Intercept) 2.37 0.07 < 0.001length -0.003 0.01 < 0.001growth rate -0.017 0.001 0.10depth 7m 0.25 0.03 < 0.001region DS 0.47 0.03 < 0.001region NS 0.23 0.06 < 0.001region QS -0.17 0.06 0.01
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 120
prime advantage of using these spline techniques is to relax the parametric assumption
on the curve shapes commonly seen in ecological and biological literature.
The functional principal component analysis technique is investigated to identify
the spatial variation of the average oyster cadmium concentration over the experimen-
tal sites, which provides a good indication of which sites are similar and might assist
future allocation of sampling efforts. There appears to be some regional grouping,
although there is some overlap between Barkley Sound, Nootka Sound and Quatsino
Sound. Possible cadmium sources (Kruzynski, 2000, 2004) from different regions are
quite different, though. For the oysters sampled from sites located in Desolation
Sound, given their close proximity to terrestrial influences, possible cadmium sources
could include cadmium contaminated phosphate fertilizers and local septic tanks.
Therefore, the spatial clustering pattern suggests an upland continental source ver-
sus a marine source in the coastal area. The oysters sampled from Webster Island,
however, are more influenced by oceanic processes rather than direct anthropogenic
influences. Possible cadmium sources at this site may also be related to forestry
practices within this region e.g. forest canopy removal with resulting erosion of soils
naturally high in cadmium. Cadmium contaminated fertilizer applied during refor-
estation may also contribute to observed oyster cadmium concentrations at this site.
Further investigation with more sampling sites and longer time duration of experi-
ments are needed to test our hypothesis.
In this study, we have seen that the semi-parametric additive model ensures a bet-
ter fit than the standard multiple linear regression model. More importantly, it has
the ability to examine the nonlinear relationships between the cadmium concentra-
tion and a set of covariates when there is no prior knowledge that these relationships
should be linear. The nonparametric term of the overall mean trend for oyster cad-
mium concentrations in the model implies that oysters may have greater cadmium
concentration during the colder winter months than the warmer summer months.
CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 121
This may be due to phytoplankton blooms in early spring. However, a longer time
series of data is needed to verify this hypothesis properly.
Our model also reveals that oyster cadmium concentrations are significantly dif-
ferent at two depths on average, being notably higher at a depth of 7 meters. This
is possibly due to the dilution with oysters at 1 meter being heavier than those at 7
meters, therefore, the amount of metal to tissue is greater at 7 meters than at 1 me-
ter, even though the amount of metal is the same. Therefore, it may be advisable for
shellfish farmers to avoid harvesting oysters at lower depths. The model also shows
that oyster cadmium concentration decreases as oyster length increases, which may
be attributed to the fact that oyster has grown more tissue relative to the amount of
metal accumulated.
If the interest is to investigate the influence of environmental factors (i.e. tem-
perature, salinity, turbidity, chlorophyll) to rates at which organisms assimilate and
utilize energy for maintenance, growth and reproduction, we may consider models
that describe processes involved in the oyster growth such as those constructed with
the dynamic budget energy theory (Bourles et al., 2009). Such models are based on
ecophysiological modeling that details the physiological processes and energetics of
an organism in response to environmental fluctuations.
Chapter 6
Future Work
Future work will consider both extensions of methods developed as well as new inter-
esting areas of research.
6.1 Spatial-temporal Modeling for Multivariate Spa-
tial Outcomes
In the spatial statistics literature, there has been great interest in joint modeling
of spatially and temporally correlated data, to enable simultaneous investigation of
space-time variation. Handcock and Wallis (1994) considered spatiotemporal mod-
elling of winter temperature data but their approach utilized separate spatial analyses
by year using a Gaussian random field model. Guttorp et al. (1994) modelled the
spatial covariances of hourly ozone levels allowing the parameters of the model to
vary as a function of time of day. Carroll et al. (1997) modelled ozone exposure
in Texas, U.S.A. by combining trend terms incorporating temperature, hourly and
monthly effects, with the correlation in the residuals being a non-linear function of
time and space. Brown et al. (2001) considered spatiotemporal modelling of rainfall
122
CHAPTER 6. FUTURE WORK 123
data, using a non-separable model in which the spatial field at a specific time was
obtained by ‘blurring’ the field (Brown et al., 2000) at the previous time point. Such
a framework can efficiently model relatively complicated dynamical processes. Waller
et al. (1997) propose spatio-temporal interaction models where the spatial effects are
nested within time, enabling an examination of the evolution of heterogeneity and
spatial patterns over time. MacNab and Dean (2001) propose a generalized additive
mixed model (GAMM), where B-spline smoothing over the temporal dimension pro-
vides a flexible means of accommodating overall time effects as well as region-specific
time effects.
For spatially and temporally correlated multivariate spatial outcomes, the iden-
tification of spatial patterns of disease risk that evolve over time may provide more
insight on true risk variation than single cross-sectional analyses at specific points in
space or time. If multiple diseases are studied simultaneously, incorporating similar
spatio-temporal trends of risk may provide a means to strengthen the evidence for
common sources of influence that reflect underlying shared risk factors. A multivari-
ate spatio-temporal analysis may also lead to improved precision for the estimation of
the underlying disease risks, by borrowing strength from other diseases as well as from
neighboring areas or time points. Richardson et al. (2006) proposed an extension of
Knorr-Held and Best (2001)’s shared component model for space-time modeling of
lung cancer incidence for males and females in Yorkshire, UK. Tzala and Best (2006)
propose an extension of Wang and Wall (2003)’s common spatial factor model for the
analysis of area-level mortality data on six diet-related cancers for Greece, over the
20-year period from 1980 to 1999.
Here, we may extend our common spatial factor model as follows: Let yijt denote
the count of disease for region i, outcome j and time t. and let Eijt denote the
CHAPTER 6. FUTURE WORK 124
expected disease counts, i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ , J and t = 1, ⋅ ⋅ ⋅ , T . The spatial-
temporal common factor model can be expressed as
log(�ijt) = �jt + log(Eijt) + jbi + �jgt + �ijt, (6.1)
where �jt is the overall mean rate for jth component at time t; �ijt is the ex-
pected disease count for region i, outcome j and time t; the spatial random effect
b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ N(0,Σb), Σb = �2b(D −W )−1 and a simple AR(1) model may
describe the temporal random effect, gt∣gt−1 ∼ N(�gt−1, �2g), where �2
g is the tempo-
ral dispersion parameter and � is the temporal autocorrelation with ∣�∣ = 1 (Waller
et al., 1997); the interaction effect of space and time over multiple outcomes can be
accommodated through � = (�111, ⋅ ⋅ ⋅ , �nJT )T ∼ N(0, �2�I−1� ), where �2
� measures the
dispersion over space and time.
For zero-heavy data, the model may take the form
log(�ijt) = �jt + jbi + �jgt, logit(�ijt) = �jt + !jdi + �jqt (6.2)
where �jt and �jt are the overall mean rates for jth component at time t for the
Poisson and excess zero components, respectively. The CAR specifications may be
employed for b = (b1, ⋅ ⋅ ⋅ , bn)T and d = (d1, ⋅ ⋅ ⋅ , dn)T , while auto-regressive time
series models may describe g = (g1, ⋅ ⋅ ⋅ , gT )T and q = (q1, ⋅ ⋅ ⋅ , qT )T .
This model assigns separable spatial and temporal effects; more sophisticated
forms which include spatio-temporal interaction effects may be considered.
6.2 Spatial Modeling for Infectious Disease
In the last decade, there has been a tremendous interest by scientists, profession-
als, politicians and the general public in understanding the transmission of infec-
tious disease to control future outbreaks. This has been motivated, in part, by
CHAPTER 6. FUTURE WORK 125
severe events. For example, Severe Acute Respiratory Syndrome (SARS) rapidly
spread over 30 countries and regions during a period of less than half a year from
the beginning of 2003, leading to over 8000 infected people and over 700 deaths
(http://www.cdc.gov/ncidod/sars/). The West Nile virus, originating from Uganda,
was found in New York in 1999, and had spread to over 44 states by 2002; in
2003 and 2004, the West Nile virus had infected over 12000 people, killing 350
(http://www.cdc.gov/ncidod/dvbid/westnile/). Another example is the pandemic
H1N1 (swine) influenza, which has killed more than 18,000 people since it appeared
in April 2009. The AIDS (Acquired Immune Deficiency Syndrome) epidemic caused
by HIV (Human Immunodeficiency Virus) has killed more than 25 million people from
1981 to 2006, with approximately 260,000 children having died of AIDS in 2009. All
these diseases have exhibited spatial and temporal patterns. However, many studies
of infectious disease transmission are limited to empirical and static analyses in the
time dimension that explores relationships between a few environmental variables and
epidemiological data.
Ordinary differential equations (ODEs) are often used to describe the dynam-
ics of infectious diseases by relating a process to its rate of change. For example,
ODEs have been used in HIV studies to model the viral dynamic system for bet-
ter understanding of the pathogenesis of HIV infection and to evaluate antiretroviral
therapy (see Wu, 2005, for a comprehensive review of statistical methods in modeling
HIV viral dynamic). With the availability of residency information, investigators can
incorporate spatial information in the ODE model, as the environmental or epidemi-
ological variable is commonly spatially correlated (e.g. social-economic status, source
of income, access to public health service, etc.). To model those covariates as having
space-time varying effects may be potentially helpful in the spatial control of disease
transmission. This topic will be the main focus of my research starting in September,
2011.
CHAPTER 6. FUTURE WORK 126
6.3 Curve Clustering
In recent years, there has been great interest in clustering spatially correlated func-
tional data. It would be helpful to develop a Bayesian hierarchical mixture model
for clustering spatially correlated functional data. Penalized splines may be used to
model the functional data, where roughness penalties are employed to regularize the
spline fit. Spatial correlation may be introduced into the model by modeling the
classification probabilities as a Markov random field.
The model-free clustering methods, such as k-means and hierarchical clustering
approaches, are easy to use and useful as exploratory tools; however, it is difficult to
make formal inference because of their lack of distributional structure (for example,
the ordering and spacing of the sampling times), and no spatial correlation can be
directly incorporated into those approaches.
The model based clustering approach of (Yeung et al., 2001) assumes the data
arise from a mixture of multivariate normal distributions but does not acknowledge
time ordering of the data. James and Sugar (2003) and Luan and Li (2003) develop
a mixed effects model for time course gene expression data using B-splines, treating
gene expression levels as continuous functions of time. Similar approaches have been
developed independently by Bar-Joseph et al. (2002) with cubic splines and Gaffney
and Smyth (2003), with random effects regression mixtures. Zhou and Wakefield
(2006) developed a random walk prior to account for time ordering, missing data and
imbalance in the design, with clustering achieved via low dimensional representations
of the functional curves.
Fernandez and Green (2002) proposed a spatial mixture model for clustering;
Broet and Richardson (2006) extended this model to analyze comparative genomic
hybridization (CGH) data by introducing gene specific prior probabilities and allowing
those prior probabilities to be correlated among neighboring genes on a chromosome to
CHAPTER 6. FUTURE WORK 127
gain more efficiency to identify gene copy number changes. Future work will develop
methodology to simultaneously take the time ordering and spatial correlation into
account. We propose the use of penalized B-splines to account for time order, and a
conditional autoregressive model for classification membership, in order to incorporate
spatial correlations among the temporal curves. The model permits modeling the
variability in the curve data at different levels and provides a probabilistic framework
for clustering.
Bibliography
Agarwal, D. K., Gelfand, A. E., and Citron-Pousty, S. (2002). Zero-inflated modelswith application to spatial count data. Environmental and Ecological Statistics 9,341–355.
Ainsworth, L. (2007). Models and Methods for Spatial Data: Detecting Outliers andHandling Zero-Inflated Counts. PhD thesis, Simon Fraser University.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transac-tions on Automatic Control 19, 716–723.
Angers, J. and Biswas, A. (2003). A bayesian analysis of zero-inflated generalizedpoisson model. Computational Statistics and Data Analysis 42, 37–46.
Antoniadis, A. and Sapatinas, T. (2007). Estimation and inference in functionalmixed-effects models. Computational Statistics and Data Analysis 51, 4793–4813.
Augustin, N., Lang, S., Musio, M., and von Wilpert, K. (2007). A spatial model forthe needle losses of pine trees in the forests of Baden-Wurttemberg: An applicationof Bayesian structured additive regression. Applied Statistics 56 (1), 29–50.
Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., and Simon, I. (2002). Anew approach to analyzing gene expression time series data. In proceedings ofthe 6th Annual Int’l Conference on Research in Computational Molecular Biology.Washington, D.C., Apr 18-21. pages 39–48.
Bendell, L. I. and Feng, C. X. (2009). Spatial and temporal variations in cadmiumconcentrations and burdens in the Pacific oyster (Crassostrea gigas) sampled fromthe Pacific north-west. Marine Pollution Bulletin 58(8), 1137–1143.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). NewYork: Springer.
128
BIBLIOGRAPHY 129
Bernardinelli, L. and Montomoli, C. (1991). Empirical Bayes versus fully Bayesiananalysis of geographical variation in disease risk. Statistics in Medicine 11, 983–1007.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems(with discussion). Journal of the Royal Statistical Society, Series B 36, 192–236.
Besag, J., York, J., and Mollie, A. (1991). Bayesian image-restoration, with two appli-cations in spatial statistics (with discussion). Annals of the Institute of StatisticalMathematics 43, 1–59.
Besse, P. and Ramsay, J. O. (1986). Principal components analysis of sampled func-tions. Psychometrika 51, 285–311.
Best, N. G., Richardson, S., and Thomson, A. (2005). Bayesian spatial models fordisease mapping. Statistical Methods in Medical Research 14, 35–59.
Bourles, Y., Alunno-Bruscia, M., and Pouvreau, S. (2009). Modelling growth andreproduction of the Pacific oyster Crassostrea gigas: Advances in the oyster-DEBmodel through application to a coastal pond. Journal of Sea Research 62, 62–71.
Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linearmodels. Journal of the American Statistical Association 88, 9–25.
Broet, P. and Richardson, S. (2006). Detection of gene copy number changes inCGH microarrays using a spatially correlated mixture model. Bioinformatics 22,911–918.
Brooks, S. P. and Gelman, A. (1998). Alternative methods for monitoring convergenceof iterative simulations. Journal of Computational and Graphical Statistics 7, 434–455.
Brown, P. E., Diggle, P. J., Lord, M. E., and Young, P. C. (2001). Space-timecalibration of radar rainfall data. Journal of the Royal Statistical Society, SeriesC: Applied Statistics 50(2), 221–241.
Brown, P. E., Karesen, K. F., Roberts, G. O., and Tonellato, S. (2000). Blur-generatednon-separable space-time models. Journal of Royal Statistical Society, Series B:Statistical Methodology 62(4), 847–860.
Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Bayes Methods for DataAnalysis. Chapman and Hall, London.
BIBLIOGRAPHY 130
Carlin, B. P. and Louis, T. A. (2000). Bayes and empirical Bayesian methods fordata analysis. London, U.K.: Chapman&Hall/CRC.
Carroll, R., Chen, R., Li, T., Newton, H., Schmiediche, H., Wang, H., and George,E. (1997). Modeling ozone exposure in Harris County, Texas. Journal of AmericanStatistical Association 92, 392–413.
Chiou, J. M., Muller, H. G., and Wang, J. L. (2004). Functional response models.Statistica Sinica 14, 675–693.
Clarke, K. and Green, R. (1988). Statistical design and analysis for a “biologicaleffects” study. Marine Ecology Progress Series 46, 213–226.
Clayton, D. G. and Bernardinelli, L. (1992). Bayesian methods for mapping diseaserisk. Geographical and Environmental Epidemiology: Methods for Small-area Stud-ies, Elliott P, Cuzick J, English D, Stern R (eds). Oxford University Press: Oxfordpages 205–220.
Clayton, D. G., Bernardinelli, L., and Montomoli, C. (1993). Spatial correlation inecological analysis. International Journal of Epidemiology 22, 1193–1202.
Clayton, D. G. and Kaldor, J. (1987). Empirical Bayes estimates of age-standardizedrelative risks for use in disease mapping. Biometrics 43, 671–682.
Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models and Applications.London: Pion.
Congdon, P. (2006). A model framework for mortality and health data classified byage, area and time. Biometrics 61, 269–278.
Crambes, C., Kneip, A., and Sarda, P. (2009). Smoothing splines estimators forfunctional linear regression. Annals of Statistics 37, 35–72.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions:estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, 377–403.
Cressie, N. (1993). Statistics for Spatial Data. revised ed. Wiley-Interscience, NewYork.
de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York.
Dean, C. B. and MacNab, Y. C. (2001). Modelling of rates over a hierarchical healthadministrative structure. The Canadian Journal of Statistics 29, 405–419.
BIBLIOGRAPHY 131
Deville, J. C. (1974). Methodes statistiques et numeriques de l’analyse harmonique.Annales de l’INSEE 15, 7–97.
Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in Sobolevspaces. Springer-Verlag, Berlin, pp.85-100.
Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject-specific curves for longitudinal data. Statistics in Medicine 00, 1–24.
Fernandez, C. and Green, P. (2002). Modeling spatially correlated data via mixtures:a Bayesian approach. Journal of Royal Statistical Society, Series B 64, 805–826.
Ferraty, F. and Vieu, P. (2000). Dimension fractale et estimation de la regressiondans des espaces vectoriels semi-normes. C. R. Acad. Sci. Paris Ser. I Math. 330,403–406.
Gaffney, S. J. and Smyth, P. (2003). Curve clustering with random effects regressionmixtures. Proceedings of the Ninth International Workshop on Artificial Intelligenceand Statistics. Key West, FL. .
Gail, M. H., Wieand, S., and Piantadosi, S. (1984). Biased estimates of treatmenteffect in randomized experiments with nonlinear regressions and omitted covariates.Biometrika 71(3), 431–444.
Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculating marginaldensities. Journal of the American Statistical Association 85, 398–409.
Gelfand, A. E. (2000). ‘Gibbs Sampling’. Journal of American Statistical Association95, 1300–1304.
Gelfand, A. E. and Vounatsou, P. (2003). Proper multivariate conditional autoregres-sive models for spatial data analysis. Biostatsitics 4, 11–25.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models.Bayesian Analysis 1(3), 515–533.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian DataAnlaysis. Boca Raton: Chapman&Hall/CRC.
Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessmentof model fitness via realized discrepancies (with discussion). Statistica Sinica 6,733–807.
BIBLIOGRAPHY 132
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images. IEEE Transactions on Pattern Analysis andMachine Intelligence 6, 721–741.
Green, J. and Richardson, S. (2002). Hidden Markov models and disease mapping.Journal of the American Statistical Association 97, 1055–1070.
Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and GeneralisedLinear Models: A Roughness Penalty Approach. Chapman and Hall, London.
Guttorp, P., Meiring, W., and Sampson, P. (1994). A space-time analysis of ground-level ozone data. Environmetrics 5, 241–254.
Hall, P., Muller, H. G., and Wang, J. L. (2006). Properties of principal componentsmethods for functional and longitudinal data analysis. Annals of Statistics 34,1493–1517.
Handcock, M. S. and Wallis, J. R. (1994). An approach to statistical spatial-temporalmodelling of meteorological fields (with discussion). Journal of American StatisticalAssociation 89, 368–390.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman andHall, New York.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of StatisticalLearning. Springer-Verlag, New York.
Hogan, J. W. and Tchernis, R. (2004). Bayesian factor analysis for spatially correlateddata, with application to summarizing area-level material deprivation from censusdata. Journal of the American Statistical Association 99, 314–324.
Jacobi, W. R., Geils, B. W., Taylor, J. E., and Zentz, W. R. (1993). Predicting theincidence of Comandra blister rust on lodgepole pine: site, stand, and alternate-host influences. Phytopathology 83, 630–637.
James, G. and Sugar, C. (2003). Clustering for sparsely sampled functional data.Journal of the American Statistical Association 17, 397–408.
Knorr-Held, L. and Best, N. G. (2001). A shared component model for detecting jointand selective clustering of two diseases. Journal of the Royal Statistical Society,Series A 164, 73–85.
Knorr-Held, L. and Rasser, G. (2000). Bayesian detection of clusters and discontinu-ities in disease maps. Biometrics 56, 13–21.
BIBLIOGRAPHY 133
Kruzynski, G. M. (2000). Cadmium in BC farmed oysters: a review of available data,potential sources, research needs and possible mitigation strategies. Canadian StockAssessment Secretariat Research Document. Fisheries and Oceans Science .
Kruzynski, G. M. (2004). Cadmium in oysters and scallops: The BC experience.Toxicology Letters 148, 159–169.
Kuhnert, P. M., Martin, T. G., Mengersen, K., and Possingham, H. P. (2005). Assess-ing the impacts of grazing levels on bird density in woodland habitat: a Bayesianapproach using expert opinion. Environmetrics 16, 717–747.
Kulldorff, M., Athas, W., Feuer, E., Miller, B., and Key, C. (1998). Evaluating clusteralarms: a space-time scan statistic and brain cancer in Los Alamos. AmericanJournal of Public Health 88, 1377–1380.
Kulldorff, M. and Nagarwalla, N. (1995). Spatial disease clusters: detection andinference. Statistics in Medicine 14, 799–810.
Laird, N. M. and Louis, T. A. (1989). Empirical Bayes ranking methods. Journal ofEducational Statistics 14, 29–46.
Lambert, D. (1992). Zero-inflated poisson regression, with an application to defectsin manufacturing. Technometrics 34, 1–14.
Lawson, A. B., Biggeri, A. B., Boehning, D., Lesaffre, E., Viel, J. F., Clark, A.,Schlattmann, P., and Divino, F. (2000). Disease mapping models: an empiricalevaluation. Statistids in Medicine 19, 2217–2241.
Leroux, B. G., Lei, X., and Breslow, N. (1999). Estimation of disease rates in smallareas: A new mixed model for spatial dependence. Statistical Models in Epidemi-ology, the Environmental and Clinical Trials pages 135–178.
Lin, R., Louis, T. A., Paddock, S. M., and Ridgeway, G. (2006). Loss function basedranking in two-stage, hierarchical models. Bayesian Analysis 1, 915–946.
Loh, J. and Zhu, Z. (2007). Accounting for spatial correlation in the scan statistic.Annals of Applied Statistics 2, 560–584.
Louis, T. A. (1984). Estimating a population of parameter values using Bayes andempirical Bayes methods. Journal of the American Statistical Association 79, 393–398.
BIBLIOGRAPHY 134
Louis, T. A. and Shen, W. (1999). Innovations in Bayes and empirical Bayes methods:estimating parameters, populations and ranks. Statistics in Medicine 18, 2493–2505.
Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using amixed-effects model with B-splines. Bioinformatics 19(4), 474–482.
MacNab, Y. C. and Dean, C. B. (2000). Parametric bootstrap and penalized quasi-likelihood inference in conditional autoregressive models. Statistics in Medicine 19,2421–2436.
MacNab, Y. C. and Dean, C. B. (2001). Autoregressive spatial smoothing and tem-poral spline smoothing for mapping rates. Biometrics 57, 949–956.
MacNab, Y. C., Farrel, P. J., Gustafson, P., and Wen, S. (2004). Estimation inBayesian disease mapping. Biometrics 60, 865–873.
Malfait, N. and Ramsay, J. O. (2003). The historical functional linear model. TheCanadian Journal of Statistics 31(2), 115–128.
Manton, K. G., Woodbury, M. A., Stallard, E., Riggan, W. B., Creason, J. B., andPellom, A. C. (1989). Empirical Bayes procedures for stabilizing maps of US cancermortality rates. Journal of the American Statistical Association 84, 637–650.
Marshall, R. J. (1991). Mapping disease and mortality rates using empirical Bayesestimators. Applied Statistics 40, 283–294.
Martin, T., Wintle, B., Rhodes, J., Kuhnert, P., Field, S., Low-Choy, S., Tyre, A.,and Possingham, H. (2005). Zero tolerance ecology: improving ecological inferenceby modelling the source of zero observations. Ecology Letters 8, 1235–1246.
Matheron, G. (1973). The intrinsic random functions and their applications. Advancesin Applied Probability 5, 439–468.
Meinguet, J. (1979). Multivariate interpolation of arbitrary points made simple.Journal of Applied Mathematical Physics 30, 292–304.
Meng, X.-L. (1994). Posterior predictive p-values. Annals of Statistics 22, 1142–1160.
Paciorek, C. J. (2007). Computational techniques for spatial logistic regression withlarge data sets. Computational Statistics and Data Analysis 51, 3631–3653.
R Development Core Team (2010). R: A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
BIBLIOGRAPHY 135
Ramsay, J. O. (1988). Monotone regression splines in action. Statistical Science 3,425–461.
Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis.Journal of the Royal statistical Society, Series B 53, 539–572.
Ramsay, J. O. and Dalzell, C. J. (1996). Functional data analyses of lip motion.Journal of the Acoustical Society of America 99, 3718–3727.
Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis:Methods and Case Studies. Springer, New York.
Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis, 2nd ed.Springer, New York.
Rathbun, S. L. and Fei, S. (2006). A spatial zero-inflated poisson regression modelfor oak regeneration. Environmental and Ecological Statistics 13, 409–426.
Ribeiro Jr, P. J. and Diggle, P. J. (2001). geor: A package for geostatistical analysis.R-NEWS 1(2), ISSN 1609–3631.
Richardson, S., Abellan, J., and Best, N. (2006). Bayesian spatio-temporal analysisof joint patterns of male and female lung cancer risk in Yorkshire (UK). StatisticalMethods in Medical Research 15, 385–407.
Richardson, S., Thomson, A., Best, N., and Elliot, P. (2004). Interpreting posteriorrelative risk estimates in disease-mapping studies. Environmental Health Perspec-tives 112, 1016–1025.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations forthe applied statistician. Annals of Statistics 12, 1151–1172.
Schallie, K. (2001). Results of the 2000 survey of cadmium in B.C. oysters. Pro-ceedings of a Workshop on Possible Pathways of Cadmium into The Pacific oysterCrassostrea Gigas as Cultured on the Coast of British Columbia 65, 31–32.
Shen, W. and Louis, T. A. (1998). Triple-goal estimates in two-stage hierarchicalmodels. Journal of the Royal Statistical Society, Series B 60, 455–471.
Silverman, B. W. (1995). Incorporating parametric effects into functional principalcomponents analysis. Journal of the Royal Statistical Society, Series B 57, 673–689.
Silverman, B. W. (1996). Smoothed functional principal components analysis bychoice of norm. Annals of Statistics 24, 1–24.
BIBLIOGRAPHY 136
Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). WinBUGS UserManual Version 1.4. Medical Research Council Biostatistics Unit: Cambridge,UK.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesianmeasures of model complexity and fit (with discussion). Journal of the RoyalStatistical Society, Series B 64, 583–640.
Sturtz, S., Ligges, U., and Gelman, A. (2005). R2winbugs: A package for runningWinBUGS from R. Journal of Statistical Software 12(3), 1–16.
Sun, D., Tsutakawa, R. K., and Speckman, P. L. (1999). Posterior distribution ofhierarchical models using CAR(1) distributions. Biometrika 86, 341–350.
Tzala, E. and Best, N. (2006). Bayesian latent variable modelling of multivariatespatio-temporal variation in cancer mortality. Biometrics 61, 269–278.
Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hy-potheses. Econometrica 57, 307–333.
Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothingparameter in the generalized spline smoothing problem. The Annals of Statistics13, 1378–1402.
Waller, L. A., Carlin, B. P., Xia, H., and Gelfand, A. E. (1997). Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association92, 607–617.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.
Wang, F. and Wall, M. (2003). Generalized common spatial factor model. Biostatistics4, 569–582.
Welsh, A. H., Cunningham, R. B., Donnelly, C. F., and Lindenmayer, D. B. (1996).Modelling the abundance of rare species: statistical models for counts with extrazeros. Ecological Modelling 88, 297–308.
Wood, S. N. (2000). Modelling and smoothing parameter estimation with multiplequadratic penalties. Journal of the Royal Statistical Society, Series B 62, 413–428.
Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal StatisticalSociety, Series B 65, 95–114.
BIBLIOGRAPHY 137
Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation forgeneralized additive models. Journal of the American Statistical Association 99,673–686.
Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapmanand Hall, London.
Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalizedadditive models. Journal of the Royal Statistical Society. Series B 70, 495–518.
Wright, D. L., Stern, H. S., and Cressie, N. (2003). Loss functions for estimationof extrema with an application to disease mapping. The Canadian Journal ofStatistics 31, 251–266.
Wu, H. (2005). Statistical methods for HIV dynamic studies in AIDS clinical trials.Statistical Methods in Medical Research 14, 171–192.
Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001). Validating clustering for geneexpression data. Bioinformatics 17, 309–318.
Zhou, C. and Wakefield, J. (2006). A Bayesian mixture model for partitioning geneexpression data. Biometrics 62(2), 5150–525.
Zhou, S. and Shen, X. (2001). Spatially adaptive regression splines and accurate knotselection schemes. Journal of the American Statistical Association 96, 247–259.
Appendix A
Appendix for Chapter 3
Overdispersion of the Multiple Spatial Outcomes in Relationship to Ran-
dom effects.
Let uj = jb + hj denote the random effect vector, where uj = (uij, ⋅ ⋅ ⋅ , uij)T
for ith region and jth outcome, i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , J . Note that in the
simulation study, we let j = 1, j = 1, ⋅ ⋅ ⋅ , J when generating simulated data. The
common spatial factor model written as a generalized linear mixed model is log(�ij) =
�j + log(Eij) +ztijuj, where zij is the ith column of the random effects design matrix
of zj. The variance for the response can be partitioned as
var(Yij) =var [E(Yij∣uj)] + E [var(Yij∣uj)]
=var[exp
{�j + log(Eij) + ztijuj
} ]+ E
[exp
{�j + log(Eij) + ztijuj
} ](A.1)
The first term on (A.1) can be written as:
var[exp
{�j + log(Eij) + ztijuj
} ]= E
[exp
{2(�j + log(Eij) + ztijuj
)}]−[E(
exp{�j + log(Eij) + ztijuj
})]2(A.2)
Consider the moment generating function of uj, evaluated at zij, Muj(zij) = E[exp(ztijuj)
],
138
APPENDIX A. APPENDIX FOR CHAPTER 3 139
The variance of the response is:
var(Yij) =exp[2 {�j + log(Eij)}
][Muj(2zij)−
{Muj(zij)
}2 ]+ exp
[{�j + log(Eij)}
]Muj(zij) (A.3)
Note that uj ∼ MVN(0,Σ( j)
), where j = (�2
b , �2ℎj
); Σ( j) = �2b (D −W )−1 +
�2ℎjI. Therefore, according to the moment generating function property for multi-
variate normal distribution, we have
var(Yij) =(
exp [{�j + log(Eij)}]E[exp(ztijuj)])
×(
exp [{�j + log(Eij)}][exp
{1.5ztijΣ( j)zij
}− exp
{0.5ztijΣ( j)zij
} ]+ 1)
(A.4)
Note the first term exp [{�j + logEij}]E[exp(ztijuj)] = E(Yij) and the second term
is greater than one. Therefore, variance of response depends on values of variance
components �2b and �2
ℎjand the variance of relative risks is order of O(log(E)/E).
Appendix B
Appendix for Chapter 4
140
APPENDIX B. APPENDIX FOR CHAPTER 4 141
20 40 60 80 100
10
01
50
20
02
50
East
No
rth
83( 638 )
130( 619 )
135( 1017 )
80( 215 )
129( 604 )
109( 842 )
188( 1389 )
133( 899 )
115( 940 )
141( 1031 )
76( 232 )
184( 1504 )
128( 387 )
119( 521 )
134( 781 )
31( 0 )
56( 22 )
109( 234 )
112( 369 )
89( 403 )
Figure B.1: Summary statistics relating to number of trees infected and the sum ofcounts of the alternate host plant per cell (in parentheses) over various subsection ofthe plot.
APPENDIX B. APPENDIX FOR CHAPTER 4 142
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
1.2
distance
sem
ivaria
nce
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
distance
sem
ivaria
nce
Figure B.2: The panel on the left displays the semivariogram of the observed infectionstatus for the trees. The panel on the right displays the semivariogram of the residualsfrom fitting M1.
APPENDIX B. APPENDIX FOR CHAPTER 4 143
●●
●
●●
●●●
●
●
●●●
●
●●●
●
●●●●●
●
●●●●●
●
●●●●
●●
●
●●●
●●●●
●
●
●
●●
●
●
●●
●●
−3
.0−
2.5
−2
.0−
1.5
−1
.0
S1 S2 S3
●●●● ●
●●● ●
●●●● ●
●●● ●
●●●●●●●●●●●
●●●●●● ●
●●●●●
●
●●●●
●
●●●●
●
●●●●
●
●●●●
β z
S1 S2 S3
● ●● ●
●
●
●
●
●
●
●●
●
● ●●
●
−1
.6−
1.2
−0
.8−
0.4
●●●
●●
●●●
●
●●●
●●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
β z
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●●
●
●●
●
−0
.40
.00
.20
.4
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
β z
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●●
● ●●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
0.4
0.8
1.2
1.6
●●●
●
●● ●●●
●
●● ●●●
●
●●
●●●
●
●●
●●
●
●
●●●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●●
●
●●
β z
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●
●●●
●
●
●●●
●
●
●●●
●●
M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
1.4
1.8
2.2
●
●
●
●
●●
●
●● ●
●●
●
●
●● ●
●●
●●●
●● ●
●●
●●
●
●●
●
●●
●
●
●●
● ●
●
●●●●●
●
●
●●●
●●●●●
●
●
●
●●●●●
●
●
●●●
●●●●
●
β z
M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
Figure B.3: Estimated treatment effects (�Z) from fitting M1, M2, M3 and M4 whenthe true model is M2 (first column) or M3 (second column). Simulation scenariosS1, S2 and S3 correspond to the inflation factor = 1, 1.5 and 2, respectively. Thetreatment group consists of trees from resistant seedlots. The horizontal dashed linerepresents the true treatment effect.
APPENDIX B. APPENDIX FOR CHAPTER 4 144
●●
●●
●
●
● ●●
●●
●
●
● ●●
●●
●● ●●
●●
●●
●●
●●
●● ●●
●●●
●●
●
●
●●
●●
●
●● ●●
●
●
●●
●
●
●●●
●
−3
.0−
2.5
−2
.0−
1.5
−1
.0
S1 S2 S3
●●
●
●●
●
●●●
●
●●
●
●●●
●
●●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
β z
S1 S2 S3
●●
●
●●
●
●●●●
●
●
●
●
●●
●
●●
●
●●●
−1
.6−
1.2
−0
.8−
0.4
●
●●●●●
●
●
●●●●●
●
●●●●●
●
●●●●●
●
●●
●●
●●
●●
●●
●●
●●
●●
● ● ●
●
●
β z
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
−0
.40
.00
.20
.4
●
●●
●
●●
●
●●
●
● ● ● ● ●
●●
●
●●
●
●●●
●
●●
●
β z
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
0.4
0.8
1.2
1.6
●
●●●●
●
●●●
●
●●●●
●
● ● ● ●
β z
● ●●
●
●
●●●●
●
●●●●
●
●●●●
●
●●●
●
●●
●
●●
●
●
●●●
●
●●
M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
1.4
1.8
2.2
● ● ● ●
●● ● ●
●
●
β z
M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
Figure B.4: Estimated treatment effects (�Z) from fitting M1, M2, M3 and M4 whenthe true model is M2 (first column) or M3 (second column). Simulation scenariosS1, S2 and S3 correspond to the inflation factor = 1, 1.5 and 2, respectively. Thetreatment group consists of randomly selected half of the trees. The horizontal dashedline represents the true treatment effect. The scale on the y-axis is the same as thatfor the corresponding sub-plot in Figure B.3.
APPENDIX B. APPENDIX FOR CHAPTER 4 145
Table B.1: Rows shaded provide SBIAS, SSE, SRMS based on the correct model forpredicted site specific treatment effect ( i) over the whole plot when the treatmentgroup consists of randomly selected half of the trees. The other rows provide thedifference between SBIAS, SSE, SRMSE of i based on the misspecified models andthe corresponding quantity for the true model. The true treatment effect is �∗Z =−2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflation factor = 1, 1.5 and 2, respectively.
True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE
�∗z = −2M2 M1 2.93 5.57 6.43 3.04 5.52 6.38 3.08 5.16 6.03
M2 61.57 161.97 177.50 61.92 160.69 176.61 62.27 160.06 176.29M3 121.70 4.48 90.28 172.20 -1.22 128.09 230.61 -8.73 175.82M4 200.91 -2.08 151.30 266.29 -4.96 208.04 318.97 -14.27 251.74
M3 M1 -0.08 14.24 13.23 0.77 14.95 14.31 0.31 13.65 12.93M2 18.86 12.22 22.72 42.00 16.02 39.89 66.04 19.14 57.66M3 69.00 166.27 184.93 66.89 165.50 183.37 64.51 165.12 182.04M4 71.05 -6.18 38.82 130.29 -3.52 86.19 190.31 -0.87 137.38
�∗z = −1M2 M1 1.18 2.59 2.94 1.04 2.73 3.03 1.46 2.68 3.11
M2 31.86 107.50 114.18 33.23 108.55 115.66 33.29 107.90 115.08M3 61.06 2.05 39.77 98.49 -1.89 66.83 138.57 -4.21 100.35M4 92.55 -2.90 59.59 126.59 -7.09 85.83 162.13 -11.33 115.16
M3 M1 0.23 5.26 5.10 0.43 4.95 4.86 0.28 4.16 4.05M2 12.37 3.22 9.29 25.62 4.09 17.05 39.05 4.77 25.27M3 34.15 109.89 117.26 32.63 109.50 116.43 31.49 109.91 116.53M4 41.77 -3.51 19.31 74.32 -2.83 42.69 105.71 -1.87 67.68
�∗z = 0M2 M1 -0.79 -0.09 -0.10 -0.61 -0.06 -0.07 1.26 -0.03 -0.01
M2 1.02 76.94 76.87 0.94 74.11 74.05 0.29 71.87 71.80M3 5.57 -0.14 0.14 8.53 -0.09 0.51 13.15 -0.18 1.07M4 12.97 -0.13 1.13 18.83 -0.06 2.53 25.82 -0.18 4.43
M3 M1 -0.13 -0.29 -0.29 -0.03 -0.07 -0.07 -0.10 -0.11 -0.11M2 1.97 -0.29 -0.26 -1.04 -0.12 -0.14 1.64 -0.07 -0.02M3 0.40 79.94 79.86 2.25 78.71 78.66 1.67 81.53 81.46M4 3.15 -0.03 0.05 5.80 -0.02 0.36 8.25 -0.02 0.57
�∗z = 1M2 M1 0.82 1.91 2.18 0.95 2.10 2.38 1.08 2.36 2.67
M2 19.62 97.57 100.66 22.17 102.55 106.22 23.65 103.47 107.46M3 57.41 -1.54 33.94 101.54 -2.37 68.37 141.48 -2.52 102.43M4 68.59 -8.42 37.07 120.74 -11.62 78.32 169.39 -14.36 119.94
M3 M1 0.59 1.68 1.84 0.41 2.11 2.22 0.28 2.38 2.45M2 7.35 -0.18 2.09 14.27 0.22 5.29 21.92 0.59 9.19M3 16.02 96.64 98.98 16.50 95.82 98.27 17.34 98.43 101.09M4 19.80 -3.13 4.00 35.19 -4.12 11.52 51.12 -5.25 20.40
�∗z = 2M2 M1 1.74 2.40 2.81 1.97 2.82 3.33 2.04 3.51 4.08
M2 25.70 115.27 119.48 28.57 121.47 126.26 30.94 124.99 130.34M3 111.75 -6.07 74.33 182.65 -3.68 136.16 243.27 -0.31 191.70M4 171.02 -16.54 115.60 275.99 -18.66 209.13 365.20 -19.56 292.31
M3 M1 0.57 7.50 7.56 0.60 8.62 8.63 0.38 9.54 9.41M2 8.24 6.53 10.14 17.98 8.37 16.62 28.72 10.87 25.07M3 22.69 99.44 103.23 25.12 104.98 109.27 27.22 109.49 114.37M4 30.41 -4.79 10.96 56.10 -6.92 26.68 83.59 -9.43 46.00
APPENDIX B. APPENDIX FOR CHAPTER 4 146
Table B.2: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific infection probability (�i) over the whole plot when thetreatment group consists of randomly selected half of the trees. The other rowsprovide the difference between SBIAS, SSE, SRMSE of �i based on the misspecifiedmodels and the corresponding quantity for the true model. The true treatment effectis �∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.
True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE
�∗z = −2M2 M1 2.44 5.56 6.35 2.72 4.66 5.40 2.65 4.27 5.02
M2 65.70 191.98 207.01 55.65 148.11 162.18 54.79 144.74 158.73M3 216.97 -3.40 162.96 243.34 0.03 196.50 331.61 4.20 279.90M4 355.24 -12.21 277.21 398.15 -1.23 335.67 549.35 4.28 482.82
M3 M1 -0.16 15.05 14.00 0.62 16.39 15.65 0.25 16.05 15.18M2 16.85 14.72 23.90 37.75 19.78 40.93 59.55 25.00 59.45M3 58.31 139.35 155.28 57.18 140.52 155.92 55.98 141.26 156.10M4 65.09 -5.14 36.79 119.53 -2.87 80.69 175.95 -0.50 129.56
�∗z = −1M2 M1 2.66 4.69 5.40 2.36 4.83 5.42 2.84 4.70 5.50
M2 59.48 171.48 185.51 59.65 169.23 183.46 57.91 162.91 176.83M3 188.84 -3.51 140.99 295.43 -1.97 236.81 382.25 3.49 320.37M4 307.18 -9.53 239.34 483.71 -3.64 408.03 641.55 3.14 563.54
M3 M1 0.08 17.44 16.52 0.49 17.99 17.24 0.12 17.53 16.63M2 25.54 16.42 29.90 52.48 22.42 51.44 79.77 28.65 74.89M3 62.29 163.28 179.16 60.93 163.70 178.99 60.29 163.46 178.58M4 88.16 -6.53 49.81 157.13 -4.31 105.94 226.99 -1.09 168.30
�∗z = 0M2 M1 2.48 6.46 7.14 2.51 6.38 7.01 2.55 5.57 6.26
M2 66.17 194.07 209.08 64.18 189.53 204.12 62.60 182.54 196.95M3 217.14 -2.98 163.14 323.89 -0.80 259.04 403.73 2.95 335.42M4 354.06 -11.62 275.97 534.48 -4.84 448.51 686.45 1.00 598.23
M3 M1 -0.19 19.53 18.48 0.65 19.91 19.20 -0.04 19.06 18.03M2 33.48 18.55 36.47 67.98 25.60 63.81 101.13 31.66 91.22M3 69.52 188.35 205.57 68.27 187.90 204.72 67.39 188.02 204.55M4 114.46 -8.46 65.88 199.38 -5.88 136.63 284.38 -2.52 212.98
�∗z = 1M2 M1 2.67 6.31 7.32 2.64 6.54 7.45 2.61 6.31 7.15
M2 68.38 202.29 217.64 67.43 197.61 212.88 65.51 191.82 206.75M3 208.63 -3.13 155.48 306.07 -0.99 242.59 381.38 1.98 312.82M4 340.86 -13.32 261.54 505.68 -6.92 419.21 646.47 -2.19 555.98
M3 M1 0.96 20.31 19.67 0.30 21.07 20.14 -0.06 20.28 19.22M2 36.47 19.86 39.42 72.34 26.15 66.55 107.70 31.19 94.57M3 73.23 196.24 214.59 70.94 197.84 215.22 69.54 197.80 214.69M4 122.22 -9.22 70.80 211.30 -6.47 144.60 299.97 -3.47 223.81
�∗z = 2M2 M1 2.77 5.78 6.72 2.93 5.77 6.71 2.62 6.19 7.01
M2 66.17 187.00 202.62 65.20 184.64 200.07 64.12 181.81 196.98M3 170.59 -2.69 126.42 254.82 -0.72 200.40 322.11 1.87 262.17M4 278.29 -12.61 210.23 418.11 -7.33 341.65 540.15 -2.75 458.85
M3 M1 0.66 18.06 17.17 0.77 19.01 18.15 0.17 18.08 17.07M2 27.19 17.69 32.48 55.70 22.64 54.36 85.47 27.71 78.20M3 70.75 179.89 198.40 69.84 181.91 200.06 69.24 181.95 199.96M4 98.67 -7.94 56.66 173.98 -5.97 117.96 249.48 -3.57 184.60
Appendix C
Appendix for Chapter 5
147
APPENDIX C. APPENDIX FOR CHAPTER 5 148
●●
●
●
●
●●
●
●
●●●
●●
●●●
●●●●
●
●
●
●●●●
●●●
●
●
●●●●
●
010
2030
40
(4)HC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●●
●●
●
●
●●
●
●●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●●
●
●
●●●
●●
●
●●
●●●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
010
2030
40(3)OB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
● ●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
010
2030
40
(3)RB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
● ●
●
●
●
●
010
2030
40
(3)TA
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●●●
● ●
●
●
●●
●●●●●
●
●●
●
●
●●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●● ●
●
●
●●
010
2030
40
(3)GH
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●
●
●●
●
●● ●
●
●●
●●●
●●
●
●
●●
●
●●
●
●●● ●
●
●
●
●
●
●●
●●
●
●
●
●●
●●●●●
010
2030
40
(3)TC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
010
2030
40
(3)TB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●●
●
●●
●●●
●
●
●
●●
●
●
●●●
●●●
●
●
●●
●●●
●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●●●
010
2030
40(2)KI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●●●●
●
●●●
●
●●
●●
●
●●●●●
●●
●●
●
●●
●
●
● ●●●●● ●●
●
●
●
010
2030
40
(1)PN
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●●
●
●●●
●
●
●
●
●
●
● ●●●●
●
●●
●
●●
●
●●●
●
●●●
●●
●
●
●●●
●●
●
●●●
●
●
●
010
2030
40
(1)BM
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●●●●
●
●
●
●
●●
●●
●
●●●
●
● ●●●●●
●●●●●
●
●
●●●
●●
●
●●
●
●
●●●
●●
●
●
●●
●
●
●
●●●●
010
2030
40
(1)JF
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●●
●
●
●●● ●●
●●
●●●
●●●
●●
●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●●
●
●
●●
●
●
●●
●● ●●●
●
●●
●
●
●
●
●●●●
●
●●
●
010
2030
40
(1)PC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●●●
●
●
●
●●●
●
●●
●●
●●●●
●
010
2030
40
(1)WI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
Cad
miu
m C
once
ntra
tion
(µg
Cd/
g)
month
Figure C.1: The solid curves and dashed curves correspond to penalized splinesmoothing curves of cadmium concentrations for oysters from 1m and 7m depth,respectively, at thirteen experimental sites from north to south row-wise. The circlesand triangles are the measured cadmium concentrations for the oysters sampled at1m and 7m depth, respectively. The horizontal dashed lines represent the EuropeanCommunity guideline (6.3 �g Cd/g on dry weight basis) and Asian guideline (13.5�g Cd/g on dry weight basis). The labels above the x-axis represent the seasons,ranging from winter 2002 (W2) to winter 2003 (W3). The labels below the x-axisrepresent the months, ranging from December 2002 (D2) to February 2004 (F4).
APPENDIX C. APPENDIX FOR CHAPTER 5 149
●
●
●●
●●
●●●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
050
100
150
(4)HC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
050
100
150
(2)KI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●●
●
●
●
●
●●●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●●●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●●●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
050
100
150
(3)OB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
050
100
150
(3)RB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●●●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
050
100
150
(3)TA
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
050
100
150
(3)GH
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
050
100
150
(3)TC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
050
100
150
(3)TB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
050
100
150
(1)PN
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
050
100
150
(1)BM
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
050
100
150
(1)JF
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●●
●
●●●●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●● ●
●
●●
●●
●
●
●●
●●●●
●
●
●
●
●
●
●●●
●
●●
●
●
050
100
150
(1)PC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
●
●●●●●●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
050
100
150
(1)WI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
leng
th (m
m)
month
Figure C.2: The solid curves and dashed curves correspond to monotone splinesmoothing curves of oyster lengths for oysters from 1m and 7m depth, respectively,at thirteen experimental sites from north to south row-wise. The circles and trianglesare the measured oyster lengths sampled at 1m and 7m depth, respectively. The la-bels above the x-axis represent the seasons, ranging from winter 2002 (W2) to winter2003 (W3). The labels below the x-axis represent the months, ranging from December2002 (D2) to February 2004 (F4).
APPENDIX C. APPENDIX FOR CHAPTER 5 150
05
1015
(4)HC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(2)KI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)OB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)RB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)TA
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)GH
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)TC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(3)TB
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(1)PN
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(1)BM
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(1)JF
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(1)PC
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
05
1015
(1)WI
D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4
W2 S3 S3 F3 W3
1m7m
grow
th ra
te
month
Figure C.3: The solid curves and dashed curves correspond to the estimated growthrates for the oysters sampled from 1m and 7m depth, respectively, at thirteen exper-imental sites from north to south row-wise. The growth rates are estimated as thefirst derivatives of the monotone smoothing functions of the oyster lengths. The labelsabove the x-axis represent the seasons, ranging from winter 2002 (W2) to winter 2003(W3). The labels below the x-axis represent the months, ranging from December2002 (D2) to February 2004 (F4).
APPENDIX C. APPENDIX FOR CHAPTER 5 151
50 100 150
01
23
45
67
length
grow
th ra
te
Figure C.4: The scatter plot of oyster growth rates versus oyster lengths.