160
MODELS AND METHODS FOR SPATIAL DATA: APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ECOLOGICAL STUDIES by Cindy Xin Feng M.Sc. (Statistics), Simon Fraser University, 2006 B.Sc. (Applied Mathematics), Beijing University of Technology, 2003 a Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Statistics and Actuarial Science c Cindy Xin Feng 2011 SIMON FRASER UNIVERSITY Summer 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Embed Size (px)

Citation preview

Page 1: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

MODELS AND METHODS FOR SPATIAL DATA:

APPLICATIONS IN EPIDEMIOLOGICAL,

ENVIRONMENTAL AND ECOLOGICAL STUDIES

by

Cindy Xin Feng

M.Sc. (Statistics), Simon Fraser University, 2006

B.Sc. (Applied Mathematics), Beijing University of Technology, 2003

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

in the Department of

Statistics and Actuarial Science

c⃝ Cindy Xin Feng 2011

SIMON FRASER UNIVERSITY

Summer 2011

All rights reserved. However, in accordance with the Copyright Act of

Canada, this work may be reproduced without authorization under the

conditions for Fair Dealing. Therefore, limited reproduction of this

work for the purposes of private study, research, criticism, review and

news reporting is likely to be in accordance with the law, particularly

if cited appropriately.

Page 2: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPROVAL

Name: Cindy Xin Feng

Degree: Doctor of Philosophy

Title of Thesis: Models and Methods for Spatial Data: Applications in

Epidemiological, Environmental and Ecological Stud-

ies

Examining Committee: Dr. Rick Routledge

Chair

Dr. Charmaine Dean, Senior Supervisor

Dr. Jiguo Cao, Supervisor

Dr. Yi Lu, Supervisor

Dr. Paramjit Gill, Internal External Examiner

Dr. Patrick Brown, External Examiner,

University of Toronto

Date Approved:

ii

Page 3: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Abstract

This thesis develops new methodologies for applied problems using smoothing tech-

niques for spatial or spatial temporal data. We investigate Bayesian ranking methods

for identifying high risk areas in disease mapping, assessing these particularly with

regard their performance in isolating emerging unusual and extreme risks in small

areas. We build on information obtained through mapping multivariate outcomes by

developing models which investigate if the multivariate spatial outcomes share the

same underlying spatial structure. We develop a general framework for joint model-

ing of multivariate spatial outcomes for count and zero-inflated count data using a

common spatial factor model.

We also study spatial exposure measures, motivated by an analysis of Comandra

blister rust infection on lodgepole pine trees from British Columbia. We contrast

nearest distance with other, more general, exposure measures and consider the impact

of mis-specification of exposure measures in a semiparametric generalized additive

modeling framework including a spatial residual term modeled as thin plate regression

spline. An appealing feature of the new spatial exposure measures considered is that

they can be easily adapted to other problems, such as investigation of the association

of asthma incidence to traffic exposures. A common theme in the thesis is the use of

functional data analysis, and we specifically adapt such methods for assessing spatial

and temporal variation of Cadmium concentration in Pacific oysters from British

iii

Page 4: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Columbia.

The methodologies developed in these projects widen the toolbox for spatial anal-

ysis in applications in epidemiology, and in environmental and ecological studies.

iv

Page 5: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Acknowledgments

I am deeply indebted to my senior supervisor Dr. Charmaine Dean for her guidance

and support in countless ways. Without her enlightening instruction, great kindness

and patience, I could not have completed my thesis. Her support and encouragement

were very helpful to me through some very difficult times in my life. I also want

to extend my gratitude to my examining committee members, Dr. Rick Routledge,

Dr. Jiguo Cao, Dr. Yi Lu, Dr. Paramjit Gill and Dr. Patrick Brown for all their

careful reviewing and insightful comments. Their detailed reviews and constructive

comments greatly improved the thesis.

Many thanks to the faculty and staff of the Department of Statistics and Actuarial

Science of Simon Fraser University for providing me a wonderful environment for

graduate studies. In particular, I would also like to thank Dr. Derek Bingham, Dr.

Boxin Tang, Dr. Richard Lockhart, Dr. Tim Swartz, Dr. Leilei Zeng, Dr. Joan Hu

and Mr. Ian Bercovitz for their support and Sadika, Kelly and Charlene for your

help always. Thank you also to the fellow graduate students for being company and

growing together with me during my graduate studies.

Finally, and most importantly, I would like to thank my family, I would not be

able to go this far without their care and encouragement.

v

Page 6: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Contents

Approval ii

Abstract iii

Acknowledgments v

Contents vi

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Disease Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Conditional Autoregressive Priors . . . . . . . . . . . . . . . . 3

1.3 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Bayesian Ranking Methods for the Detection of Isolated Hotspots

vi

Page 7: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

in Disease Mapping 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Bayesian Disease-Mapping Model . . . . . . . . . . . . . . . . . . . . 13

2.3 Ranking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Squared error loss function for the isolation measures . . . . . 16

2.3.2 Squared error loss function for the ranks of the isolation measures 16

2.3.3 Weighted rank squared error loss function . . . . . . . . . . . 17

2.3.4 Misclassification rates of regions in the top 100 % group . . . 18

2.4 Comparison of Rank Estimators of Isolation . . . . . . . . . . . . . . 19

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Joint Analysis of Multivariate Spatial Count and Zero-Heavy Count

Outcomes 38

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Models for Joint Count Outcomes . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Common Spatial Factor Model for Counts . . . . . . . . . . . 42

3.2.2 Common Spatial Factor Model for Zero Heavy Counts . . . . 43

3.2.3 Model Assessment and Comparison . . . . . . . . . . . . . . . 46

3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Ontario Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.2 Comandra Blister Rust Tree Infection . . . . . . . . . . . . . . 53

3.4 Power of the Test for Common Spatial Structure . . . . . . . . . . . . 59

3.5 Precision Gains Through Joint Outcome Modeling . . . . . . . . . . . 64

3.6 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 66

4 Impact of Misspecifying Spatial Exposures 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

Page 8: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

4.2 Comandra Blister Rust Study . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Flexible Smooth Models . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Comparison of Exposure Measures for CBR Infection . . . . . . . . . 83

4.5 Assessing the Effect of Misspecification of Spatial Exposure Measures 87

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 Exploring Spatial and Temporal Variations of Cadmium Concentra-

tions in Pacific Oysters from British Columbia 97

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.1.1 The Motivating Datasets . . . . . . . . . . . . . . . . . . . . . 98

5.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2.1 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2.2 Monotone Spline Smoothing . . . . . . . . . . . . . . . . . . . 104

5.2.3 Functional Principal Component Analysis . . . . . . . . . . . 105

5.2.4 Semi-Parametric Additive Model . . . . . . . . . . . . . . . . 107

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.2 Spatial Variability . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.3 The Semi-Parametric Additive Model . . . . . . . . . . . . . . 116

5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Future Work 122

6.1 Spatial-temporal Modeling for Multivariate Spatial Outcomes . . . . 122

6.2 Spatial Modeling for Infectious Disease . . . . . . . . . . . . . . . . . 124

6.3 Curve Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 128

viii

Page 9: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

A Appendix for Chapter 3 138

B Appendix for Chapter 4 140

C Appendix for Chapter 5 147

ix

Page 10: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 1

Introduction

1.1 Overview

In recent years, there has been considerable interest in the development and applica-

tion of spatial models and methods for the analysis of spatially correlated data, which

are often geographically referenced, temporally correlated or highly multivariate. For

example, a motivating dataset considers the analysis of lung cancer for males and

females by local health unit in Ontario. The key idea throughout the approaches

considered is to take advantage of the correlation structure among observations to

perform estimation, prediction, hypothesis testing and other statistical procedures.

We begin with a review of some important concepts which form the building blocks

of the methods and models developed in later chapters. This is followed by an outline

of the material presented in each of the chapters of the thesis.

1

Page 11: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 2

1.2 Disease Mapping

Mapping of disease incidence mortality rates is of primary importance in many epi-

demiological studies. The use of crude rates to estimate rare disease risks in small

areas such as health units, census areas or administrative zones, is problematic since it

does not account for the high variability of population sizes over the different regions,

nor the spatial patterns of the regions under study. Because of this, interpretation of

the spatial distribution of disease based on crude estimates is often misleading. Al-

ternatively, Bayesian inference is widely used to produce stabilized risk maps through

borrowing information from neighborhoods across the map. Early developments of

disease mapping methodology included the use of empirical Bayes (EB) techniques

(Manton et al., 1989; Marshall, 1991; Dean and MacNab, 2001; Breslow and Clay-

ton, 1993) to estimate parameters, and a plug-in approximation of these for posterior

inference, which yielded unbiased estimates of the relative risks. However, the vari-

ance of these estimates were underestimated, since the EB approach does not account

for the uncertainty arising from estimating hyperparameters. In recent years, (fully)

Bayesian (FB) approaches have gained prominence. Inference is based on Markov

chain Monte Carlo (MCMC) algorithms (Besag et al., 1991; Bernardinelli and Mon-

tomoli, 1991; MacNab et al., 2004; Congdon, 2006). Interval estimation of relative

risks based on posterior distributions account for the uncertainty associated with the

estimates through the hyperprior specifications. Bayesian methods for disease map-

ping is often termed hierarchical spatial modeling. The first level of the hierarchy

depicts the distribution of the data; the second level introduces the spatial depen-

dence through random effects which account for heterogeneity in the risks; at the

lowest level is specified the distribution of the hyperparameters.

Page 12: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 3

1.2.1 Conditional Autoregressive Priors

One of the most popular choices for the distribution of the random effects in hi-

erarchical spatial modeling is the intrinsic conditional autoregressive (CAR) model

(Besag et al., 1991). Let W = (wij) denote the so-called spatial proximity matrix,

i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , n for n regions, where wii = 0 and wij = 1 if the ith

and the jth areas are neighbours (denoted j ∼ i), and 0 otherwise. The conditional

expectation and variance are

E(bi∣bj ∕=i) =1

wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2b

wi+, (1.1)

where b−i represents and b = (b1, ⋅ ⋅ ⋅ , bn) has joint distribution

b ∼MVN(0,Σ), Σ = �2b (D −W )−1 , (1.2)

where D = diag(w1+, ⋅ ⋅ ⋅ , wn+), wi+ =∑

j wij. The forms (1.1) and (1.2) define the

intrinsic CAR (Besag et al., 1991) uniquely. With this model, local smoothing can be

achieved, as E(bi∣bj ∕=i) is the local risk average over the neighborhood of region i and

Var(bi∣bj ∕=i) is scaled by the inverse of the number of neighbors, so that the greater

the number of neighbors the smaller the variance. However, the intrinsic CAR prior

is improper, since the matrix (D−W ) is singular. This impropriety can be remedied

by enforcing constraints such as∑n

i=1 bi = 0, which can be implemented numerically

at each iteration of an MCMC algorithm used for model fitting. Alternatively, the

so-called proper CAR model may be used; this model incorporates an additional

parameter �, so that the full conditionals are

E(bi∣bj ∕=i) =�

wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2b

wi+, � ∈ (0, 1) (1.3)

leading to the unique joint distribution

b ∼MVN(0,Σ), Σ = �2b (D − �W )−1 , (1.4)

Page 13: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 4

so that the covariance matrix (D − �W ) is non-singular.

Alternatively, Leroux et al. (1999) proposed a CAR model defining the full con-

ditionals as

E(bi∣bj ∕=i) =�

1− �+ �wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2b

1− �+ �wi+, � ∈ (0, 1) (1.5)

leading to the unique joint distribution

b ∼MVN(0,Σ), Σ = �2b {�(D −W ) + (1− �)I}−1 , (1.6)

where � is a weighting parameter which weights the contributions from the spatially

correlated effect, modeled as intrinsic CAR, and the independent random noise term,

an independent normal distribution.

For point referenced data, geostatistical models (Cressie, 1993) are often used,

which directly specify the covariance matrix based on the distance between the spa-

tial sites. For example, the correlation between two spatial sites may decay expo-

nentially with distance; whereas, CAR models are specified based on the adjacency

structure among the spatial units, and can be used for either point-referenced data

or lattice data. In addition, inference for geostatistical models usually requires in-

version of covariance matrixes at each MCMC iteration; CAR models are therefore

computationally more efficient than geostatistical models.

1.3 Thin-Plate Splines

Thin-plate splines (Duchon, 1977) offer a very elegant approach for estimating a

smooth function of multiple predictor variables. The following provides a concise

introduction to thin plate splines. For a more detailed description, see (Duchon,

1977; Meinguet, 1979; Green and Silverman, 1994; Wood, 2004, 2006).

Page 14: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 5

Suppose the response yi, i = 1, ⋅ ⋅ ⋅ , n, is modeled as a smooth function of covari-

ates xi such that

yi = f(xi) + �i, i = 1, ⋅ ⋅ ⋅ , n, (1.7)

where f is an unknown function on a fixed domain D ⊂ Rd, �i is a random error term,

and xi ⊂ D are fixed values for covariates.

Thin-plate spline smoothing estimates f by finding the function f which minimizes

the penalized sum of squares

1

n

n∑i=1

wi {yi − f(xi)}2 + �Jm(f) , (1.8)

where wi, i = 1, 2, ⋅ ⋅ ⋅ , n, are some fixed constants; Jm(f) is penalty function measur-

ing the non-smoothness or so-called ‘wiggliness’ of f , and � is the smoothing param-

eter, which controls the tradeoff between f fitting the data precisely and smoothness

of f . The penalty term is defined as

Jm(f) =

∫⋅ ⋅ ⋅∫Rd

∑�1+⋅⋅⋅+�d=m

m!

�1! ⋅ ⋅ ⋅ �d!

( ∂mf

∂x�11 ⋅ ⋅ ⋅ ∂x�dd

)2dx1 ⋅ ⋅ ⋅ dxd . (1.9)

The sum in the integral is taken over all the integers � = (�1, ⋅ ⋅ ⋅ , �d)T such that

�1 + ⋅ ⋅ ⋅ �d = m, where d denotes the number of covariates, so d = 2 for spatial

longitude and latitude coordinate data, and the order m of differentiation in the

penalty can be any integer satisfying 2m > d. Matheron (1973) and Duchon (1977)

showed that the function minimizing (1.8) has the form

f(x) =k∑j=1

�j�j(x) +n∑i=1

i i(x) , (1.10)

where (�1, ⋅ ⋅ ⋅ , �k) are linearly independent polynomials spanning the space of all

d-dimensioned polynomials of degree less than m, and �j, j = 1, ⋅ ⋅ ⋅ , k and i, i =

1, ⋅ ⋅ ⋅ , n are coefficients to be estimated. For example, when d = 2, m = 2, k = 3

Page 15: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 6

and x = (x1, x2), we have �1(x) = 1, �2(x) = x1 and �3(x) = x2. For d = 2, m = 3,

k = 6, we have �1(x) = 1, �2(x) = x1, �3(x) = x2, �4(x) = x1x2, �5(x) = x21,

�6(x) = x22. The functions ( 1, ⋅ ⋅ ⋅ , n) are a set of n radial basis functions, defined

as

i(r) =

⎧⎨⎩ amd∥r∥2m−dlog∥r∥, d even

bmd∥r∥2m−d, d odd

where amd and bmd are constants.

For modeling spatial effects, thin-plate regression splines can be viewed as a Gaus-

sian process with generalized covariance (Cressie, 1993), characterized in terms of

distance �. The form of the covariance in two dimensions is C(�) ∝ �2m−2log(�),

where m is the order of the spline (commonly two). Paciorek (2007) provided a nice

comparison of a variety of approaches for modeling spatial surface. Wood (2000, 2003,

2004) proposed the use of iterative weighted fitting of reduced rank thin-plate splines

for computational efficiency.

1.4 Outline of Thesis

This thesis develops models and methods for the analysis of spatial or spatial-temporal

data arising from epidemiology, environmental and ecological studies. Specific prob-

lems will be considered including identification of high risk isolated areas in Chapter

2; misspecification of spatial exposure measures in Chapter 4; joint modeling of multi-

variate spatially correlated outcomes using common spatial factor models in Chapter

3; and investigation of functional data analysis approaches for modeling spatially and

temporally correlated data in Chapter 5. Each of Chapters 2, 3, 4 and 5 constitute

papers submitted. As a result, some introductory material may be repeated through

these chapters as well as the descriptions of motivating data sets.

Page 16: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 7

1.4.1 Chapter 2

In disease mapping studies, often there is interest in identifying high risk areas in

order to investigate causes of mortality for surveillance purposes, or perhaps for effi-

cient allocation of health funding. Here, we focus on identification of locally isolated

high risk regions termed ‘local hotspots’ or ‘emerging hotspots’, defined as regions

with elevated risks, with respect to their neighbors. Identification of ‘local hotspots’ or

‘emerging hotspots’ before they become extreme is crucial for disease surveillance. We

develop methods of ranking the difference between area risks or ranks and correspond-

ing values for neighbours, based on (1) the standardized mortality ratio (SMR), (2)

minimizing mean squared errors of estimation for relative risks (3) minimizing mean

squared errors of estimation for ranks of risks, (4) minimizing a weighted squared

error loss function for ranks and (5) maximizing the sensitivity in the upper and

lower 100 % relative risks at prespecified . We evaluate our methods through sim-

ulation investigation in a scenario which reflects the Scottish lip cancer data used in

several mapping studies. Our simulation results show that ranking the difference be-

tween posterior ranks of emerging hotspots and corresponding values for neighbours,

based on minimizing mean squared errors of estimation for ranks, is superior to other

methods for identifying emerging hotspots.

1.4.2 Chapter 3

This chapter discusses joint outcome modeling of multivariate spatial data, where

outcomes include count as well as zero-inflated count data. The framework utilized for

the joint spatial count outcome analysis reflects that which is now commonly employed

for the joint analysis of longitudinal and survival data, termed shared frailty models,

in which the outcomes are linked through a shared latent spatial random risk term.

We discuss these types of joint mapping models and consider the benefits achieved

Page 17: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 8

through such joint modeling in the disease mapping context. We also consider the

power of tests for common spatial structure and develop recommendations on the

sort of power achievable in some contexts, as well as overall recommendations on the

utility of joint mapping. We illustrate the approaches in an analysis of lung cancer

mortality as well as an ecological study of Comandra blister rust infection of lodgepole

pine trees.

1.4.3 Chapter 4

In environmental and epidemiological studies, the nearest distance between the sus-

ceptible subject and the exposure source is a commonly used exposure measure, prin-

cipally because this measure is easy to collect. However, the density of the exposure in

the neighborhood of the subject may play an important role in the response to expo-

sure. Misspecification of exposure measures may result in inaccurate determinations

of the link between exposure and the response of interest. Such considerations are

motivated by the study of the disease dynamics of Comandra blister rust (Cronartium

comandrae) on lodgepole. This disease spreads to pine trees through alternate host

plants near the trees. We aim at understanding the relationship between the alternate

host plant presence and the disease, as well as effects relating to genetic variation in

the trees. We contrast the use of nearest distance to the alternate host plant, with

host plant densities at different orders of neighborhood, as exposure measures, in the

framework of a flexible semiparametric generalized additive model, while adjusting for

a spatially smooth surface. We demonstrate that if exposure is inaccurately modeled,

bias in estimating genetic effects may manifest themselves. Our study also provides

information on the added benefit of collecting more detailed information on exposure

beyond the simple nearest distance measure.

Page 18: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 1. INTRODUCTION 9

1.4.4 Chapter 5

Oysters from the Pacific Northwest coast of British Columbia, Canada, contain high

levels of cadmium, in some cases exceeding some international food safety guidelines.

A primary goal of this chapter is the investigation of the spatial and temporal variation

in cadmium concentrations for oysters sampled from coastal British Columbia. Such

information is important so that recommendations can be made as to where and when

oysters can be cultured such that accumulation of cadmium within these oysters

is minimized. Some modern statistical methods are applied to achieve this goal,

including monotone spline smoothing, functional principal component analysis and

semi-parametric additive modelling. Oyster growth rates are estimated as the first

derivatives of the monotone smoothing growth curves. Some important patterns in

cadmium accumulation by oysters are observed. For example, most inland regions

tend to have a higher level of cadmium concentration than most coastal regions, so

more caution needs to be taken for shellfish aquaculture practices occurring in the

inland regions. The semi-parametric additive modelling shows that oyster cadmium

concentration decreases with oyster length, and oysters sampled at 7m have higher

average cadmium concentration than those sampled at 1m.

1.4.5 Chapter 6

The thesis closes with a discussion of future research topics.

Page 19: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 2

Bayesian Ranking Methods for the

Detection of Isolated Hotspots in

Disease Mapping

2.1 Introduction

In disease mapping, early capture of emerging hotspots, that is, regions with ele-

vated risks which are surrounded by areas with much lower risks, before they become

extreme, is crucial in decision-making related to health surveillance. Such decision-

making processes may refer to optimal allocation of resources for health prevention, or

to decisions reflecting mobility of a society or other environmental controls. A typical

approach for detection of disease hotspots through a hypothesis testing framework

utilizes the scan statistic (see Kulldorff and Nagarwalla, 1995; Kulldorff et al., 1998),

which aims at detecting the location and size of hotspots without any preconceived

assumptions about these values. Our focus here is quite different as we seek to es-

timate and rank various local elevations in risk across a map. Model based spatial

10

Page 20: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 11

methods are used here to estimate such ranks. For rare diseases, the observed dis-

ease count may exhibit extra Poisson variation. Hence, the standardized mortality

ratios (SMRs), a basic investigative tool for epidemiologists, may be highly variable.

Subsequently, in maps of SMRs, the most variable values, arising typically from low

population areas, tend to be highlighted, masking the true underlying pattern of dis-

ease risk. To address the issue of such overdispersion, the field of disease mapping

has flourished in the last decade with a variety of estimation methods and spatial

models for latent levels of the model hierarchy. In particular, there have been many

developments related to Bayesian hierarchical models, which allow the risk in an area

to borrow strength from neighboring areas where the disease risks are similar. These

models have indeed become standard tools for mapping rates (see Besag et al., 1991;

Clayton and Bernardinelli, 1992; Clayton et al., 1993; Lawson et al., 2000; MacNab

et al., 2004; Best et al., 2005, for example) in order to identify global hotspots and

trends in the risk surface across the map.

Identification of local or emerging hotspots have received less attention. It is

unclear whether and what sorts of smoothing techniques offer advantages for iden-

tifying isolated hotspots, over basic estimates such as raw rates. Here, we maintain

the focus on Bayesian hierarchical conditional autoregressive (CAR) models, devel-

oped by Besag et al. (1991); Clayton and Bernardinelli (1992); Clayton et al. (1993).

This model and its extensions have become commonplace in epidemiological studies

and have been shown to be flexible and robust (Lawson et al., 2000). Best et al.

(2005) demonstrates the merits of the CAR model when compared to other contem-

porary models including a multivariate normal geostatistical model with exponential

covariance, a spatial mixture model, a partition model and a gamma moving average

model. While CAR models were not designed to detect isolated hotspots or clusters

of isolated hotspots, they have nevertheless been used broadly for identifying extreme

risks.

Page 21: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 12

The most natural measure of isolation is the difference between the risk or rank of a

potential hotspot and the corresponding quantity for its neighbors. Ranking methods

play a valuable role in drawing attention to elevated regions. This chapter considers

methods for ranking isolation measures with the goal of using these to identify local

or emerging hostpots. We note that Laird and Louis (1989) showed that ranking of

empirical Bayes estimators can be more accurate than that of conventional maximum

likelihood estimators. Shen and Louis (1998) investigated ranking procedures using

squared error loss functions operating on the difference between the estimated and

true ranks. We note also that in many applications, interest focuses principally on

identifying the locations with relatively high (e.g. in the upper 10 %) or low risks.

With such an emphasis, Lin et al. (2006) discussed various loss functions for Bayesian

optimal ranking, as well as decision rules for identifying the regions with the top

100 % risk values. Wright et al. (2003) developed a weighted rank squared error loss

function targeted at the most likely high-risk locations. We contrast these methods

for identifying the highest and lowest isolation measures across a map and develop

recommendations based on adaptations of these procedures. Though we focus on

disease mapping, we note that methods for ranking isolation measures may be broadly

useful in many other contexts, particularly sociological, for ranking political or racial

isolation, or ecological, for diversity studies.

In Section 2.2, we review the Bayesian hierarchical models commonly used for

analyzing disease incidence and mortality data. Section 2.3 discusses the ranking

methods considered, focusing on identifying regions associated with high risks which

are isolated and building upon Bayesian hierarchical models. Section 2.4 evaluates the

methods using the spatial distribution of lip cancer from Scotland where local hotspots

are artificially generated. Section 2.5 closes with a discussion and recommendations.

Page 22: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 13

2.2 Bayesian Disease-Mapping Model

It is well known that Bayesian hierarchical models for disease mapping provide a trade-

off between bias and variance reduction of estimates, and is particularly helpful in

cases where the disease is rare. The variance reduction is achieved through borrowing

information from the neighboring region to produce a more stable estimate of the risk

surface with estimated risks shrunk toward the overall mean risk, or some function of

this mean. Marshall (1991) reviews empirical Bayes and some early Bayesian methods

for disease mapping; Lawson et al. (2000) compares disease mapping models using

various goodness of fit criteria; Best et al. (2005) provides a comprehensive review of

the recent development in Bayesian disease mapping and compares models through

simulation studies; Richardson et al. (2004) conducts a comprehensive evaluation

designed to highlight the amount of smoothing of risk which occurs and the effects on

identifying global hotspots in a variety of settings. Our aim here is to evaluate various

ranking methods for risk estimators obtained from fitting Bayesian disease mapping

models. We focus on the basic spatial model described by Besag et al. (1991).

Let the area under study be divided into n contiguous regions labeled i = 1, ⋅ ⋅ ⋅ , n,

and let y = (y1, ⋅ ⋅ ⋅ , yn)T be the observed, and E = (E1, ⋅ ⋅ ⋅ , En)T be the expected,

disease counts. Denote by � = (�1, ⋅ ⋅ ⋅ , �n)T , i = 1, ⋅ ⋅ ⋅ , n the underlying random

region-specific disease risks. The response variables, conditional on �i, i = 1, ⋅ ⋅ ⋅ , n,

are assumed independent and Poisson distributed: yi∣�i∼Poisson(�i), �i = �iEi. The

conditional log linear model (Besag et al., 1991) specifies

log(�i) = � + log(Ei) + �i, �i = � + bi + ℎi ,

where � denotes the overall mean risk, while �i is decomposed into a spatially cor-

related random error term bi, and a uncorrelated error ℎi. The spatially correlated

Page 23: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 14

random effects, b = (b1, ⋅ ⋅ ⋅ , bn)T are conveniently interpreted conditionally, as

bi∣bj ∕=i ∼ N

(∑j∼iwijbj∑j∼iwij

,�2b∑

j∼iwij

),

where j ∼ i indicates that region j belongs to the neighbourhood of region i,

i = 1, ⋅ ⋅ ⋅ , n. Neighborhoods define the scope of the conditional influence and may

be constructed in different ways depending on the context of the analysis. In our

application, we define regions which are contiguous in space with the ith region,

sharing a common boundary, as its neighborhood. The weights, wij ≥ 0, wii = 0,

i, j = 1, ⋅ ⋅ ⋅ , n may be based on adjacency indicators for a lattice, or on a distance

measure between region i and j. Where the weights are based on adjacency indica-

tors, the joint distribution of random effects, b, is described as the intrinsic condi-

tional autoregressive model (Besag, 1974; Sun et al., 1999): b ∼ MVN(0, �2bQ−1),

where Q has ith diagonal element equal to the number of neighbors of the ith region

while for i ∕= j, Qij = −1 if i and j are neighbors, and 0 otherwise. The vector of

random risks, �, accommodates extra variation by a white noise error vector, and

h = (ℎ1, ⋅ ⋅ ⋅ , ℎn)T ∼ MVN(0, �2ℎI), where I is an identity matrix of dimension

n. By combining the independent and spatially correlated sources of random errors,

we obtain the convolution conditional autoregressive model for defining the distribu-

tion of the risks �i, as defined by Besag et al. (1991): h+ b ∼ MVN(0,Σ), where

Σ = �2ℎI + �2

bQ−1. The values of �2

ℎ and �2b give a sense of the contributions of

spatial and non-spatial components in explaining the variability in the map of risks.

Bayesian analysis requires the specification of prior distribution for the parameters.

We put diffuse prior on the intercept �. For the variance parameters (�2b , �

2ℎ) of

the random effects (b,h), we let the square root be a noninformative uniform prior

density between 0 and 100 (Gelman, 2006).

In the Bayesian approach to disease mapping, inference on the relative risks is

based on the posterior distribution of the risks given the data. The use of Markov

Page 24: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 15

chain Monte Carlo (MCMC) methods based on Gibbs sampling (Geman and Ge-

man, 1984; Gelfand and Smith, 1990) yields easy implementation in the WinBUGS

software package (Spiegelhalter et al., 2003), allowing for estimation of the posterior

distribution of the relative risks. The R project R2WinBUGS (Sturtz et al., 2005)

may be used to export results for additional analyses using R.

2.3 Ranking Methods

To estimate isolation, we propose to rank the difference between the rank or risk es-

timates of the region under consideration and the corresponding mean value from its

neighbours. We expect this to provide a useful mechanism for identifying areas with

emerging or unusual elevated risk, and hence for prioritizing public health investiga-

tions. Our discussion of ranking approaches are from both (i) traditional perspectives

which use estimates based upon the SMR and (ii) those based on smoothing methods

with a focus of obtaining a general impression of trends over space as well as utilizing

these to provide more precise identification of isolated high risk areas.

Let d = (d1, ⋅ ⋅ ⋅ , dn)T be a vector representing the isolation measure defined as

the true difference in relative risks between the region and the mean value of the risk

for its neighborhood

di = �i −1

Ni

∑j∼i

�j , (2.1)

where Ni denotes the number of neighbours for region i, i = 1, ⋅ ⋅ ⋅ , n. Define the

corresponding rank of di as

rank(di) = Ri =n∑j=1

I {di ≤ dj} , (2.2)

where I {A} is the indicator function for event A. The smallest difference has rank

n and the largest has rank 1. The ranking methods considered are obtained by

Page 25: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 16

minimizing the following loss functions.

2.3.1 Squared error loss function for the isolation measures

It is well known that the posterior mean minimizes the Bayesian risk with respect

to the squared-error loss (SEL) function (Berger, 1985). For example, the posterior

mean, E(�i∣y), is the optimal Bayes estimate obtained by minimizing the posterior

expectation of the sum of squared error loss function L(�, �) =∑n

i=1(�i − �i)2/n

(Carlin and Louis, 1996). In our case, we rank the posterior mean of the isolation

value, E(di∣y), which minimizes the posterior expectation L(d, d) =∑n

i=1(di−di)2/n.

The corresponding estimated ranks are denoted as PM.

2.3.2 Squared error loss function for the ranks of the isola-

tion measures

Laird and Louis (1989), Shen and Louis (1998) and Louis and Shen (1999) showed

that if ranks of parameters are of interest, using a rank estimator directly is more ap-

propriate than using the parameter estimator to obtain ranks. The posterior expected

rank is obtained by minimizing the sum of squared error loss function of the ranks

L(R, R) =∑n

i=1(Ri−Ri)2/n. The estimated ranks, which are non-integer quantities,

are

Ri = E(Ri∣y) =n∑j=1

P (di ≤ dj∣y) , (2.3)

and tend to be shrunk towards the mid-rank (n+ 1)/2. Hence, we rank the posterior

means of (2.3) as described below, and denote the corresponding estimated ranks,

Ri = rank(Ri), as PRANK. Lin et al. (2006) shows that the estimator is also optimal

under weighted squared error loss of ranks, 1/n∑n

i=1wi(Ri − Ri)2 for any values of

wi, i = 1, ⋅ ⋅ ⋅ , n. Calculation of PRANK can be easily implemented in the Bayesian

Page 26: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 17

context. Let �(r) = (�(r)1 , ⋅ ⋅ ⋅ , �(r)n )T be a random draw of � from p(�∣y); rank the

isolation measures d(r)i = �

(r)i − 1/Ni

∑j∼i �

(r)j , i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ ,Ni, and

subsequently rank the average rank of d(r)i over the MCMC iterations, r = 1, ⋅ ⋅ ⋅ , R,

to obtain the optimal rank based on (2.3).

Ranking methods described in Subsections 2.3.1 and 2.3.2 may be reasonable

choices when accurate ranking of all regions is of interest. In contrast, the methods

described in Subsection 2.3.3 and 2.3.4 focus on high risk areas.

2.3.3 Weighted rank squared error loss function

The posterior means are less variable than a typical draw from the posterior distribu-

tion (Louis, 1984). Therefore, high risks tend to be underestimated, while low risks

tend to be overestimated. Wright et al. (2003) introduces weighted rank squared error

loss functions in a hierarchical setting for estimating extrema (hotspot) of parame-

ters. In an exploratory approach, we adapt this method to be aligned with a focus

on identifying local isolated hotspots.

Let (d(1), ⋅ ⋅ ⋅ , d(n)) be the ordered vector of d, d(1) < ⋅ ⋅ ⋅ < d(n), assuming no ties.

To identify the most isolated hotspot, we consider the following loss function:

J(d, d, c

)=

n∑k=1

n∑j=1

cjI{dk = d(j)

}(dk − dk

)2=

n∑k=1

cr(k)

(dk − dk

)2, (2.4)

where r(k) ≡{j : dk = d(j)

}, cr(k) =

∑nj=1 cjI

{dk = d(j)

}and c = (c1, ⋅ ⋅ ⋅ , cn)T is the

vector of weights for d. The optimal Bayes estimator of dk is obtained by minimizing

the conditional expectation of the kth element in (2.4),

E{Jk(d, dk, c∣y

)}=

∫ n∑j=1

cjI{dk = d(j)

}(dk − dk

)2p(d∣y)dd , (2.5)

Page 27: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 18

which yields

dk =

∑nj=1 cj

∫I{dk = d(j)

}dkp(d∣y)dd∑n

j=1 cj∫I{dk = d(j)

}p(d∣y)dd

=

∑nj=1E

(dk∣dk = d(j),y

)cjp(dk = d(j)∣y

)∑n

j=1 cjp(dk = d(j)∣y

) .

(2.6)

The estimate dk is a weighted average of conditional posterior means of dk, with the

weight being cj multiplied by the posterior probability that dk has rank j. The corre-

sponding estimated ranks are denoted as WRSEL. For identifying extreme risks, we

use the suggestion in Wright et al. (2003) to consider a sharply increasing weighting

vector, with ci = exp [{(n+ 1)− i} /s] as the weight for rank i, i = 1, ⋅ ⋅ ⋅ , n. We

let WRSEL(a) denote the estimated ranks when s = 2, so that the weighting func-

tion puts large weight on highly isolated risks and almost 0 weight otherwise, and

WRSEL(b) denote the estimated ranks when s = 10, so that the weight function de-

clines less steeply as risks become less isolated. Figure 2.1 displays the weight vectors

c for WRSEL(a) and WRSEL(b) when n = 56.

2.3.4 Misclassification rates of regions in the top 100 % group

Lin et al. (2006) considered specific loss functions tailored for estimating extreme

ranks. They recommended ranking the posterior probability that a region’s rank is

in the top 100 % of ranks based on the rank-based misclassification loss function:

L0∣1( ,R, R) =1

n

n∑i=1

{FP( ,Ri, Ri) + FN( ,Ri, Ri)

}, (2.7)

where

FP( ,Ri, Ri) = I{Ri > (n+ 1), Ri ≤ (n+ 1)

};

FN( ,Ri, Ri) = I{Ri ≤ (n+ 1), Ri > (n+ 1)

}, (2.8)

Page 28: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 19

0 10 20 30 40 50

0.00.2

0.40.6

0.81.0

rank

c

WRSEL(a)WRSEL(b)

Figure 2.1: Plot of the weight function c for WRSEL(a) and WRSEL(b). In thisplot, the weight functions are scaled to have maximum value of 1.

where FP (false positive) and FN (false negative) indicate the two possible misclassi-

fication rates.

Lin et al. (2006) shows the loss function (2.7) is minimized by ranking the following

posterior probabilities:

P (Ri ≤ (n+ 1)∣y) , (2.9)

as in Lin et al. (2006), based on the posterior distribution of Ri, and minimizes

errors in classifying regions above or below a percentile threshold. The corresponding

estimated ranks are denoted as PPR.

2.4 Comparison of Rank Estimators of Isolation

In an effort to understand how these ranking methods perform and how well they cap-

ture isolated hotspots when these are only modestly elevated, we consider hotspots

Page 29: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 20

from regions with low expected counts, where the elevation in risk ranges from mod-

erate to large. We consider a single isolated hotspot, a small cluster of contiguous

hotspots, and, for comparison, several non-contiguous isolated hotspots.

In the investigations, the background relative risks are spatially correlated while an

independent discrete random effect inflates the risks in the target regions. Specifically,

counts were generated from a multinomial distribution

yi ∼ Multinomial

(n∑i=1

Ei,Ei�i∑ni=1Ei�i

), �i = exp(� + bi + log�i) , (2.10)

where � is the overall mean rate over the map; bi denotes a spatially correlated ran-

dom effect; �i = 1 if the region is not a hotspot, and constant t otherwise, t being

the inflation factor. To accommodate sampling variability, each simulation scenario is

replicated 500 times. Two MCMC chains have been run for a total of 20,000 iterations,

keeping every 10th, after a 10,000 iteration burn-in period. Brooks-Gelman-Rubin

diagnostics (Brooks and Gelman, 1998), as well as graphical checks of chains and

their autocorrelations were performed to assess convergence. The distribution of the

spatially correlated random effects, the expected disease counts and the neighbor-

hood structure mimic the fitted distribution from an initial analysis of the Scottish

lip cancer data (see Breslow and Clayton, 1993, for example). The data comprise

observed and expected counts of lip cancer cases during the period 1975-1980 over 56

Scottish counties. Table 2.1 summarizes observed and expected counts for this data.

The lip cancer data is known for exhibiting severe extra-Poisson variation (Clayton

and Kaldor, 1987). Breslow and Clayton (1993) and others have found that a con-

ditional Poisson model with spatially correlated CAR random effects provides a fair

fit to these data. We use the estimated model parameters from such an analysis to

define the background spatial pattern. Additionally, emerging isolated hotspots are

generated as

Page 30: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 21

∙ Scenario I: A single region is considered as emerging hotspot. Three candi-

dates are considered with expected counts of 1.8, 6 and 14.6, corresponding

approximately to the 10th, 50th and 90th percentiles of the expected counts,

respectively. Note that we choose the hotspot with low expected count (10th

percentile of the expected counts), such that it is surrounded by neighbours

with fairly high expected counts, one of which has expected count 50.7. In

this case, the neighbours may have substantial smoothing effects on the target

region under the CAR model.

∙ Scenario II: A group of three contiguous regions is considered as an isolated clus-

ter. Two cases are considered: (i) areas with low expected counts 3.3, 4.8 and

2.9; (ii) areas with high expected counts of 9.3, 14.6 and 88.7. When contigu-

ous regions are proposed as hotspots, di (2.1) is calculated by excluding target

hotspots from Ni. This mimics a hypothesis testing scenario where a specific

cluster is being tested. Note that the expected counts from the neighbours of

case (i) are fairly low, with mean expected counts about 7.5.

∙ Scenario III: A group of three non-contiguous regions is considered as an isolated

group of regions of higher risk. Two cases are considered (i) areas with low

expected counts of 2.5, 3.3 and 3.6; (ii) areas with high expected counts of 10.1,

50.7 and 8.2. The expected counts of the neighbours for two of the isolated

hotspots in case (i) are fairly low, while the third hotspot has a neighbor with

the highest expected count over the map.

Note that estimated risks for the isolated hotspots which have moderate or high

expected counts are less likely to be influenced by disease counts for their neighbours.

In the simulation studies, the risks of the elevated regions are inflated to be sharply

different from their neighbors and (i) not overly high (rank about 10th place), and

(ii) moderately high (rank about 3rd place) and (iii) high (rank about 1st place). The

Page 31: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 22

Table 2.1: Scottish lip cancer data: summary statistics.

Minimum First quartile Median Third quartile MaximumObserved count (y) 0.00 4.75 8.00 11.00 39.00Expected count (E) 1.10 4.05 6.30 10.12 88.70SMR (y/E) 0 0.49 1.11 2.24 6.43

magnitudes of the inflations for Scenarios I, II and III are reflected in Figures 2.2,

2.3 and 2.4, respectively. The corresponding geographical locations for the isolated

hotspots are shown in Figures 2.5 and 2.6, respectively. We also consider scaling the

expected counts by a factor u = 1, 4 and 8 for all scenarios. The threshold in (2.9)

corresponds to 1/(n + 1) for Scenario I and 3/(n + 1) for Scenarios II and III. The

simulated data are analyzed using model (2.1).

To assess the accuracy of the proposed ranking methods for identification of iso-

lated hotspots, we consider the root mean squared error of Ri, for an isolated hotspot

at site i, given by

RMSE(Ri) =

{1

M

M∑m=1

(R

(m)i −Ri

)2}1/2

, (2.11)

where Ri is the true rank (2.2) and R(m)i is the estimated value based on the mth

simulated dataset, m = 1, ⋅ ⋅ ⋅ ,M . For cases where clusters are considered (Scenario

II and III), we calculate the average RMSE for the hotspots.

We also evaluate the ranking methods based on the correct positive and false

Page 32: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 23

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

10th●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd ●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

1st

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

10th●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●●●

0 10 30 501

23

45

true rank

tru

e r

ela

tive

ris

k

1st

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

10th●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

1st

Figure 2.2: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top, middle and bottom rows correspondto Scenario I with the target region having low, moderate and high expected incidencecount, respectively. The isolated hotspot, shown as black dots, are inflated to aboutthe 10th (column 1), 3rd (column 2) and 1st (column 3) places. The symbol +identifies neighboring regions of the isolated hotspot.

Page 33: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 24

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ● ●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●

10th ●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ● ●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●

3rd●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●● ● ●

●●●

● ●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

● 1st

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●● ●

●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●●

10th ●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●● ●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●

3rd

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●● ●

●●

●● ●●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●1st

Figure 2.3: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.

Page 34: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 25

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ●●

●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●●●

10th ●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●● ●●

●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

3rd

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●● ●●

●●

●●●●

●●● ●● ●●●

●●

●●

0 10 20 30 40 50

12

34

5

true rank

true

rel

ativ

e ris

k

●●

1st

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●●●

10th ●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●●

3rd

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●● ●●● ● ●●

●●

0 10 20 30 40 50

12

34

5

true rank

true

rel

ativ

e ris

k

1st

Figure 2.4: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three non-contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.

Page 35: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 26

unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

low E

unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

mod

erat

e E

unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

high

E

Fig

ure

2.5:

The

pan

els

dis

pla

ydi

for

Sce

nar

ioI.

The

singl

eis

olat

edhot

spot

wit

hlo

w,

moder

ate

and

and

hig

hex

pec

ted

count,

are

iden

tified

by

the

red

circ

lein

the

1st,

2nd

and

3rd

pan

els,

resp

ecti

vely

.

Page 36: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 27

under −1.5−1.5 − 00 − 1.5over 1.5

low E

under −1.5−1.5 − 00 − 1.5over 1.5

high E

under −1.5−1.5 − 00 − 1.5over 1.5

low E

under −1.5−1.5 − 00 − 1.5over 1.5

high E

Figure 2.6: The top and bottom panels display di for Scenarios II and III, respectively.The cluster of three contiguous isolated hotspots with low and high expected countsfor simulation Scenario II are identified by the red circles in the left and right toppanels, respectively; the cluster of three non-contiguous hotspots with low and highexpected counts for simulation Scenario III are identified by the red circles in the leftand right bottom panels, respectively.

Page 37: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 28

positive rates

CP = P (Ri < �∣Ri < �) =1

M

M∑m=1

I{R

(m)i < �∣Ri < �

};

FP = P (Ri < �∣Ri > �) =1

M

M∑m=1

I{R

(m)i < �∣Ri > �

}, (2.12)

where � in (2.12) denotes the threshold defining high ranks, � = 2 for Scenario I and

4 for Scenarios II and III.

Table 2.2 displays RMSE, CP and FP for all the ranking methods evaluated here

for Scenario I for the case where the hotspot is associated with a low expected count

surrounded by neighbours with high expected values. It is not surprising that SMR

performs better in this case, as the CAR model pools information from the neighbours

to produce an estimate for the target region; therefore, the risk estimate for this

isolated hotspot tends to be smoothed under the CAR model. In contrast, for the

case of an isolated hotspot with moderate or large expected count, as shown in Tables

2.3 and 2.4, PRANK outperforms SMR. The gains of using PRANK are substantial

when the expected incidence count for the emerging hotspot is large. For example,

in Table 2.4, when the isolated hotspot is in the 10th place, CP is about 71.2% while

FP is about 0.5% for PRANK, yielding a performance which is far superior to the

other ranking methods. In general, WRSEL(a), WRSEL(b) and PPR, perform less

well. The WRSEL function tends to inflate the point estimates of the high risks;

because their weights are low, inaccuracies in point estimates of the other regions

with low isolation measures are relatively unimportant. WRSEL does not provide

precise estimates of all the risks and this may make it unsuitable for ranking purposes

(ranking requires good estimates over the whole map). Our empirical evaluation of

PPR over a sequence of values of the threshold (not shown here) suggests that

the performance of this estimator in terms of RMSE, CP and FP is influenced by ,

especially when the expected disease counts are low for the isolated hotspots. For

Page 38: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 29

all the ranking methods, RMSE decreases, CP increases and FP decreases when the

emerging hotspots are gradually elevated above the whole surface, and when the

expected incidence counts for all the regions are inflated. These findings apply also

to the cases where the isolated hotspots are a cluster of three contiguous regions (see

Tables 2.5 and 2.6) and also where the isolated hotspots are three non-contiguous

regions (see Tables 2.7 and 2.8). It is also interesting to note that, in contrast to

Scenario I, for Scenario II, where three contiguous regions with low expected counts

are inflated as a cluster of hotspots, PRANK is superior to SMR, as the CAR model

has less of a smoothing effect on these isolated hotspots.

2.5 Summary

In this study, we focus on developing and evaluating rank estimators for disease map-

ping for the identification of emerging isolated hotspots. To determine the magnitude

of elevation of the hotspots relative to their neighbours, we developed an isolation

measure, the difference of risks or their rank estimators for the emerging high risk

regions and their neighbours. In summary, we note that though the CAR model

provides a smoothed risk surface, the estimates for PRANK or PM based on this

model perform reasonably well in detecting the emerging isolated hotspots. Simula-

tion studies show that gains of using PRANK may be substantial compared to other

ranking methods considered, especially when the disease is rare and the high risk area

is not yet a global outlier. The research has adopted the widely used CAR model.

Rank estimators based on other models may yield different results on identification of

isolated hotspots. The isolation measure developed here depends on the definition of

the neighborhood structure. The performance of the isolation measure may depend

on the distribution of the number of neighbours; hence the development of methods

which account for the number of neighbours may be useful.

Page 39: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 30

In addition, in comparison to the classical scan statistic, we expect that the rank-

ing methods based on the spatial model may have lower false positive rates for identify-

ing isolated hotspots, since the classical scan statistic is very sensitive to the violation

of the assumption of spatial independence, detecting clusters at the 5% level much

more often than 5% of the time when spatially correlated data are simulated (Loh

and Zhu, 2007). It would be useful to compare the use of the scan statistic to our

ranking methods through simulation studies when no isolated hotspots exist.

Page 40: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 31

Table 2.2: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with LOW expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 8.988 0.262 0.013 4.211 0.542 0.008 1.842 0.706 0.005PM 11.924 0.014 0.018 6.991 0.098 0.016 4.631 0.212 0.014WRSEL(a) 14.479 0.004 0.018 10.173 0.064 0.017 7.689 0.132 0.016WRSEL(b) 15.673 0.006 0.018 10.304 0.074 0.017 7.542 0.160 0.015PPR 21.577 0.008 0.018 13.456 0.088 0.017 9.316 0.162 0.015PRANK 11.643 0.024 0.018 6.555 0.150 0.015 4.103 0.278 0.013

u = 4 SMR 2.109 0.428 0.010 0.417 0.886 0.002 0.253 0.948 0.001PM 4.331 0.070 0.017 1.291 0.646 0.006 0.629 0.834 0.003WRSEL(a) 6.937 0.036 0.018 1.865 0.560 0.008 0.913 0.784 0.004WRSEL(b) 5.753 0.050 0.017 1.449 0.612 0.007 0.700 0.828 0.003PPR 10.509 0.054 0.017 1.785 0.626 0.007 0.739 0.832 0.003PRANK 3.935 0.096 0.016 1.154 0.676 0.006 0.576 0.862 0.003

u = 8 SMR 1.305 0.472 0.010 0.205 0.958 0.001 0.118 0.986 0.000PM 2.510 0.164 0.015 0.397 0.872 0.002 0.161 0.974 0.000WRSEL(a) 3.409 0.132 0.016 0.443 0.834 0.003 0.179 0.968 0.001WRSEL(b) 2.781 0.164 0.015 0.422 0.858 0.003 0.161 0.974 0.000PPR 5.720 0.160 0.015 0.422 0.864 0.002 0.155 0.976 0.000PRANK 2.373 0.188 0.015 0.374 0.890 0.002 0.141 0.980 0.000

Page 41: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 32

Table 2.3: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with MODERATE expected disease counts, whose risk wasinflated to about the 10th, 3rd and 1st place; the expected disease counts for all theregions are scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 3.159 0.208 0.014 1.152 0.580 0.008 0.605 0.802 0.004PM 3.151 0.200 0.015 1.275 0.542 0.008 0.560 0.832 0.003WRSEL(a) 9.498 0.028 0.018 4.629 0.206 0.014 1.997 0.602 0.007WRSEL(b) 6.426 0.082 0.017 2.577 0.420 0.011 0.980 0.750 0.005PPR 8.818 0.112 0.016 2.705 0.474 0.010 0.995 0.784 0.004PRANK 2.164 0.378 0.011 0.729 0.764 0.004 0.319 0.922 0.001

u = 4 SMR 0.931 0.604 0.007 0.341 0.884 0.002 0.110 0.988 0.000PM 0.963 0.588 0.007 0.241 0.942 0.001 0.089 0.992 0.000WRSEL(a) 1.957 0.326 0.012 0.392 0.864 0.002 0.110 0.988 0.000WRSEL(b) 1.138 0.512 0.009 0.300 0.922 0.001 0.089 0.992 0.000PPR 1.483 0.534 0.008 0.272 0.938 0.001 0.089 0.992 0.000PRANK 0.769 0.690 0.006 0.195 0.962 0.001 0.077 0.994 0.000

u = 8 SMR 0.597 0.710 0.005 0.200 0.960 0.001 0.000 1.000 0.000PM 0.642 0.706 0.005 0.179 0.968 0.001 0.000 1.000 0.000WRSEL(a) 0.872 0.552 0.008 0.205 0.958 0.001 0.000 1.000 0.000WRSEL(b) 0.672 0.678 0.006 0.179 0.968 0.001 0.000 1.000 0.000PPR 0.722 0.700 0.005 0.179 0.968 0.001 0.000 1.000 0.000PRANK 0.546 0.790 0.004 0.161 0.974 0.000 0.000 1.000 0.000

Page 42: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 33

Table 2.4: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with HIGH expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 1.834 0.250 0.014 0.799 0.630 0.007 0.417 0.860 0.003PM 1.321 0.428 0.010 0.494 0.804 0.004 0.249 0.950 0.001WRSEL(a) 7.695 0.030 0.018 2.490 0.358 0.012 0.832 0.782 0.004WRSEL(b) 3.025 0.188 0.015 0.906 0.674 0.006 0.319 0.920 0.001PPR 5.272 0.250 0.014 0.926 0.744 0.005 0.382 0.936 0.001PRANK 0.696 0.712 0.005 0.257 0.940 0.001 0.118 0.986 0.000

u = 4 SMR 0.651 0.684 0.006 0.283 0.920 0.001 0.077 0.994 0.000PM 0.562 0.766 0.004 0.200 0.960 0.001 0.063 0.996 0.000WRSEL(a) 1.049 0.504 0.009 0.268 0.934 0.001 0.077 0.994 0.000WRSEL(b) 0.660 0.716 0.005 0.205 0.958 0.001 0.063 0.996 0.000PPR 0.720 0.734 0.005 0.195 0.962 0.001 0.063 0.996 0.000PRANK 0.415 0.870 0.002 0.167 0.972 0.001 0.045 0.998 0.000

u = 8 SMR 0.537 0.734 0.005 0.118 0.986 0.000 0.000 1.000 0.000PM 0.454 0.816 0.003 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(a) 0.610 0.686 0.006 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(b) 0.486 0.792 0.004 0.110 0.988 0.000 0.000 1.000 0.000PPR 0.475 0.808 0.003 0.100 0.990 0.000 0.000 1.000 0.000PRANK 0.369 0.880 0.002 0.089 0.992 0.000 0.000 1.000 0.000

Page 43: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 34

Table 2.5: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 5.264 0.455 0.031 3.530 0.578 0.024 1.712 0.808 0.011PM 3.904 0.412 0.033 2.762 0.568 0.024 1.611 0.821 0.010WRSEL(a) 9.437 0.111 0.050 7.373 0.276 0.041 3.445 0.619 0.022WRSEL(b) 7.071 0.249 0.043 4.987 0.425 0.033 2.207 0.740 0.015PPR 6.695 0.351 0.042 4.601 0.517 0.032 2.278 0.800 0.016PRANK 3.268 0.549 0.026 2.214 0.689 0.018 1.385 0.879 0.007

u = 4 SMR 2.172 0.656 0.019 1.508 0.823 0.010 1.204 0.977 0.001PM 2.148 0.635 0.021 1.485 0.819 0.010 1.193 0.987 0.001WRSEL(a) 3.767 0.442 0.032 2.081 0.706 0.017 1.228 0.961 0.002WRSEL(b) 2.465 0.583 0.024 1.586 0.793 0.012 1.198 0.983 0.001PPR 2.523 0.623 0.027 1.584 0.813 0.015 1.180 0.990 0.003PRANK 2.011 0.683 0.018 1.422 0.845 0.009 1.185 0.991 0.001

u = 8 SMR 1.701 0.740 0.015 1.279 0.908 0.005 1.179 0.997 0.000PM 1.697 0.745 0.014 1.268 0.914 0.005 1.182 0.999 0.000WRSEL(a) 2.043 0.646 0.020 1.389 0.853 0.008 1.191 0.996 0.000WRSEL(b) 1.755 0.721 0.016 1.301 0.901 0.006 1.183 0.999 0.000PPR 1.781 0.741 0.019 1.287 0.915 0.010 1.179 0.999 0.001PRANK 1.655 0.764 0.013 1.253 0.922 0.004 1.181 0.999 0.000

Page 44: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 35

Table 2.6: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 4.105 0.347 0.037 2.511 0.599 0.023 1.767 0.778 0.013PM 3.514 0.509 0.028 2.121 0.727 0.015 1.504 0.865 0.008WRSEL(a) 14.481 0.014 0.056 9.552 0.119 0.050 4.741 0.432 0.032WRSEL(b) 7.166 0.209 0.045 4.073 0.522 0.027 2.161 0.777 0.013PPR 7.550 0.413 0.046 4.196 0.672 0.032 2.466 0.843 0.019PRANK 2.628 0.713 0.016 1.572 0.835 0.009 1.328 0.934 0.004

u = 4 SMR 2.269 0.646 0.020 1.445 0.854 0.008 1.204 0.972 0.002PM 2.184 0.677 0.018 1.378 0.883 0.007 1.184 0.983 0.001WRSEL(a) 5.851 0.294 0.040 2.518 0.673 0.019 1.262 0.929 0.004WRSEL(b) 2.722 0.601 0.023 1.504 0.849 0.009 1.190 0.978 0.001PPR 4.034 0.648 0.032 1.583 0.867 0.023 1.184 0.984 0.010PRANK 1.966 0.729 0.015 1.304 0.917 0.005 1.179 0.989 0.001

u = 8 SMR 1.965 0.705 0.017 1.291 0.909 0.005 1.163 0.994 0.000PM 1.967 0.718 0.016 1.268 0.923 0.004 1.160 0.997 0.000WRSEL(a) 3.107 0.551 0.025 1.437 0.838 0.009 1.167 0.989 0.001WRSEL(b) 2.091 0.691 0.018 1.290 0.911 0.005 1.161 0.995 0.000PPR 3.275 0.705 0.030 1.271 0.927 0.021 1.167 0.998 0.011PRANK 1.893 0.753 0.014 1.244 0.941 0.003 1.162 0.995 0.000

Page 45: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 36

Table 2.7: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 9.478 0.412 0.033 4.000 0.630 0.021 2.561 0.816 0.010PM 8.487 0.286 0.040 3.993 0.543 0.026 2.544 0.777 0.013WRSEL(a) 9.800 0.081 0.052 6.782 0.301 0.040 3.668 0.613 0.022WRSEL(b) 9.080 0.147 0.048 5.522 0.427 0.032 2.872 0.717 0.016PPR 8.361 0.227 0.047 5.112 0.497 0.032 2.600 0.758 0.017PRANK 8.642 0.391 0.034 3.926 0.631 0.021 2.482 0.823 0.010

u = 4 SMR 3.484 0.577 0.024 1.458 0.851 0.008 1.191 0.971 0.002PM 3.335 0.521 0.027 1.487 0.835 0.009 1.180 0.976 0.001WRSEL(a) 4.550 0.367 0.036 1.871 0.750 0.014 1.208 0.954 0.003WRSEL(b) 3.618 0.476 0.030 1.570 0.818 0.010 1.186 0.972 0.002PPR 4.030 0.505 0.032 1.536 0.833 0.013 1.171 0.979 0.003PRANK 3.337 0.569 0.024 1.437 0.856 0.008 1.167 0.983 0.001

u = 8 SMR 2.495 0.641 0.020 1.308 0.904 0.005 1.177 0.998 0.000PM 2.509 0.600 0.023 1.314 0.905 0.005 1.176 0.998 0.000WRSEL(a) 2.878 0.540 0.026 1.384 0.861 0.008 1.184 0.995 0.000WRSEL(b) 2.591 0.585 0.024 1.326 0.897 0.006 1.178 0.998 0.000PPR 3.277 0.603 0.025 1.302 0.915 0.009 1.174 0.999 0.001PRANK 2.513 0.612 0.022 1.303 0.911 0.005 1.178 0.998 0.000

Page 46: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 2. ISOLATED HOTSPOT DETECTION 37

Table 2.8: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 3.787 0.379 0.035 2.308 0.593 0.023 1.552 0.809 0.011PM 3.438 0.465 0.030 2.056 0.699 0.017 1.396 0.882 0.007WRSEL(a) 13.639 0.015 0.056 8.508 0.124 0.050 4.016 0.481 0.029WRSEL(b) 7.210 0.183 0.046 3.736 0.494 0.029 1.794 0.801 0.011PPR 7.308 0.375 0.047 3.646 0.647 0.032 1.678 0.871 0.018PRANK 2.550 0.665 0.019 1.547 0.819 0.010 1.230 0.941 0.003

u = 4 SMR 2.032 0.649 0.020 1.341 0.885 0.007 1.170 0.973 0.002PM 1.991 0.668 0.019 1.297 0.918 0.005 1.154 0.985 0.001WRSEL(a) 4.936 0.218 0.044 1.827 0.717 0.016 1.194 0.954 0.003WRSEL(b) 2.349 0.569 0.024 1.356 0.885 0.007 1.159 0.981 0.001PPR 2.618 0.647 0.032 1.290 0.922 0.016 1.142 0.990 0.007PRANK 1.852 0.738 0.015 1.252 0.944 0.003 1.150 0.987 0.001

u = 8 SMR 1.785 0.749 0.014 1.246 0.935 0.004 1.152 0.996 0.000PM 1.770 0.768 0.013 1.215 0.955 0.003 1.145 0.997 0.000WRSEL(a) 2.456 0.539 0.026 1.321 0.872 0.007 1.154 0.994 0.000WRSEL(b) 1.865 0.718 0.016 1.224 0.946 0.003 1.148 0.996 0.000PPR 1.930 0.766 0.024 1.216 0.964 0.013 1.144 0.998 0.008PRANK 1.756 0.799 0.011 1.212 0.959 0.002 1.145 0.996 0.000

Page 47: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 3

Joint Analysis of Multivariate

Spatial Count and Zero-Heavy

Count Outcomes

3.1 Introduction

In public health, environmental and ecological studies, variables measured at the same

spatial locations may be correlated so that the spatial structures of such variables

across the region under consideration are very similar, indicating that they may be

characterized by a common spatial risk surface. Employing such a commonality in

risks may be useful for gaining precision of local area risk estimates, especially for

rare diseases.

Shared component spatial models have been studied in a variety of applied con-

texts. Knorr-Held and Best (2001) proposed a shared-component model which mim-

ics an ecological regression on the unobserved shared component. The two diseases

38

Page 48: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 39

considered in that application share a common spatial structure and, as well, sup-

port disease-specific spatially uncorrelated random errors. Fitting the model requires

strong prior assumptions of the random spatial and uncorrelated errors, typically be-

cause of challenges arising related to identifiability of the latent spatial fields. Wang

and Wall (2003) proposed a common spatial factor model to study multivariate indi-

cators of cancer risk across counties in Minnesota. To avoid identifiability issues, the

model includes the common spatial structure term but no excess heterogeneity and,

as well, the variance of the shared spatially correlated random effect is considered

as fixed. Hogan and Tchernis (2004) proposed a common factor model for spatial

multivariate count data with constraints imposed on the variance structure of the

conditional autoregressive model they employ. Congdon (2006) set out a modeling

framework for modeling multiple health outcomes over area, age, and time dimensions

that takes account of spatial correlation as well as interactions between dimensions.

Tzala and Best (2006) proposed a Bayesian latent variable model for cancer mor-

tality data, which linked spatial effects. As well, other joint modeling approaches

for multivariate spatial data have been proposed including the multivariate version

of the conditional autoregressive model (MVCAR) (Gelfand and Vounatsou, 2003),

which assumes the spatial structure is the same across the multivariate outcomes.

Such modelling allows for the pooling of information across spatial units as well as

across multiple outcomes within units. In contrast, the common spatial factor model

may stratify the spatial variation into two components: the shared component and

outcome-specific components. Such a modeling approach permits a simple analysis

of which spatial term dominates as well as an identification of the common spatial

structure. Though testing for a common spatial structure is quite relevant in certain

studies, there has been very little discussion of the power of such tests. We consider

this in the context of the analysis of count data an also examine the utility of joint

modeling in terms of gains in the efficiency of estimating relative risks.

Page 49: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 40

In environmental and ecological studies, counts data are often characterized by an

excess of zeros and spatial dependence (Clarke and Green, 1988; Welsh et al., 1996;

Martin et al., 2005). When studying of abundance of species in ecological studies, hav-

ing a large proportion of zero counts may indicate the habitat is unsuitable in certain

areas, for example. In such cases, standard distributions such as Poisson, binomial

and negative-binomial may fail to provide an adequate fit. A class of distributions

for such data is defined as zero-inflated distributions (Lambert, 1992).

For handling zero-inflation, the use of mixture models and conditional models are

two common approaches within the context of ecological and health studies. The

well-known zero-inflated Poisson (ZIP) model (Lambert, 1992) is a mixture of a de-

generate zero mass and a Poisson distribution. On the other hand, Welsh et al. (1996)

formulate a two-component conditional model where the presence/absence of counts

is modeled with a binomial distribution and the abundance at active sites is mod-

eled using a truncated Poisson or truncated negative binomial distribution. These

two models have different interpretations. Structural zeros and random zeros are not

distinguished under the conditional specification, whereas the mixture model permits

an examination of the different sources of error (Kuhnert et al., 2005). For more

discussion of zero-inflated models from a Bayesian perspective see Angers and Biswas

(2003) and Ainsworth (2007).

In many applications, zero-inflated count data are spatially correlated. Rathbun

and Fei (2006) introduced a zero-inflated Poisson model, in which the component

modeling the excess zeros is governed by a hidden spatial probit model; a threshold,

defining large probabilities in the probit layer, governs the proportion of zeros. Agar-

wal et al. (2002) also proposed a zero inflated model for spatial count data using a

mixture model approach and incorporating spatial random errors into either or both

of the model components. With multivariate zero-inflated count data corresponding

Page 50: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 41

to several related spatial outcomes, there is also the possibility of linking model com-

ponents across the various outcomes using a shared latent spatial structure. This

would be relevant, for example, if the underlying, hidden mechanisms resulting in the

structural zeros, or the abundance of counts, are related across the outcomes.

The methods developed in this chapter for joint outcome analysis of spatial count

and zero-heavy count data focus on the use of shared latent spatial frailty models.

We discuss such joint mapping models and evaluate what benefits may be achieved

through joint modeling. The rest of the chapter is structured as follows. Section

3.2 describes a general modeling framework for common spatial factor models for

count data and zero-inflated count data. Section 3.3 presents two motivating appli-

cations, applying the common spatial factor model to Ontario lung cancer data and

zero-inflated forestry infection data related to a study of Comandra blister rust on

lodgepole pine trees. Section 3.4 examines hypothesis testing of whether two spatial

maps share the same underlying spatial structure for count data. A power study is

performed based on the situational context of the Ontario lung cancer data. Section

3.5 compares joint and separate modeling in terms of accuracy and efficiency of es-

timating relative risks through simulation investigations. Some closing remarks are

provided in Section 3.6.

3.2 Models for Joint Count Outcomes

We present here a general modeling framework for the common spatial factor model

for joint modeling of count data and zero-inflated count data. In disease mapping,

the typical response is a rate (both in health and forest epidemiology), hence the

focus on the analysis of counts herein. However, generalization of the model to other

non-normal data is straightforward.

Page 51: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 42

3.2.1 Common Spatial Factor Model for Counts

Let yij∣�ij ∼ Poisson(�ij) for region i = 1, ⋅ ⋅ ⋅ , n and outcome j = 1, ⋅ ⋅ ⋅ , J , where

yij denotes the response and �ij denotes the expected mean count for outcome j in

region i. The common spatial factor model can be written as:

log(�ij) = �j + log(Eij) + jbi + ℎij , (3.1)

where �j denotes the overall mean rate for the jth outcome and Eij is the expected

number of disease counts in region i for the jth outcome based on some standardized

rates; bi, i = 1, ⋅ ⋅ ⋅ , n is the spatial random effect assumed here to follow a condi-

tional autoregressive distribution (Besag, 1974) to account for the spatially struc-

tured correlation in the outcomes; j is the factor loading for the shared spatial

component on outcome j, with 1 = 1, and ℎij represents uncorrelated error terms.

The spatial component b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ MVN(0,Σb), Σb having generalized

inverse Σ−1b = �−2b (D −W ); W = (Wrs) is often called the neighborhood matrix

and D = diag(W1.,W2., ⋅ ⋅ ⋅ ,Wn.) with Wr. =∑n

s=1Wrs. The neighborhood matrix

defines the spatial structure across the region; a simple adjacency model sets Wrs

to 1 if regions r and s share a boundary, and 0 otherwise. The spatially uncorre-

lated random effect hj ∼ N(0, �2ℎjI), where hj = (ℎ1j, ⋅ ⋅ ⋅ , ℎnj)T , j = 1, ⋅ ⋅ ⋅ , J and

I is the identity matrix. The relative risk for outcome j within region i is then

�ij = exp(�j + jbi + ℎij).

The common spatial factor model may be implemented in a Bayesian framework

using Markov chain Monte Carlo (MCMC) procedures. The joint posterior distribu-

tion is expressed as

p(�, b,h, , �2b , �

2ℎ1, ⋅ ⋅ ⋅ , �2

ℎJ∣Y ) ∝ L(Y ∣�, b,h, )p(b∣�2

b )p(h∣�2ℎ)

p(�)p( )p(�2b )p(�

2ℎ1

) ⋅ ⋅ ⋅ p(�2ℎJ

) , (3.2)

with� = (�1, ⋅ ⋅ ⋅ , �J)T , b = (b1, ⋅ ⋅ ⋅ , bn)T , h = (h1, ⋅ ⋅ ⋅ ,hJ)T where hj = (ℎ1j, ⋅ ⋅ ⋅ , ℎnj)T

Page 52: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 43

and = ( 2, ⋅ ⋅ ⋅ , J)T . The first term on the right hand side of (3.2) is the conditional

likelihood,

L(Y ∣�, b,h, ) ∝ exp

[−

n∑i=1

J∑j=1

Eijexp(�j + jbi + ℎij)

]

×n∏i=1

J∏j=1

[Eijexp(�j + jbi + ℎij)]yij . (3.3)

The second and third terms on the right hand side of (3.2) are the distributions

of b and h respectively and the remaining terms are the prior distributions on

(�, , �2b , �

2ℎ1, ⋅ ⋅ ⋅ , �2

ℎJ); Flat priors are assigned to �, while = ( 2, ⋅ ⋅ ⋅ , J)T are

assigned N(0, �2 ) priors with �2

= 1000; the priors assigned to the standard devi-

ations of the spatially structured and unstructured random effects (�b, �ℎ1 , ⋅ ⋅ ⋅ , �ℎJ )

are uniformly distributed between 0 and 100, due to the robust properties of this

prior (Gelman, 2006). The Gibbs sampler (Gelfand and Smith, 1990; Carlin and

Louis, 2000) is a natural choice for updating the parameters in this setting, since

it takes advantage of the conditional specification of the joint distribution. Each of

the full conditional distributions required by the Gibbs sampler can be successively

updated.

3.2.2 Common Spatial Factor Model for Zero Heavy Counts

In many applications, a large number of zero counts may be observed for spatial

outcomes. We refer to situations where the number of zeros are unacceptably large

given the spatial structure of means over the map, and pose challenges for analysis

in terms of goodness of fit using a simple spatial count model which is based on a

conditional Poisson distribution. Here, we extend the common spatial factor model

to handle such zero-inflated count data. Conditional on spatial random effects bi and

Page 53: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 44

di as described below, suppose that the response variable Yij is distributed as

Yij∣zij =

⎧⎨⎩ 0 if zij = 1,

Poisson(�ij) if zij = 0(3.4)

where i indexes region, i = 1, ⋅ ⋅ ⋅ , n and j indexes the multivariate outcomes, for

example, diseases, j = 1, ⋅ ⋅ ⋅ , J . Here, Yij denotes the observed jth outcome at the ith

spatial location. The variable zij is a latent Bernoulli indicator, a trigger for the excess

zeros, with mean function �ij, while Poisson(�ij) denotes a conditionally independent

Poisson random variable with mean �ij, conditional on bi and di. Before specifying bi

and di, we note immediately that the corresponding probability distribution functions

are

P (Yij = yij) =

⎧⎨⎩ �ij + (1− �ij)e−�ij if yij = 0,

(1− �ij)e−�ij�

yijij

yij !if yij > 0

. (3.5)

The parameters �ij and �ij depend on random effects bi, di, so that

log(�ij) = �j + jbi, logit(�ij) = �j + !jdi , (3.6)

where �j and �j denote the overall mean rates for the Poisson and excess zero proba-

bility components for the jth outcome; b = (b1, ⋅ ⋅ ⋅ , bn)T ∼MVN(0, �2b (D −W )−1)

is the spatially correlated random effect for the Poisson count component and d =

(d1, ⋅ ⋅ ⋅ , dn)T ∼MVN(0, �2d(D −W )−1) is the spatially correlated random effect for

the excess zero probability component. The mean and the variance of the conditional

distributions are E(Yij) = (1 − �ij)�ij and var(Yij) = (1 − �ij)�ij(1 + �ij�ij). The

factor loading parameter j reflects the influence of the latent common spatial factor

b on the Poisson component of the jth outcome and correspondingly !j measures the

impact of the shared component d on the component generating excess zeros for the

jth outcome; we set 1 = !1 ≡ 1. Similar shared random effect structures could be

derived for other link functions beyond the log and logit. The details of the likelihood

Page 54: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 45

and prior below reflect an assumption that b and d are independent; however more

complicated forms may be considered.

The zero inflated common spatial factor model (3.6) may be implemented in a

Bayesian framework using MCMC methods. The joint posterior distribution of the

parameters is:

p(�, �, b,d, ,!, �2b , �

2d∣Y ) ∝ L(Y ∣�, �, b,d, ,!)p(�)p(�)p( )p(!)

p(b∣�2b )p(d∣�2

d)p(�2b )p(�

2d) , (3.7)

where � = (�1, ⋅ ⋅ ⋅ , �J)T , � = (�1, ⋅ ⋅ ⋅ , �J)T , = ( 2, ⋅ ⋅ ⋅ , J)T , ! = (!2, ⋅ ⋅ ⋅ , !J)T .

The first term on the right hand side of (3.7) is the likelihood

L(Y ∣�, �, b,d, ,!) ∝n∏i=1

J∏j=1

[I(yij = 1) {�ij + (1− �ij)e�ij}

+ I(yij = 0)

{(1− �ij)

e−�ij�yijij

yij!

}], (3.8)

where I(A) is the indicator function. The forms of the conditional distributions of the

model parameters are proportional to the joint posterior density as in (3.7); these can

be obtained by retaining quantities involving relevant parameters from the joint pos-

terior. Samples from the posterior distribution in (3.7) so obtained by MCMC allow

computation of summary measures and credible intervals of any arbitrary functional

of the parameters. We assign normal priors on �, �, and ! with a moderately

large variance to avoid computational instability. The parameters (�b, �d) can be

assigned uniform distribution with again only moderately large variance to avoid the

situation of exceedingly large random effects arising in a sample. Diffuse priors may

be employed to determine sensitivity to these choices.

Page 55: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 46

3.2.3 Model Assessment and Comparison

To compare various models we employ the deviance information criterion (DIC), as

DIC=D(�) + pD where D(�) is the posterior mean of the deviance with D(�) =

−2logL(Y ∣�), and � denotes the collection of parameters in the model (Spiegelhalter

et al., 2002). The penalty term pD is the effective number of model parameters,

defined by pD = D(�)−D(�) where � = E[�∣Y ] is the posterior mean of �. Models

with lower DIC scores are preferred as they achieve more optimal combination of fit

and parsimony.

A common and effective tool used as a diagnostic method in Bayesian analyses is

the comparison of the posterior predictive distribution of replicated data under the

model with the observed data (Rubin, 1984; Gelman et al., 1996, 2004). If the model

accurately represents the process that generated the data, then replicated data gen-

erated under the model should look similar to the observed data. Posterior predictive

comparisons are usually implemented by drawing simulated values from the posterior

predictive distribution and comparing these samples to the observed data using test

quantities that characterize important features of the data. Let L(y∣�) denote the

likelihood for the model, where y denotes the data and � denotes all the parame-

ters in the model, including the hyperparameters. The posterior distribution of � is

f(�∣y) ∝ L(y∣�)f(�), where f(�) denotes the prior distribution of the parameters.

The posterior predictive distribution of replicated data yrep is then defined as:

f(yrep∣y) =

∫L(yrep∣�)f(�∣y)d� , (3.9)

which is the likelihood of the future data averaged over the posterior distribution

f(�∣y). The distribution is termed as predictive distribution.

The predictive data yrep reflect the expected observation assuming the model with

observed data y. If the model is adequate, the values of y and yrep should be close.

A evaluation of closeness can be carried out using some summary function of distance

Page 56: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 47

D(y,�) with assessment of the overall goodness of fit using posterior predictive p-

value (Meng, 1994) given by

P(D(yrep,�) > D(y,�)∣y

).

In our application, we considered

D(y,�) =n∑i=1

J∑j=1

(yij − �ij)2

�ij. (3.10)

We comment that Meng (1994) and Carlin and Louis (2000) argue that posterior

predictive checks provide a measure of discrepancy between the model and the data,

and are not directly informative for model comparison and inference. A posterior

p-value around 0.5 indicates that the distributions of the replicated and actual data

are close, while a value close to zero or one indicates strong differences between them

(Gelman et al., 1996).

3.3 Applications

In this section, we will present two applications which utilize common spatial factor

models: an analysis of Ontario lung cancer for males and females, and an analysis of

zero-inflated forestry data which relate to Comandra blister rust infection of lodgepole

pine trees.

3.3.1 Ontario Lung Cancer

Lung cancer incidence and expected counts in 37 public health units (PHU) over the

period 1995-2002 in Ontario are considered here in a joint analysis of counts for males

(yim observed; Eim expected) and for females (yif observed; Eif expected). Figure 3.1

displays the standardized mortality rates (SMR) for Ontario lung cancer for males and

Page 57: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 48

Figure 3.1: SMR of Ontario lung cancer for males and females.

females, respectively, with some similarity of spatial structure observed over gender.

The common spatial factor model for joint modeling of lung cancer mortality counts

for males and females is expressed as:⎧⎨⎩ log(�im) = �m + log(Eim) + bi + ℎim

log(�if ) = �f + log(Eif ) + ⋅ bi + ℎif, (3.11)

where the common spatial structure b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ MVN(0, �2b (D −W )−1),

hm = (ℎ1m, ⋅ ⋅ ⋅ , ℎnm)T ∼ MVN(0, �2ℎmI), hf = (ℎ1f , ⋅ ⋅ ⋅ , ℎnf )T ∼ MVN(0, �2

ℎfI)

and is the factor loading for the shared spatial random effect. Two chains with

dispersed starting values for all parameters are run. Each chain was run for an initial

10,000 burn-in iterations followed by an additional 10,000 iterations thinned at 10,

resulting in a total of 2000 iterations to be used for posterior inference. Sensitivity

with respect to prior distributions was assessed by comparisons from repeating the

analysis with other weakly informative prior specifications. This comparison indicated

results to be fairly robust over the forms of prior considered. Table 3.1 presents the

posterior summary statistics for the parameter estimates obtained from employing

Page 58: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 49

Table 3.1: Posterior summaries for the parameters of M1 in the analysis of Ontariolung cancer incidence.

Parameter Mean 95% CI�m 0.066 (0.039, 0.090)�f 0.086 (0.055, 0.115) 1.145 (0.866, 1.466)�2b 0.059 (0.032, 0.116)�2ℎm

0.0038 (0.0018, 0.0094)�2ℎf

0.0050 (0.0023, 0.0129)

M1. That much of the variability is accounted for by the common factor component

is reflected in the posterior mean (95% credible interval) of the variance parameter

of the shared random effect, 0.059(0.032, 0.116), when contrasted with the random

noise terms estimated as 0.0038(0.0018, 0.0094) for males and 0.0050(0.0023, 0.0129)

for females. The spatial variation dominates the spatially uncorrelated errors. As

well, it is important to note that �2b does not reflect the full variance term as it

is simply the scale factor operating on the variance-covariance matrix of the spatial

effects. The factor loading parameter is estimated as 1.145(0.866, 1.466), indicating

that males and females share a common and possibly identical spatial structure, with

substantially above zero and with = 1 not rejected.

The posterior median estimates along with the 95% credible intervals of the com-

mon latent spatially structured effect and the male- and female-specific unstructured

components are displayed in Figure 3.2. The common spatial component resembles

the SMRs maps for male and female lung cancer data (Figure 3.1) with an urban

versus rural effect apparent, while the residual terms appear to be more or less flat,

confirming the dominance of a strong underlying spatial structure shared between

males and females.

Page 59: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 50

Fig

ure

3.2:

The

pan

els

inth

efirs

tro

wdis

pla

yth

esp

atia

lm

aps

for

the

pos

teri

orm

ean

esti

mat

esfo

rℎm

,b

and

ℎf;

the

pan

els

inth

ese

cond

row

dis

pla

yth

e95

%cr

edib

lein

terv

als

forℎm

,b

andℎf

inth

ean

alysi

sof

the

Onta

rio

lung

cance

rdat

a.

Page 60: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 51

To examine whether any residual spatial structure has been left for each of the sub-

models, we used Moran’s I (Cliff and Ord, 1981; Cressie, 1993) to compute residual

spatial autocorrelation. Let yim and yif denote the predicted disease counts for males

and females. Moran’s I statistic is defined as

Im =e′mWmeme′mem

; If =e′fWfef

e′fef,

where e′m = (e1m, ⋅ ⋅ ⋅ , enm) with eim = (yim − yim)/√

var(yim); e′f = (e1f , ⋅ ⋅ ⋅ , enf )

with eif = (yif− yif )/√

var(yif ) and W f = Wm being the n×n adjacency matrix for

the regions with elements wij. The posterior mean (95% credible interval) of Im and

If are -0.053(-0.305, 0.156) and -0.057(-0.289, 0.153) respectively, suggesting that the

residuals have no significant spatial correlation.

In addition, we also considered several competing models including the separate

counterpart models corresponding to each gender and models with simplified random

effect structures. Table 3.2 lists pD and DIC values for the competing models con-

sidered. Overall, model M1 which includes the shared common spatial random error

as well as the random noise terms for each sub-model is considered most optimal

according to various measures of fit, though the difference between M1 and either of

M2 or M3 is not substantial. The separate counterpart models yield poorer fits com-

pared to the joint models. Examination of the posterior predictive p-value indicates

a substantially poorer fit for M4.

In the spatial analyses above, we decompose the variability associated with multi-

ple outcome domains into a variety of sources. In the separate models, each outcome

domain is decomposed into spatially structured and unstructured variability. With

the spatial common factor model, the variability of all outcome domains is decom-

posed into one common spatially structured error term and, additionally, domain-

specific unstructured variability. It is therefore of interest to evaluate what proportion

of total variability in each outcome is explained by each of the terms in the different

Page 61: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 52

Table 3.2: pD and DIC for competing models in the analysis of Ontario lung cancerincidence.

Type Model pD DICShared M1 log(�im) = �m + logEim + bi + ℎim 56.5 725.9

log(�if ) = �f + logEif + ⋅ bi + ℎif

M2 log(�im) = �m + logEim + bi 56.5 727.5log(�if ) = �f + logEif + ⋅ bi + ℎif

M3 log(�im) = �m + logEim + bi + ℎim 56.6 728.3log(�if ) = �f + logEif + ⋅ bi

M4 log(�im) = �m + logEim + bi 38.8 764.1log(�if ) = �f + logEif + ⋅ bi

Separate M5 log(�im) = �m + logEim + bim + ℎim 68.3 738.2log(�if ) = �f + logEif + bif + ℎif

M6 log(�im) = �m + logEim + bim 68.8 740.1log(�if ) = �f + logEif + bif + ℎif

M7 log(�im) = �m + logEim + bim + ℎim 69.4 741.6log(�if ) = �f + logEif + bif

M8 log(�im) = �m + logEim + bim 70.1 743.9log(�if ) = �f + logEif + bif

Page 62: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 53

models. For the separate analyses of the male (j = 1) and female (j = 2) outcomes,

at each iteration of the MCMC simulation we calculate the empirical variances s2bj

and s2ℎj , of the spatially structured and unstructured random effects, respectively.

The proportion of variability explained by the structured random effects for jth out-

come domain is calculated as s2bj/(s2bj

+ s2ℎj). For the common spatial factor model,

the empirical variances for the common and domain specific components, s2b and s2ℎj ,

respectively, are also calculated for each run of the MCMC samples. The fraction of

the variability explained by the common factor for jth outcome domain is calculated

as the ratio s2b/(s2b+s2ℎj). The results show that for the separate analysis, the spatially

structured component accounts for 53% of the variability for males and 42% of the

variability for females. In contrast, for the common spatial factor model, the shared

component accounts for 94% of the variability for males and 88% of the variability

for females; most of the variability was absorbed by the shared spatially structured

factor, with little unstructured residual variation left.

3.3.2 Comandra Blister Rust Tree Infection

In a study of Comandra blister rust infection of lodgepole pine trees in British

Columbia, Canada, we model the joint outcomes of the counts of lesions on trees

arising from infection and the counts of an alternate host plant (bastard toad flax)

surrounding the trees, which promotes infection. The study is part of a larger investi-

gation conducted by the British Columbia Ministry of Forests and Range of infection

on lodgepole pine trees. The lodgepole pine trees are located in each cell of a 124×64

grid, with each grid cell being 1.5m square. There are buffer trees on the plantation

and responses are collected for a random sample of 500 trees on yiL, the number of

lesions on the tree, marking infection areas, and yiH , the number of alternate host

plants within the 1.5m square grid containing the ith tree. The alternate host plant

Page 63: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 54

20 40 60 80

100

150

200

lesion counts

Easting

Nor

thin

g

0

1

2

3

4

20 40 60 80

100

150

200

counts for host plants

Easting

Nor

thin

g

01

5

8

10

15

20

Figure 3.3: The left panel displays the lesion counts and the right panel displays thecounts of alternate host plants over the experimental field in the Comandra blisterrust analysis.

serves as a host for the fungus causing the infection on the trees. Preliminary data

analysis indicate that lesion counts are positively associated with the host plants,

driving the hypothesis that a shared frailty model might be useful for analyzing the

common spatial pattern of these responses. Figure 3.3 displays the spatial structures

of the lesion counts and counts of alternate host plants for the 500 sampled trees.

One of the most widely used approach for modeling spatial correlation for point

referenced data is kriging, which commonly uses the exponential decay function to

model spatial correlation between points r and s: �rs = exp(−�drs), where � > 0

controls the rate of decay of correlation with distance, larger values indicating rapid

decay and drs being the distance between the two spatial points. For shared frailty

modeling, this method is computationally slow and cumbersome to implement due

to inversion of the n × n covariance matrix at each MCMC iteration. Here, we

use the conditional autoregressive model, as an approximation, for specifying the

Page 64: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 55

●●

●●

●●

● ●

●● ●

●●

10 20 30 40 50 60

05

1015

lesion counts

distance

sem

ivar

ianc

e

●●

●●

● ● ● ● ●●

●● ●

10 20 30 40 50 60

0.0

0.4

0.8

counts for host plants

distance

sem

ivar

ianc

e

Figure 3.4: Empirical semivariogram for counts of lesions and alternate host plants.

shared spatially correlated random effect. To obtain neighborhood definitions for this

model, we note that the empirical semivariograms of counts of lesions and alternate

host plants, displayed in Figure 3.4 suggest spatial correlations in both outcomes

up to about 20 meters. Therefore, spatial correlation in the shared frailty term is

accommodated by setting Wrs = I {d(r, s) ≤ 20m} where d(r, s) denotes the distance

between trees r and s.

Both counts of lesions (65% of zero) and alternate host plants (84% of zero) exhibit

excessive zeros, as displayed in Figure 3.5. In the current context, the extra-Poisson

variation induced by the excess zeros cannot be accommodated by a simple Poisson-

type distribution. Additionally, note that Vuong’s non-nested hypothesis test (Vuong,

1989) for a comparison of the predicted probabilities of Poisson regression and ZIP

regression indicates that the observed frequency of zeros far exceeds that expected

Page 65: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 56

lesion counts

Freq

uenc

y

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

%zeros=65.6%

counts of disease host plantsFr

eque

ncy

0 5 10 20 30

0.0

0.2

0.4

0.6

0.8

%zeros=84.8%

0 5 15 25

05

1525

Figure 3.5: Histograms of counts of lesions and alternate host plants. The subplot inthe right panel displays the distribution for counts of disease host plants excludingzeros.

under the Poisson assumption. We utilize a zero-inflated distribution instead.

In our application, our goal is to examine whether spatial similarity of the random

processes exist across the spatial maps of the two responses through a latent random

effect and also to examine whether the zero mass components of the two distributions

are also correlated through a shared underlying latent process that is spatially varying.

The zero inflated common spatial factor model (3.4), (3.5), (3.6) is implemented

with j = L for the response of lesion counts and H for alternate host plants. We

assign weakly informative prior N(0, 10) to �L, �H , �L, �H , and !. Note that we

restrict variance components equal to one to allow identification in this application.

In other applications, careful examination of prior specification that impose enough

constraint to allow for identification, without restricting variance components to be

equal to one may be necessary. Table 3.3 provides summaries of the posterior dis-

tributions of the parameters. The posterior mean (95% credible interval) for is

Page 66: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 57

Table 3.3: Posterior summaries for the parameters of the zero-inflated common spatialfactor model in the analysis of Comandra blister rust infection.

Parameter Mean 95% CI�L 0.066 (-0.102, 0.223)�H 1.034 (0.456, 1.450)�L -0.095 (-0.459, 0.213)�H 2.025 (1.465, 2.696) 5.972 (4.251, 8.454)! 6.419 (2.906, 10.491)

5.972(4.251, 8.454) and for ! is 6.419(2.906, 10.491). Both of the credible intervals

do not cover zero, so there is evidence in commonality of spatial structure for lesion

counts and counts of host plants, both for the Poisson distributed component, and

the excess zero component. We also considered other competing models including the

counterpart separate models as listed in Table 3.4. The zero inflated common spatial

factor model offers a better fit with lower DIC score. As well, the shared compo-

nent models are superior to their counterpart separate models (see another example

illustrating this in Table 3.4–contrast M2 and M4).

Figure 3.6 displays the posterior medians of the common spatial factor for the

Poisson distributed component, b, and the excess zero component, d. Strong spatial

structures are manifested in both shared spatial components, with higher values for b

in the south-west and south-east quadrants and almost the opposite pattern observed

for d. Our model exhibits the feature that regions with higher probability of observing

excess zeros for both components have lower probability of observing large counts.

More sophisticated models accounting for the correlation of the two components may

be useful.

Page 67: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 58

Tab

le3.

4:pD

and

DIC

for

com

pet

ing

model

sin

the

anal

ysi

sof

Com

andra

blist

erru

stin

fect

ion.

Typ

eM

odel

Poi

sson

dis

trib

uti

onE

xce

ssZ

ero

pD

DIC

Shar

edM

1lo

g(�iL

)=�L

+b i

logi

t(�iL

)=� L

+di

119.

217

84.9

log(�iH

)=�H

+ b⋅bi

logi

t(�iH

)=� H

+ d⋅d

i

M2

log(�iL

)=�L

+b i

logi

t(�iL

)=� L

58.0

1831

.9lo

g(�iH

)=�H

+ b⋅bi

logi

t(�iH

)=� H

Sep

arat

eM

3lo

g(�iL

)=�L

+b iL

logi

t(�iL

)=� L

+diL

32.8

1982

.9lo

g(�iH

)=�H

+b iH

logi

t(�iH

)=� H

+diH

M4

log(�iL

)=�L

+b iL

logi

t(�iL

)=� L

27.7

1989

.3lo

g(�iH

)=�H

+b iH

logi

t(�iH

)=� H

Page 68: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 59

20 40 60 80

100

150

200

posterior median for b

East

Nor

th

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

20 40 60 80

100

150

200

posterior median for d

East

Nor

th

−0.2

−0.1

0.0

0.1

0.2

Figure 3.6: The left panel displays the posterior median of the shared random effectb for the Poisson count component. The right panel displays the posterior median ofthe shared random effect d for the excess zero probability component.

3.4 Power of the Test for Common Spatial Struc-

ture

Using a shared frailty model as (3.1) to evaluate and test for common spatial structure

across outcomes seems an intuitively appealing and easily interpretable approach.

Additionally, it would be useful to have a sense of whether such a framework provides

a reasonably powerful test under scenarios, as seen for example, in the analysis of the

lung cancer data. More broadly, routinely evaluating the power of testing procedures

employed in an application is an important and helpful consideration. To examine

the strength of the evidence in the data for a test of common spatial structure across

outcomes, we conduct a simulation study which evaluates the power of the test

H0 : = 0 versus H1 : ∕= 0 . (3.12)

Page 69: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 60

In this case, H0 corresponds to the case that there is no common spatial structure for

the multiple spatial outcomes. We note that the power of a test of H0 : = 1 versus

H1 : ∕= 1 may also be evaluated in a similar manner. Such a hypothesis may be

particularly applicable to the lung cancer analysis of males and females, and to other

similar gender comparisons.

The Bayesian approach to hypothesis testing is based on the credible interval

derived from the posterior distribution of the parameters. The power investigation

utilizes the generation of bootstrap samples based on the posterior estimates from

fitting the model to the data under consideration. The procedure is described in

terms of the lung cancer analysis as follows. Let �m and �f denote the posterior mean

estimates of the intercept terms for males (j = m) and females (j = f), respectively,

from fitting the model M1 (3.11) to the Ontario lung cancer data; we use these as true

values in the study, with and the variance components varying around estimates

obtained from that analysis as described below. Let �2b , �

2ℎm

and �2ℎf

denote the true

values of the variance parameters for b, hm and hf . Calculate the covariance matrix

for b as Σb = �2b(D −W )−1, where (D −W ) is the neighborhood structure based

on Ontario lung cancer mortality data. Set r = 1.

(a) At the rth replicate, generate b(r) = (b(r)1 , ⋅ ⋅ ⋅ , b(r)n )T ∼MVN(0,Σb), h

(r)m =

(ℎ(r)1m, ⋅ ⋅ ⋅ , ℎ

(r)nm)T ∼MVN(0, �2

ℎmI) and h

(r)f = (ℎ

(r)1f , ⋅ ⋅ ⋅ , ℎ

(r)nf )T ∼MVN(0, �2

ℎfI).

(b) Calculate the relative risks for males and females:

�(r)im = exp(�m + b

(r)i + ℎ

(r)im); �

(r)if = exp(�f + ⋅ b(r)i + ℎ

(r)if ), i = 1, ⋅ ⋅ ⋅ , n.

Then, generate counts from

y(r)im ∼ Multinomial

(n∑i=1

Eim,Eim�

(r)im∑n

i=1Eim�(r)im

);

y(r)if ∼ Multinomial

(n∑i=1

Eif ,Eif�

(r)if∑n

i=1Eif�(r)if

).

Page 70: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 61

(c) Fit the common spatial factor model using y(r)im and y

(r)if , i = 1, ⋅ ⋅ ⋅ , n, to

obtain the posterior estimate (r). Let (r)L and

(r)U denote the 2.5% and 97.5%

quantiles of MCMC distribution of (r).

(d) Set r to r + 1. Repeat (a)-(c) for r = 2, ⋅ ⋅ ⋅ , R replicates, for example

R = 1000.

The power is then calculated as

Power = 1−R∑r=1

I( (r)L < 0 <

(r)U

)/R ,

where I(A) is the indicator function for the event A.

Intuitively, the greater the difference between the true value and zero, the greater

the power should be. As well, the power should increase with the relative dominance of

the common factor b, in explaining heterogeneity of outcomes, relative to the outcome

specific random noise h. Hence, we study the power of the test as and �2b/�

2ℎ vary in

such a manner as to reflect the relative importance of the shared component, where,

for simplicity, we have �2ℎ = �2

ℎm= �2

ℎf. We investigate scenarios with identical

spatial structures as in the Ontario lung cancer incidence study, including the same

estimate of �2ℎ as well as the same expected counts, and noting for completeness

here that power would increase as the expected counts increase. We consider values

of varying between 0 and 2 in increments of 0.1 and let �2b/�

2ℎ take values of

0, 0.5, 1, 1.5, 2, 3, 4, ⋅ ⋅ ⋅ , 8. Figure 3.7 displays the power as a function of for three

values of the ratio �2b/�

2ℎ, while Figure 3.8 displays the power as a function of �2

b/�2ℎ

for three values of . When �2b/�

2ℎ is 10 or higher, the power attains the target value

of one quickly, but the curve is far less steep for �2b/�

2ℎ = 5. Figure 3.8 shows that

the power can be quite low if �2b = �2

ℎ and = 1, for example. For high power in

this spatial framework, we require a clear dominance of �2b when = 1. We note that

this is achieved in the lung cancer analysis. When data are more sparse than for this

Page 71: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 62

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

γ

pow

er

● ●

● ●

●●

● ●● ● ●

●●

● ● ● ● ● ●

σb2/σh

2=20

σb2/σh

2=10

σb2/σh

2=5

Figure 3.7: Power curves for testing spatial homogeneity of the risk maps for males andfemales in the Ontario lung cancer analysis, over varying values of the link parameter ranging between 0 to 2 in increments of 0.1, for three values of the ratio �2

b/�2ℎ.

analysis, i.e. for the cases when the expected counts are E/50 or E/100, the power is

considerably lower. For example, when �2b/�

2ℎ = 10, = 1, and the expected counts

are scaled by 1/50, the power drops to about 0.2 −− contrast this with the value

of about 1 when the expected counts and parameter settings are as seen in the lung

cancer data.

Page 72: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 63

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

σb2/σh

2

pow

er

● ●

● ●● ●

●● ● ● ● ●

γ=2γ=1γ=0.5

Figure 3.8: Power curves for testing spatial homogeneity of the risk maps for malesand females in the Ontario lung cancer analysis, over varying values of �2

b/�2ℎ of

0, 0.5, 1, 1.5, 2, 3, ⋅ ⋅ ⋅ , 8 at three values of .

Page 73: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 64

3.5 Precision Gains Through Joint Outcome Mod-

eling

The main emphasis in this chapter is a discussion of models and methods for the

exploration of common spatial structure across outcomes through the use of common

spatial factor models. As an aside, we also note that such models, if they are appli-

cable, may also provide more precise estimates of small area risks. We evaluate here

the gains in precision of risk estimates with particular emphasis on scenarios where

the power of testing for common spatial structure is high, as in the setting of the lung

cancer analysis.

In particular, we consider the spatial structure of the map as for the Ontario lung

cancer analysis; �2b/�

2ℎ is assigned to be 10, 50 or 100, with �2

ℎ = 0.001, 0.005 and

0.01; = 1; the expected disease counts are scaled by the inverse of �, with � = 1,

50 and 100; all other parameter values are set to the estimates from the analysis of

the lung cancer incidence data. Note that the Appendix A provides details on how

the variance of the response depends on the magnitude of the variance components

as well as on the expected counts. Data are generated for each combination of the

parameter settings, and analyzed using both the common spatial factor model and

separate analyses of each outcome in a parallel structure to the joint outcome analysis

including spatial and uncorrelated random effects.

The performance of the common spatial factor model and the separate models

for the outcomes were assessed in terms of the average relative bias and average

root mean squared error for the relative risk estimators across regions and outcome

groups, along with the average standard deviation and exceedance probabilities for

those areas with true relative risks above one, where averaging is taken over a large

number, R, of simulation runs, R = 1000, for each of the settings of parameters.

Page 74: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 65

We define these evaluation criteria precisely below for estimators �(r)ij denoting the

posterior mean estimate of the true relative risk �ij for local area i and outcome j

from the rth simulated data set using the joint outcome analysis. Similar definitions

would apply for estimators obtained from the separate analyses of the outcomes. We

compute the average absolute relative bias (ABIAS), standard deviation (ASE) and

root mean squared error (ARMSE) as

ABIAS(�) =1

nJ

J∑j=1

n∑i=1

∣∣∣∣∣∑R

r=1 �(r)ij /R− �ij�ij

∣∣∣∣∣,ASE(�) =

1

nJ

J∑j=1

n∑i=1

⎡⎣ R∑r=1

(�(r)ij −

R∑r=1

�(r)ij /R

)2

/R

⎤⎦ 12

,

ARMSE(�) =1

nJ

J∑j=1

n∑i=1

[R∑r=1

(�(r)ij − �ij

)2/R

] 12

,

respectively. Since identification of high risk areas is often of a prime interest in

disease mapping, we compute the average exceedance probability for those areas with

true relative risks greater than unity:

APREX(�) =1

nJ

J∑j=1

n∑i=1

[R∑r=1

p(�(r)ij > 1∣j : �ij > 1

)].

Table 3.5 contrasts the ABIAS, ASE, AMRSE and APREX for the joint and

separate analyses for studies with �2ℎ = 0.001 and = 1. The table shows that

the gains through using the joint model are not overly substantial, even when the

population distribution is sparse. However, as we increase �2ℎ to 0.005 and 0.01

(see Tables 3.6 and 3.7), it is evident that the joint model results in smaller standard

deviations, and therefore, smaller root mean squared errors, particularly as the shared

component becomes more dominant, and the expected counts become sparser. For

example, Table 3.7 shows that when �2b = 1 and the expected counts are scaled by

1/100, the ARMSE for the joint model is 10% smaller than that for the separate

Page 75: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 66

Table 3.5: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2

ℎ is 0.001.

�2b = 0.01 �2

b = 0.05 �2b = 0.1

Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0520 0.0518 0.0590 0.0581 0.0674 0.0665

ASE 0.0340 0.0363 0.0388 0.0438 0.0397 0.0461ARMSE 0.0675 0.0688 0.0769 0.0802 0.0861 0.0900APREX 0.6071 0.6036 0.7088 0.6976 0.7480 0.7372

� = 50 ABIAS 0.0604 0.0602 0.0938 0.0971 0.1138 0.1233ASE 0.0833 0.0888 0.1029 0.1063 0.1224 0.1260ARMSE 0.1116 0.1158 0.1517 0.1566 0.1819 0.1916APREX 0.4930 0.4925 0.5296 0.5255 0.5540 0.5452

� = 100 ABIAS 0.0621 0.0620 0.1028 0.1041 0.1308 0.1360ASE 0.1012 0.1067 0.1167 0.1204 0.1338 0.1360ARMSE 0.1274 0.1318 0.1691 0.1729 0.2039 0.2102APREX 0.4769 0.4765 0.5005 0.4983 0.5198 0.5151

model. Plots of the ASE by region (not shown here) show that gains in precision are

sharper for those regions with smaller expected counts, for all simulation settings. In

terms of APREX, the joint model has only very slightly higher predictive ability to

detect areas of high risk. The common factor model is potentially beneficial if there

is reasonably strong spatial correlation for the multiple spatial outcomes, and the

disease is relatively rare.

3.6 Summary and Concluding Remarks

In this chapter, we present a general modeling framework for joint modeling of count

data and zero-inflated count data. We examine important aspects of the approach

Page 76: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 67

Table 3.6: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2

ℎ is 0.005.

�2b = 0.05 �2

b = 0.25 �2b = 0.5

Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0615 0.0615 0.0917 0.0911 0.1327 0.1319

ASE 0.0415 0.0447 0.0424 0.0478 0.0420 0.0479ARMSE 0.0832 0.0851 0.1158 0.1186 0.1683 0.1706APREX 0.7055 0.6972 0.7927 0.7824 0.8159 0.8112

� = 50 ABIAS 0.0943 0.0982 0.1388 0.1594 0.1522 0.1856ASE 0.1067 0.1094 0.1583 0.1716 0.1889 0.2139ARMSE 0.1599 0.1645 0.2399 0.2638 0.2908 0.3274APREX 0.5320 0.5275 0.6037 0.5894 0.6520 0.6285

� = 100 ABIAS 0.1037 0.1057 0.1692 0.1874 0.2013 0.2350ASE 0.1206 0.1236 0.1797 0.1849 0.2187 0.2394ARMSE 0.1778 0.1813 0.2791 0.2976 0.3445 0.3816APREX 0.5010 0.4994 0.5581 0.5474 0.6014 0.5833

Page 77: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 68

Table 3.7: The average absolute relative bias (ABIAS), average standard deviation(ASE) and average root mean squared error (ARMSE), along with average exceedanceprobability (APREX) for regions with true relative risks greater than one. The ex-pected disease counts are scaled by the inverse of �. The true value of �2

ℎ is 0.01.

�2b = 0.1 �2

b = 0.5 �2b = 1

Joint Separate Joint Separate Joint Separate� = 1 ABIAS 0.0733 0.0731 0.1353 0.1348 0.2151 0.2145

ASE 0.0439 0.0469 0.0438 0.0482 0.0428 0.0473ARMSE 0.0976 0.0991 0.1736 0.1754 0.3014 0.3028APREX 0.7672 0.7575 0.7897 0.7826 0.7793 0.7756

� = 50 ABIAS 0.1127 0.1216 0.1549 0.1850 0.1902 0.2225ASE 0.1319 0.1365 0.1899 0.2150 0.2123 0.2517ARMSE 0.1967 0.2054 0.2964 0.3309 0.3987 0.4416APREX 0.5674 0.5596 0.6336 0.6142 0.6684 0.6438

� = 100 ABIAS 0.1300 0.1364 0.1986 0.2301 0.2384 0.2883ASE 0.1431 0.1450 0.2214 0.2419 0.2602 0.2971ARMSE 0.2187 0.2246 0.3489 0.3846 0.4622 0.5128APREX 0.5283 0.5235 0.5901 0.5746 0.6150 0.5923

Page 78: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 69

including the power of testing the common spatial structure and the gains in effi-

ciency in estimating relative risks. In summary, the power of identifying common

spatial structure increases as the factor loading parameter increases, as well as when

the shared spatially correlated random effect becomes more dominant. As well, the

common spatial factor model offers some improvement in the efficiency of the rela-

tive risk estimator when the common spatial term becomes more dominant as well as

when the disease is relatively rare. Importantly, the shared frailty model can be used

to identify the common spatial structures across outcomes, where such exist.

More sophisticated spatiotemporal models are also possible. For example, let yijt

denote the count of disease for region i, outcome j and time t. and Eijt denote the

expected disease count, i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ , J and t = 1, ⋅ ⋅ ⋅ , T . A common

spatio-temporal factor model may be expressed as

log(�ijt) = �jt + log(Eijt) + jbi + �jgt + �ijt, (3.13)

where �jt is the overall mean rate for jth component at time t; �ijt is the expected

mean count of disease for region i, outcome j and time t; The spatial random effect

b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ N(0,Σb), Σb = �2b(D −W )−1 and a simple AR(1) model for

the temporal random effect may be used: gt∣gt−1 ∼ N(�gt−1, �2g), where �2

g is the

temporal dispersion parameter and � is the temporal autocorrelation, with ∣�∣ = 1

(Waller et al., 1997); the interaction effect of space and time over multiple outcomes

is accommodated through � = (�111, ⋅ ⋅ ⋅ , �nJT )T ∼ N(0, �2�I−1� ), where �2

� measures

the residual dispersion of outcomes over space and time. This model is intuitively

appealing because it allows multiple outcomes to be linked through both the shared

spatially correlated component, as well as the shared temporally correlated compo-

nent. Such a model may also be extended to incorporate zero-inflation. The major

challenge for all these extensions is the determination of computationally feasible and

efficient algorithms for inference.

Page 79: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 70

We also note that disease mapping techniques are often used to gain precision

over the highly variable standardized mortality ratio estimate of small area risks. In

situations where it is not clear what sort of spatial or neighborhood structure may

be appropriate, it may be that sufficient gains in precision of estimates of small-

area risks may be attained through joint outcome modeling using only independent

error terms, where small-area effects are clearly linked across a variety of outcomes

(e.g. gender, age-groups). In this case, the shared random effect would be assumed

uncorrelated. This may provide a robust and efficient alternative to assumptions of

spatial correlation across a map. Shared frailty models for clustering regions might

also be useful in this regard.

Page 80: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 4

Impact of Misspecifying Spatial

Exposures

4.1 Introduction

Comandra blister rust (CBR) is a disease of hard pines that is caused by a fungus

growing in the inner bark. The fungus (Cronartium comandrae) has a complex life

cycle. Though it infects hard pines, it needs an alternate host (here termed AHP),

an unrelated plant, to spread from one pine to another. On hard pines, the fungus

causes growth reduction, stem deformity, and mortality. In addition, pines with stem

cankers produce significantly fewer cones and seeds than healthy trees.

Losses to Comandra blister rust on lodgepole pine are large in British Columbia,

Canada; mortality in young stands sometimes exceeds 85%. Understanding the effect

of spatial proximity of susceptible trees to the alternate host plant is likely critical

to understanding disease dynamics in such natural systems where the spatial pattern

of trees and alternate host plants are highly variable. However, few studies which

contrast and evaluate appropriate measures of host plant spatial proximity have been

71

Page 81: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 72

performed. In forestry, the most common measure recorded and utilized is the dis-

tance between the tree and the nearest AHP (Jacobi et al., 1993). Though nearest

distance is an important measure of proximity, one might also postulate that the den-

sity of the AHP in various neighborhoods of the tree may have an important effect.

This chapter aims to understand the dynamics of the disease, especially the relation-

ship with the host plant through which the disease spreads. In particular we aim to

elucidate the effect of misspecifying measures of proximity in spatial studies of ex-

posure even as we utilize flexible models for investigating relationships. This type of

investigation is possible in our study due to the availability of detailed data on AHP,

providing a unique and rich framework for assessing the effect of misspecification of

measures of proximity.

Here we contrast the use of (i) nearest distance to alternate host plant with (ii)

alternate host plant density at various neighborhoods of the tree, where both mea-

sures are incorporated in a flexible generalized additive modeling framework. The

complex nature of the phenomenon under study as well as the interest in not incor-

porating misspecification through the use of linear models suggests the adoption of

such a semi-parametric generalized additive model. Of primary focus is the effect

of misspecifying the measure by which the alternate host plant affects the risk of

infection with respect to a treatment effect. If the covariate is improperly modeled,

bias in estimating treatment effects may manifest themselves. Through simulation

studies, we investigate the practical implication of the bias in the treatment effect for

scenarios related to those seen in the application. We complement this investigation

with an analysis of Comandra blister rust infection which incorporates both nonlin-

ear effects of the measures of spatial proximity of the host plants as well as spatial

random effects which account for spatial correlation.

The rest of the chapter is organized as follows: in Section 4.2 we discuss the mo-

tivating data in detail. Section 4.3 describes the semi-parametric additive model and

Page 82: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 73

outlines inference for this model. Section 4.4 contains an analysis of the data includ-

ing model selection for a series of competing models. In Section 4.5, we investigate

model misspecification with regards the use of different measures of spatial proximity

of alternate host plants. Concluding remarks and recommendations are provided in

Section 4.6.

4.2 Comandra Blister Rust Study

The motivating study is part of larger investigation, conducted by the British Columbia

(BC) Ministry of Forests and Range, of the response of lodgepole pine trees to Co-

mandra blister rust. Here, we consider responses of 7055 lodgepole pines trees, which

were planted over a 124×64 grid in 2004 and subsequently examined for infection

in 2006, with each grid being 1.5m × 1.5m. Except for about 2000 buffer trees,

the remaining trees belong to 110 genetically different seedlots with about 50 trees

randomly allocated across the plot from each seedlot. There is evidence of genetic

variation in resistance of lodgepole pine to Comandra blister rust; though not con-

sidered in this chapter in detail, identifying such genetic resistance was a prime focus

of the investigation. Data are recorded at the grid cell level.

Comandra blister rust spreads by airborne spores from trees to alternative host

plants and alternative host plants pass the spores to other trees. It takes about

two years for the infection to appear on a tree. Rust susceptibility is assessed as a

presence/absence trait in 2006, as shown in Figure 4.1. The map reveals some areas

(such as the south-west quadrant), with a high density of infected trees. Figure B.1

in the Appendix B provides summary statistics related to infection counts and the

counts of the alternate host plant over various subsections of the plot. Preliminary

studies show that Comandra rust infections are severe on sites where the alternate

host is abundant.

Page 83: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 74

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

20 40 60 80 100

100

150

200

250

East

Nor

th

Figure 4.1: Locations of the trees; a triangle indicates infection with Comandra blisterrust.

Page 84: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 75

4.3 Flexible Smooth Models

To use as flexible a model as possible to describe the effect of a covariate, we consider a

semiparametric generalized additive model incorporating flexible non-linear functions

of exposure measures as well as the spatial surface. Let yi denote the observed

response indicating the infection status of tree i at spatial easting and northing site

coordinates si = (ei, ni), such that yi = 1 if tree i is infected with CBR and yi = 0

otherwise, i = 1, ⋅ ⋅ ⋅ , N . We have yi ∼ Bernoulli (�i), where �i = p(yi = 1) represents

the probability of infection.

We now consider how disease risk may be modeled as a function of spatial location

in relation to the alternate host plant. We first note that the ideal exposure measure is

the cumulative effect due to the exposure received at each site in the study region. As

well, the effect of exposure at each grid may well differ due to topographical variables

(e.g. elevation, slope, aspect) and meteorology (e.g. wind direction, temperature,

humidity). A very general form for the effect of the covariate over the whole region

can be expressed as ∫s

fi,s {�(s)} ds ,

where �(s) is the value of the exposure at location s, and fi,s is a non-parametric or

parametric function determining the effect of exposure �(s) at site i. The function

fi,s may be identical when ∥si − s∥ = �, so that the functional form governing the

effect of the exposure is the same at concentric circles radiating from any tree i; in

this case fi,s may be expressed as f�. Alternatively, fi,s may be sharply increasing

with �(s) for site s to the right of any tree i, and zero for sites directly to the left;

this might be the case if the outcome depends on wind direction. We approximate

such a general expression here by defining fi,s to be the same over wide locations s,

termed here neighborhoods, and by defining the effect as zero for reasonably distant

locations from site i, as described in detail below.

Page 85: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 76

0 1 2 3 4 5

01

23

45

Ai3

Ai2

Ai1

Figure 4.2: Illustration of the definition of the first (Ai1), second (Ai2) and third (Ai3)order neighborhoods of tree i. Note that neighborhoods are non-overlapping.

Define the first order neighborhood (Ai1) of the ith tree as the grid cell in which

the tree is located, and the ℎth order neighborhood, Aiℎ, as the (2ℎ− 1)× (2ℎ− 1)

grid cells centered at the ith tree, excluding Aiℎ′ , ℎ′ < ℎ, ℎ = 2, ⋅ ⋅ ⋅ , H. Figure 4.2

displays Ai1, Ai2 and Ai3.

The usual measure of proximity of AHP to the susceptible tree is the distance

from the tree to the nearest AHP. Here, this is measured as Li= min{ℎ ∋ ziℎ > 0}

where ziℎ is the total alternate host plant count in Aiℎ. Hence, Li reflects how large

ℎ has to be in order the union of neighborhoods∪ℎ′<ℎAiℎ′ has at least one alternate

host plant.

Few attempts have been made in the forestry and environmental as well as medical

literature to assess measures for characterizing spatial proximity in disease exposure

Page 86: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 77

studies beyond nearest distance. Here, we consider the natural alternative measure

of the density of the alternative host plant; we also consider that density effects differ

at various neighborhoods. We approximate the density effects in the different order

of neighborhoods by the same form, with density effects being zero beyond the third

order neighborhood. To be more specific, define the density of AHP in the ℎth order

neighborhood of tree i as Di(ℎ) = ziℎ/∣Aiℎ∣, where ∣Aiℎ∣ represents the number of

cells in the ℎth order neighborhood centered at tree i. So the actual density in AHP

counts per m2 is Di(ℎ)/2.25. Figure 4.3 displays smoothed plots of Li, Di(1), Di(2)

and Di(3) over the study area, which indicate that trees in the north-east corner

of the plot are far away from the alternate host plants, which are abundant on the

south-west and south-east quadrants.

Previous studies show evidence of genetic variation in resistance of lodgepole pine

to CBR and we incorporate this variable in our model by a simple linear term on the

logit scale. Let Zi = 1 if the genetic family of the tree is hypothesized to be resistant

and 0, otherwise. The model is then expressed as:

log {�i/(1− �i)} = � + �ZZi + fL(Li) +

∫ ∞ℎ=1

fD(ℎ) {D(ℎ)} dℎ+ fs(ei, ni) ,

where � is the overall mean relative risk and �Z represents the genetic effect; more

generally, �Z may represent a treatment effect of interest in other environmental ap-

plications. fL and fD(ℎ) are univariate cubic penalized splines for nearest distance

and densities. The term fs(ei, ni) is a two dimensional thin-plate regression spline

accounting for the spatial autocorrelation over space. This structure might be deter-

mined by residual factors beyond AHP which are unobserved.

To flexibly model AHP distance and density effects, cubic regression splines as a

polynomial spline approximation of a smooth function is a flexible choice in geospatial

modeling. For each covariate x, the B-spline formulation of f can be expressed as a

Page 87: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 78

20 40 60 80

10

01

50

20

0L

East

No

rth

5

10

15

20

25

30

35

20 40 60 801

00

15

02

00

D(1)

East

No

rth

015810152030

50

20 40 60 80

10

01

50

20

0

D(2)

East

No

rth

01

5810

15

20

30

20 40 60 80

10

01

50

20

0

D(3)

East

No

rth

01

5

8

10

15

Figure 4.3: Plots of the distributions of the covariates L, D(1), D(2) and D(3) overthe study area. Darker shades of red indicate higher values and lighter shades ofyellow indicate lower values.

Page 88: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 79

linear combination of k + l basis functions:

f(xi) =k+l∑m=1

�m�m(xi) = �T� . (4.1)

This approach assumes that the true non-linear forms by which these terms af-

fect the logit of the probability of infection can be approximated by a polynomial

spline of degree l with k inner knots. The columns of the design matrix �T are

given by the B-spline basis functions evaluated at the observations xi, that is �T =

(�1(xi), ⋅ ⋅ ⋅ , �k+l(xi))T while � = (�1, ⋅ ⋅ ⋅ , �k+l)T represents the corresponding spline

coefficients.

For the spatial effect fs, we employ a thin-plate regression spline (Green and

Silverman, 1994; Wood, 2004), which is a higher dimensional extension of smoothing

splines with the useful property of not requiring a specification of knot locations;

as well, thin-plate splines are reasonably computationally efficient and are isotropic.

Thin plate regression splines can be seen as a Gaussian process with generalized

covariance (Cressie, 1993), characterized in terms of distance �. The form in two

dimensions is C(�) ∝ �2m−2log(�), where m is the order of the spline (commonly

two). Paciorek (2007) provides a nice comparison of a variety of models for spatial

surface. We adopt a thin-plate regression spline primarily for its computationally

efficiency with large datasets. Let e = (e1, ⋅ ⋅ ⋅ , eN)′ and n = (n1, ⋅ ⋅ ⋅ , nN)′. A

thin-plate spline over two dimensional space indexed by s = (e,n) takes the form,

f(s) =∑i

�i�(∥s− si∥) + (a1 + a2e+ a3n) , (4.2)

where �i, i = 1, ⋅ ⋅ ⋅ , n and a1, a2 and a3 are constants, ∥⋅∥ denotes the Euclidean

norm, �(r) = 18�r2log(r2), for r > 0 and 0 otherwise, subject to the identifiability

constraints that∑

i �i =∑

i �iei =∑

i �ini = 0.

To estimate the parameters �T =(�TL,�

TD(1),�

TD(2),�

TD(3), �Z , �1, ⋅ ⋅ ⋅ , �n, a1, a2, a3

),

Page 89: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 80

we maximize the penalized log-likelihood

lp(�) = l(�)− 1

2

∑j

�jJ[1](fj)−

1

2�sJ

[2](fs) , (4.3)

conditional on smoothing parameters �j, operating on the penalty terms J [1] and

J [2] as discussed below. Here l(�) is the log-likelihood associated with the Bernoulli

response:

l(�) =n∑i=1

[yilog {p(yi = 1∣�)}+ (1− yi)log {1− p(yi = 1∣�)}

]. (4.4)

Given values of �j, lp is maximized to find �. The smoothing parameters �j control

the tradeoff between goodness of fit of the model and model smoothness with the

second and third terms of equation (4.3) penalizing sharp changes of the splines.

For the one-dimensional spline and the two-dimensional thin-plate spline (Green and

Silverman, 1994) of (4.1), the penalty terms in equation (4.3) are, respectively,

J [1](fj) =

∫ℜ

{df 2j (t)

dt

}2

dt j = 1, ⋅ ⋅ ⋅ , p (4.5)

J [2](fs) =

∫ ∫ℜ2

{(∂2fs∂sx2

)2

+ 2

(∂2fs∂sxsy

)2

+

(∂2fs∂sy2

)2}dsxdsy . (4.6)

The penalized likelihood may be maximized by a penalized iteratively reweighted

least squares (P-IRLS) algorithm. Given the smoothing parameter, at the kth P-

IRLS iteration, the following penalized sum of squares would be minimized with

respect to g = X� to find the (k + 1)th estimate g[k+1] = X�[k+1]:

n∑i=1

{w

[k]i (z

[k]i − gi)

}2

+

p∑j=1

�jJ[1](fj) + �sJ

[2](fs) , (4.7)

where z[k] is a vector of pseudodata z[k]i = g

[k]i + g′(�

[k]i )(yi − �

[k]i ), and W[k] is a

diagonal matrix with elements w[k]i = 1/

√V (�

[k]i )g′(�

[k]i )2 and V (�

[k]i ) is proportional

to the variance of Yi according to the current estimate �[k]i . This mimics penalized

Page 90: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 81

quasi-likelihood in generalized linear models (Breslow and Clayton, 1993), in main

part because of the connections with linear representations as in (4.1), and (4.2) and

yields (at least conceptually) straightforward implementation.

The smoothing parameters �j are estimated by minimizing the generalized cross

validation score (Craven and Wahba, 1979; Wahba, 1985) for each working penalized

linear model of the P-IRLS iteration; the score has the following form:

GCV =n ∥√

W(z−X�) ∥2

[n− trace(A)]2, (4.8)

where A = X(XTWX + S)−1XTW the influence matrix (Hastie and Tibshirani,

1990) while S measures the roughness of the smooth functions. In our context, GCV

is used to estimate the smoothing parameters as well as for variable selection, and

may be directly implemented in the penalized iteratively re-weighted least squares

algorithm (Wood, 2004). In Section 4, we discuss its use for model selection. We

note that model fitting may be conveniently implemented in R (R Development Core

Team, 2010) using the mgcv package (Wood, 2006). In the following subsection, we

discuss methods for model assessment.

Predictive Accuracy: Finding the presence of the alternate host plants within the land-

scape may increase the likelihood of finding disease infected trees and may serve as a

key component in determining the severity of CBR hazard in a given area. Hence, the

models are evaluated on their abilities to reproduce the observed presence/absence

of tree infection using the covariate information at each site. The estimated prob-

ability of disease presence, �i, may be obtained based on the fitted values for any

postulated model. We may then compare model performance using receiver operat-

ing characteristic (ROC) curves, which display the true-positive rate (TPR) versus

the false-positive rate(FPR) for different thresholds of prediction probabilities used

Page 91: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 82

to create a disease status classification. Here, we define

TPR(c) =

∑ni=1 I

(�i > c∣yi = 1

)∑ni=1 I(yi = 1)

; FPR(c) =

∑ni=1 I

(�i > c∣yi = 0

)∑ni=1 I(yi = 0)

, (4.9)

where 0 < c < 1 and I is an indicator function. Thus, TPR is equal to the number

of trees for which the estimated probability of disease presence is greater than c, for

trees which are truly infected, divided by the total number of trees which are truly

infected. FPR is interpreted similarly for trees which are absent of infection. The

ROC curves plot TPR versus FPR for a grid of selected values of c. The greater

the area under the ROC curve (AUC), the better the method discriminates between

true presence and absence. An area of 1 defines a perfect calibration, where for some

specific c, trees that are truly diseased (� > c) or not diseased (� < c) are so identified.

Generalized Akaike’s Information Score: A natural way to evaluate models is to com-

pare generalized Akaike’s information criterion (AIC) scores (Wood, 2008), which is

expressed as AIC = D(�) + 2trace(A)�, where D(�), the model deviance, is de-

fined as 2�(lsat − l), l is the log-likelihood of the fitted model and lsat the maximum

value for the (saturated) log-likelihood of the model with one parameter per datum.

The dispersion parameter � can be estimated by the Pearson estimator (Wood, 2006).

Generalized Cross Validation Score: The theoretical justification for the use of the

generalized cross validation (GCV) score for model assessment is that the GCV score

is asymptotically a predictive mean square error criterion. This means that for large

n, the value of � that minimizes the GCV score will yield a spline estimate that

minimizes the mean square error between the estimate and the true, unknown model

function. Here, GCV is used as a criterion to assess the predictive accuracy of the

models.

Page 92: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 83

4.4 Comparison of Exposure Measures for CBR

Infection

To evaluate the impact of the use of different exposure measures, we consider the

following candidate models including either or both density and distance or neither

of these terms:

M1 : gi = � + �ZZi + fL(Li) +H∑ℎ=1

fD(ℎ) {Di(ℎ)}+ fs(ei, ni) ,

M2 : gi = � + �ZZi +H∑ℎ=1

fD(ℎ) {Di(ℎ)}+ fs(ei, ni) ,

M3 : gi = � + �ZZi + fL(Li) + fs(ei, ni) ,

M4 : gi = � + �ZZi + fs(ei, ni) .

Figure 4.4 displays the fitted partial spatial effect for the four models with the

fit from M4 having the highest spatial roughness while M1 seems the smoothest.

The south-west and south-east quadrants have higher values of the underlying spa-

tial effect while this is much lower in the north-east and north-west. Figure B.2 in

the Appendix B displays the semivariogram estimates (Ribeiro Jr and Diggle, 2001)

obtained from raw values of infection status and the residuals from the fit of M1,

along with Monte Carlo envelopes obtained by repeated random permutation of the

data values on the spatial locations. Positive spatial dependence is evident in the

raw data, exhibiting a correlation length of approximately 20m, while the use of M1

eliminates such spatial correlation. Figure 4.5 displays the covariate effects, showing

that the log odds of infection decreases as nearest distance increases, whereas the log

odds of infection increases as densities increase. The effects of nearest distance seem

to be insignificant when distance is beyond about 5 (on the order of 3.75m), while

the effects of density in the first order neighborhood achieves a maximum value when

Page 93: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 84

Table 4.1: AIC, GCV and AUC (area under ROC curve) scores and the estimates ofthe genetic effect for four competing models.

Model AIC GCV AUC �(SE(�))M1 7848.51 0.127 0.703 -1.018(0.141)M2 7859.66 0.128 0.702 -1.013(0.141)M3 7954.91 0.143 0.697 -0.982(0.138)M4 8284.34 0.185 0.674 -0.934(0.135)

density is about 40 AHPs per cell. The effect of AHP densities in the second and

third order neighborhoods also generally increase to a threshold.

Table 4.1 reports the estimates of the genetic effect (�Z) for the four competing

models, all of which are significant. The estimated log odds of infection for (hypothe-

sized) genetically resistant versus non-resistant trees for M1 is -1.018. Across models,

the difference in this effect is negligible for M2, about 0.036 higher for M3, and about

0.084 higher in M4.

To assess which of these models best fits the data, we consider evaluating them

through their predictive accuracy and other measures of goodness of fit. Examination

of all the goodness of fit criteria, shown in Table 4.1, suggests that M1 and M2

are preferable to the other competing models. Hence, alternate host plant densities

appear to provide somewhat more information than the nearest distance as a measure

of host plant proximity. In other words, considering densities as the exposure measure

both enhances model fit and improves prediction accuracy. The worst fit arises from

M4, so some measure of AHP exposure is helpful in predicting disease; here preferably

both distance and density.

In the next section, we assess whether biases in treatment effect may be mani-

fested, and large predictive error induced, by misspecification of the exposure mea-

sures.

Page 94: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 85

20 40 60 80 100

10

01

50

20

02

50

M1

East

No

rth

−2

−1.6

−1.6

−1.6

−1.4

−1.4

−1.4

−1.4

−1.2

−1.2

−1.2

−1.2

−1

−1

−1

−0.8

−0.8

−0.6 −0.6

−0.4

20 40 60 80 1001

00

15

02

00

25

0

M2

East

No

rth

−2 −

1.8

−1.6

−1.6

−1.6

−1.4

−1.4

−1.4

−1.4 −1.2

−1.2

−1.2

−1.2

−1

−1

−1

−0.8

−0.8

−0.8

−0.6

−0.6

−0.4

20 40 60 80 100

10

01

50

20

02

50

M3

East

No

rth

−1.8

−1.

6

−1.6

−1.4

−1.4

−1.4

−1.2

−1.2

−1.2

−1

−1

−1

−1

−0.8

−0.8

−0.6

−0.

6

−0

.6

−0.4

−0.4

−0.2

20 40 60 80 100

10

01

50

20

02

50

M4

East

No

rth

−2

−2

−1.5

−1.5

−1.5 −1

−1

−1

−1

−1

−0.5

−0.5

0

0 0

Figure 4.4: Estimated spatial terms for models M1, M2, M3 and M4 in an analysisof infection over a plantation. Darker shades of red indicate higher risk, while lightershades of yellow indicate lower risk values.

Page 95: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 86

01

02

5

−2−1012

L

partial L effect

04

08

0

−6−4−202

D(1

)

partial D(1) effect

01

02

03

0

−1.00.01.0

D(2

)

partial D(2) effect

01

02

0

0.00.51.01.5

D(3

)

partial D(3) effect

04

08

0

−8−402

D(1

)

partial D(1) effect

01

02

03

0

−1.00.01.0

D(2

)

partial D(2) effect

01

02

0

−0.50.51.5

D(3

)

partial D(3) effect

01

02

5

−4−2012

L

partial L effectF

igure

4.5:

Est

imat

edpar

tial

cova

riat

eeff

ects

wit

h95

%co

nfiden

cein

terv

als

(das

hed

lines

)fr

omfitt

ing

M1

(firs

tro

w),

M2

(sec

ond

row

),M

3(t

hir

dro

w),

resp

ecti

vely

.

Page 96: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 87

4.5 Assessing the Effect of Misspecification of Spa-

tial Exposure Measures

In this simulation experiment, we seek to study the impact of misspecified exposure

measures on estimation of a treatment effect in a similar generalized additive modeling

framework as adopted in the analysis in Section 4.4. We investigate the magnitude

of biases in such treatment estimates under a variety of scenarios.

It is well known that if treatment is independent of the covariate measures, such

as the situation in a randomized experiment where covariates are balanced across

treatment groups, omission of covariates will not induce the bias of the treatment

effect in the setting of linear model as these effects are orthogonal. However, Gail

et al. (1984) has shown that this is not necessarily the case with nonlinear regression.

These authors show that when treatment is binary, the omission of a balanced co-

variate in logistic regression may lead to bias in the estimate of the treatment effect,

with the bias being towards the null hypothesis of there being of no effect. This

result emphasizes the fact that for logistic models, randomization will not guarantee

unbiased estimates of treatment effects when important covariates are omitted. Here,

we investigate the impact of misspecification of the exposure measures with (1) the

treatment group consisting of the trees from resistant seedlots at sites exactly as in

the CBR study, which were planted randomly across the experimental plot; (2) the

treatment group being randomly selected as half of the trees in the plot. We consider

the second case as the spatial distribution of AHP would have evolved over the two

years of the study of initial planting. The randomization of trees to different treat-

ments assures that on average there should be no systematic differences in observed

or unobserved covariates between units assigned to the different treatments.

The simulation study requires the generation of bootstrap samples, based on the

Page 97: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 88

fitted model under consideration and we also include inflation or deflation of the

partial covariate effects and the site specific structure. The procedure for generating

simulated data is as follows:

(a) Let the partial covariate effects from fit of M1 to the CBR study, be de-

noted fL(Li), fD(ℎ) {Di(ℎ)}, ℎ = 1, ⋅ ⋅ ⋅ , H and fs(ei, ni); let � denote the fitted

intercept. We modify these effects based on forestry scientific considerations

so that they become constant after certain thresholds. For example, our true

partial nearest distance effect, f ∗L(Li) decreases until 5 with f ∗L(Li) = fL(Li)

for Li < 5, f ∗L(Li) = fL(5) for Li ≥ 5 with a smooth transition to the con-

stant value. Similarly the true covariate effect for D(1), f ∗D(1) {Di(1)}, mimics

fD(1) {Di(1)} up to the threshold of 40; threshold values for D(2) and D(3) are

at fD(2)(15) and fD(3)(5), respectively. These true effects from which data are

generated are displayed in Figure 4.6. The true spatial effect f ∗s (ei, ni) is set to

fs(ei, ni), and �∗ = �.

(b) Calculate the true site specific log odds of infection; log {�∗i /(1− �∗i )} = g∗i ;

if M2 is the true model, g∗i = �∗ + �∗ZZi + H∑ℎ=1

f ∗D(ℎ) {Di(ℎ)}+ f ∗s (ei, ni) ,

(4.10)

if M3 is the true model, g∗i = �∗ + �∗ZZi + f ∗L(Li) + f ∗s (ei, ni) , (4.11)

for i = 1, ⋅ ⋅ ⋅ , n. Here represents an inflation factor which modifies the mag-

nitude of the partial covariate effects. Simulation scenarios S1, S2 and S3 corre-

spond to = 1, 1.5, and 2 respectively. Figure 4.6 displays the inflated partial

covariate effects for these scenarios. The true treatment effect �∗Z takes values

−2,−1, 0, 1, 2.

Page 98: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 89

(c) For each combination of the allocation of treatment trees and parameter set-

tings, generate y(b)i from a Bernoulli distribution with mean 1/ {1 + exp(−g∗i )},

i = 1, ⋅ ⋅ ⋅ , n for b = 1, ⋅ ⋅ ⋅ , B replicates, B = 1000.

(d) Fit models M1-M4 using y(b)i as the responses to obtain estimates of the

treatment effect, �(b)z , b = 1, ⋅ ⋅ ⋅ , B.

A natural overall measure to consider when assessing the predictive abilities of the

fitted models at each location is the change in the infection probability caused by

treatment versus no treatment over each of the sites in the plot,

∗i = 1/[1 + exp {g∗i (Zi = 1)}]− 1/[1 + exp {g∗i (Zi = 0)}] , (4.12)

where g∗i (Zi = k), a function of Zi, represents the predicted log odds of infection as in

(4.11) using site covariate values except Zi and with Zi set to k = 0, 1. The estimated

site specific treatment effect is denoted as (b)i = 1/[1 + exp

{g(b)i (Zi = 1)

}]− 1/[1 +

exp{g(b)i (Zi = 0)

}] in the b-th replicate, b = 1, ⋅ ⋅ ⋅ , B. We use the sum of the absolute

bias, standard error and root mean squared error across all the sites to evaluate the

performance of the estimate of the site specific treatment effect under the different

models. These are defined as

SBIAS( ) =N∑i=1

∣∣∣∣∣ 1

B

B∑b=1

(b)i − ∗i

∣∣∣∣∣ ;

SSE( ) =N∑i=1

⎧⎨⎩ 1

B

B∑b=1

(

(b)i −

1

B

B∑b=1

(b)i

)2⎫⎬⎭ ;

SRMSE( ) =N∑i=1

{1

B

B∑b=1

(

(b)i − ∗i

)2}, (4.13)

where = { (b)i }, i = 1, ⋅ ⋅ ⋅ , N and b = 1, ⋅ ⋅ ⋅ , B.

To compare predictive accuracy for different models under different settings, define

� ∗i = 1/[1 + exp {g∗i (Zi = 0)}] as the true infection probability for the ith tree and

Page 99: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 90

0 5 10 15 20 25 30 35

−0.

40.

00.

40.

8

L

part

ial L

effe

ct

S1S2S3

0 20 40 60 80 1200.

00.

51.

01.

52.

0D(1)

part

ial D

(1)

effe

ct

S1S2S3

0 5 10 15 20 25 30 35

0.0

0.5

1.0

1.5

2.0

D(2)

part

ial D

(2)

effe

ct

S1S2S3

0 5 10 15 20 25

0.0

0.5

1.0

1.5

2.0

D(3)

part

ial D

(3)

effe

ct

S1S2S3

Figure 4.6: True partial covariate effects for simulating data in scenarios S1, S2 andS3. The treatment group consists of trees from resistant seedlots. The vertical greyline indicates the threshold of the covariate after which the effect asymptotes.

Page 100: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 91

�(b)i = 1/[1 + exp

{g(b)i (Zi = 0)

}] as the estimated infection probability for the ith

tree, i = 1, ⋅ ⋅ ⋅ , N . We compare the models in terms of sum of the absolute bias,

sum of the standard error and sum of the root mean squared error of the infection

probabilities for all trees, defined similarly in (4.13).

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

β z

−3.0

−2.5

−2.0

−1.5

−1.0

S1 S2 S3

−3.0

−2.5

−2.0

−1.5

−1.0

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

Figure 4.7: Estimated treatment effect from fitting M1, M2, M3 and M4 when thetrue model is M2. Simulation scenarios S1, S2 and S3 correspond to the inflationfactor = 1, 1.5 and 2, respectively. The treatment group consists of trees fromresistant seedlots. The horizontal dashed line represents the true treatment effect.

As an illustration, Figure 4.7 displays estimated treatment effects obtained from

fitting M1, M2, M3 and M4 when the true model is M2 and the true treatment effect

is -2, with the sites for the treatment group being identical here to those trees from

resistant seedlots in the CBR study. The results for all the simulation scenarios are

displayed Figure B.3 in the Appendix B. Similarly, and Figure B.4 in the Appendix B

Page 101: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 92

displays results when the treatment group is randomly selected as half of the trees. In

general, if the true model is M2, bias in estimating the treatment effect may arise with

M3 and M4 as shown in the left-hand side panels in Figures B.3 and B.4. The bias

of the treatment effect increases with inflation of the covariate effects, and for larger

values of can be substantial. The bias from the use of M3 and M4 also increases as

the magnitude of the treatment effect increases. An important point is that the bias

of the estimated treatment effect under misspecified models tends to be close to zero

when the true treatment effect is zero, but becomes positive when the true treatment

effect is negative, and conversely, negative when the true treatment effect is positive.

Hence when the true treatment effect is not zero, the bias attenuates the magnitude

of the estimated effect to zero. Note that, as may be expected, M1 yields unbiased

estimates in all simulation scenarios. The right-hand panels in Figures B.3 and B.4

in the Appendix B show that the bias arising from misspecified models when the true

model is M3 is almost negligible in all settings even with rather large values of .

Table 4.2 presents SBIAS, SSE and SRMSE for the site specific treatment effects

when the treatment group consists of trees from resistant seedlots as in the CBR

study, while B.1 in the Appendix B presents corresponding results for the case where

treatment trees are randomly chosen in this simulation experiment. In general, SBIAS

in the correctly specified model is considerably smaller than for the misspecified mod-

els and close to that for model M1, while SBIAS from the fit of M4 is substantial in

all simulation scenarios. The SSE for M4 is smaller than that for the other models,

since no smoothers are included in M4; also, it estimates fewer parameters than the

other models. In all simulation scenarios, the correctly specified model has the lowest

SRMSE; however, note that the difference between SRMSE values for M2 and M3

when M3 is true is less than the difference when M2 is true, under the same simulation

scenario. The increase in SRMSE observed under misspecified models, over that for

the true model, can be very substantial when inflation factors, , and the treatment

Page 102: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 93

effect are large. Note also that when �∗Z = 0, i.e. there is no treatment effect, SRMSE

values for the true and misspecified models are comparable.

Table 4.3 shows SBIAS, SSE and SRMSE for the predicted site specific infection

probability when the treatment group consists of trees from resistant seedlots. The

results obtained when the treatment group is randomly selected as half of the trees in

the plot are presented in Table B.2 in the Appendix B. The findings are quite similar

to those predicting treatment effects over all sites, except that the errors introduced

are more pronounced as errors associated with the incorrect covariate are not removed

through the calculation of differences at sites associated with Z = 1 versus Z = 0.

4.6 Summary

The study contrasted the use of nearest distance to alternate host plants versus al-

ternate host plant density as a spatial covariate with additive effects in a study of

the dynamics of CBR on lodgepole pine. Importantly, we utilized the detailed data

set available to evaluate the bias in estimating treatment effects and predictive errors

induced by misspecification of exposure measures. Our results suggest that bias in

estimating the treatment effect may be quite large under model misspecification. As

the true partial covariate effects increases, the bias becomes substantial. In addition,

the bias of the difference in site specific estimated probabilities of infection with and

without treatment over the whole plot tends to be close to zero when the true treat-

ment effect is zero; when the true treatment effect is not zero, this estimated overall

effect of infection probability tends to be attenuated to zero under model misspecifi-

cation. We recommend further investigation of this phenomenon in a wide range of

studies.

We also note that a model including both distance and density terms is usually

less efficient than the correctly specified model due to the inflated variance induced

Page 103: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 94

Table 4.2: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific treatment effect ( i) over the whole plot when the treat-ment group consists of the trees from resistant seedlots. The other rows providethe difference between SBIAS, SSE, SRMSE of i based on the misspecified mod-els and the corresponding quantity for the true model. The true treatment effect is�∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.

True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE

�∗Z = −2M2 M1 2.01 4.03 4.65 2.16 4.32 4.92 1.86 4.60 5.02

M2 48.68 167.05 177.14 49.78 170.09 180.76 51.86 174.67 185.88M3 124.48 0.84 82.14 170.88 -5.18 113.87 223.01 -11.63 154.35M4 207.56 -7.69 144.38 268.30 -13.42 193.14 312.45 -21.76 228.28

M3 M1 0.19 10.31 10.01 0.40 10.17 9.92 0.32 10.73 10.41M2 24.19 8.95 20.67 47.11 11.63 35.97 69.38 14.38 52.74M3 51.91 166.26 177.66 50.70 166.84 177.80 50.71 165.08 176.25M4 79.00 -6.35 39.09 136.83 -5.44 83.05 193.00 -3.97 131.36

�∗Z = −1M2 M1 1.56 1.64 2.05 1.54 2.28 2.66 1.47 2.38 2.70

M2 29.28 156.43 160.54 29.60 154.49 158.91 29.89 153.19 157.71M3 59.65 2.85 31.07 98.50 0.55 55.92 139.96 -1.48 86.86M4 85.44 -2.41 44.51 119.69 -5.68 67.36 153.47 -8.78 93.85

M3 M1 0.11 2.96 2.94 0.33 2.83 2.84 0.40 2.97 2.97M2 10.80 1.45 5.70 24.12 1.88 11.42 36.12 2.15 17.66M3 30.63 155.03 159.36 28.96 155.03 159.09 28.89 154.63 158.79M4 39.70 -3.16 13.12 70.72 -3.22 30.49 99.42 -3.09 50.16

�∗Z = 0M2 M1 2.61 0.17 0.28 0.91 -0.03 -0.02 0.33 0.13 0.12

M2 5.37 161.95 161.89 1.80 154.35 154.22 2.03 144.73 144.62M3 7.89 0.99 1.44 14.58 0.76 1.61 25.31 0.80 3.33M4 7.63 0.17 0.60 14.23 0.00 0.82 24.48 -0.41 1.98

M3 M1 -0.95 0.17 0.15 -0.73 -0.07 -0.09 -0.50 0.02 -0.01M2 3.26 -0.21 -0.11 2.38 -0.91 -0.80 -2.09 -1.37 -1.50M3 3.64 162.71 162.59 6.49 162.74 162.72 11.16 156.17 156.42M4 -0.25 -0.74 -0.75 -4.22 -1.05 -1.17 -9.73 -1.21 -1.61

�∗Z = 1M2 M1 1.63 1.83 2.12 2.14 1.99 2.36 1.80 1.78 2.16

M2 20.34 186.24 188.33 23.57 190.76 193.38 24.79 185.25 188.13M3 54.28 0.12 28.55 95.35 0.28 57.71 133.36 0.88 86.85M4 68.04 -6.41 30.15 117.44 -8.95 64.58 163.67 -10.95 100.03

M3 M1 0.18 0.84 0.88 0.12 1.10 1.15 0.14 1.59 1.63M2 7.96 -0.76 0.22 14.32 -1.40 1.11 21.24 -1.10 3.68M3 15.43 188.13 189.24 16.51 186.53 187.84 18.14 183.74 185.34M4 17.48 -2.54 0.62 30.57 -3.95 3.66 44.51 -4.76 9.10

�∗Z = 2M2 M1 1.28 2.52 2.79 1.68 3.24 3.59 1.81 3.39 3.80

M2 24.49 179.58 182.74 27.05 185.80 189.43 29.47 195.42 199.34M3 106.32 -2.11 71.49 169.35 2.25 126.52 223.55 6.37 173.54M4 175.61 -12.84 111.44 277.51 -13.40 196.88 363.53 -13.95 270.52

M3 M1 0.79 5.93 5.99 0.66 6.63 6.63 0.56 8.32 8.22M2 14.01 5.22 8.86 26.36 6.83 15.05 37.13 9.52 22.90M3 22.05 168.28 170.85 23.92 171.38 174.21 25.68 174.73 178.08M4 36.46 -3.95 10.48 65.29 -5.57 25.79 94.64 -7.02 44.46

Page 104: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 95

Table 4.3: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific infection probability (�i) over the whole plot when thetreatment group consists of the trees from resistant seedlots. The other rows providethe difference between SBIAS, SSE, SRMSE of �i based on the misspecified modelsand the corresponding quantity for the true model. The true treatment effect is�∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.

True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE

�∗Z = −2M2 M1 2.44 5.56 6.35 2.69 6.10 6.89 2.36 6.14 6.74

M2 65.70 191.98 207.01 63.39 186.45 201.00 61.50 179.97 194.18M3 216.97 -3.40 162.96 323.38 0.23 259.32 404.02 3.74 335.96M4 355.24 -12.21 277.21 536.31 -4.36 451.24 689.50 1.35 601.89

M3 M1 0.16 18.52 17.67 0.27 19.28 18.40 0.00 21.24 20.14M2 31.85 18.07 34.85 63.22 24.80 59.84 94.07 31.55 86.74M3 68.53 185.30 202.54 67.04 185.34 201.99 67.52 184.48 201.54M4 107.19 -8.66 60.43 187.95 -6.56 126.61 267.52 -3.34 198.54

�∗Z = −1M2 M1 2.61 4.89 5.77 2.72 6.37 7.15 2.33 6.34 6.90

M2 66.21 194.02 209.07 63.60 189.11 203.56 61.67 182.44 196.52M3 221.91 -2.13 167.77 330.22 0.10 264.65 410.46 3.53 341.12M4 363.74 -10.99 285.12 548.43 -4.43 461.57 702.63 1.20 613.23

M3 M1 0.14 18.95 18.09 0.34 18.96 18.17 -0.04 21.37 20.28M2 33.08 18.17 35.49 65.18 25.00 60.93 96.58 31.96 88.46M3 68.67 188.81 205.79 67.12 188.96 205.34 67.71 187.84 204.69M4 110.36 -8.70 62.09 192.74 -6.61 129.57 273.84 -3.45 203.03

�∗Z = 0M2 M1 2.37 6.30 6.97 2.75 6.23 7.00 2.45 6.58 7.14

M2 67.36 197.06 212.40 64.81 191.32 206.10 62.64 184.28 198.66M3 224.78 -3.21 169.34 332.30 -0.09 266.51 411.51 3.31 342.03M4 369.63 -12.09 289.28 554.14 -4.67 466.49 707.41 0.86 617.30

M3 M1 -0.08 19.36 18.41 0.52 19.56 18.83 -0.05 20.79 19.70M2 33.37 18.91 36.47 67.29 24.88 62.09 99.05 31.82 89.93M3 69.65 191.56 208.70 67.93 191.64 208.15 68.58 190.04 207.08M4 113.15 -8.89 63.94 197.42 -6.96 132.96 279.93 -3.46 208.23

�∗Z = 1M2 M1 2.05 6.40 6.96 2.39 6.75 7.37 2.32 5.80 6.37

M2 67.26 198.28 213.49 64.53 193.36 207.88 63.07 186.22 200.64M3 223.49 -3.17 168.14 330.13 -0.03 264.05 407.82 3.02 338.28M4 367.63 -11.95 287.15 550.13 -4.61 461.78 700.76 0.77 610.36

M3 M1 -0.14 19.04 18.05 0.37 18.42 17.64 -0.01 20.67 19.59M2 34.56 18.54 36.54 67.37 25.20 62.34 99.62 31.75 90.04M3 69.28 192.27 209.29 67.72 192.65 209.06 68.22 191.54 208.35M4 114.45 -8.50 64.84 198.69 -6.65 133.85 281.55 -3.36 209.00

�∗Z = 2M2 M1 2.44 6.30 7.09 2.70 6.37 7.19 2.37 6.13 6.75

M2 66.59 195.99 211.10 64.48 190.60 205.29 62.64 185.31 199.61M3 219.26 -3.05 164.90 323.50 0.56 259.42 400.46 3.26 331.95M4 359.49 -11.97 280.26 537.98 -4.25 451.74 686.48 0.59 596.89

M3 M1 0.27 18.56 17.68 0.11 18.89 17.94 -0.20 20.74 19.54M2 32.67 18.43 35.44 64.74 24.65 60.38 96.31 31.46 87.86M3 69.58 190.11 207.52 68.20 190.38 207.23 68.91 189.27 206.63M4 110.97 -8.63 62.57 193.65 -6.54 130.31 275.00 -3.19 204.03

Page 105: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 4. MISSPECIFICATION OF SPATIAL EXPOSURE MEASURES 96

by the inclusion of redundant smooth terms; however this efficiency loss is minimal.

Our findings for the Comandra blister rust infection indicate that the alternate

host plant densities at different orders of neighborhood are important exposure mea-

sures. In our developments, we adopted a discrete approach to account for the effect

of densities at various neighborhoods. More precision may be gained if we model the

densities in a continuous way with density effects decreasing as we move away from

the site.

As well, it would be interesting to compare our model to a traditional infectious

disease model, in which the transition probabilities for infection from the alternate

host plants to the pine trees may be modeled when more information about infec-

tiousness of the alternate host plants is available.

Page 106: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 5

Exploring Spatial and Temporal

Variations of Cadmium

Concentrations in Pacific Oysters

from British Columbia

5.1 Introduction

Pacific oysters (Crassostrea gigas) are cultivated along the northwest coast of North

America from Washington to Alaska and accumulate levels of cadmium that exceed

some international tolerances. Health guidelines for the European Community set the

tolerance of cadmium concentration at 6.3 �g Cd/g dry weight basis, and Asian ex-

port markets set the tolerance at 13.5 �g Cd/g dry weight basis (Kruzynski, 2004). In

1999, several shipments of oysters cultured in the province of British Columbia (BC),

Canada, were rejected by the Hong Kong Food and Environmental Hygiene Depart-

ment for exceeding the 13.5 �g Cd/g dry weight basis import limit. A subsequent

97

Page 107: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 98

shellfish survey by the Canadian Food Inspection Agency (CFIA) confirmed these

shipments were not unusual and reported a mean cadmium value of 17.7 �g Cd/g

dry weight basis for BC oysters cultured over the broad geographic area (Schallie,

2001). In 2000, Fisheries and Oceans Canada provided possible sources where cad-

mium might be originating. They concluded that the cadmium in BC oysters is

mainly due to the geology of the area (Kruzynski, 2000), but the source of cadmium

for these oysters is still uncertain. Consequently, there has recently been an enormous

investment on studying the issue.

A primary interest of our analysis is to study how oyster cadmium concentrations

vary over space and time. A second objective is to investigate how cadmium con-

centrations depend on oyster growth over time. We illustrate how spline smoothing

techniques can be employed to address both of these concerns.

5.1.1 The Motivating Datasets

In July 2000, in collaboration with the British Columbia Ministry of Agriculture

Food and Fisheries (BCMAFF), Simon Fraser University initiated a grow-out study

whereby Pacific oysters from the same seed source (Coast Seafoods, Washington State,

USA) of the same age were deployed to existing oyster culture locations along the

western coast of BC. The deployment dates of all oysters are the same within each

site. Representative locations in both the east (mainland) and outer west (oceanic)

were included. Deployment occurred along lines that were approximately 8 m long

with seeded shells inserted at 30 cm intervals from the surface. Oysters were sampled

approximately bimonthly between December 2002 (D2) to February 2004 (F4), from

shallow (about 1 m depth) to deep (about 7 m depth) positions along the long-

line. Oyster shell length (at the maximum length), was recorded in the field at time

of sampling. Then, the sampled oysters were transported on ice to the laboratory,

Page 108: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 99

where they were killed and frozen whole until cadmium concentration analysis.

In this chapter, we consider thirteen sites identified in Figure 5.1. In the remainder

of the chapter, the location of each site is denoted as the two-letter abbreviation of

its name with the number in brackets indicating which region the site is from. The

five southernmost sites located in the region of Barkley Sound (BS) on the westmost

side of Vancouver Island are (1)Poett Nook (PN), (1)Useless Inlet-3 (BM), (1)Useless

Inlet-4 (JF), (1)Useless Inlet-5 (PC) and (1)Webster Island/Effingham Inlet (WI).

Moving northward, the site (2)Kendrick (KI) is located in the region of Nootka Sound

(NS) located to the north of Barkley Sound. Six sites from the region of Desolation

Sound (DS), located on the west coast of the British Columbia mainland and on

the east side of Vancouver Island include (3)Orchard Bay (OB), (3)Redonda Bay

(RB), (3)Teakerne Arm (TA), (3)Gorge Harbour (GH), (3)Thors Cove (TC) and

(3)Trevenen Bay (TB). The northernmost site considered is (4)Hecate Cove (HC)

located in the region of Quatsino Sound (QS). Note that the region of Nootka Sound

(NS) or the region of Quatsino Sound (QS) include only one site due to sampling

difficulty in the specific geographical location.

5.1.2 Statistical Methods

In this study, we seek to explain the variation in observed cadmium concentrations

and note that since measurements are taken in this study at unequally-spaced time

points, traditional techniques from time series analysis are difficult to implement.

Bendell and Feng (2009) demonstrated the presence of certain temporal variation and

clustering patterns for oyster cadmium concentrations contained in BC oysters via

preliminary visual and simple statistical approaches. The average oyster cadmium

concentrations and oyster length at the same depth at each site are measured at

some discrete time points. Smoothing splines are used to estimate the average oyster

Page 109: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 100

−129 −128 −127 −126 −125 −124 −123

48.5

49.0

49.5

50.0

50.5

51.0

longitude

latit

ude

●●

● ●

Hectate Cove(HC)

Kendrick Inlet(KI)

Poett Nook(PN)Useless Inlet(BM/JF/PC)Webster Island(WI)

Orchard Bay(OB)Redonda Bay(RB)

Teakern Arm(TA)Gorge Harbour(GH) Thors Cove(TC)

Trevenen Bay(TB)

Desolation Sound (DS)

Barkley Sound (BS)

Nootka Sound (NS)

Quatsino Sound (QS)

Figure 5.1: The geographical range of samples of cultured Pacific oysters along thewest-coast of British Columbia analyzed for oyster cadmium concentrations.

Page 110: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 101

cadmium concentration as a smooth nonparametric function at each site. The average

oyster length will be non-decreasing over time, so a monotone smoothing technique

(Ramsay, 1988) is used to estimate the average oyster length as a monotone function

of time.

In addition to identifying the temporal trends in cadmium accumulation by oys-

ters, we also sought to assess spatial influences on measured cadmium concentrations.

Specifically, we aim to detect those regions where the oyster cadmium concentration

over the entire deployment time is the highest, and hence provide advice to shellfish

farmers to avoid these areas as possible farming sites. Classical multivariate prin-

cipal components analysis (PCA) has often been adopted to identify these features.

Here we employ functional alternatives, namely functional principal component anal-

ysis (FPCA), which incorporate the smoothing techniques into PCA (Deville, 1974;

Besse and Ramsay, 1986; Hall et al., 2006). Ramsay and Silverman (2005) give a nice

introduction about the methodologies and applications of FPCA.

Oyster cadmium concentration may also depend on the explanatory variables

over time (i.e., oyster length and growth rate). Bendell and Feng (2009) showed

that cadmium concentrations in oysters were linked to a number of factors, such as

region, depth, growth rate and oyster length by using a standard multivariate linear

regression model. All of these factors played important roles in determining final

tissue concentrations and ultimately the amounts of cadmium transferred to higher

tropic levels. However, the linear relationship between the cadmium concentration

and these covariates may not be a valid assumption. We relax this strict linear

relationship constraint by using a semi-parametric additive model, which allows for

flexible dependence structures (Ferraty and Vieu, 2000; Malfait and Ramsay, 2003;

Chiou et al., 2004; Antoniadis and Sapatinas, 2007; Crambes et al., 2009).

Oysters’ growth rate appears to be associated with temperature and food avail-

ability, which in turn, might be linked to the amount of cadmium contained in oysters.

Page 111: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 102

Bendell and Feng (2009) attempted to look at the role of growth in influencing oys-

ter cadmium concentrations by calculating the growth rate as the change in oyster

length divided by the total Julian days (the time interval between time 0 when the

oysters were first deployed to the sampling date) to adjust for different deployment

times across the sites. The main drawback of their method is that they considered

global growth rates by dividing the change in length by the length of the time period

between deployment and sampling. Here, we present an alternative way to consider

local (instantaneous) growth rates, which are calculated as the first derivatives of the

monotone smoothing curves for the oyster lengths, alleviating the limitation of the

previous approach.

The rest of the chapter is organized as follows. In Section 5.2 we provide an

overview of some smoothing techniques used in this analysis, including penalized

smoothing, monotone smoothing, functional principal component analysis and semi-

parametric additive modelling. These methods are then illustrated in Section 5.3

using our motivating oyster data where the capacity of the smoothing approaches to

handle several features of this dataset is demonstrated. Important results on spatial

and temporal variation of cadmium concentrations are discussed. Concluding remarks

are provided in Section 5.4.

5.2 Methodology

This section reviews the methods used in this application. The average measurements

of oysters sampled at the same time points were modeled as a function of time at each

site. The growth rates of oysters at each site are estimated as the first derivatives of

the monotone smoothing spline estimator for oyster lengths. We also use functional

principal component analysis for detecting the spatial variation of the average oyster

cadmium concentrations. The effects of a number of factors on the oyster cadmium

Page 112: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 103

concentration functions are examined by semi-parametric additive modelling.

5.2.1 Spline Smoothing

Let ysgi represent the measurement of cadmium concentration on the ith oyster sam-

pled at the gth time point from site s, s = 1, . . . , N , g = 1, . . . , G, i = 1, . . . , nsg,

where nsg denotes the number of oysters sampled from site s at time g. Let xs(t)

denote the mean cadmium concentration curve for each site, which is represented as

a linear combination of basis functions

xs(t) =b+d+1∑k=1

csk�k(t) = �(t)Tcs , (5.1)

where cs is the vector of B-spline coefficients csk, k = 1, . . . , K, corresponding to the

kth spline effect at site s, �(t) is the vector of cubic B-spline basis function �k(t), b

is the number of break-points or knots, and d is the degree of the polynomial within

each segment - cubic splines (d = 3) are often used. There are many equivalent

bases for the spline space but the most popular is the so-called B-spline basis due to

its numerical stability and computational efficiency (de Boor, 1978). To implement

spline smoothing, the basis coefficient vector cs needs to be estimated, for example,

using least squares. The fitted curve is determined by the number and location of

the knots. Ramsay (1988) and Zhou and Shen (2001) discuss how to choose the

number and location of the knots. The drawback of the dependence of splines on

suitable knot placement has been discussed in the literature (Hastie et al., 2001;

Durban et al., 2004). Our study uses the penalized fitting strategy to alleviate the

importance of knot locations (Wood, 2000) by putting one knot at each distinct

time point with measurements, and a roughness penalty term is used to control the

smoothness of the fitted function. This eliminates the need to choose knot locations

and makes estimated curves more stable at the cost of some increase in bias. The

Page 113: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 104

basis coefficient vector cs is estimated by minimizing the penalized sum of squared

error (PENSSE) loss function,

PENSSE(cs) =G∑g=1

nsg∑i=1

[ysgi − xs(tg)]2 + �s

∫ tG

t1

[d2

dt2xs(t)

]2dt , (5.2)

where tg represent the actual time at the gth time point . The second term penalizes

the roughness of the fitted function. The smoothing parameter �s for site s determines

the trade-off between the fit of the data and the smoothness of the fitted function.

Ramsay and Dalzell (1991) suggest �s can often be chosen by inspection of the curve

smoothness or through an automated procedure such as generalized cross-validation

(GCV) (Craven and Wahba, 1979).

Taking the derivative of (5.2) with respect to the parameter vector cs and solving

for cs yields

cs =

{G∑g=1

nsg[�(tg)�(tg)

T]

+ �s

∫ tG

t1

d2

dt2�(t)

d2

dt2�(t)Tdt

}−1 [ G∑g=1

nsg∑i=1

ysgi�(tg)

].

The estimate for the smooth function is then

xs(t) = �(t)T cs . (5.3)

5.2.2 Monotone Spline Smoothing

In principle, the average oyster length should not decrease over time, even when the

noise inherent to any data set may suggest otherwise. To account for this, we employ

a monotone smoothing technique (Ramsay, 1988) to model oyster length over time.

A strictly monotone smooth function has a strictly positive first derivative. Let ls(t)

represent the average oyster length function at site s . The growth rate dls(t)/dt must

be positive, so it is expressed as the exponential of an unconstrained function Ws(t):

dls(t)/dt = exp[Ws(t)]. By integrating both sides of this equation, ls(t) can be written

Page 114: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 105

as

ls(t) = �s0 +

∫ t

t1

eWs(u)du .

By using this representation, Ws(u) can be flexibly estimated as a linear combination

of basis functions, Ws(u) =∑

k csk�k(u), defined similarly to (5.1). Here, we need to

estimate �s0 and cs1, ⋅ ⋅ ⋅ , csK . We estimate these parameters by minimizing,

PENSSE(�s0, cs1, ⋅ ⋅ ⋅ , csK) =G∑g=1

nsg∑i=1

[ℓsgi − �s0 −

∫ ti

t1

eWs(t)dt

]2+ �s

∫ tG

t1

[d2Ws(t)

dt2

]2dt ,

where ℓsgi represent the length for the ith oyster sampled from site s at the gth time

point . In this situation, we cannot obtain closed forms for the estimates of �s0 and

the spline coefficients cs1, ⋅ ⋅ ⋅ , csK . The Newton-Raphson iteration method is used

to obtain the coefficient estimates. It is easily implemented and converges quickly.

To avoid converging to local minima, one can try different starting values for basis

coefficients.

5.2.3 Functional Principal Component Analysis

Here we outline the statistical methodology of FPCA, which we use in the following

section to examine the oyster data set. FPCA is a multivariate technique that can

partition variability among the measurements into components of decreasing ‘impor-

tance’. In this application, we treat the distribution of mean curves for the cad-

mium concentration, defined in (5.3) as the ‘response’. We subtract the mean curve

x(t) =∑N

s=1 xs(t)/N from each curve and use zs(t) = xs(t) − x(t), to implement

FPCA, as our interest is primarily in characterizing the deviations of the xs(t) from

the average curve. The first functional principal component weight function �1(t) is

estimated by maximizing sum squared functional principal component (FPC) scores

Page 115: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 106

∑s f

2s1, where

fs1 =

∫ tG

t1

�1(t)zs(t)dt, s = 1, ⋅ ⋅ ⋅ , N ,

subject to

∥�1∥2 =

∫ tG

t1

�21(t)dt = 1 . (5.4)

The second functional principal component weight function �2(t) is estimated by

maximizing sum squared FPC scores, subject to the constraint ∥�2∥2 = 1 and the

additional constraint ∫ tG

t1

�1(t)�2(t)dt = 0 . (5.5)

Other functional principal component weight functions can be estimated in the same

way.

Searching for the mutually orthonormal and normalized weight functions is equiv-

alent to the problem of eigenanalysis of the variance-covariance function or operator,

defined by

v(t, t′) = N−1N∑s=1

zs(t)zs(t′) ,

then any eigenfunction �p(t), p = 1, ⋅ ⋅ ⋅ , P , satisfies the functional eigenequation∫ tG

t1

v(t, t′)�p(t′)dt′ = �p�p(t) ,

for an appropriate eigenvalue �p. The proportion of each eigenfunction �p(t) taking

account of total variation among N curves is calculated as �p/∑P

p=1 �p . In practice,

the first PL eigenfunctions are chosen such that∑PL

p=1 �p/∑P

p=1 �p is greater than some

threshold, because they account for most of the total variation. The first two com-

ponents in the oyster data set accounted for much of the variation, providing enough

information regarding the principal sources of variation between mean concentration

curves.

Page 116: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 107

To control the smoothness of eigenfunctions, Ramsay and Silverman (2005) in-

troduce a smoothed PCA approach. The first eigenfunction �1(t) is estimated by

maximizing the penalized sample variance

PCAPSV(�(t)) =var∫ tGt1�(t)zs(t)dt

∥�1∥2 + �∫ tGt1

[d2�(t)dt2

]2dt ,subject to ∥�1∥2 = 1 . The smoothing parameter � controls the trade-off between

the maximization of the sample variance and the roughness of the first eigenfunc-

tion. Each subsequent eigenfunction, �j(t), j = 2, 3, ⋅ ⋅ ⋅ , is estimated by maximizing

the penalized variance PCAPSV(�(t)) subject to two constraints ∥�j∥2 = 1 and the

modified form of orthogonality∫ tG

t1

�j(t)�k(t)dt+

∫ tG

t1

[d2�j(t)

dt2

][d2�k(t)

dt2

]dt = 0 for k = 1, ⋅ ⋅ ⋅ , j − 1 .

Ramsay and Silverman (2005) explain in detail how to find these eigenfunctions by

solving a single eigenvalue problem in Section 9.4 of their book . Silverman (1996)

shows the theoretical advantages of this approach.

5.2.4 Semi-Parametric Additive Model

Oyster cadmium concentration trends for the sampling sites can be explained by

variables such as the depth, region, oyster length and oyster growth rate. A semi-

parametric additive model is proposed to investigate which regions tend to have

greater cadmium concentrations, and which sizes of oysters tend to have high con-

centrations within any region, after adjusting for depth and growth rate effects.

Let ysgi denote the measurement of cadmium concentration on the ith oyster

sampled at the gth time point from site s, s = 1, . . . , N , g = 1, . . . , G, i = 1, . . . , nsg,

where nsg denotes the number of oysters sampled from site s at time g. We investigate

Page 117: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 108

the semi-parametric additive model:

M1 : log(ysgi) = � + s0(tg) + s1(lsi(tg)) + s2(rsi(tg))

+�dI(depthsi = 7m) +H∑ℎ=1

�ℎI(regionsi = ℎ) + �si(tg) ,

where � represents the overall mean, tg is the actual time at the gth time point,

the nonparametric smooth function s0(⋅) represents the overall mean trend. The

growth rate rsi(t) is estimated as the first derivative of the monotone smoothing

spline estimator of oyster lengths lsi(t), s1(⋅) and s2(⋅) are nonparametric smooth

functions of observed oyster length and estimated oyster growth rate, respectively.

Here s0(⋅), s1(⋅) and s2(⋅) are not constrained to be of any pre-specified parametric

form. Instead, we model these terms as linear combinations of cubic B-splines: sk(⋅) =∑pkj=1 ckj�kj(⋅), k = 0, 1 and 2, where ckj are coefficients of the smooth. The linear

coefficients, �d and �ℎ, are discrete effects for the depth and region in our study

where the effects of being at the depth 1m and region BS are set as baseline as these

are set to be the reference levels for these factor variables. We use �si(t) to denote

independent errors with mean zero and common variance.

To avoid overfitting, M1 is estimated by penalized maximum likelihood estimation

(Hastie and Tibshirani, 1990). The semi-parametric additive model is estimated by

minimizing the penalized sum of squared error loss function:

PENSSE =∑s,i,g

{log(ysgi)−

[� + s0(tg) + s1 {lsi(tg)}+ s2 {rsi(tg)}

+ �dI(depthsi = 7m) +H∑ℎ=1

�ℎI(regionsi = ℎ)]}2

+ �0

∫ [d2s0(t)

dt2

]2dt+ �1

∫ [d2s1(l)

dl2

]2dl + �2

∫ [d2s2(r)

dr2

]2dr ,

where the smoothing parameters �0, �1, �2, determine the amount of smoothing for

Page 118: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 109

each of the smooth terms. The smoothing parameters are estimated with a computa-

tionally efficient method by applying GCV in generalized ridge regression problems

(Wood, 2004). The above model is fitted using an R package mgcv (Wood, 2004).

Note that although oyster growth rate is estimated as the first derivative of the

monotone smoothing spline estimator of the oyster length, it has little correlation

with the oyster length as correlation coefficient equals to -0.17 (scatter plot is shown

in Figure C.4 in the Appendix C). It is not rare for the first derivative of a variable

to be independent from that variable itself. For example, although the velocity of a

moving car is the first derivative of the position function, the velocity is independent

from the position of the car.

Bendell and Feng (2009) estimate linear effects of oyster growth rate and oyster

length and effects of depth and region using a standard multiple linear regression

model

M2 : log(ysgi) =� + �llsi(tg) + �rrsi(tg) + �dI(depthsi = 7m)

+H∑ℎ=1

�ℎI(regionsi = ℎ) + �si(tg) ,

where �l and �r are linear coefficients for oyster length and oyster growth rate, re-

spectively.

To compare the two models, we employ the Akaike information criterion (AIC)

(Akaike, 1974), which penalizes the complexity of the model for using a large number

of parameters. The standard multiple linear regression model M2 is less complex

than the semi-parametric additive model M1 and easier to interpret. However, M1

is appealing in terms of flexibility in the trends for the covariate effects and avoids

restricting the trend to a linear form.

Page 119: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 110

5.3 Results

5.3.1 Data Representation

In this subsection, all plots are provided in the Appendix C. Figure C.1 in the

Appendix C displays smoothed functions of average oyster cadmium concentrations

over time by penalized spline smoothing at each of the experimental sites. During

the initial sampling time in winter 2002 and 2003, the oysters appear to exhibit high

cadmium concentrations, which decrease over the summer of 2003, and subsequently

increase towards winter 2003, though the patterns vary from site to site. Also, the

oyster cadmium concentrations tend to be higher for oysters sampled from 7m depth

than those sampled at 1m.

The curves for the average oyster length are estimated using monotone spline

smoothing, as the average oyster length should not decrease over time. As an il-

lustration, Figure 5.2 displays the fitted curves of the lengths and growth rates for

oysters sampled from site WI at 1m and 7m, respectively. The fitted curves for all

the experimental sites are displayed in Figure C.2 in the Appendix C, which shows

that the oysters sampled at 7m tend to be smaller than those sampled at 1m possibly

due to food availability at different depths. Figure C.3 in the Appendix C shows that

the trends and variations of the growth rate are not aligned across the sites. Note

that the estimated growth rate at the first measurement time from site PN at 1m

is beyond 15, and it is much higher than the other estimated growth rates which

are approximately less than 8. This may be caused by the boundary effect of spline

smoothing, which often yields unreliable estimates at the boundaries. Therefore, this

extreme estimate is removed when exploring the effect of growth rate on cadmium

concentration in the semi-parametric additive model M1 and the standard multiple

linear regression model M2.

Page 120: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 111

●●

●●

●●

●●

●●

●●

●●

●●

●●

050

100

150

month

leng

th (

mm

)

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

02

46

810

month

grow

th ra

te

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

Figure 5.2: The left panel shows the monotone spline smoothing curves of oysterlengths for oysters sampled from site WI at 1m (solid curves) and 7m (dashed curves),respectively. The circles and triangles are the measured oyster lengths sampled at1m and 7m depth, respectively. The right panel shows the estimated growth ratesfor the oysters sampled from site WI at 1m (solid curves) and 7m (dashed curves),respectively. The growth rates are estimated as the first derivatives of the monotonesmoothing functions of the oyster lengths. The labels above the x-axis represent theseasons, ranging from winter 2002 (W2) to winter 2003 (W3). The labels below thex-axis represent the months, ranging from December 2002 (D2) to February 2004(F4).

Page 121: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 112

5.3.2 Spatial Variability

To visualize FPCA results, we examine plots of the overall mean function and the

functions obtained by adding and subtracting a suitable multiple of the eigenfunctions

(Silverman, 1995).

The multiple is �√�p, where � represents a correcting factor to adjust the mag-

nitude of the effect of �p(t) with respect to the square root of the eigenvalue �p. The

correcting factor � can be set to be any value subjectively to adjust the magnitude of

the effect of the eigenfunction �p(t) with respect to the square root of the eigenvalue

�p. Here, we choose � = 0.2 subjectively to produce a clear visual impression of the

effect of principal components on the overall mean function. Figure 5.3 displays the

overall mean curve and the effect of adding (+) and subtracting (−) a multiple of

each of the first two weighting functions for 1m and 7m, respectively. The first FPC

displayed in the upper left panel of Figure 5.3 accounts for about 89% of the variation

of the average cadmium concentration for the oysters sampled at 1m among the thir-

teen experimental sites. The effect of the first eigenfunction is approximately to add

or subtract a constant to cadmium concentration throughout the time period. This

indicates that about 89% of variability between sites is accounted for by the average

cadmium concentration differences. The second eigenfunction explains about 10% of

the variation after accounting for the variability of the first eigenfunction, indicating

that about 10% of the variation among sites is the change of cadmium concentration

from winter 2002 and winter 2003. Similar patterns are observed for the variation

of the cadmium concentration for the oysters sampled at 7m, except that the second

eigenfunction represents the change of cadmium concentration after August 2003.

One of the important features of FPCA is the ability to examine the scores of

each curve on each eigenfunction (Ramsay and Dalzell, 1996; Ramsay and Silverman,

2002). The bottom two panels in Figure 5.4 displays the first FPC score against the

Page 122: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 113

9.5

10.0

10.5

11.0

11.5

PC 1 ( 89.3 % )

month

Cad

miu

m C

once

ntra

tion

(µg

Cd/

g)

D2 F3 A3 J3 J3 S3 N3 J4 F4

++++++++++++++++++++++++++++++++

++++

++++

++++

++++

++

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−

−−−−

−−−−

−−−−

−−−−

10.0

10.5

11.0

11.5

PC 2 ( 9.9 % )

monthC

adm

ium

Con

cent

ratio

n (µ

g C

d/g)

D2 F3 A3 J3 J3 S3 N3 J4 F4

++++++++++++++++++++++++++

++++

++++

++++++++++++++++

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−

−−−−

−−−−

−−−−

−−

1112

1314

1516

PC 1 ( 75.8 % )

month

Cad

miu

m C

once

ntra

tion

(µg

Cd/

g)

D2 F3 A3 J3 J3 S3 N3 J4 F4

++++++++++++++++++++++++++++++++++

++++++

++++

++++

++

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−−−

−−−−−−

−−−−

1213

1415

16PC 2 ( 15 % )

month

Cad

miu

m C

once

ntra

tion

(µg

Cd/

g)

D2 F3 A3 J3 J3 S3 N3 J4 F4

++++++++++++++++++++++++++++++++

++++

++++

++++

++++

++

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−−−−−−

−−−−−−

−−

Figure 5.3: The top two panels and the bottom two panels display the mean oystercadmium concentration curve and the effects of adding (+) and subtracting (-) a smallmultiple of each eigenfunctions for oysters sampled at 1m and 7m depth, respectively.The percentages indicate the amount of total spatial variation accounted by functionalprincipal components. The labels of the x-axis represent the months, ranging fromDecember 2002 (D2) to February 2004 (F4).

Page 123: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 114

second FPC score for 1m and 7m, respectively. There appears to be some regional

groupings, although there is some overlap between Barkley Sound, Nootka Sound and

Quatsino Sound. One referee suggests to use a hierarchical clustering tree for group

classification by considering the n× 2 matrix with row entries taken as the principal

component scores associated with eigenvalues �1 and �2. The top two panels in Figure

5.4 shows the clustering tree for the samples at 1m and 7m, respectively. For samples

at 1m, three groups are obtained by cutting the tree at height 25: group 1=TB,

OB, GH and TC, which score highly on the first PC; group 2= RB, TA, JF, BM

and WI, which score moderately on the first PC; group 3=PC, HC, PN and KI,

which score low on the first PC. For samples at 7m, three groups are obtained by

cutting the tree at height 25: group 1=OB, WI, TA and TC, which score highly on

the first PC; group 2= GH, TB, BM and JF, which score moderately on the first

PC; group 3=PN, KI, PC, RB and HC, which score low on the first PC. Therefore,

cadmium concentrations for the oysters sampled from those inland sites may have

higher cadmium concentration on average than the coastal sites at 1m depth. In

fact, the form of the groups is completely guided by the first principal component

coordinates because this first axis accounts for about 90% of the entire variability.

This axis can then serve as a pollution index because it sorts the observations by mean

cadmium concentration. Similar grouping patterns are also found for the oysters

sampled at 7m depth, except for site WI, exhibiting high cadmium concentration

on average. It is interesting to note that the oysters sampled from WI are more

influenced by oceanic processes rather than direct anthropogenic influences. Possible

sources at this one site could also be related to forestry practices within this region

e.g. forest canopy removal with resulting erosion of soils naturally high in cadmium.

Cadmium contaminated fertilizer applied during reforestation could also contribute

to observed oyster cadmium concentrations at this site.

Page 124: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 115

(1)P

C

(4)H

C

(1)P

N

(2)K

I

(3)T

B

(3)O

B

(3)G

H

(3)T

C

(3)R

B

(3)T

A (1)J

F

(1)B

M

(1)W

I

020

4060

1m

Hei

ght

(1)P

N

(2)K

I

(1)P

C

(3)R

B

(4)H

C

(3)G

H

(3)T

B

(1)B

M

(1)J

F (3)O

B

(1)W

I

(3)T

A

(3)T

C

020

60

7m

Hei

ght

−15 −10 −5 0 5 10 15

−6−2

02

46

PC1(89.3%)

PC

2(9.

9%) ●

(3)RB(3)TA

(3)GH

(3)TC

(3)TB

(1)PN

(1)BM

(1)JF

(1)PC

(1)WI

(4)HC(2)KI

(3)OB

1m

−10 0 10 20

−10

−50

510

PC1(75.8%)

PC

2(15

%)

(3)RB

(3)TA

(3)GH

(3)TC

(3)TB

(1)PN

(1)BM(1)JF

(1)PC

(1)WI

(4)HC

(2)KI

(3)OB

7m

Figure 5.4: The upper two panels show the hierarchical clustering trees for the n ×2 matrix with row entries taken as the principal component scores associated toeigenvalues �1 and �2 for 1m and 7m depth, respectively. The bottom two panels showthe first two functional principal component scores at 1m and 7m depth, respectively.The location of each site is shown by the two-letter abbreviation of its name withthe number in the bracket indicating which region the site is from. The sites fromBarkley Sound are symbolized as the solid squares; the site from Nootka Sound issymbolized as the solid triangle; the sites from Desolation Sound are symbolized asthe solid circles; and the site from Quatsino Sound is symbolized as the square withtriangle inside. The convex hulls are added to the clusters with the cluster centroidsymbolized as the red circle for each of the clusters, provided the trees are cut atheight 25.

Page 125: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 116

5.3.3 The Semi-Parametric Additive Model

The semi-parametric additive model M1 and the standard multiple linear regression

model M2 are compared in terms of AIC. The AIC is 856.68 for M1 and 936.41 for M2.

A model with a lower AIC score is preferred as it achieves a more optimal combination

of fit and parsimony. As a result, AIC favors M1 over M2. In comparison to M2, M1

used nonparametric functions of length and growth rate to explain the variation left

after accounting for the effects of depth and region.

Figure 5.5 displays the estimated model terms with 95% confidence intervals. The

top left panel shows that the oyster cadmium concentration averaged over thirteen

sites has the lowest value in summer 2003 and relatively higher values in winter 2002

and 2003. A longer series of data would be needed to test if this pattern is consistent

over years. The other two top panels show that the oyster length and growth rate

tend to have nonlinear relationships with oyster cadmium concentration. The aver-

age cadmium concentration decreases with the oyster length, indicating that smaller

oysters have higher cadmium concentration than their longer counterparts. The third

panel shows that the partial growth rate effect appears to decrease with the growth

rate up to about 2.5 mm per month and then becomes statistically nonsignificant.

Note that the confidence interval for the partial effect of growth rate gets wider as

growth rate gets larger, as there are fewer oysters with higher values of growth rate.

Figure 5.5 also displays the comparison for oyster cadmium concentrations be-

tween 1m and 7m and among all the regions after adjusting for the smooth terms

of length and growth rate effects in model M1. On average, oyster cadmium con-

centrations are higher for the oysters sampled at 7m than those from 1m and higher

for those from region DS than those from the other regions with regions BS and QS

having significantly lower oyster cadmium concentrations than the other locations,

confirming the results of functional principal component analysis.

Page 126: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 117

The results in Table 5.1 for model M1 indicate that the functional effects of oyster

length and growth rate on the cadmium concentration are significant and the average

oyster cadmium concentration at a lower depth of 7m is significantly higher than

that at depth of 1m by about 0.23 �g Cd/g. Also, the average oyster cadmium

concentration in oysters from region DS is significantly higher than the other regions

by about 0.47 �g Cd/g. It is worthwhile noting that the model terms that are

common to both models have remarkably similar coefficients, since depth and region

are independent from oyster length and oyster growth rate. To be more precise, the

nonparametric functions of length and growth rate explain the remaining variation of

cadmium concentration conditional on the depth and region variables.

5.4 Concluding Remarks

In this chapter, we investigate the nature of the spatial and temporal distributions

of oyster cadmium concentration within the regions of our study. We illustrate some

statistical methodologies to provide a route for statistical analysis directed at enhanc-

ing biological insight. Those methodologies can readily be applied to a wide variety

of marine ecological data characterized by being irregularly spaced and noisy, while

allowing spatial clustering and potentially interesting and important factors to be

identified.

To handle missing and irregularly spaced temporal measurements, we have in-

vestigated the use of penalized spline smoothing technique to estimate the mean

curve of oyster cadmium concentration. We also adopt the monotone spline smooth-

ing method to impose non-decreasing constraints on oyster length curve estimation.

Oyster growth rate is characterized as the first derivative of the estimated curve for

oyster length. To the best of our knowledge, few attempts have been made so far in

the marine ecological literature to impose shape restrictions on the growth curve. The

Page 127: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 118

−0.2

−0.1

0.0

0.1

0.2

month

parti

al e

ffect

for m

onth

D2 F3 A3 J3 A3 O3 D3 F4 50 100 150

−0.4

−0.2

0.0

0.2

0.4

length

parti

al e

ffect

for l

engt

h

0 1 2 3 4 5 6 7

−0.1

0.0

0.1

0.2

0.3

growth rate

parti

al e

ffect

for g

row

th ra

te

0.00

0.05

0.10

0.15

0.20

0.25

0.30

depth

parti

al e

ffect

for d

epth

1m 7m

−0.2

0.0

0.2

0.4

region

parti

al e

ffect

for r

egio

n

BS DS NS QS

Figure 5.5: The estimated partial effects of covariates on cadmium concentration in oystersin model M1. The top left panel displays the estimated partial effect over months s0(t);the top middle panel displays the estimated partial effect of oyster length s1(lsi(t)); andthe top right panel displays the estimated partial effect of oyster growth rate s2(rsi(t)).The x-axis tick labels in the top left panel represent the months ranging from December2002 (D2) to February 2004 (F4). The growth rate is calculated as the first derivativeof the monotone smoothing curve for the oyster length. The bottom two panels showthe estimated partial effects for each level of depth and region, respectively. The effectsof being at the depth 1m and region BS are set as baseline due to the default contrastshaving been used. In all the panels, the dashed lines indicate the 95% confidence intervalsfor the partial effects.

Page 128: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 119

Table 5.1: Results for the semi-parametric additive model (M1) and standard linearregression model (M2). The effects of being at the depth 1m and region BS are setas baseline due to the default contrasts having been used. Note “edf” represents theeffective degrees of freedom of the functional parameters.

Semi-parametric Additive Model (M1)Parametric coefficients:

Estimate Std.Error p-value(Intercept) 2.01 0.03 < 0.001depth 7m 0.23 0.03 < 0.001region DS 0.48 0.03 < 0.001region NS 0.24 0.06 < 0.001region QS -0.16 0.06 0.01Approximate significance of smooth terms:

edf p-values0(t) 3.59 < 0.001s1(length) 1.56 < 0.001s2(growth rate) 3.36 0.008

Standard Linear Regression Model (M2)Estimate Std.Error p-value

(Intercept) 2.37 0.07 < 0.001length -0.003 0.01 < 0.001growth rate -0.017 0.001 0.10depth 7m 0.25 0.03 < 0.001region DS 0.47 0.03 < 0.001region NS 0.23 0.06 < 0.001region QS -0.17 0.06 0.01

Page 129: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 120

prime advantage of using these spline techniques is to relax the parametric assumption

on the curve shapes commonly seen in ecological and biological literature.

The functional principal component analysis technique is investigated to identify

the spatial variation of the average oyster cadmium concentration over the experimen-

tal sites, which provides a good indication of which sites are similar and might assist

future allocation of sampling efforts. There appears to be some regional grouping,

although there is some overlap between Barkley Sound, Nootka Sound and Quatsino

Sound. Possible cadmium sources (Kruzynski, 2000, 2004) from different regions are

quite different, though. For the oysters sampled from sites located in Desolation

Sound, given their close proximity to terrestrial influences, possible cadmium sources

could include cadmium contaminated phosphate fertilizers and local septic tanks.

Therefore, the spatial clustering pattern suggests an upland continental source ver-

sus a marine source in the coastal area. The oysters sampled from Webster Island,

however, are more influenced by oceanic processes rather than direct anthropogenic

influences. Possible cadmium sources at this site may also be related to forestry

practices within this region e.g. forest canopy removal with resulting erosion of soils

naturally high in cadmium. Cadmium contaminated fertilizer applied during refor-

estation may also contribute to observed oyster cadmium concentrations at this site.

Further investigation with more sampling sites and longer time duration of experi-

ments are needed to test our hypothesis.

In this study, we have seen that the semi-parametric additive model ensures a bet-

ter fit than the standard multiple linear regression model. More importantly, it has

the ability to examine the nonlinear relationships between the cadmium concentra-

tion and a set of covariates when there is no prior knowledge that these relationships

should be linear. The nonparametric term of the overall mean trend for oyster cad-

mium concentrations in the model implies that oysters may have greater cadmium

concentration during the colder winter months than the warmer summer months.

Page 130: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 5. SPATIAL AND TEMPORAL VARIATION IN CADMIUM 121

This may be due to phytoplankton blooms in early spring. However, a longer time

series of data is needed to verify this hypothesis properly.

Our model also reveals that oyster cadmium concentrations are significantly dif-

ferent at two depths on average, being notably higher at a depth of 7 meters. This

is possibly due to the dilution with oysters at 1 meter being heavier than those at 7

meters, therefore, the amount of metal to tissue is greater at 7 meters than at 1 me-

ter, even though the amount of metal is the same. Therefore, it may be advisable for

shellfish farmers to avoid harvesting oysters at lower depths. The model also shows

that oyster cadmium concentration decreases as oyster length increases, which may

be attributed to the fact that oyster has grown more tissue relative to the amount of

metal accumulated.

If the interest is to investigate the influence of environmental factors (i.e. tem-

perature, salinity, turbidity, chlorophyll) to rates at which organisms assimilate and

utilize energy for maintenance, growth and reproduction, we may consider models

that describe processes involved in the oyster growth such as those constructed with

the dynamic budget energy theory (Bourles et al., 2009). Such models are based on

ecophysiological modeling that details the physiological processes and energetics of

an organism in response to environmental fluctuations.

Page 131: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Chapter 6

Future Work

Future work will consider both extensions of methods developed as well as new inter-

esting areas of research.

6.1 Spatial-temporal Modeling for Multivariate Spa-

tial Outcomes

In the spatial statistics literature, there has been great interest in joint modeling

of spatially and temporally correlated data, to enable simultaneous investigation of

space-time variation. Handcock and Wallis (1994) considered spatiotemporal mod-

elling of winter temperature data but their approach utilized separate spatial analyses

by year using a Gaussian random field model. Guttorp et al. (1994) modelled the

spatial covariances of hourly ozone levels allowing the parameters of the model to

vary as a function of time of day. Carroll et al. (1997) modelled ozone exposure

in Texas, U.S.A. by combining trend terms incorporating temperature, hourly and

monthly effects, with the correlation in the residuals being a non-linear function of

time and space. Brown et al. (2001) considered spatiotemporal modelling of rainfall

122

Page 132: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 6. FUTURE WORK 123

data, using a non-separable model in which the spatial field at a specific time was

obtained by ‘blurring’ the field (Brown et al., 2000) at the previous time point. Such

a framework can efficiently model relatively complicated dynamical processes. Waller

et al. (1997) propose spatio-temporal interaction models where the spatial effects are

nested within time, enabling an examination of the evolution of heterogeneity and

spatial patterns over time. MacNab and Dean (2001) propose a generalized additive

mixed model (GAMM), where B-spline smoothing over the temporal dimension pro-

vides a flexible means of accommodating overall time effects as well as region-specific

time effects.

For spatially and temporally correlated multivariate spatial outcomes, the iden-

tification of spatial patterns of disease risk that evolve over time may provide more

insight on true risk variation than single cross-sectional analyses at specific points in

space or time. If multiple diseases are studied simultaneously, incorporating similar

spatio-temporal trends of risk may provide a means to strengthen the evidence for

common sources of influence that reflect underlying shared risk factors. A multivari-

ate spatio-temporal analysis may also lead to improved precision for the estimation of

the underlying disease risks, by borrowing strength from other diseases as well as from

neighboring areas or time points. Richardson et al. (2006) proposed an extension of

Knorr-Held and Best (2001)’s shared component model for space-time modeling of

lung cancer incidence for males and females in Yorkshire, UK. Tzala and Best (2006)

propose an extension of Wang and Wall (2003)’s common spatial factor model for the

analysis of area-level mortality data on six diet-related cancers for Greece, over the

20-year period from 1980 to 1999.

Here, we may extend our common spatial factor model as follows: Let yijt denote

the count of disease for region i, outcome j and time t. and let Eijt denote the

Page 133: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 6. FUTURE WORK 124

expected disease counts, i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ , J and t = 1, ⋅ ⋅ ⋅ , T . The spatial-

temporal common factor model can be expressed as

log(�ijt) = �jt + log(Eijt) + jbi + �jgt + �ijt, (6.1)

where �jt is the overall mean rate for jth component at time t; �ijt is the ex-

pected disease count for region i, outcome j and time t; the spatial random effect

b = (b1, ⋅ ⋅ ⋅ , bn)T ∼ N(0,Σb), Σb = �2b(D −W )−1 and a simple AR(1) model may

describe the temporal random effect, gt∣gt−1 ∼ N(�gt−1, �2g), where �2

g is the tempo-

ral dispersion parameter and � is the temporal autocorrelation with ∣�∣ = 1 (Waller

et al., 1997); the interaction effect of space and time over multiple outcomes can be

accommodated through � = (�111, ⋅ ⋅ ⋅ , �nJT )T ∼ N(0, �2�I−1� ), where �2

� measures the

dispersion over space and time.

For zero-heavy data, the model may take the form

log(�ijt) = �jt + jbi + �jgt, logit(�ijt) = �jt + !jdi + �jqt (6.2)

where �jt and �jt are the overall mean rates for jth component at time t for the

Poisson and excess zero components, respectively. The CAR specifications may be

employed for b = (b1, ⋅ ⋅ ⋅ , bn)T and d = (d1, ⋅ ⋅ ⋅ , dn)T , while auto-regressive time

series models may describe g = (g1, ⋅ ⋅ ⋅ , gT )T and q = (q1, ⋅ ⋅ ⋅ , qT )T .

This model assigns separable spatial and temporal effects; more sophisticated

forms which include spatio-temporal interaction effects may be considered.

6.2 Spatial Modeling for Infectious Disease

In the last decade, there has been a tremendous interest by scientists, profession-

als, politicians and the general public in understanding the transmission of infec-

tious disease to control future outbreaks. This has been motivated, in part, by

Page 134: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 6. FUTURE WORK 125

severe events. For example, Severe Acute Respiratory Syndrome (SARS) rapidly

spread over 30 countries and regions during a period of less than half a year from

the beginning of 2003, leading to over 8000 infected people and over 700 deaths

(http://www.cdc.gov/ncidod/sars/). The West Nile virus, originating from Uganda,

was found in New York in 1999, and had spread to over 44 states by 2002; in

2003 and 2004, the West Nile virus had infected over 12000 people, killing 350

(http://www.cdc.gov/ncidod/dvbid/westnile/). Another example is the pandemic

H1N1 (swine) influenza, which has killed more than 18,000 people since it appeared

in April 2009. The AIDS (Acquired Immune Deficiency Syndrome) epidemic caused

by HIV (Human Immunodeficiency Virus) has killed more than 25 million people from

1981 to 2006, with approximately 260,000 children having died of AIDS in 2009. All

these diseases have exhibited spatial and temporal patterns. However, many studies

of infectious disease transmission are limited to empirical and static analyses in the

time dimension that explores relationships between a few environmental variables and

epidemiological data.

Ordinary differential equations (ODEs) are often used to describe the dynam-

ics of infectious diseases by relating a process to its rate of change. For example,

ODEs have been used in HIV studies to model the viral dynamic system for bet-

ter understanding of the pathogenesis of HIV infection and to evaluate antiretroviral

therapy (see Wu, 2005, for a comprehensive review of statistical methods in modeling

HIV viral dynamic). With the availability of residency information, investigators can

incorporate spatial information in the ODE model, as the environmental or epidemi-

ological variable is commonly spatially correlated (e.g. social-economic status, source

of income, access to public health service, etc.). To model those covariates as having

space-time varying effects may be potentially helpful in the spatial control of disease

transmission. This topic will be the main focus of my research starting in September,

2011.

Page 135: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 6. FUTURE WORK 126

6.3 Curve Clustering

In recent years, there has been great interest in clustering spatially correlated func-

tional data. It would be helpful to develop a Bayesian hierarchical mixture model

for clustering spatially correlated functional data. Penalized splines may be used to

model the functional data, where roughness penalties are employed to regularize the

spline fit. Spatial correlation may be introduced into the model by modeling the

classification probabilities as a Markov random field.

The model-free clustering methods, such as k-means and hierarchical clustering

approaches, are easy to use and useful as exploratory tools; however, it is difficult to

make formal inference because of their lack of distributional structure (for example,

the ordering and spacing of the sampling times), and no spatial correlation can be

directly incorporated into those approaches.

The model based clustering approach of (Yeung et al., 2001) assumes the data

arise from a mixture of multivariate normal distributions but does not acknowledge

time ordering of the data. James and Sugar (2003) and Luan and Li (2003) develop

a mixed effects model for time course gene expression data using B-splines, treating

gene expression levels as continuous functions of time. Similar approaches have been

developed independently by Bar-Joseph et al. (2002) with cubic splines and Gaffney

and Smyth (2003), with random effects regression mixtures. Zhou and Wakefield

(2006) developed a random walk prior to account for time ordering, missing data and

imbalance in the design, with clustering achieved via low dimensional representations

of the functional curves.

Fernandez and Green (2002) proposed a spatial mixture model for clustering;

Broet and Richardson (2006) extended this model to analyze comparative genomic

hybridization (CGH) data by introducing gene specific prior probabilities and allowing

those prior probabilities to be correlated among neighboring genes on a chromosome to

Page 136: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

CHAPTER 6. FUTURE WORK 127

gain more efficiency to identify gene copy number changes. Future work will develop

methodology to simultaneously take the time ordering and spatial correlation into

account. We propose the use of penalized B-splines to account for time order, and a

conditional autoregressive model for classification membership, in order to incorporate

spatial correlations among the temporal curves. The model permits modeling the

variability in the curve data at different levels and provides a probabilistic framework

for clustering.

Page 137: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Bibliography

Agarwal, D. K., Gelfand, A. E., and Citron-Pousty, S. (2002). Zero-inflated modelswith application to spatial count data. Environmental and Ecological Statistics 9,341–355.

Ainsworth, L. (2007). Models and Methods for Spatial Data: Detecting Outliers andHandling Zero-Inflated Counts. PhD thesis, Simon Fraser University.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transac-tions on Automatic Control 19, 716–723.

Angers, J. and Biswas, A. (2003). A bayesian analysis of zero-inflated generalizedpoisson model. Computational Statistics and Data Analysis 42, 37–46.

Antoniadis, A. and Sapatinas, T. (2007). Estimation and inference in functionalmixed-effects models. Computational Statistics and Data Analysis 51, 4793–4813.

Augustin, N., Lang, S., Musio, M., and von Wilpert, K. (2007). A spatial model forthe needle losses of pine trees in the forests of Baden-Wurttemberg: An applicationof Bayesian structured additive regression. Applied Statistics 56 (1), 29–50.

Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., and Simon, I. (2002). Anew approach to analyzing gene expression time series data. In proceedings ofthe 6th Annual Int’l Conference on Research in Computational Molecular Biology.Washington, D.C., Apr 18-21. pages 39–48.

Bendell, L. I. and Feng, C. X. (2009). Spatial and temporal variations in cadmiumconcentrations and burdens in the Pacific oyster (Crassostrea gigas) sampled fromthe Pacific north-west. Marine Pollution Bulletin 58(8), 1137–1143.

Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). NewYork: Springer.

128

Page 138: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 129

Bernardinelli, L. and Montomoli, C. (1991). Empirical Bayes versus fully Bayesiananalysis of geographical variation in disease risk. Statistics in Medicine 11, 983–1007.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems(with discussion). Journal of the Royal Statistical Society, Series B 36, 192–236.

Besag, J., York, J., and Mollie, A. (1991). Bayesian image-restoration, with two appli-cations in spatial statistics (with discussion). Annals of the Institute of StatisticalMathematics 43, 1–59.

Besse, P. and Ramsay, J. O. (1986). Principal components analysis of sampled func-tions. Psychometrika 51, 285–311.

Best, N. G., Richardson, S., and Thomson, A. (2005). Bayesian spatial models fordisease mapping. Statistical Methods in Medical Research 14, 35–59.

Bourles, Y., Alunno-Bruscia, M., and Pouvreau, S. (2009). Modelling growth andreproduction of the Pacific oyster Crassostrea gigas: Advances in the oyster-DEBmodel through application to a coastal pond. Journal of Sea Research 62, 62–71.

Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linearmodels. Journal of the American Statistical Association 88, 9–25.

Broet, P. and Richardson, S. (2006). Detection of gene copy number changes inCGH microarrays using a spatially correlated mixture model. Bioinformatics 22,911–918.

Brooks, S. P. and Gelman, A. (1998). Alternative methods for monitoring convergenceof iterative simulations. Journal of Computational and Graphical Statistics 7, 434–455.

Brown, P. E., Diggle, P. J., Lord, M. E., and Young, P. C. (2001). Space-timecalibration of radar rainfall data. Journal of the Royal Statistical Society, SeriesC: Applied Statistics 50(2), 221–241.

Brown, P. E., Karesen, K. F., Roberts, G. O., and Tonellato, S. (2000). Blur-generatednon-separable space-time models. Journal of Royal Statistical Society, Series B:Statistical Methodology 62(4), 847–860.

Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Bayes Methods for DataAnalysis. Chapman and Hall, London.

Page 139: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 130

Carlin, B. P. and Louis, T. A. (2000). Bayes and empirical Bayesian methods fordata analysis. London, U.K.: Chapman&Hall/CRC.

Carroll, R., Chen, R., Li, T., Newton, H., Schmiediche, H., Wang, H., and George,E. (1997). Modeling ozone exposure in Harris County, Texas. Journal of AmericanStatistical Association 92, 392–413.

Chiou, J. M., Muller, H. G., and Wang, J. L. (2004). Functional response models.Statistica Sinica 14, 675–693.

Clarke, K. and Green, R. (1988). Statistical design and analysis for a “biologicaleffects” study. Marine Ecology Progress Series 46, 213–226.

Clayton, D. G. and Bernardinelli, L. (1992). Bayesian methods for mapping diseaserisk. Geographical and Environmental Epidemiology: Methods for Small-area Stud-ies, Elliott P, Cuzick J, English D, Stern R (eds). Oxford University Press: Oxfordpages 205–220.

Clayton, D. G., Bernardinelli, L., and Montomoli, C. (1993). Spatial correlation inecological analysis. International Journal of Epidemiology 22, 1193–1202.

Clayton, D. G. and Kaldor, J. (1987). Empirical Bayes estimates of age-standardizedrelative risks for use in disease mapping. Biometrics 43, 671–682.

Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models and Applications.London: Pion.

Congdon, P. (2006). A model framework for mortality and health data classified byage, area and time. Biometrics 61, 269–278.

Crambes, C., Kneip, A., and Sarda, P. (2009). Smoothing splines estimators forfunctional linear regression. Annals of Statistics 37, 35–72.

Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions:estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, 377–403.

Cressie, N. (1993). Statistics for Spatial Data. revised ed. Wiley-Interscience, NewYork.

de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York.

Dean, C. B. and MacNab, Y. C. (2001). Modelling of rates over a hierarchical healthadministrative structure. The Canadian Journal of Statistics 29, 405–419.

Page 140: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 131

Deville, J. C. (1974). Methodes statistiques et numeriques de l’analyse harmonique.Annales de l’INSEE 15, 7–97.

Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in Sobolevspaces. Springer-Verlag, Berlin, pp.85-100.

Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject-specific curves for longitudinal data. Statistics in Medicine 00, 1–24.

Fernandez, C. and Green, P. (2002). Modeling spatially correlated data via mixtures:a Bayesian approach. Journal of Royal Statistical Society, Series B 64, 805–826.

Ferraty, F. and Vieu, P. (2000). Dimension fractale et estimation de la regressiondans des espaces vectoriels semi-normes. C. R. Acad. Sci. Paris Ser. I Math. 330,403–406.

Gaffney, S. J. and Smyth, P. (2003). Curve clustering with random effects regressionmixtures. Proceedings of the Ninth International Workshop on Artificial Intelligenceand Statistics. Key West, FL. .

Gail, M. H., Wieand, S., and Piantadosi, S. (1984). Biased estimates of treatmenteffect in randomized experiments with nonlinear regressions and omitted covariates.Biometrika 71(3), 431–444.

Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculating marginaldensities. Journal of the American Statistical Association 85, 398–409.

Gelfand, A. E. (2000). ‘Gibbs Sampling’. Journal of American Statistical Association95, 1300–1304.

Gelfand, A. E. and Vounatsou, P. (2003). Proper multivariate conditional autoregres-sive models for spatial data analysis. Biostatsitics 4, 11–25.

Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models.Bayesian Analysis 1(3), 515–533.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian DataAnlaysis. Boca Raton: Chapman&Hall/CRC.

Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessmentof model fitness via realized discrepancies (with discussion). Statistica Sinica 6,733–807.

Page 141: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 132

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images. IEEE Transactions on Pattern Analysis andMachine Intelligence 6, 721–741.

Green, J. and Richardson, S. (2002). Hidden Markov models and disease mapping.Journal of the American Statistical Association 97, 1055–1070.

Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and GeneralisedLinear Models: A Roughness Penalty Approach. Chapman and Hall, London.

Guttorp, P., Meiring, W., and Sampson, P. (1994). A space-time analysis of ground-level ozone data. Environmetrics 5, 241–254.

Hall, P., Muller, H. G., and Wang, J. L. (2006). Properties of principal componentsmethods for functional and longitudinal data analysis. Annals of Statistics 34,1493–1517.

Handcock, M. S. and Wallis, J. R. (1994). An approach to statistical spatial-temporalmodelling of meteorological fields (with discussion). Journal of American StatisticalAssociation 89, 368–390.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman andHall, New York.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of StatisticalLearning. Springer-Verlag, New York.

Hogan, J. W. and Tchernis, R. (2004). Bayesian factor analysis for spatially correlateddata, with application to summarizing area-level material deprivation from censusdata. Journal of the American Statistical Association 99, 314–324.

Jacobi, W. R., Geils, B. W., Taylor, J. E., and Zentz, W. R. (1993). Predicting theincidence of Comandra blister rust on lodgepole pine: site, stand, and alternate-host influences. Phytopathology 83, 630–637.

James, G. and Sugar, C. (2003). Clustering for sparsely sampled functional data.Journal of the American Statistical Association 17, 397–408.

Knorr-Held, L. and Best, N. G. (2001). A shared component model for detecting jointand selective clustering of two diseases. Journal of the Royal Statistical Society,Series A 164, 73–85.

Knorr-Held, L. and Rasser, G. (2000). Bayesian detection of clusters and discontinu-ities in disease maps. Biometrics 56, 13–21.

Page 142: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 133

Kruzynski, G. M. (2000). Cadmium in BC farmed oysters: a review of available data,potential sources, research needs and possible mitigation strategies. Canadian StockAssessment Secretariat Research Document. Fisheries and Oceans Science .

Kruzynski, G. M. (2004). Cadmium in oysters and scallops: The BC experience.Toxicology Letters 148, 159–169.

Kuhnert, P. M., Martin, T. G., Mengersen, K., and Possingham, H. P. (2005). Assess-ing the impacts of grazing levels on bird density in woodland habitat: a Bayesianapproach using expert opinion. Environmetrics 16, 717–747.

Kulldorff, M., Athas, W., Feuer, E., Miller, B., and Key, C. (1998). Evaluating clusteralarms: a space-time scan statistic and brain cancer in Los Alamos. AmericanJournal of Public Health 88, 1377–1380.

Kulldorff, M. and Nagarwalla, N. (1995). Spatial disease clusters: detection andinference. Statistics in Medicine 14, 799–810.

Laird, N. M. and Louis, T. A. (1989). Empirical Bayes ranking methods. Journal ofEducational Statistics 14, 29–46.

Lambert, D. (1992). Zero-inflated poisson regression, with an application to defectsin manufacturing. Technometrics 34, 1–14.

Lawson, A. B., Biggeri, A. B., Boehning, D., Lesaffre, E., Viel, J. F., Clark, A.,Schlattmann, P., and Divino, F. (2000). Disease mapping models: an empiricalevaluation. Statistids in Medicine 19, 2217–2241.

Leroux, B. G., Lei, X., and Breslow, N. (1999). Estimation of disease rates in smallareas: A new mixed model for spatial dependence. Statistical Models in Epidemi-ology, the Environmental and Clinical Trials pages 135–178.

Lin, R., Louis, T. A., Paddock, S. M., and Ridgeway, G. (2006). Loss function basedranking in two-stage, hierarchical models. Bayesian Analysis 1, 915–946.

Loh, J. and Zhu, Z. (2007). Accounting for spatial correlation in the scan statistic.Annals of Applied Statistics 2, 560–584.

Louis, T. A. (1984). Estimating a population of parameter values using Bayes andempirical Bayes methods. Journal of the American Statistical Association 79, 393–398.

Page 143: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 134

Louis, T. A. and Shen, W. (1999). Innovations in Bayes and empirical Bayes methods:estimating parameters, populations and ranks. Statistics in Medicine 18, 2493–2505.

Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using amixed-effects model with B-splines. Bioinformatics 19(4), 474–482.

MacNab, Y. C. and Dean, C. B. (2000). Parametric bootstrap and penalized quasi-likelihood inference in conditional autoregressive models. Statistics in Medicine 19,2421–2436.

MacNab, Y. C. and Dean, C. B. (2001). Autoregressive spatial smoothing and tem-poral spline smoothing for mapping rates. Biometrics 57, 949–956.

MacNab, Y. C., Farrel, P. J., Gustafson, P., and Wen, S. (2004). Estimation inBayesian disease mapping. Biometrics 60, 865–873.

Malfait, N. and Ramsay, J. O. (2003). The historical functional linear model. TheCanadian Journal of Statistics 31(2), 115–128.

Manton, K. G., Woodbury, M. A., Stallard, E., Riggan, W. B., Creason, J. B., andPellom, A. C. (1989). Empirical Bayes procedures for stabilizing maps of US cancermortality rates. Journal of the American Statistical Association 84, 637–650.

Marshall, R. J. (1991). Mapping disease and mortality rates using empirical Bayesestimators. Applied Statistics 40, 283–294.

Martin, T., Wintle, B., Rhodes, J., Kuhnert, P., Field, S., Low-Choy, S., Tyre, A.,and Possingham, H. (2005). Zero tolerance ecology: improving ecological inferenceby modelling the source of zero observations. Ecology Letters 8, 1235–1246.

Matheron, G. (1973). The intrinsic random functions and their applications. Advancesin Applied Probability 5, 439–468.

Meinguet, J. (1979). Multivariate interpolation of arbitrary points made simple.Journal of Applied Mathematical Physics 30, 292–304.

Meng, X.-L. (1994). Posterior predictive p-values. Annals of Statistics 22, 1142–1160.

Paciorek, C. J. (2007). Computational techniques for spatial logistic regression withlarge data sets. Computational Statistics and Data Analysis 51, 3631–3653.

R Development Core Team (2010). R: A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Page 144: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 135

Ramsay, J. O. (1988). Monotone regression splines in action. Statistical Science 3,425–461.

Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis.Journal of the Royal statistical Society, Series B 53, 539–572.

Ramsay, J. O. and Dalzell, C. J. (1996). Functional data analyses of lip motion.Journal of the Acoustical Society of America 99, 3718–3727.

Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis:Methods and Case Studies. Springer, New York.

Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis, 2nd ed.Springer, New York.

Rathbun, S. L. and Fei, S. (2006). A spatial zero-inflated poisson regression modelfor oak regeneration. Environmental and Ecological Statistics 13, 409–426.

Ribeiro Jr, P. J. and Diggle, P. J. (2001). geor: A package for geostatistical analysis.R-NEWS 1(2), ISSN 1609–3631.

Richardson, S., Abellan, J., and Best, N. (2006). Bayesian spatio-temporal analysisof joint patterns of male and female lung cancer risk in Yorkshire (UK). StatisticalMethods in Medical Research 15, 385–407.

Richardson, S., Thomson, A., Best, N., and Elliot, P. (2004). Interpreting posteriorrelative risk estimates in disease-mapping studies. Environmental Health Perspec-tives 112, 1016–1025.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations forthe applied statistician. Annals of Statistics 12, 1151–1172.

Schallie, K. (2001). Results of the 2000 survey of cadmium in B.C. oysters. Pro-ceedings of a Workshop on Possible Pathways of Cadmium into The Pacific oysterCrassostrea Gigas as Cultured on the Coast of British Columbia 65, 31–32.

Shen, W. and Louis, T. A. (1998). Triple-goal estimates in two-stage hierarchicalmodels. Journal of the Royal Statistical Society, Series B 60, 455–471.

Silverman, B. W. (1995). Incorporating parametric effects into functional principalcomponents analysis. Journal of the Royal Statistical Society, Series B 57, 673–689.

Silverman, B. W. (1996). Smoothed functional principal components analysis bychoice of norm. Annals of Statistics 24, 1–24.

Page 145: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 136

Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). WinBUGS UserManual Version 1.4. Medical Research Council Biostatistics Unit: Cambridge,UK.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesianmeasures of model complexity and fit (with discussion). Journal of the RoyalStatistical Society, Series B 64, 583–640.

Sturtz, S., Ligges, U., and Gelman, A. (2005). R2winbugs: A package for runningWinBUGS from R. Journal of Statistical Software 12(3), 1–16.

Sun, D., Tsutakawa, R. K., and Speckman, P. L. (1999). Posterior distribution ofhierarchical models using CAR(1) distributions. Biometrika 86, 341–350.

Tzala, E. and Best, N. (2006). Bayesian latent variable modelling of multivariatespatio-temporal variation in cancer mortality. Biometrics 61, 269–278.

Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hy-potheses. Econometrica 57, 307–333.

Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothingparameter in the generalized spline smoothing problem. The Annals of Statistics13, 1378–1402.

Waller, L. A., Carlin, B. P., Xia, H., and Gelfand, A. E. (1997). Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association92, 607–617.

Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.

Wang, F. and Wall, M. (2003). Generalized common spatial factor model. Biostatistics4, 569–582.

Welsh, A. H., Cunningham, R. B., Donnelly, C. F., and Lindenmayer, D. B. (1996).Modelling the abundance of rare species: statistical models for counts with extrazeros. Ecological Modelling 88, 297–308.

Wood, S. N. (2000). Modelling and smoothing parameter estimation with multiplequadratic penalties. Journal of the Royal Statistical Society, Series B 62, 413–428.

Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal StatisticalSociety, Series B 65, 95–114.

Page 146: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

BIBLIOGRAPHY 137

Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation forgeneralized additive models. Journal of the American Statistical Association 99,673–686.

Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapmanand Hall, London.

Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalizedadditive models. Journal of the Royal Statistical Society. Series B 70, 495–518.

Wright, D. L., Stern, H. S., and Cressie, N. (2003). Loss functions for estimationof extrema with an application to disease mapping. The Canadian Journal ofStatistics 31, 251–266.

Wu, H. (2005). Statistical methods for HIV dynamic studies in AIDS clinical trials.Statistical Methods in Medical Research 14, 171–192.

Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001). Validating clustering for geneexpression data. Bioinformatics 17, 309–318.

Zhou, C. and Wakefield, J. (2006). A Bayesian mixture model for partitioning geneexpression data. Biometrics 62(2), 5150–525.

Zhou, S. and Shen, X. (2001). Spatially adaptive regression splines and accurate knotselection schemes. Journal of the American Statistical Association 96, 247–259.

Page 147: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Appendix A

Appendix for Chapter 3

Overdispersion of the Multiple Spatial Outcomes in Relationship to Ran-

dom effects.

Let uj = jb + hj denote the random effect vector, where uj = (uij, ⋅ ⋅ ⋅ , uij)T

for ith region and jth outcome, i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , J . Note that in the

simulation study, we let j = 1, j = 1, ⋅ ⋅ ⋅ , J when generating simulated data. The

common spatial factor model written as a generalized linear mixed model is log(�ij) =

�j + log(Eij) +ztijuj, where zij is the ith column of the random effects design matrix

of zj. The variance for the response can be partitioned as

var(Yij) =var [E(Yij∣uj)] + E [var(Yij∣uj)]

=var[exp

{�j + log(Eij) + ztijuj

} ]+ E

[exp

{�j + log(Eij) + ztijuj

} ](A.1)

The first term on (A.1) can be written as:

var[exp

{�j + log(Eij) + ztijuj

} ]= E

[exp

{2(�j + log(Eij) + ztijuj

)}]−[E(

exp{�j + log(Eij) + ztijuj

})]2(A.2)

Consider the moment generating function of uj, evaluated at zij, Muj(zij) = E[exp(ztijuj)

],

138

Page 148: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX A. APPENDIX FOR CHAPTER 3 139

The variance of the response is:

var(Yij) =exp[2 {�j + log(Eij)}

][Muj(2zij)−

{Muj(zij)

}2 ]+ exp

[{�j + log(Eij)}

]Muj(zij) (A.3)

Note that uj ∼ MVN(0,Σ( j)

), where j = (�2

b , �2ℎj

); Σ( j) = �2b (D −W )−1 +

�2ℎjI. Therefore, according to the moment generating function property for multi-

variate normal distribution, we have

var(Yij) =(

exp [{�j + log(Eij)}]E[exp(ztijuj)])

×(

exp [{�j + log(Eij)}][exp

{1.5ztijΣ( j)zij

}− exp

{0.5ztijΣ( j)zij

} ]+ 1)

(A.4)

Note the first term exp [{�j + logEij}]E[exp(ztijuj)] = E(Yij) and the second term

is greater than one. Therefore, variance of response depends on values of variance

components �2b and �2

ℎjand the variance of relative risks is order of O(log(E)/E).

Page 149: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Appendix B

Appendix for Chapter 4

140

Page 150: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 141

20 40 60 80 100

10

01

50

20

02

50

East

No

rth

83( 638 )

130( 619 )

135( 1017 )

80( 215 )

129( 604 )

109( 842 )

188( 1389 )

133( 899 )

115( 940 )

141( 1031 )

76( 232 )

184( 1504 )

128( 387 )

119( 521 )

134( 781 )

31( 0 )

56( 22 )

109( 234 )

112( 369 )

89( 403 )

Figure B.1: Summary statistics relating to number of trees infected and the sum ofcounts of the alternate host plant per cell (in parentheses) over various subsection ofthe plot.

Page 151: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 142

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

1.2

distance

sem

ivaria

nce

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

distance

sem

ivaria

nce

Figure B.2: The panel on the left displays the semivariogram of the observed infectionstatus for the trees. The panel on the right displays the semivariogram of the residualsfrom fitting M1.

Page 152: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 143

●●

●●

●●●

●●●

●●●

●●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●

−3

.0−

2.5

−2

.0−

1.5

−1

.0

S1 S2 S3

●●●● ●

●●● ●

●●●● ●

●●● ●

●●●●●●●●●●●

●●●●●● ●

●●●●●

●●●●

●●●●

●●●●

●●●●

β z

S1 S2 S3

● ●● ●

●●

● ●●

−1

.6−

1.2

−0

.8−

0.4

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

β z

●●

● ●●

●●●

●●

●●

−0

.40

.00

.20

.4

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

β z

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.4

0.8

1.2

1.6

●●●

●● ●●●

●● ●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

β z

●●

●●

●●

●●●

●●●

●●●

●●●

●●

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

1.4

1.8

2.2

●●

●● ●

●●

●● ●

●●

●●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●●●

●●●

●●●●●

●●●●●

●●●

●●●●

β z

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

Figure B.3: Estimated treatment effects (�Z) from fitting M1, M2, M3 and M4 whenthe true model is M2 (first column) or M3 (second column). Simulation scenariosS1, S2 and S3 correspond to the inflation factor = 1, 1.5 and 2, respectively. Thetreatment group consists of trees from resistant seedlots. The horizontal dashed linerepresents the true treatment effect.

Page 153: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 144

●●

●●

● ●●

●●

● ●●

●●

●● ●●

●●

●●

●●

●●

●● ●●

●●●

●●

●●

●●

●● ●●

●●

●●●

−3

.0−

2.5

−2

.0−

1.5

−1

.0

S1 S2 S3

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

β z

S1 S2 S3

●●

●●

●●●●

●●

●●

●●●

−1

.6−

1.2

−0

.8−

0.4

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

β z

● ● ● ● ●

−0

.40

.00

.20

.4

●●

●●

●●

● ● ● ● ●

●●

●●

●●●

●●

β z

●●

●●

●●

●●

●●

●●

●●●

0.4

0.8

1.2

1.6

●●●●

●●●

●●●●

● ● ● ●

β z

● ●●

●●●●

●●●●

●●●●

●●●

●●

●●

●●●

●●

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

1.4

1.8

2.2

● ● ● ●

●● ● ●

β z

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4

Figure B.4: Estimated treatment effects (�Z) from fitting M1, M2, M3 and M4 whenthe true model is M2 (first column) or M3 (second column). Simulation scenariosS1, S2 and S3 correspond to the inflation factor = 1, 1.5 and 2, respectively. Thetreatment group consists of randomly selected half of the trees. The horizontal dashedline represents the true treatment effect. The scale on the y-axis is the same as thatfor the corresponding sub-plot in Figure B.3.

Page 154: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 145

Table B.1: Rows shaded provide SBIAS, SSE, SRMS based on the correct model forpredicted site specific treatment effect ( i) over the whole plot when the treatmentgroup consists of randomly selected half of the trees. The other rows provide thedifference between SBIAS, SSE, SRMSE of i based on the misspecified models andthe corresponding quantity for the true model. The true treatment effect is �∗Z =−2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflation factor = 1, 1.5 and 2, respectively.

True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE

�∗z = −2M2 M1 2.93 5.57 6.43 3.04 5.52 6.38 3.08 5.16 6.03

M2 61.57 161.97 177.50 61.92 160.69 176.61 62.27 160.06 176.29M3 121.70 4.48 90.28 172.20 -1.22 128.09 230.61 -8.73 175.82M4 200.91 -2.08 151.30 266.29 -4.96 208.04 318.97 -14.27 251.74

M3 M1 -0.08 14.24 13.23 0.77 14.95 14.31 0.31 13.65 12.93M2 18.86 12.22 22.72 42.00 16.02 39.89 66.04 19.14 57.66M3 69.00 166.27 184.93 66.89 165.50 183.37 64.51 165.12 182.04M4 71.05 -6.18 38.82 130.29 -3.52 86.19 190.31 -0.87 137.38

�∗z = −1M2 M1 1.18 2.59 2.94 1.04 2.73 3.03 1.46 2.68 3.11

M2 31.86 107.50 114.18 33.23 108.55 115.66 33.29 107.90 115.08M3 61.06 2.05 39.77 98.49 -1.89 66.83 138.57 -4.21 100.35M4 92.55 -2.90 59.59 126.59 -7.09 85.83 162.13 -11.33 115.16

M3 M1 0.23 5.26 5.10 0.43 4.95 4.86 0.28 4.16 4.05M2 12.37 3.22 9.29 25.62 4.09 17.05 39.05 4.77 25.27M3 34.15 109.89 117.26 32.63 109.50 116.43 31.49 109.91 116.53M4 41.77 -3.51 19.31 74.32 -2.83 42.69 105.71 -1.87 67.68

�∗z = 0M2 M1 -0.79 -0.09 -0.10 -0.61 -0.06 -0.07 1.26 -0.03 -0.01

M2 1.02 76.94 76.87 0.94 74.11 74.05 0.29 71.87 71.80M3 5.57 -0.14 0.14 8.53 -0.09 0.51 13.15 -0.18 1.07M4 12.97 -0.13 1.13 18.83 -0.06 2.53 25.82 -0.18 4.43

M3 M1 -0.13 -0.29 -0.29 -0.03 -0.07 -0.07 -0.10 -0.11 -0.11M2 1.97 -0.29 -0.26 -1.04 -0.12 -0.14 1.64 -0.07 -0.02M3 0.40 79.94 79.86 2.25 78.71 78.66 1.67 81.53 81.46M4 3.15 -0.03 0.05 5.80 -0.02 0.36 8.25 -0.02 0.57

�∗z = 1M2 M1 0.82 1.91 2.18 0.95 2.10 2.38 1.08 2.36 2.67

M2 19.62 97.57 100.66 22.17 102.55 106.22 23.65 103.47 107.46M3 57.41 -1.54 33.94 101.54 -2.37 68.37 141.48 -2.52 102.43M4 68.59 -8.42 37.07 120.74 -11.62 78.32 169.39 -14.36 119.94

M3 M1 0.59 1.68 1.84 0.41 2.11 2.22 0.28 2.38 2.45M2 7.35 -0.18 2.09 14.27 0.22 5.29 21.92 0.59 9.19M3 16.02 96.64 98.98 16.50 95.82 98.27 17.34 98.43 101.09M4 19.80 -3.13 4.00 35.19 -4.12 11.52 51.12 -5.25 20.40

�∗z = 2M2 M1 1.74 2.40 2.81 1.97 2.82 3.33 2.04 3.51 4.08

M2 25.70 115.27 119.48 28.57 121.47 126.26 30.94 124.99 130.34M3 111.75 -6.07 74.33 182.65 -3.68 136.16 243.27 -0.31 191.70M4 171.02 -16.54 115.60 275.99 -18.66 209.13 365.20 -19.56 292.31

M3 M1 0.57 7.50 7.56 0.60 8.62 8.63 0.38 9.54 9.41M2 8.24 6.53 10.14 17.98 8.37 16.62 28.72 10.87 25.07M3 22.69 99.44 103.23 25.12 104.98 109.27 27.22 109.49 114.37M4 30.41 -4.79 10.96 56.10 -6.92 26.68 83.59 -9.43 46.00

Page 155: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX B. APPENDIX FOR CHAPTER 4 146

Table B.2: Rows shaded provide SBIAS, SSE, SRMS based on the correct modelfor predicted site specific infection probability (�i) over the whole plot when thetreatment group consists of randomly selected half of the trees. The other rowsprovide the difference between SBIAS, SSE, SRMSE of �i based on the misspecifiedmodels and the corresponding quantity for the true model. The true treatment effectis �∗Z = −2,−1, 0, 1, 2 and simulation scenarios S1, S2 and S3 correspond to inflationfactor = 1, 1.5 and 2, respectively.

True Fit S1 ( = 1) S2 ( = 1.5) S3 ( = 2)SBIAS SSE SRMSE SBIAS SSE SRMSE SBIAS SSE SRMSE

�∗z = −2M2 M1 2.44 5.56 6.35 2.72 4.66 5.40 2.65 4.27 5.02

M2 65.70 191.98 207.01 55.65 148.11 162.18 54.79 144.74 158.73M3 216.97 -3.40 162.96 243.34 0.03 196.50 331.61 4.20 279.90M4 355.24 -12.21 277.21 398.15 -1.23 335.67 549.35 4.28 482.82

M3 M1 -0.16 15.05 14.00 0.62 16.39 15.65 0.25 16.05 15.18M2 16.85 14.72 23.90 37.75 19.78 40.93 59.55 25.00 59.45M3 58.31 139.35 155.28 57.18 140.52 155.92 55.98 141.26 156.10M4 65.09 -5.14 36.79 119.53 -2.87 80.69 175.95 -0.50 129.56

�∗z = −1M2 M1 2.66 4.69 5.40 2.36 4.83 5.42 2.84 4.70 5.50

M2 59.48 171.48 185.51 59.65 169.23 183.46 57.91 162.91 176.83M3 188.84 -3.51 140.99 295.43 -1.97 236.81 382.25 3.49 320.37M4 307.18 -9.53 239.34 483.71 -3.64 408.03 641.55 3.14 563.54

M3 M1 0.08 17.44 16.52 0.49 17.99 17.24 0.12 17.53 16.63M2 25.54 16.42 29.90 52.48 22.42 51.44 79.77 28.65 74.89M3 62.29 163.28 179.16 60.93 163.70 178.99 60.29 163.46 178.58M4 88.16 -6.53 49.81 157.13 -4.31 105.94 226.99 -1.09 168.30

�∗z = 0M2 M1 2.48 6.46 7.14 2.51 6.38 7.01 2.55 5.57 6.26

M2 66.17 194.07 209.08 64.18 189.53 204.12 62.60 182.54 196.95M3 217.14 -2.98 163.14 323.89 -0.80 259.04 403.73 2.95 335.42M4 354.06 -11.62 275.97 534.48 -4.84 448.51 686.45 1.00 598.23

M3 M1 -0.19 19.53 18.48 0.65 19.91 19.20 -0.04 19.06 18.03M2 33.48 18.55 36.47 67.98 25.60 63.81 101.13 31.66 91.22M3 69.52 188.35 205.57 68.27 187.90 204.72 67.39 188.02 204.55M4 114.46 -8.46 65.88 199.38 -5.88 136.63 284.38 -2.52 212.98

�∗z = 1M2 M1 2.67 6.31 7.32 2.64 6.54 7.45 2.61 6.31 7.15

M2 68.38 202.29 217.64 67.43 197.61 212.88 65.51 191.82 206.75M3 208.63 -3.13 155.48 306.07 -0.99 242.59 381.38 1.98 312.82M4 340.86 -13.32 261.54 505.68 -6.92 419.21 646.47 -2.19 555.98

M3 M1 0.96 20.31 19.67 0.30 21.07 20.14 -0.06 20.28 19.22M2 36.47 19.86 39.42 72.34 26.15 66.55 107.70 31.19 94.57M3 73.23 196.24 214.59 70.94 197.84 215.22 69.54 197.80 214.69M4 122.22 -9.22 70.80 211.30 -6.47 144.60 299.97 -3.47 223.81

�∗z = 2M2 M1 2.77 5.78 6.72 2.93 5.77 6.71 2.62 6.19 7.01

M2 66.17 187.00 202.62 65.20 184.64 200.07 64.12 181.81 196.98M3 170.59 -2.69 126.42 254.82 -0.72 200.40 322.11 1.87 262.17M4 278.29 -12.61 210.23 418.11 -7.33 341.65 540.15 -2.75 458.85

M3 M1 0.66 18.06 17.17 0.77 19.01 18.15 0.17 18.08 17.07M2 27.19 17.69 32.48 55.70 22.64 54.36 85.47 27.71 78.20M3 70.75 179.89 198.40 69.84 181.91 200.06 69.24 181.95 199.96M4 98.67 -7.94 56.66 173.98 -5.97 117.96 249.48 -3.57 184.60

Page 156: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

Appendix C

Appendix for Chapter 5

147

Page 157: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX C. APPENDIX FOR CHAPTER 5 148

●●

●●

●●●

●●

●●●

●●●●

●●●●

●●●

●●●●

010

2030

40

(4)HC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

010

2030

40(3)OB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

010

2030

40

(3)RB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

010

2030

40

(3)TA

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●●

●●●●●

● ●

●●

●●●●●

●●

●●●

●●

●●

●●

●● ●

●●

010

2030

40

(3)GH

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●● ●

●●

●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●●●●

010

2030

40

(3)TC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●

●●●

●●

●●

●●

●●

010

2030

40

(3)TB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

010

2030

40(2)KI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●●

●●●

●●

●●

●●●●●

●●

●●

●●

● ●●●●● ●●

010

2030

40

(1)PN

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●

●●●

● ●●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

010

2030

40

(1)BM

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●●●

●●

●●

●●●

● ●●●●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●●●

010

2030

40

(1)JF

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●

●●● ●●

●●

●●●

●●●

●●

●●●

●●●●

●●

●●●●●●●

●●

●●

●● ●●●

●●

●●●●

●●

010

2030

40

(1)PC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●

●●●●

010

2030

40

(1)WI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

Cad

miu

m C

once

ntra

tion

(µg

Cd/

g)

month

Figure C.1: The solid curves and dashed curves correspond to penalized splinesmoothing curves of cadmium concentrations for oysters from 1m and 7m depth,respectively, at thirteen experimental sites from north to south row-wise. The circlesand triangles are the measured cadmium concentrations for the oysters sampled at1m and 7m depth, respectively. The horizontal dashed lines represent the EuropeanCommunity guideline (6.3 �g Cd/g on dry weight basis) and Asian guideline (13.5�g Cd/g on dry weight basis). The labels above the x-axis represent the seasons,ranging from winter 2002 (W2) to winter 2003 (W3). The labels below the x-axisrepresent the months, ranging from December 2002 (D2) to February 2004 (F4).

Page 158: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX C. APPENDIX FOR CHAPTER 5 149

●●

●●

●●●

●●

●●

●●

●●

●●

050

100

150

(4)HC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●

●●

●●

●●

●●

050

100

150

(2)KI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●●

●●

●●

●●

050

100

150

(3)OB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●

●●

●●● ●

●●

●●

●●

050

100

150

(3)RB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

050

100

150

(3)TA

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●●

●●●

●●

●●

●●●

●●

050

100

150

(3)GH

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●

●●

●●●

●●●●

050

100

150

(3)TC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●

●●

●●

● ●

●●

●●

●●

050

100

150

(3)TB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

● ●●

●●

●●

050

100

150

(1)PN

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●

●●

●●

●●

●●

●●

●●

●●

050

100

150

(1)BM

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●●●

●●

●●●

●●

●●●

●●

050

100

150

(1)JF

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●

●●

●●

●●●●

●●●

●●

050

100

150

(1)PC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

050

100

150

(1)WI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

leng

th (m

m)

month

Figure C.2: The solid curves and dashed curves correspond to monotone splinesmoothing curves of oyster lengths for oysters from 1m and 7m depth, respectively,at thirteen experimental sites from north to south row-wise. The circles and trianglesare the measured oyster lengths sampled at 1m and 7m depth, respectively. The la-bels above the x-axis represent the seasons, ranging from winter 2002 (W2) to winter2003 (W3). The labels below the x-axis represent the months, ranging from December2002 (D2) to February 2004 (F4).

Page 159: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX C. APPENDIX FOR CHAPTER 5 150

05

1015

(4)HC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(2)KI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)OB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)RB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)TA

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)GH

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)TC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(3)TB

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(1)PN

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(1)BM

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(1)JF

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(1)PC

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

05

1015

(1)WI

D2 J3 F3 M3 A3 M3 J3 J3 A3 S3 O3 N3 D3 J4 F4

W2 S3 S3 F3 W3

1m7m

grow

th ra

te

month

Figure C.3: The solid curves and dashed curves correspond to the estimated growthrates for the oysters sampled from 1m and 7m depth, respectively, at thirteen exper-imental sites from north to south row-wise. The growth rates are estimated as thefirst derivatives of the monotone smoothing functions of the oyster lengths. The labelsabove the x-axis represent the seasons, ranging from winter 2002 (W2) to winter 2003(W3). The labels below the x-axis represent the months, ranging from December2002 (D2) to February 2004 (F4).

Page 160: MODELS AND METHODS FOR SPATIAL DATA: … · APPLICATIONS IN EPIDEMIOLOGICAL, ENVIRONMENTAL AND ... Data: Applications in Epidemiological, Environmental ... spatial models and methods

APPENDIX C. APPENDIX FOR CHAPTER 5 151

50 100 150

01

23

45

67

length

grow

th ra

te

Figure C.4: The scatter plot of oyster growth rates versus oyster lengths.