Upload
phamkhue
View
215
Download
0
Embed Size (px)
Citation preview
Identification of Local Clusters for Count Data: A
Model-Based Moran’s I Test
Tonglin Zhang ∗and Ge Lin †
Purdue University and West Virginia University
February 14, 2007
∗Department of Statistics, Purdue University, 250 North University Street,West Lafayette, IN 47907-2066, Email:
[email protected]†Department of Geology and Geography, West Virginia University, Morgantown, WV 26506-6800, email:
0
Identification of Local Clusters for Count Data: A
Model-Based Moran’s I Test
Abstract
We set out IDR as a loglinear model-based Moran’s I test for Poisson count data that resembles
the Moran’s I residual test for Gaussian data. We evaluate its type I and type II error probabilities
via simulations, and demonstrate its utility via a case study. When population sizes are hetero-
geneous, IDR is effective in detecting local clusters by local association terms with an acceptable
type I error probability. When used in conjunction with local spatial association terms in loglinear
models, IDR can also indicate the existence of first-order global cluster that can hardly be removed
by local spatial association terms. In this situation, IDR should not be directly applied for local
cluster detection. In the case study of St. Louis homicides, we bridge loglinear model methods
for parameter estimation to exploratory data analysis, so that a uniform association term can be
defined with spatially varied contributions among spatial neighbors. The method makes use of
exploratory tools such as Moran’s I scatter plots and residual plots to evaluate the magnitude of
deviance residuals, and it is effective to model the shape, the elevation and the magnitude of a
local cluster in the model-based test.
Keywords: Cluster and clustering; deviance residual; Moran’s I; permutation test; spatial
autocorrelation; type I error probability.
1 Introduction
Count and cross-tabulated frequency data are common in geographical analyses. Many spatial
phenomena, such as births, deaths, crimes and species richness, can be counted by a spatial unit,
either as a raw count or as a ratio over some exposure. Prior to the 1970s, count data were often
converted to rate for statistical analyses because of limited computational power in categorical
statistics. In the late 1970s, computationally expensive methods, such as loglinear models for
1
cross-tabulated data were introduced into social sciences and geography [15, 38], and they were
quickly included in many statistical packages. In spatial statistical analyses, however, counts
are still frequently converted to rate so that a testing method for continuous variables, such as
Moran’s I [26, 27] or Getis-Ord’s G [20], can be directly applied. However, when population sizes
are heterogeneous across spatial units, converting counts to rates often leads to variance inflation
and biased type I error probabilities. Some propose to incorporate a population weight to the test
statistics [29, 35], but the heterogeneity problem still remains [5]. Since a loglinear model can take
account of population sizes in its likelihood ratio test, it is natural to extend the spatial statistics
under the loglinear model framework.
In this article, we set out a loglinear model-based test statistic for Poisson count data that
corresponds to Moran’s I for continuous data. We chose Moran’s I because of its popularity
and its ease of implementation. There have been hundreds of applications and extensions of
the statistic since Moran’s I was first published in 1948 [26]. Currently, most researchers focus
on estimation methods [13, 22, 37], spatial distribution properties [7, 19], and adjustment of
heterogeneous population sizes for count data [5, 29, 35]. A concurrent theme is focused on local
spatial statistics or indicators [16]. It is pointed out that the extent of spatial correlation may
vary locally due to omitted, misspecified, and deficient measurements for a stationary spatial
relationship [17]. A significant Moran’s I test may be caused by either a global trend of spatial
autocorrelation, or a few local spatial associations. Attempts have been made to partition space
for spatially varied parameterization [10], and to decompose a global autocorrelation measure, such
as Moran’s I, into a local indicator of spatial association (LISA) [3]. With auxiliary information,
LISA is able to locate spatial associations, such as hot spots and cool spots [33]. Our model-based
test should complement LISA, because it is not only able to explicitly indicate high-value and
low-value clusters, but it is also able to account for heterogeneous population sizes.
As its name suggested, a model-based test depends on a particular statistical model. In a linear
regression model, a dependent variable is often associated with a set of explanatory variables. After
2
a final model is derived, a residual Moran’s I test for spatial autocorrelation can also be performed
to detect spatial clustering for unexplained variations ([14], p. 197). When a regression model
does not include any explanatory variables, the residual Moran’s I test is identical to the Moran’s
I test of the dependent variable. If we can bridge this test for spatial autocorrelation to a loglinear
model, it would likely narrow the apparent knowledge gap between Moran’s I for continuous data
and other autocorrelation tests for count data ([38], p. 307).
There are some recent advances in incorporating count data in spatial statistics. Griffith [21]
introduced a spatial filter specification of the auto-regressive logistic model that is able to remove
the global clustering effect. The model is likely to provide unbiased parameter estimates for auto-
regressive logistic regression, but due to its focus on model correction, the method may not be
able to detect a local association. Several test statistics, such as Ipop and I∗pop [29] and modified
I [35] by Empirical Bayes Index (EBI), CG [34] or spatial-X2 test [31], are able to account for
heterogeneous population sizes and to detect a local cluster, but none of them can account for
ecological or geographic covariates. Lin’s [23] spatial logit association model is able to include
ecological covariates and spatial associations, but the significance of a logit association term is
not a direct measure of spatial clustering. Although Apanasovich and his coauthors [6] used
the Pearson residuals to test for spatial autocorrelation in their autoregressive model, the test
was not formally specified and evaluated for wider applications. In this paper, we demonstrate
that Moran’s I based on loglinear residuals can be used not only as a global indicator of spatial
autocorrelation, but also as a tool for modeling the location, the shape, the elevation and the size
of a local spatial cluster.
In the remaining sections of the paper, we briefly review the permutation test of Moran’s I by
using regression residuals and reformulate it in the context of Poisson data by using the deviance
residuals of a loglinear model. We then use simulations to evaluate its statistical properties under
the null and alternative hypotheses of spatial independence. In section 4, we apply the deviance
residual Moran’s I test to the St. Louis crime data. Finally, we provide some concluding remarks.
3
2 A Model-Based Moran’s I
Consider a study area that has m regions indexed by i. Let Xi be the variable of interest in region
i. Moran’s I [26, 27] is expressed as:
I =
∑mi=1
∑mj=1 wij(Xi − X̄)(Xj − X̄)
[∑m
i=1
∑j=1 wij][
∑mi=1(Xi − X̄)2/m]
, (1)
where X̄ =∑m
i=1 Xi/m, and wij with wii = 0 is the (i, j)-th element of a spatial weight matrix W .
Commonly wij is defined by the adjacency of spatial units: wij = 1 if regions i and j are adjacent
(neighbors) and wij = 0 otherwise. A significant and positive value of Moran’s I indicates the
existence of a positive autocorrelation, or the existence of high-value or low-value clustering. A
significant and negative value of Moran’s I indicates a negative autocorrelation, or a tendency
toward the juxtaposition of high values next to low values.
The null hypothesis of Moran’s I is usually based on the assumption that the distributions
of Xi are homogeneous. The p-value of the significance of Moran’s I is computed from a z-test
based on its z-value given by Z(I) = [I−E(I)]/√
V (I), where E(I) and V (I) are the theoretically
mean and variance respectively under the null hypothesis. Under the null hypothesis of no spatial
autocorrelation, Z(I) is assumed to be asymptotically distributed of N(0, 1) as m → ∞. The
theoretical values of E(I) and V (I) are usually computed under the random permutation test
scheme:
E(I) = − 1
m− 1(2)
and
V (I) =m[(m2 − 3m + 3)S1 −mS2 + 3S2
0 ]− b2[(m2 −m)S1 − 2mS2 + 6S2
0 ]
(m− 1)(m− 2)(m− 3)S20
− E2(I), (3)
where S0 =∑m
i=1
∑mj=1,j 6=i(wij +wji)/2, S1 =
∑mi=1
∑mj=1,j 6=i(wij +wji)
2/2, and S2 =∑m
i=1(wi·+w·i)2
with wi· =∑m
j=1,j 6=i wij, and b2 = m∑m
i=1(zi − z̄)4/[∑m
i=1(zi − z̄)2]2 ([14], p. 21).
When observations are counts, such as crimes, Xi in (1) often takes the form of case rate
as Xi = ni/ξi, where ni is the number of cases and ξi is the at risk population size in region
4
i. However, the homogeneous assumption under this specification may not be valid [11]. Since
loglinear models can relax this assumption, we can specify a loglinear model and use its deviance
residuals to test for spatial autocorrelation.
Suppose that the random count Ni with an observed count ni, i = 1, · · · ,m, follows a Poisson
distribution and assume that the counts Nis are independent. Suppose that a set of geographical
covariates are observed together with the count ni. Then, a loglinear model can be set out by
taking the observed geographical covariates as explanatory variables and the logarithm of the
at risk population size (e.g. log(ξi)) as an offset term. When the parameters are estimated by
maximum likelihood estimation, the estimated value n̂i of the expected count E(Ni) can be derived
and the conventional deviance residual ([1], p. 588) for region i is
ri,d = 2sign(ni − n̂i)[ni log(ni/n̂i)− ni + n̂i]1/2, (4)
where sign(·) is the sign function defined as sign(a) is 1 if a > 0, is 0 if a = 0, and is −1 if a < 0.
The concepts and statistical properties of deviance residuals in loglinear models are well es-
tablished. We can readily extend these concepts to spatial statistics. Note that the numerator of
Moran’s I is a martingale if Xis are independent with mean 0. When Xi = ri,d with ri,d given
in (4) and n̂i is replaced by the expected count E(Ni), X1, · · · , Xm are independent and E(Xi)
is almost 0 if E(Ni) is large (e.g. E(Ni) > 5). When E(Ni) is estimated by n̂i from a loglinear
model, then under the model assumption, n̂i is a consistent estimator of E(Ni), and the joint dis-
tribution of (r1,d, · · · , rm,d) is approximately normal with mean 0 and variance-covariance matrix
an orthogonal projection matrix [1, 30] denoted by Pm. For a fixed number of covariates when m
is large, the orthogonal project matrix Pm is almost equivalent to the identity matrix since the
dimension of the kernel space of the project matrix is equal to the number of covariates. When m
goes to infinity, r1,d, · · · , rm,d are approximately independent and the asymptotic normality of Z(I)
can be proven by a martingale approximation of the numerator of Moran’s I with an application
of the Martingale Central Limit Theorem ([9], p. 475). In addition, one must also consider the
5
convergence of the permutation mean and variance of Moran’s I in this scenario [32].
This particular asymptotic formulation of the deviance residuals is analogous to that of re-
gression residuals ([14], p. 198). It is noted that deviance residuals are very flexible in loglinear
models, and they reflect categorical structure (in this case spatial structure) while controlling for
potentially heterogeneous population sizes ([1], p. 495). We can similarly test deviance residu-
als for spatial autocorrelation by specifying a loglinear model. Since a loglinear model, such as
log-rate model, can incorporate geographic (or ecological) covariates, we can test its residuals for
spatial autocorrelation in the presence or absence of ecological covariates. A nature approach is to
apply the random permutation test so that Moran’s I based on the deviance residual of a loglinear
model is analogous to Moran’s I based on residuals of a regression model [6].
Given that deviance residuals are approximately multivariate normal, we can test spatial au-
tocorrelation of the residuals by replacing Xi in (1) with ri,d in equation (4), and we label it IDR.
The mean and variance of IDR can be identically derived from the random permutation scheme
of the conventional Moran’s I as given by equations (2) and (3) respectively. To implement IDR,
we can simply estimate the expected counts under the null model with the intercept only, which
indicates that n̂i = ξi(n/ξ) with n =∑n
i=1 ni and ξ =∑m
i=1 ξi. In this case, the i-th deviance
residual ri,d can be derived by inserting n̂i in (4). If IDR is positive and significant, it suggests
spatial clustering, which can either be contributed by a first-order clustering trend or a few local
clusters. We can detect clustering contributions by applying spatial association models [23, 24].
First, a number of spatial association terms are added to the null model. Then, the parameter
estimates together with residuals are derived in the model fitting process. The existence of spatial
autocorrelation is tested again via IDR for the model residuals. If IDR is significant in the null
model but not significant in the model with local association terms, the significance found in the
null model is likely to be accounted for by the local association terms. If a few spatial association
terms cannot reduce the significance of IDR from the null model, it suggests the existence of a
first-order global clustering tendency.
6
Under the assumption that there is a local cluster in the study area, a loglinear model with a
spatial association term is:
log(λi) = log(ξi) + β0 + β1di(j) (5)
where λi = E(Ni), β0 is the grand mean and β1 is the unknown parameter for the spatial association
term defined by di(j) in which di(j) is 1 if location i is believed to be in a cluster centered at unit
j, and otherwise di(j) is 0. We test if β1 significantly differs from 0. The significance of the local
association term is determined by its p-value via the likelihood ratio test over the null model
without the spatial association term. Likewise, the contribution of the spatial association term to
IDR is determined by comparing the p-value of IDR with and without the term. If the coefficient
of β1 is positively significant, then the local cluster is a hot spot. If the coefficient is negatively
significant, then the local cluster is a cool spot.
Besides the above likelihood ratio test, we can gauge the contribution of β1 for the local cluster
by comparing IDR results with and without the β1 term in model (5). If IDR is not significant when
the spatial association term is included, then the clustered effect in the null model is sufficiently
removed by the association term. If the inclusion of spatial association term in model (5) does not
change the significance level of IDR in the null model, then the clustering effect remains. To further
improve model fit and to identify the explained clustering effect, one can either refine the spatial
association term already in the model or include another spatial association term. Finally, if the
existence of a local cluster is accompanied by a first-order global clustering trend, the likelihood
ratio test may still be significant by the inclusion of the local association term, but it is unlikely
to reduce the p-value of IDR from a significant level to a non-significant level.
7
3 Simulation Assessment of IDR
We designed Monte Carlo simulation experiments to assess the effectiveness of the model-based test
IDR under population heterogeneity. Type I error rates were evaluated under the null hypothesis
of homogeneous rates with heterogeneous population sizes, while spatial cluster modeling was
evaluated in the presence and absence of first-order global clustering trend. All the simulation
experiments were based on a 10 × 10 lattice with wij being defined by the rook rule of spatial
adjacency. We set the significance level of α = 0.05 to assess the rejection rates of IDR in each set
of simulations. In the presence of a local cluster, a residual plot was also furnished to facilitate
the evaluation process.
In addition to IDR, we included the original Moran’s I by converting counts to rates, and
denoted it by Ir which is defined by letting xi = ri = ni/ξi in (1). Previous studies have demon-
strated that Ir is sensitive to heterogeneous populations, and the inclusion of Ir was to serve as
a baseline for comparison. We also included the Empirical Bayesian Index (EBI) denoted by
IEBI , a population-adjusted Moran’s I proposed by Assuncao and Reis [5]. IEBI is found to be
effective for adjusting population sizes in the presence of population heterogeneity, and it has
been included in GeoDa, a popular spatial exploratory data analysis freeware [4]. However, IEBI
is not a model-based test and it cannot include ecological covariates. This can be seen from the
definition, in which zi = ri,EBI = (pi − b)/√
νi, where b = n/ξ, νi = a + b/ξi, a = s2 − b/(ξ/m)
and s2 =∑m
i=1 ξi(pi − d)/ξ. Hence,
IEBI =
∑mi=1
∑j 6=i wij(
pi−b√νi− 1
m
∑ml=1
pl−b√νl
)(pj−b√νj− 1
m
∑ml=1
pl−b√νl
)
[∑m
l=1(pi−b√
νi− 1
m
∑ml=1
pl−b√νl
)2/m][∑m
i=1
∑mj=1,6=i wij]
. (6)
Assess IDR for population heterogeneity. Following the simulation studies of Walter [36]
and Assuncao and Reis [5], we compared the type I error probabilities of IDR, IEBI and Ir based
on Monte Carlo simulations. Walter [36] reported that densely populated areas with a pocket
sparsely populated area could cause an excessive type I error probability for Ir. To represent this
pattern, we generated relatively low population of 106(1 − η)2 for lattice points within a 2-unit
8
circle centered at (3, 3), and 106 for others. The η value indexes population heterogeneity from 0
to 0.8 with an increment of 0.04 increments. When η = 0, all the populations were homogeneous.
As η is getting closer to 1, the populations are increasingly heterogeneous.
Based on the above population patterns, we generated independent Poisson random variables
with the mean value being 10−4 times the population size for each lattice point. Since identical
rates were expected across all lattice points, there should be no spatial clustering. The rejection
rate, therefore, should reflect the type I error probability of the spatial autocorrelation test. For
each η value, we calculated type I error probabilities based on 10, 000 simulations and resultant z
values. The results (Figure 1) show that both IDR and IEBI were able to account for population
heterogeneity with an almost identical type I error probability around 0.05 for all η values. The
type I error probability of Ir, however, was only acceptable when η is small with little variation
in population sizes. As η increased, the type I error rates also increased. When population sizes
varied substantially (η = 0.8), the rejection rate was as high as 25%, a result consistent with
Walter’s simulations.
Assess IDR for local cluster detection. Based on the previous simulation result, we devised
a fixed heterogeneous population pattern: the population was 105 if a point on the lattice was
within the circle and the population was 106 otherwise. We generated independent Poisson random
variables with the mean equal to 0.0001 times the population size for each lattice point. We then
inserted a 2-unit circle for a cluster effect centered at (7, 7), and set the mean equal to 0.0001(1+δ)
times the population. The δ value represented the strength and direction of the cluster effect, and
it increased from −0.8 to 0.8 with 0.04 in each step. If δ < 0, the circle represented a low-value
cluster; if δ > 0, it represented a high-value cluster.
Again, based on 10, 000 simulations for each δ value under population heterogeneity, the re-
jection rates of IDR with and without the spatial association term from model (5) are shown in
Figure 2. The rejection rate without the spatial association term indicates the statistical power of
IDR, while the rate with the spatial association term indicates the effectiveness of the model-based
9
test for a high-value or low-value cluster. If the model based test is effective, the test statistic
should no longer be significant when the spatial association term that covers the exact circle being
included. The results show that IDR under the null model had a reasonable power (Figure 2).
When δ values were around 0, the rejection rate was around 0.05. When the absolute δ values
were greater than 0.25, the rejection rates were about 15%. When the δ value reached −0.8 or a
cool spot, the rejection rate was almost 100%. When δ value reached 0.8, the rejection rate was
about 85%. Both results suggest that IDR under the null model is likely to be significant when
there is a strong local cluster.
However, when the cluster tendency was accounted for by the spatial association term, the
rejection rates were consistently around 0.05, suggesting that IDR was unlikely to be significant
when a spatial association term absorbed the cluster effect. Since the relative risks within the
cluster were all similarly higher or lower than the rest of the area in our simulations, once its
effect was removed by the spatial association term, the study area became spatially independent,
a result consistent with previous simulations in the spatial logit association model [23].
The effect of the spatial association term can be illustrated by the residual and QQ-normal
plots from a single simulation. The upper panel of Figure 3 displays the results under the null
model. The IDR test had a p-value of 0.0001 primarily due to a number of extremely high deviance
residuals from the clustered area. Likewise, the QQ-plot shows that a number of high values are
concentrated in the upper tail, suggesting the existence of extreme values. The lower panel in
Figure 3 shows that once the spatial association term was added to the model, the effect of extreme
large residual values in the null model was disappeared, whereas the p-value of IDR reduced to
0.12 with evenly distributed residuals. This result is also collaborated from the QQ-plot with all
the values along a straight line.
Assess IDR for a local cluster in the presence of first-order global clustering. It is
known that a local cluster and a first-order clustering trend can operate simultaneously. In the
presence of global clustering, it is often necessary to first de-trend before fitting a spatial regression
10
model [2, 12]. We intend to evaluate the performance of IDR in this situation by generating the
global spatial structure from a log-normal distribution, and by inserting a local cluster from the
previous simulation with δ = 2.0 in the simulation. If the local test is insensitive to the first-order
clustering tendency, then it indicates the existence of global clustering.
In the simulation process, we first generated 100 identically independently distributed (iid)
N(0, 1) random variables, denoted by ε = (ε1, ε2, · · · , ε100). Next, we calculated a vector u by
letting u = (I − ρW )−1ε with ρ increasing from 0 to 0.2 in step increment 0.01 such that u
satisfied
u = ρWu + ε,
where ρ is the coefficient of the global spatial association [2, 5]. Third, we let λ = 0.0001(1+2d)eu,
where λ = (λ1, · · · , λ100) was the vector of Poisson intensity for generating counts. We generated a
conditional independent Poisson random variable Ni with parameter λi times the i-th population
size. When ρ = 0, there was only a local cluster in the simulated pattern, and when ρ 6= 0,
there were both local and global clustering tendencies in the simulation pattern. We assess the
effectiveness of IDR by comparing the rejection rates of IDR with and without the spatial association
term.
Based on 10, 000 simulations for each ρ value, the results (Figure 4) showed that the spatial
association term was unable to reduce the clustering effect except when the global clustering trend
was very weak. For instance, when ρ = 0, the rejection rate for IDR in the null model was about
28%, and it suggested spatial clustering. When the spatial association term was included in this
case, the local cluster tendency was reduced similar to the previous simulation. As the global
clustering trend ρ increased, the rejection rates of IDR also increased, and the two curves with and
without the association term were likely to be significant for even a modest increase in ρ. The
inclusion of the spatial association term had little effect on removing a local clustering effect in the
presence of the global clustering tendency. It further suggests that even when a association term
might be significant in terms of the likelihood ratio test, the local effect might not be trustworthy,
11
because the global effect overshadowed the local effect.
Figure 5 displays the residual and QQ-plots of the deviance residuals with and without the
spatial association term from a single simulation run (ρ = 0.15). It is evident, there were only
few large deviance residuals in absolute values, and they were not clumped together. This pattern
is in sharp contrast with the one in Figure 3. In addition, the p-values of IDR with and without
using the spatial association term were very close: 0.0003 with the spatial association term and
0.0001 without. These result suggest that the inclusion of a local association term is unlikely to
reduce the significance of IDR because of the overall global clustering effect.
In summary, IDR is effective in reducing type I error probabilities of the traditional Moran’s
I due to heterogeneous population sizes, and its performance is comparable to that of IEBI . An
advantage of IDR over IEBI is its ability to include ecological or other spatial covariates. When a
significant IDR is contributed mainly by a local cluster, we can devise a spatial association term to
remove the cluster effect, so that the spatial autocorrelation observed in the null model would not
be significant anymore. The exact form of association term can be determined either by a stepwise
regression method [23] or from a exploratory method, such as deviance residual plots. Since IDR
is sensitive to the existence of local clusters but not sensitive to the presence of the global trend,
the inclusion of a spatial association term in the IDR test can indicate whether a first-order global
clustering trend exists or not.
4 St. Louis Homicides Data analysis
In this section, we apply IDR to analyzing homicides in the St. Louis region. The data set was
originally analyzed by Messner, et. al [25], and it is also included as part of exercises in GeoDa
[4], a simple spatial analysis package developed by Anselin and his associates. In the original
paper, homicide rates for 1984-1988 and 1988-1993 periods were analyzed at the county level, and
a number of local clusters including one centered at St. Louis City were identified by LISA. Here,
12
we can use the model based IDR to detect spatial clustering based on homicide incidents and the at
risk population. Analogous to LISA, we also included a local version of deviance residual Moran’s
I or deviance residual LISA denoted by IDR,i. The result of IDR,i was compared with the results
of the local versions of Ir and IEBI , denoted by Ir,i and IEBI,i respectively, where IDR,i, Ir,i and
IEBI,i can be defined according to the formula given by Anselin [3] as
Ii =
∑mj=1,j 6=i wij(xi − x̄)(xj − x̄)
∑mi=1(xi − x̄)2/m
(7)
by letting xi = ri,d, xi = ri and xi = ri,EBI respectively. All IDR,i, Ir,i and IEBI,i are able to provide
additional ways of exploratory spatial analysis for count data, such as cluster maps. However, only
IDR,i is able to provide an additional clustering analysis when a covariate variable is accounted
for.
In the preliminary analysis, we found that IDR for the 1984-1988 period was 0.0319 with an
insignificant p-value of 0.2034, and IDR for the 1988-1993 period was 0.1137 with a significant
p-value of 0.0066. We, therefore, focused on the latter period.
Between 1988 and 1993, there were 2, 650 homicides, and the average homicide rate was about
10 per 100, 000. County populations in the study area vary substantially: St. Louis County was
the largest with more than one million residents, and five other counties that include St. Louis
City, St. Clair, Boone, Sangamon and Macon had at least 50, 000 residents. To detect spatial
clustering for homicides, we first fitted the null model. The results from IDR indicated a significant
clustering tendency with the z-value of 2.72 and p-value of 0.0066. When we plotted the deviance
residuals by five equal intervals (Figure 6), St. Louis City was in the first interval with 40.22
deviance residual, St. Clair county was in the third interval (18.06), and there was no county
in the second interval. This indicated that St. Louis City was the only county that indicated
a high-value cluster surrounded by St. Clair, St. Louis and Madison counties. In addition, we
further plotted deviance residuals LISA by using GeoDa, and found that the standardized values
of IDR,i, Ir,i and IEBI,i were 17.5, 25.45 and 25.64 respectively when i indicated the St. Louis
13
county, and the values were 12.27, 13.93 and 14.01 respectively when i indicated the St. Clair
county. The values of the rest counties were much lower that the values of those two counties.
The LISA plot also indicated that St. Louis and Madison counties were next to high valued
counties, presumably the two very high valued county St. Louis City and St. Clair counties.
Based on the above information, we decided not to adopt the spatial association term that assigns
equal contribution to the clustered effect. We refined the shape of the cluster by examining each
individual residual within the adjacent counties, and devised a spatially varied association term
to capture the magnitude of residual variation within a cluster.
Based on the principle of the uniform association model [1], a large residual value should
correspond to a large di(j) value, and a relatively small residual value should correspond to a small
di(j) value. When a neighbor county has an ignorable absolute residual value, it can be dropped
from the spatial specification. From the five equal interval classification, St. Louis City was in
the first interval, St. Clair in the third, and Madison and St. Louis in the fourth interval. We
assigned, accordingly, 4 to St. Louis City, 2 to St. Clair, and 1 to Madison and St. Louis counties,
and this assignment could be achieved automatically in our search algorithm because of standard
intervals were used. The results show that the model with the spatial association term was highly
significant contributing to an around 2, 610 reduction of deviance from the null model of 2, 944
to the alternative model of 334. In the meantime, the coefficient of the spatial association term
0.6145 indicates a high-value cluster, and its inclusion changed the p-values of IDR from 0.0066 in
the null model to 0.9094 in the alternative model. It suggests that the spatial association term
can remove the effect of the local cluster, and there was no global clustering trend. In addition,
if we inspect the deviance residuals individually for the 4 counties, we could see 40, 20, and −5.
Based on this information, we further experimented with assigning 4 to St. Louis City, 2.5 to
St. Clair, and 1 to Madison and St. Louis counties (Model II), and this assignment could further
reduce the deviance to 177 with a z-value of 0.3920 for IDR. In both cases, the values of IDR,i
significantly decreased to a very low level for St. Louis and St. Clair counties which was almost not
14
significant throughout the region at the 0.05 probability level when we adjusted for the multiple
testing problem of 78 units by the Bonferroni’s method (see [28], p 153).
It is worth noting that odds ratios can be used to describe the shape of a cluster. For instance,
the odds ratio of 1.849 = e0.6145 in Model I between St. Louis county and other counties indicates
that St. Louis county was 1.849 = e0.6145 times as likely as other counties to have a homicide.
Similarly, St. Louis City would be 11.68 = e4×0.6145 times and St. Clair would be 3.4178 = e2×0.6145
times as likely.
Alternatively, we can use geographical covariates to explain the detected clustering tendency.
For instance, it is known that St. Louis City had a high concentration of Blacks. We obtained the
percentage of Blacks from the 1990 census for all the 78 counties and used it as an ecological co-
variate in place of the spatial association term. The results (Table 1 last row) show the percentage
of Blacks was positively associated with the likelihood of homicides in the study area. The ecolog-
ical model performed slightly better than the spatial association model in terms of the likelihood
ratio test, i.e., smaller deviance with the same number degrees of freedom. In addition, when the
ecological variable was included, the p-value of 0.4902 for IDR was not significant, suggesting that
there was no spatial autocorrelation anymore. This result implies that the St. Louis City cluster
detected by the association term can be explained by the percentage of Blacks in the case study.
The use of an ecological variable or a spatial association term can both yield useful information
to describe and quantify a detected cluster.
5 Concluding Remarks
In this paper, we have specified and evaluated IDR as a loglinear model-based Moran’s I test
for Poisson count data that resembles the Moran’s I residuals test for Gaussian data. Based on
previous studies, we pointed out that loglinear residuals are not only asymptotically normal, but
also applicable to the permutation test of Moran’s I for a correctly specified model. We evaluated
15
type I and type II error rates via simulations, and found that IDR was effective to account for
heterogeneous population sizes, and to detect a local cluster in the absence of a global trend. In
the presence of a global trend, the power of detecting a local cluster was very weak, a problem
that also exists for a continuous dependent variable in a linear regression model [12].
In the case study, we extended Lin’s [23] spatial association model that emphasizes equal
contributions among spatial neighbors to an ordered or uniform spatial association model that
captures spatially varied contributions among spatial neighbors within a cluster. This model has
several advantages. First, it makes use of exploratory tools such as Moran’s I scatter plots and
residual plots to evaluate the magnitude of deviance residuals. Second, cluster shape can be
determined in terms of its geographic coverage and its slope via odds ratios. In other words,
a 3-dimensional cluster that spatially varies in terms of its magnitude can be derived by the
spatially varied association term. Third, this analysis can be extended to probit, logit [6] and
other limited dependent variables under the loglinear framework. Finally, our model-based IDR
test is complementary to recent development of residual-based spatial statistical approaches [8].
Future research should extend IDR to other test statistics, such as Getis-Ord’s G [20] and
Geary’s c [18], and assess their effectiveness for various spatial problems. Likewise, there are
many conventional methods for modeling categorical associations, and we should examine their
effectiveness for constructing a spatially varied association term, and for specifying various forms
of loglinear models in the context of spatial analysis. The current study does not offer any de-
trend methods in the presence of a global trend, and how to de-trend while locating and explaining
local clusters remains an challenging issue. Finally, like other model-based tests, when a model
is mis-specified, the result from a model-based test, such as IDR can be misleading, criteria for a
correctly specified model should be established for spatial loglinear models.
Aknowledgements: The authors would like to thank a reviewer for the detailed comments
and suggestions, which have substantially improved the quality of the paper.
16
References
[1] Agresti, A. (2002). Categorical Data Analysis. Wiley, New York.
[2] Anselin, L. (1990). Spatial dependence and spatial structural instability in applied regression
analysis. Journal of Regional Science, 30, 185-207.
[3] Anselin, L. (1995). Local indicators of spatial association-LISA. Geographical Analysis, 27,
93-115.
[4] Anselin, L., Syabri, I. and Kho, Y. (2006). GeoDa: An introduction to spatial data analysis.
Geographical Analysis, 38, 5-22..
[5] Assuncao, R. and Reis, E. (1999). A new proposal to adjust Moran’s I for population density.
Statistics in Medicine, 18, 2147-2162.
[6] Apanasovich, T. V, Sheather, S., Lupton, J. R., Popovic, N., Yurner, N. D., Chapkin, R. S.,
Braby, L., A., Carroll, R. J. (2003). Testing for spatial correlation in nonstationary binary
data, with application to aberrant crypt foci in colon carcinogenesis. Biometrics, 50, 752-761.
[7] Bennett, R. J. and Haining, R. P. (1985). Spatial structure and spatial interaction modeling
approaches to the statistical analysis of geographic data. Journal of Royal Statistical Society
A, 48, 1-36.
[8] Baddeley, A., Turner, R. and Hazelton, M. (2005). Residual analysis for spatial point processes.
Journal of Royal Statistical Society B, 67, 617-666.
[9] Billingsley, P. (1995). Probability and Measure, Wiley, New York.
[10] Brunsdon, C., Aitkin, M., Fotheringham, S. and Charlton, M. (1999). A comparison of ran-
dom coefficient modeling and geographically weighted regression for spatial non-stationary
regression problems. Geographical and Environmental Modelling, 3, 47-62.
17
[11] Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of Royal
Statistical Society A, 154, 143-55.
[12] Cressie, N. (1993). Statistics for spatial data, Wiley, New York.
[13] Cliff, A. D. and Ord, J. K. (1972). Test for spatial autocorrelation among regression residuals.
Geographical Analysis, 4, 267-284.
[14] Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models And Applications, Pion, London.
[15] Fingleton, B. (1983b). Loglinear models with dependent spatial data. Environment and Plan-
ning A, 15, 801-13
[16] Fotheringham, S. (1997). Trends in quantitative geography: I: stressing the local. Progress in
Human Geography, 21, 88-96.
[17] Fotheringham, S. (1999). Guest editorial: local modeling. Geographical and Environmental
Modeling, 3 5-7.
[18] Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statisti-
cian, 5, 115-145.
[19] Getis, A. and Aldstadt, J. (2004). Constructing the spatial weights matrix using a local
statistic. Geographical Analysis, 36, 90-104.
[20] Getis, A. and Ord, J. (1992). The analysis of spatial association by use of distance statistics.
Geographical Analysis, 24, 189-206.
[21] Griffith, D. (2002). A spatial filtering specification for the auto-Poisson model. Statistics and
Probability Letters, 58, 245-251.
[22] Lee, S. I. (2004). A generalized significance testing method for global measures of spatial
association: an extension of the Mantel test. Environment And Planning A, 36, 1687-1703.
18
[23] Lin, G. (2003). A spatial logit association model for cluster detection. Geographical Analysis,
35, 329-340.
[24] Lin, G. and Zhang, T. (2005). Loglinear residual tests of Moran’ I autocorrelation and their
applications to Kentucky Breast Cancer Data. Geographical Analysis, to appear.
[25] Messner, S., Anselin, L., Baller, R., Hawkins, D., Deane, G. and Tolnay, S. (1999). The
spatial patterning of county homicide rates: an application of exploratory spatial data analysis.
Journal of Quantitative Criminology, 15, 423-450.
[26] Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical
Society Series B, 10, 243-251.
[27] Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37, 17-23.
[28] Neter, J., Kutner, M. H., Nachtsheim, C. and Wasserman, W. (1996). Applied Linear Statis-
tical Models, 4th Edition, McGraw Hill, New York.
[29] Oden, N. (1995). Adjusting Moran’s I for population density. Statistics in Medicine, 14, 17-26.
[30] Pierce, D. and Schafer, D. (1986). Residuals in Generalized linear models. Journal of American
Statistical Association, 81, 977-986.
[31] Rogerson, P. A. (1999). The detection of clusters using a spatial version of the chi-square
goodness-of-fit statistics. Geographical Analysis, 31, 130-147.
[32] Sen, A. (1976). Large sample-size distribution of statistics used in testing for spatial correla-
tion. Geographical analysis, 9, 175-184.
[33] Sokal, P. R., Oden, N. L. and Thomson, B. A. (1998). Local spatial autocorrelation in a
biological model. Geographical Analysis, 30, 411-432.
[34] Tango, T. (1995). A class of tests for detecting general and focused clustering of rare diseases.
Statistics in Medicine, 14, 2323-2334.
19
[35] Waldhor, T. (1996). The spatial autocorrelation coefficient Moran’s I under heteroscedasticity.
Statistics in Medicine, 15, 887-92.
[36] Walter, S. D. (1992). The analysis of regional patterns in health data. American Journal of
Epidemiology, 136, 730-741.
[37] Whittemore, A., Friend, N., Brown, B. and Holly, E. (1987). A test to detect clusters of
disease. Biometrika, 74, 631-635.
[38] Wrigley, N. (1985) Categorical Data Analysis for Geographers and Environmental Scientists.
Longman, New York.
20
0.0 0.2 0.4 0.6 0.8
0.00
0.05
0.10
0.15
0.20
0.25
η
Rej
ectio
n R
ate
IrIEBI
IDR
Figure 1: Type I error rates of Ir, IDR and IEBI under heterogeneity (α = 0.05).
21
−0.5 0.0 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Local Cluster
δ
Rej
ectio
n R
ate
IDR withoutIDR with
Figure 2: Rejection rate of IDR with and without the spatial association term (α = 0.05).
22
0 20 40 60 80 100
−2
02
46
Residual Plot: Without
Index
Dev
ianc
e R
esid
uals
−2 −1 0 1 2
−2
02
46
QQ−plot: Without
Theoretical Quantiles
Sam
ple
Qua
ntile
s
0 20 40 60 80 100
−2
−1
01
23
Residual Plot: With
Index
Dev
ianc
e R
esid
uals
−2 −1 0 1 2
−2
−1
01
23
QQ−plot: With
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 3: Residual plots and QQ-plots in the presence a local cluster (δ = 0.5).
23
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Global and Local Trend
δ
Rej
ectio
n R
ate
IDR withoutIDR with
Figure 4: Power functions of IDR with and without the spatial association term.
24
0 20 40 60 80 100
−20
−10
010
2030
40
Residual Plot: Without
Index
Dev
ianc
e R
esid
uals
−2 −1 0 1 2
−20
−10
010
2030
40
QQplot: Without
Theoretical Quantiles
Sam
ple
Qua
ntile
s
0 20 40 60 80 100
−20
−10
010
2030
40
Residual Plot: With
Index
Dev
ianc
e R
esid
uals
−2 −1 0 1 2
−20
−10
010
2030
40
QQplot: With
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 5: Residual plots and QQ-plots in the presence of local and global clustering structures
25
Madison
St. ClairSt. Louis
Deviance-10.3 - -0.196-0.196 - 9.9089.908 - 20.01220.012 - 30.11630.116 - 40.22
100 0 100 200 Miles
Figure 6: Deviance residuals of the null model for St. Louis homicides.
26
Table 1: Loglinear model estimates and IDR results for St. Louis homicides: 1988-1993.
Models β̂1 p-value G2 d.f. IDR p-value
Null 2944.0 77 0.1137 0.0066
Spatial association I (St. Louis) 0.6145 0 334.2 76 −0.0352 0.9094
Spatial association II (St. Louis)* 0.6240 0 177.2 76 0.0132 0.7283
Ecological covariate (% of Blacks) 0.0554 0 301.9 76 0.0306 0.4902
Note: variables captured by β̂1 are in parentheses . Model I assigns 4 to St. Louis county, 2 to St. Clair
county, and 1 to the other adjacent counties; Model II differs by assigning 2.5 to St. Clair county.
27