13
Research Article Received 8 January 2011, Accepted 13 August 2011 Published online 3 November 2011 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.4404 Spatial scan statistics with overdispersion Tonglin Zhang, a * Zuoyi Zhang a and Ge Lin b The spatial scan statistic has been widely used in spatial disease surveillance and spatial cluster detection for more than a decade. However, overdispersion often presents in real-world data, causing not only violation of the Poisson assumption but also excessive type I errors or false alarms. In order to account for overdispersion, we extend the Poisson-based spatial scan test to a quasi-Poisson-based test. The simulation shows that the proposed method can substantially reduce type I error probabilities in the presence of overdispersion. In a case study of infant mortality in Jiangxi, China, both tests detect a cluster; however, a secondary cluster is identified by only the Poisson-based test. It is recommended that a cluster detected by the Poisson-based scan test should be inter- preted with caution when it is not confirmed by the quasi-Poisson-based test. Copyright © 2011 John Wiley & Sons, Ltd. Keywords: cluster detection; overdispersion; quasi-Poisson model; spatial scan statistics; type I error probabilities 1. Introduction The use of spatial statistics for disease cluster has matured enough that spatial surveillance for dis- ease clusters has become routine in many public health agencies in the USA. Daily cluster detection or surveillance practices often find a cluster or clusters that, after follow-up with a local hospital or health department, prove to be false alarms. Frequent false alarms can cause alarm fatigue, which may lead to missing real cluster signals. Although social and behavior scientists most often use Anselin’s local indicators of spatial associa- tion (LISA) for exploratory spatial analyses [1], Kulldorff’s [2] spatial scan statistic is perhaps the most frequently used for disease surveillance, which is the focus of the current study. Like many other spa- tial statistics, the spatial scan statistic can neither directly incorporate ecological covariates nor account for overdispersion. Although different ways of incorporating ecological covariates have been proposed [3–5], few have proposed to account for overdispersion. Given that the type I error probability can be significantly inflated [6], accounting for overdispersion in the spatial scan test can be extremely helpful in the analysis and surveillance of real-world data. Because spatially varied cases and populations can cause overdispersion, accounting for these effects can, at least, remove them from spatial correlated data. A common way to account for overdispersion is to extend the Poisson model to the negative bino- mial model [7]. However, this method usually implies a specific mechanism for which the relationship between the mean and variance functions is explicit. For spatially autocorrelated data, overdispersion often has many unknown causes, and spatial random effect is one way to account for them. Loh and Zhu offered an MCMC method that adjusts spatial autocorrelation in the computation of the p-value of the spatial scan statistic [6], which is similar to accounting for overall effects by using random effects. However, this method was mainly concerned with the adjustment of the p-value in limited situations. Our approach is to modify the test statistic by a quasi-Poisson model [8], so that overdispersion can be a Department of Statistics, Purdue University, 250 North University Street, West Lafayette, IN 47907-2066, USA b Department of Health Services Research and Administration, College of Public Health, University of Nebraska Medical Center, Omaha, NE 68198-4350, USA *Correspondence to: Tonglin Zhang, Department of Statistics, Purdue University, 250 North University Street, West Lafayette, IN 47907-2066, USA. E-mail: [email protected] 762 Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Spatial scan statistics with overdispersion

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatial scan statistics with overdispersion

Research Article

Received 8 January 2011, Accepted 13 August 2011 Published online 3 November 2011 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.4404

Spatial scan statistics withoverdispersionTonglin Zhang,a*† Zuoyi Zhanga and Ge Linb

The spatial scan statistic has been widely used in spatial disease surveillance and spatial cluster detection formore than a decade. However, overdispersion often presents in real-world data, causing not only violation of thePoisson assumption but also excessive type I errors or false alarms. In order to account for overdispersion, weextend the Poisson-based spatial scan test to a quasi-Poisson-based test. The simulation shows that the proposedmethod can substantially reduce type I error probabilities in the presence of overdispersion. In a case study ofinfant mortality in Jiangxi, China, both tests detect a cluster; however, a secondary cluster is identified by onlythe Poisson-based test. It is recommended that a cluster detected by the Poisson-based scan test should be inter-preted with caution when it is not confirmed by the quasi-Poisson-based test. Copyright © 2011 John Wiley &Sons, Ltd.

Keywords: cluster detection; overdispersion; quasi-Poisson model; spatial scan statistics; type I errorprobabilities

1. Introduction

The use of spatial statistics for disease cluster has matured enough that spatial surveillance for dis-ease clusters has become routine in many public health agencies in the USA. Daily cluster detection orsurveillance practices often find a cluster or clusters that, after follow-up with a local hospital or healthdepartment, prove to be false alarms. Frequent false alarms can cause alarm fatigue, which may lead tomissing real cluster signals.

Although social and behavior scientists most often use Anselin’s local indicators of spatial associa-tion (LISA) for exploratory spatial analyses [1], Kulldorff’s [2] spatial scan statistic is perhaps the mostfrequently used for disease surveillance, which is the focus of the current study. Like many other spa-tial statistics, the spatial scan statistic can neither directly incorporate ecological covariates nor accountfor overdispersion. Although different ways of incorporating ecological covariates have been proposed[3–5], few have proposed to account for overdispersion. Given that the type I error probability can besignificantly inflated [6], accounting for overdispersion in the spatial scan test can be extremely helpfulin the analysis and surveillance of real-world data. Because spatially varied cases and populations cancause overdispersion, accounting for these effects can, at least, remove them from spatial correlated data.

A common way to account for overdispersion is to extend the Poisson model to the negative bino-mial model [7]. However, this method usually implies a specific mechanism for which the relationshipbetween the mean and variance functions is explicit. For spatially autocorrelated data, overdispersionoften has many unknown causes, and spatial random effect is one way to account for them. Loh andZhu offered an MCMC method that adjusts spatial autocorrelation in the computation of the p-value ofthe spatial scan statistic [6], which is similar to accounting for overall effects by using random effects.However, this method was mainly concerned with the adjustment of the p-value in limited situations.Our approach is to modify the test statistic by a quasi-Poisson model [8], so that overdispersion can be

aDepartment of Statistics, Purdue University, 250 North University Street, West Lafayette, IN 47907-2066, USAbDepartment of Health Services Research and Administration, College of Public Health, University of Nebraska MedicalCenter, Omaha, NE 68198-4350, USA

*Correspondence to: Tonglin Zhang, Department of Statistics, Purdue University, 250 North University Street, West Lafayette,IN 47907-2066, USA.

†E-mail: [email protected]

762

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 2: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

directly incorporated into the test statistic. We first propose our modifications based on Kulldorff’s spa-tial scan statistic and then assess them numerically via simulation and case studies. Finally, we providesome concluding remarks.

2. Methods

2.1. Kulldorff’s spatial scan test

The spatial scan test uses a moving window (e.g., circle, rectangle, or ellipse) of variable size to identifya set of regions or points that maximizes the likelihood ratio test statistic for the null hypothesis of apurely random Poisson or Bernoulli process [2, 9]. The statistic requires the following: (1) a study areahas a set ofm locations or subregions and (2) each location has a number of cases and the correspondingat-risk population. Let yi be the number of cases and ni be the at-risk population at location i . Assumethat the counts are independent Poisson random variables with the expected value ni�i , where �i is theunit-specific relative risk. The spatial scan statistic is used to detect any spatial cluster of regions in whichthe expected rates within the cluster are significantly higher than the expected rates in the rest of the area.Specifically, let C be the cluster. The test compares the total number of disease counts yc D

Pi2C yi

within C and the total number of counts y Nc DPi2 NC yi outside of C with the corresponding at-risk

populations inside and outside of C , denoted by nc DPi2C ni and n Nc D

Pi2 NC ni , respectively.

Assume that counts are independently Poisson distributed. Let �i D �c if i 2 C and �i D �0 if i 62 C .Then, the likelihood function is

LC .�0; �c/D

mYiD1

nyii

yi Š

! Xi2C

�yic e��cni

!0@Xi 62C

�yi0 e��0ni

1A : (1)

The likelihood ratio statistic is to test the null hypothesis H0.�c D �0) against the alternativehypothesis of H1.�c > �0/,

ƒC Dmax�c>�0 LC .�0; �c/

max�cD�0 LC .�0; �c/D

�yc=nc

y=n

�yc �y Nc=n Ncy=n

�y Nc; (2)

if yc=nc > y Nc=n Nc and ƒC D 1 otherwise, where y DPmiD1 yi and nD

PmiD1 ni . Because C is

unknown, Kulldorff treated it as a member of a collection of candidates of possible spatial clustersC, which leads to the Poisson-based spatial scan statistic being

ƒDmaxC2C

ƒC : (3)

Because the null distribution of ƒ is not analytically tractable, the p-value of ƒ is derived by MonteCarlo simulations.

2.2. Overdispersion and the quasi-Poisson model

Overdispersion occurs when disease count variability exceeds that of the predicted value from thePoisson model [10]. So that overdispersion can be accounted for, the Poisson assumption is oftenmodified by the quasi-Poisson model with the marginal mean and variance relationship, in whichoverdispersion occurs when � > 1:

V.yi /D �E.yi /; i D 1; � � � ; m: (4)

A random effect Poisson mixture model is a flexible way to account for overdispersion [7, 10]. In aPoisson mixture model, the response variable is still Poisson given the mean value. However, the meanitself varies according to some distribution because some relevant explanatory variables are missing inthe dataset. Suppose under a specific distribution of �i , we haveE.�i /D �i and V.�i /D �2i . Conditionalon �i , we have V.yi j�i /DE.yi j�i /D ni�i and E.yi /DEŒE.yi j�i /�D ni�i . Marginally, there is

V.yi /DEŒV.yi j�i /�C V ŒE.yi j�i /�DE.ni�i /C V.ni�i /D ni�i C n2i �2i :

A quasi-Poisson model is derived if

�2i D.� � 1/�i

ni(5)

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

763

Page 3: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

because in this case we have

V.yi /D ni�i C .� � 1/ni�i D ��i D �E.yi /:

Even though the distribution of �i is not specified, a Poisson–gamma mixture model is the natural choice[7] in which �i has the density of

gi .�i I�i ; �/D1

��ni�i��1

� � ni

� � 1

�ni�i��1

�ni�i��1 �1

i e�ni �i��1 : (6)

In addition to the computational advantage, the Poisson–gamma mixture model is easy to interpretbecause its marginal distribution belongs to the exponential family. It has been used in the longitudinalstudy [11], spatial random field model [12], and motor vehicle crash analysis [13].

In general, the density given by Equation (6) is the special case of the general �.˛i ; ˇi / distributionbecause gi .�i I�i ; �/ is derived if we choose ˛i D ni�i=.� � 1/ and ˇi D ni=.� � 1/. In this case, thetrue PMF of yi is

p.yi I�i ; �/D�.yi C

ni�i��1

/

yi Š�.ni�i��1

/��yi�

ni�i��1 .� � 1/yi : (7)

Marginally, yi follows a negative binomial distribution with the dispersion parameter equal to ni=.��1/.As �! 1C, the quasi-Poisson model reduces to the pure Poisson model as �! 1C.

2.3. Derivation of the test statistic

Following the cluster model by Kulldorff [2], we assume that E.yi=ni /D �0 if i 2 NC and E.yi=ni /D�c if i 2 C in the quasi-Poisson model. We consider the hypothesis testing problem of the null hypoth-esis H0 W �c D �0 against the alternative hypothesis H1 W �c > �0. For a given C , the likelihoodfunction is

LC .�0; �c ; �/

DYi 62C

"�.yi C

ni�0��1

/

yi Š�.ni�0��1

/��yi�

ni�0��1 .� � 1/yi

#Yi2C

"�.yi C

ni�c��1

/

yi Š�.ni�c��1

/��yi�

ni�c��1 .� � 1/yi

#

D

(mYiD1

1

yi Š

).� � 1/y��y�

�0n NcC�cnc��1

8<:Yi 62C

�.yi Cni�0��1

/

�.ni�0��1

/

9=;(Yi2C

�.yi Cni�c��1

/

�.ni�c��1

/

):

Under H0 W �c D �0, LC .�0; �c ; �/ becomes

LC .�0; �0; �/D

(mYiD1

1

yi Š

).� � 1/y��y�

n�0��1

(mYiD1

�.yi Cni�0��1

/

�.ni�0��1

/

):

Let O�10;C , O�1c;C , and O�1C be the conditional maximum likelihood estimates (MLEs) of �0, �c , and �

given C under �c > �0. Let O�00 and O�0 be the conditional MLE of �0 and � under �c D �0. Then, thelikelihood ratio statistic is

ƒC;o Dmax�c>�0 LC .�0; �c ; �/

max�cD�0 LC .�0; �c ; �/DLC . O�

10;C ; O�

1c;C ;O�1C /

LC . O�00; O�

00;O�0/

;

and the quasi-Poisson-based spatial scan statistic is

ƒo DmaxC2C

ƒC;o; (8)

where C is still an unknown member of the collection of candidates C in Equation (3). The spatial clusteris significant if the value of ƒo is large.

764

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 4: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

The MLEs O�10;C , O�1c;C , and O�1C can be derived by maximizing LC .�0; �c ; �/ under �c > �0. The

MLEs O�00 and O�0 can be derived by maximizing LC .�0; �0; �/, which is independent of C . Therefore,we need to compute only O�00 and O�00 under H0. However, the MLEs O�10;C , O�1c;C , and O�1C and the MLEs

O�00 and O�0 are not analytically tractable. From a list of numerical fitting procedures [14], we chose thecomputationally efficient Newton–Raphson method. Because it is necessary to compute the first-orderand second-order partial derivatives of the conditional log-likelihood function for a given C , we presentthe details of the Newton–Raphson algorithm for the computation of the MLEs in Appendix A.

As an alternative to the numerical method for MLEs, we propose to approximate ƒo by using themoment estimates (MEs) of unknown parameters, which can lead to closed-form formulae. The formulaecan be derived straightforwardly, and the results are

Q�10;C Dy Nc

n Nc; Q�1c;C D

yc

nc; Q�1C Dmax

8<:1; 1

m� 2

24Xi2 NC

.yi � ni Q�10;C /

2

ni Q�10;C

CXi2C

.yi � ni Q�10;C /

2

ni Q�10;C

359=;

Q�00 Dy

n; Q�0 Dmax

(1;

1

m� 1

mXiD1

.yi � ni Q�00/2

ni Q�00

) (9)

if y Nc=n Nc < yc=nc and

Q�10;C D Q�1c;C D Q�

00 D

y

n; Q�1C D

Q�0 Dmax

(1;

1

m� 1

XiD1

.yi � ni Q�00/2

ni Q�00

)(10)

if y Nc=n Nc > yc=nc .

When the MEs given by Equations (9) and (10) are used as the estimates of the parameters, spatialscan statistic ƒo becomes

Qƒo Dmaxc2CQƒC;o; (11)

where

QƒC;o DLC . Q�

10;C ; Q�

1c;C ;Q�1C /

LC . Q�00; Q�

00;Q�0/

:

As an alternative to ƒo, Qƒo can increase computational efficiency.From these computations,ƒo and Qƒo are expected to be able to detect a cluster when overdispersion is

present. As �! 1C, the likelihood function LC .�0; �c ; �/ in Equation (A.1) converges to LC .�0; �c/in Equation (1), which implies that ƒ and ƒo are asymptotically equivalent as � ! 1C. This can beshown easily by using the Sheffé theorem [15].

2.4. Bootstrap for p-values

We follow Kulldorff’s approach and compute the p-value via the bootstrap method. The null hypothesisis rejected if ƒo or Qƒo is large. We calculate the bootstrap p-value conditionally on the total numberof counts. However, we cannot directly apply Kulldorff’s approach, as the distribution of .y1; � � � ; ym/conditional on y has not been analytically solved. We, therefore, apply the following bootstrap algorithm.

Under H0 W �c D �0, the distributions of �i are independently distributed, and the densities are

gi .�i I�0; �/D1

�.ni�0��1

/

�ni

� � 1

�ni�0��1

�ni�0��1 �1

i e�ni �i��1 ; (12)

for i D 1; � � � ; m. If �1; � � � ; �m are given, then

.y1; � � � ; ym/jy � Multinom�y;�

n1�1PmiD1 ni�i

; � � � ; nm�mPmiD1 ni�i

��: (13)

As in ƒo or Qƒo, �1; � � � ; �m are treated as random variables; we calculate its p-values on the basis ofthe randomly generated �1; � � � ; �m conditional on y. In the following, we only describe the bootstrapalgorithm for ƒo because the bootstrap algorithm for Qƒo is similar.

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

765

Page 5: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

The bootstrap algorithm for ƒo:

(1) On the basis of real observations y1; � � � ; ym, compute O�00 and O�00 . Derive the observed value ofƒo and denote it by ƒ�o .

(2) Generate Q�1; � � � ; Q�m independently from the density given by Equation (12) by letting �0 D O�00and � D O�00 . Then, generate counts Qy1; � � � ; Qym from the conditional distribution displayed byEquation (13).

(3) On the basis of generated observations Qy1; � � � ; Qym, calculate the simulated value ƒo.(4) Repeat steps 2 and 3 J times. Denote ƒo;j as the j th simulated value of ƒo derived in step 3.

Then, the p-value of ƒo equals #fƒ�o > ƒo;j I j D 1; � � � ; J C 1g=.J C 1/, where ƒo;JC1 D ƒ�oand #A represents the number of elements in set A.

All ofƒ,ƒo, and Qƒo have their roles in spatial cluster detection. ƒ is the conventional method that iswidely available and used. ƒo is equivalent to ƒ in the absence of overdispersion. Although ƒo and Qƒoboth account for overdispersion, Qƒo is computationally more efficient thanƒo because of its closed-formformulae for the conditional MLEs.

Note that the proposed method is conditioned on the total observations. An alternative is to directlyand unconditionally generate data from the estimated null distribution. Our preliminary analysisshowed that the behavior of the unconditional method was almost identical to the conditional method.Because our intent was to compare the proposed method with Kulldorff’s method, we only present theconditional method.

3. Simulation and case study

In both simulation and case studies, we used real-world infant mortality data for Jiangxi Province, China.In 2000, Jiangxi Province had the highest infant mortality rate (IMR) in eastern China. In addition, thefemale IMR of 6:02% was 1:33 times higher than the male IMR. The sex ratio at birth (males/females)in Jiangxi Province was also the highest in China at 138=100 in 2000 [16]. The cause and distribution ofhigh IMR were of great concern in China. The Chinese government sent numerous investigative teamsto the area and funded numerous studies on these issues.

In the simulation, we used the number of births within each county as the at-risk population and redis-tributed infant deaths among 99 counties according to our designs. In the case study, we used real birthsand deaths in each county to detect infant mortality clusters, with particular attention to female infants.

3.1. Simulation

We evaluated and compared type I error rates and power functions between Kulldorff’s ƒ and the pro-posed ƒo and Qƒ0 on the basis of 1000 simulation runs. We kept the number of births in each county(ranging from 1,138 to 19,270) in the simulations and redistributed the total number of deaths at theprovincial level to form various clusters. We fixed the shape and the size of the cluster at the center of theprovince and generated clustering effects from no cluster to one strong cluster. Specifically, we inserteda cluster C centered at Xingan with its six adjacent counties (Yushui, Xiajiang, Yongfeng, Fengcheng,Zhangshu, and Le’an), and the total number of infant births within the cluster was 55,354. On the basisof the latitude and longitude, we defined C as the nearest k-neighbors for all counties with k varyingfrom 3 to 30 because of the known cluster size.

We assumed that

E.yi /D �ini D �0.1C ıi /ni D 0:001.1C ıi /ni D

(0:001ni ; if i 62 C;

0:001.1C ı/ni ; if i 2 C;(14)

where ıi D 0 if i 62 C and ıi D ı if i 2 C . We used ı to measure the strength of the cluster from 0 to2:0, with ı D 0 indicating no spatial cluster effect and ı D 2:0 indicating strong spatial cluster effect.

We used � to measure the strength of overdispersion. When � D 1, we generated data from the purePoisson model with expected value given in Equation (14). For a selected � > 1, we generated thequasi-Poisson model in Equation (4) according to the following two steps:

766

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 6: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

(1) Generating the conditional expected value �i from the distribution described by Equation (6)with �i .�i I ı; �/ being the PDF of �.0:001ni=.� � 1/; ni=.� � 1// if i 62 C and the PDF of�.0:001.1C ı/ni=.� � 1/; ni=.� � 1// if i 2 C ;

(2) Generating independent Poisson random variables yi with conditional expected value �i .

It is clear that the random counts yi generated from the above steps satisfy the quasi-Poisson model inEquation (14). There is no overdispersion if � D 1. As � becomes large, the effect of overdispersion isalso large.

In the simulation, we compared type I error probabilities and power functions of ƒ, ƒo, and Qƒo foreach selected ı (Figure 1). We generated random quasi-Poisson counts and calculated bootstrap p-valuesfor ƒ, ƒo, and Qƒo from 999 random replications of the null hypothesis (ı D 1) and alternative hypothe-sis (ı > 0) with respect to whether overdispersion was presented (� > 1) or not (� D 1). In addition, forall detected clusters with significant p-values, we assessed their location specificity by the percentage oftimes that they were identical to the designed cluster in terms of the center and members.

When there was no overdispersion (� D 1), the type I error probabilities and power functions of ƒ,ƒo, and Qƒo were close to each other. All of them had about 5% rejection rates when there was no cluster-ing effect: 5:3% for ƒ, 3:4% for ƒo, and 4:3% for Qƒo, respectively. As ı increased, the power functioncurves increased rapidly in the same manner. They were all able to detect the existence of clusters whenı was greater than 0:35. When ı reached 2:0, the power functions of all three test statistics were equalto 100%, indicating the existence of a strong cluster.

In the selected overdispersion effects (� > 1), bothƒo and Qƒo performed identically, and their behav-iors diverged further from that ofƒ as the overdispersion effects increased. The type I error probabilitiesof Kulldorff’s ƒ were consistently higher than 5% even when the clustering effect was weak or non-existent. When ı D 0, we found that the rejection rates of ƒ were 15:1% when � D 1:25, 24:4% when� D 1:5, and 55:1% when � D 2:0. In contrast, the type I error probabilities of ƒo and Qƒo were alwaysaround 5% when ı D 0. Regardless of � values, when � D 1:25, the rejection rates of ƒo and Qƒo were5:4% and 4:5%, respectively; when � D 1:5, the rejection rates were 4:7% and 4:8%, respectively; when� D 2:0, the rejection rates were 4:0% and 4:6%, respectively. These results suggested that the type Ierror probabilities of ƒo and Qƒo were not significantly affected by overdispersion effects.

It should be noted that even though the powers of ƒ, ƒo, and Qƒo were comparable in the absence ofoverdispersion, the powers of the latter two were lower than that of ƒ in the presence of overdispersion.

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

φ=1

δ

Pow

er F

unct

ion

ΛΛo

Λ~o

0.0 0.5 1.0 1.5

0.2

0.4

0.6

0.8

1.0

φ=1.25

δ

Pow

er F

unct

ion

ΛΛo

Λ~o

0.0 0.5 1.0 1.5

0.2

0.4

0.6

0.8

1.0

φ=1.5

δ

Pow

er F

unct

ion

ΛΛo

Λ~o

0.0 0.5 1.0 1.5

0.2

0.4

0.6

0.8

1.0

φ=2

δ

Pow

er F

unct

ion

ΛΛo

Λ~o

Figure 1. Simulations of power functions as functions of ı for selected �.

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

767

Page 7: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

This can be explained by the fact that ƒ does not account for overdispersion but ƒo or Qƒo does. Fur-thermore, as overdispersion effects became stronger, the powers ofƒo and Qƒo became weaker primarilybecause of extra noise generated by overdispersion effects that cannot be fully accounted for.

In assessing location specificities among the three test statistics (Table I), we found that the percent-ages of correct specifications were comparable. However, because the rejection rate ofƒ was high when� was large and ı was small, a false alarm rate of over 50% (e.g., � D 2) was associated with 99%mis-identified cluster centers.

In summary, when there was no overdispersion, the type I error probabilities and power functions ofthe three scan tests were comparable. When there was an overdispersion, the behaviors of ƒo and Qƒowere almost identical, and they effectively removed the relatively high type I error rate of ƒ. Althoughthe location specificities among the three test statistics were also comparable, high type I error rates wereassociated with extremely highly mis-specified cluster locations.

3.2. Case study

As mentioned earlier, Jiangxi province had the highest IMR, especially for female infants, which hascaused great concern among academics and the government. In this case study, we attempt to locatethe geographic source of elevated risks by county or county-level city within the province. Jiangxi has99 county-level units, with a mix of mountains, hills, and plains. The total population of the provincewas 40:4 million in 2000, with 99% being Han ethnic majority. The numbers of infant births and deathsfrom November 1, 1999, to October 31, 2000, were 26,494 and 632,717, respectively. The provincial-level IMR was 41:8 per 1000. The county-level IMR varied from 4:5 per 1000 to over 100 per 1000(Figure 2).

We analyzed data separately for male infants, female infants, and male and female infants combinedand checked for consistencies among the three test statistics of ƒ, ƒo, and Qƒ0. Following Kulldorff, wedefined C as the nearest k-neighbors for all counties and devised a procedure to let k vary from 2 to 49so that C contains at least two counties but no more than 50% of counties. The procedure was to detectall possible clusters in the dataset, and it kept running until the detected cluster was not significant. Werestricted the clusters from overlapping. When scanned for the next cluster, we updated the expectedvalue of disease count for each county according to its estimated disease rate at each stage of the clus-ter detection. The parameter estimates, values of test statistics, and their bootstrap p-values for the firstcluster, second cluster, etc., were computed sequentially. If a cluster was detected by the three tests, itwas a confirmed cluster. If a cluster was detected by only the Poisson-based scan test, then the existenceof the cluster should be interpreted with caution.

The results from ƒ for overall infant mortality indicated the existence of two clusters, with somediscrepancies among the three tests (Table II). The first cluster was fairly consistent among ƒ, ƒo, andQƒo. Each indicated a cluster centered in Taihe County, which is the center of an impoverished hillyregion that has had severe soil erosion. The result from ƒ pointed to a cluster of 11 counties (Figure 3),whereas the results from ƒo and Qƒo covered additional counties (Gan, XiaJiang, and AnFu). The totalinfant deaths within the 11-county cluster were 6,344, with an IMR of 95:5 per 1000 births, which was2:28 times the provincial rate. The second cluster detected by ƒ pointed to a 13-county region. Thetotal infant deaths within the cluster were 4,892, with an IMR of 53:8 per 1000 births, which was 1:29times the provincial rate. It should be noted that the average IMR outside of the two clusters was 3:21%,which would increase the ratio of IMRs from the inside to the outside of the clusters. Such an effect canalso be seen from the values of O�c and O�0 for the second clusters. However, because of the presence of

Table I. Percentage of correctly identified cluster centers given significant p-values.

� D 1 � D 1:25 � D 1:5 � D 2:0

ı ƒ ƒo Qƒo ƒ ƒo Qƒo ƒ ƒo Qƒo ƒ ƒo Qƒo

0 1:1 1:3 0:7 1:4 1:2 1:1 0:7 1:0 1:0 0:8 0:7 0:7

0:5 68:6 62:3 66:9 54:5 53:4 53:6 48:0 46:1 45:9 34:7 37:7 37:4

1:0 96:7 97:7 97:3 94:6 95:4 95:7 93:4 91:8 91:8 83:9 85:6 85:4

1:5 99:7 99:8 100 99:8 99:3 99:3 99:3 99:2 99:2 97:3 98:0 98:0

2:0 100:0 100:0 100:0 100:0 100:0 100:0 99:7 100:0 100:0 99:6 99:4 99:4

768

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 8: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

Figure 2. Infant mortality rate of Jiangxi province in China, 2000.

Table II. Spatial clusters of infant mortality in Jiangxi: total (T), female (F) and male (M).

First cluster Second cluster

Value P -value . O�c ; O�0; O�/ Value P -value . O�c ; O�0; O�/

2 logƒ 3897:86 0:001 .0:088; 0:035; 1/ 896:05 0:001 .0:063; 0:039; 1/

T 2 logƒo 30:49 0:001 .0:090; 0:036; 79:0/ 11:64 0:141 .0:064; 0:039; 66:9/

2 log Qƒo 32:52 0:001 .0:095; 0:035; 75:2/ 11:59 0:142 .0:063; 0:039; 66:4/

2 logƒ 2880:02 0:001 .0:141; 0:051; 1/ 697:45 0:001 .0:093; 0:057; 1/

F 2 logƒo 26:92 0:001 .0:131; 0:052; 65:5/ 12:78 0:076 .0:096; 0:056; 57:1/

2 log Qƒo 27:91 0:001 .0:141; 0:051; 60:0/ 12:94 0:066 .0:093; 0:056; 53:9/

2 logƒ 983:30 0:001 .0:055; 0:023; 1/ 194:74 0:001 .0:036; 0:024; 1/

M 2 logƒo 24:79 0:001 .0:052; 0:023; 27:2/ 4:83 0:942 .0:034; 0:025; 25:7/

2 log Qƒo 28:56 0:001 .0:055; 0:023; 32:2/ 4:99 0:929 .0:036; 0:024; 29:5/

overdispersion, neither ƒo nor Qƒo indicated the existence of the cluster. For this reason, the suggestedsecond cluster should be interpreted cautiously.

The estimates of � for both the first and second clusters were very large, indicating significant overdis-persion. Even after both clusters were accounted for, the estimate was still O� > 60. Even though the thirdcluster was detected by ƒ, the non-significant results from both the second and third clusters from ƒoand Qƒo convinced us that we should not include it. Therefore, we colored the first cluster red, as it wasconfirmed by all methods, and we colored the second cluster orange to indicate caution.

Finally, there was little difference between male and female samples in terms of clustered patterns.The first clusters detected byƒ for the male and female samples were identical, not only to each other butalso to the first cluster identified by the combined sample above. The second male cluster was also iden-tical to the one from the combined sample, whereas the female cluster had two fewer counties. Becausethe second cluster was not significant because of overdispersion, we gave it less credence. We there-fore concluded that there was a spatial cluster of infant mortality in Jiangxi centered in Taihe County.

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

769

Page 9: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

Figure 3. Cluster of infant mortality of Jiangxi province in China, 2000.

This cluster was proportionally affected by both male and female samples. High female infant mortalitywas common among all counties. When female infant mortality was high in clustered counties, maleIMR was also high.

4. Concluding remarks

In this article, we have extended Kulldorff’s spatial scan test to the quasi-Poisson-based spatial scantest to account for overdispersion. We used both ƒo and Qƒo. The former used the MLE that mimicsKulldorff’s method with substantially increased computational intensity. The latter used the ME that hasthe same computational efficiency as Kulldorff’s method, but it is an approximation to the MLE method.In the context of spatial scan tests for simulated data and a real-world example, we found that both ƒoand Qƒo were effective. In the absence of overdispersion, the performance of bothƒo and Qƒo were almostequivalent to the performance of Kulldorff’sƒ. In the presence of overdispersion, bothƒo and Qƒo wereable to significantly reduce type I error probabilities. In almost all situations in the numerical study, thebehavior of Qƒo was almost identical to that of ƒo.

In the deployment and the operation of real-world disease surveillance, both speed and false alarm rateshould be taken into consideration. Because Kulldorff’s ƒ was not designed to account for overdisper-sion and the computation of MLE ofƒo was time consuming, we recommend using Qƒo as a complementto ƒ. In our evaluations, Qƒo has the computational efficiency of ƒ, and the statistical performance ofƒo. Although the proposed method was for spatial data, it can also be applied to spatial econometric [17]and ecological models [18]. It is particularly relevant for the bootstrap method, when many real-worldproblems cannot be readily approximated by the asymptotical distribution [19].

However, our finding does not mean that Qƒo should always accompany Kulldorff’s spatial scan statis-tic. In many retrospective spatial cluster analyses, the existence of a cluster is often suspected. When oneor two clusters are identified by the spatial scan statistic, they tend to be accurate in terms of both thestatistical power and spatial specificity. An additional weaker cluster, even though it might cause a falsealarm, may still be worth investigating. Only in the context of surveillance as often as daily, weekly,or monthly in public health operations do frequent false alarms due to overdispersion cause problems.

770

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 10: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

As shown in the simulation, the presence of moderate overdispersion was associated with more than50% type I error probabilities and 99% of mis-identified cluster centers. If a public health surveillancespecialist has to follow-up every lead that suggests the existence of a cluster, 50% of calls will be falsealarms. In such situations, Qƒo is a nice complement to Kulldorff’s spatial scan statistic.

Overdispersion can be caused by spatially correlated disease risks [20], which usually represents aglobal clustering tendency. When spatial correlation between disease risks is considered, a correlatedrandom effect term can be used by fitting the underline statistical model with diagnostic methods ofmodel residuals [21]. A range of diagnostic methods to assess the true global effect can be found inthe literature [22]. In general, both type I error probability and the power function of the spatial scantest can be significantly influenced by the presence of a global effect, and accounting for the globaltrend in the spatial scan test could partially reduce the type I error probability [6]. Our simulationstudies suggested that the power function would be significantly reduced if the type I error probabilitywere well controlled.

APPENDIX A. Numerical computation

Taking the logarithm of Equation (A.1), we derive the conditional log-likelihood function of �0, �c , and� as

`C .�0; �c ; �/D log

mYiD1

1

yi Š

!C y log

�� � 1

�CXi 62C

�log�

�yi C

ni�0

� � 1

�� log�

�ni�0

� � 1

��

CXi2C

�log�

�yi C

ni�c

� � 1

�� log�

�ni�c

� � 1

���.�0n Nc C�cnc/ log.�/

� � 1:

(A.1)Because its partial derivatives can be analytically derived, we simply display the results below and skipthe steps of the derivation:

@`C .�0; �c ; �/

@�0DXi 62C

ni

� � 1

�yi C

ni�0

� � 1

��

�ni�0

� � 1

���n Nc log.�/

� � 1;

@`C .�0; �c ; �/

@�cDXi2C

ni

� � 1

�yi C

ni�c

� � 1

��

�ni�c

� � 1

���nc log.�/

� � 1;

@`C .�0; �c ; �/

@�D

y

�.� � 1/�Xi 62C

ni�0

.� � 1/2

�yi C

ni�0

� � 1

��

�ni�0

� � 1

��

�Xi2C

ni�c

.� � 1/2

�yiC

ni�c

� � 1

��

�ni�c

� � 1

���.n Nc�0C nc�c/Œ� � 1� � log.�/�

�.� � 1/2;

(A.2)

@2`C .�0; �c ; �/

@�20DXi 62C

n2i.� � 1/2

� 1.yi C

ni�0

� � 1/� 1.

ni�0

� � 1/

�;

@2`C .�0; �c ; �/

@�0@�cD0;

@2`C .�0; �c ; �/

@�0@�D�

Xi 62C

ni

.� � 1/2Œ .yi C

ni�0

� � 1/� .

ni�0

� � 1/�

�Xi 62C

n2i �0

.� � 1/3Œ 1.yi C

ni�0

� � 1/� 1.

ni�0

�/��

n Nc Œ� � 1� � log.�/�

�.� � 1/2;

(A.3)

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

771

Page 11: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

@2`C .�0; �c ; �/

@�2cDXi2C

n2i.� � 1/2

Œ 1.yi Cni�c

� � 1/� 1.

ni�c

� � 1/�;

@2`C .�0; �c ; �/

@�c@�D�

Xi2C

ni

.� � 1/2Œ .yi C

ni�c

� � 1/� .

ni�c

� � 1/�

�Xi2C

n2i �c

.� � 1/3Œ 1.yi C

ni�c

� � 1/� 1.

ni�c

�/��

nc Œ� � 1� � log.�/�

�.� � 1/2;

(A.4)

@2`C .�0; �c ; �/

@�2Dy.1� 2�/

�2.� � 1/2CXi 62C

2ni�0

.� � 1/3Œ .yi C

ni�0

� � 1/� .

ni�0

� � 1/�C

Xi 62C

n2i �20

.� � 1/4

� Œ 1.yi Cni�0

� � 1/� 1.

ni�0

� � 1/�

CXi2C

2ni�c

.� � 1/3Œ .yi C

ni�c

� � 1/� .

ni�c

� � 1/�

CXi2C

n2i �2c

.� � 1/4Œ 1.yi C

ni�c

� � 1/� 1.

ni�c

� � 1/�

�.n Nc�0C nc�c/Œ2�

2� log.�/� .� � 1/.3� � 1/�

�2.� � 1/3;

(A.5)

where .t/ D d log�.t/=dt is the digamma function and 1.t/ D d2 log�.t/=dt2 D d .t/=dt is thetrigamma function [23].

To implement the Newton–Raphson algorithm, we need the conditional gradient and the Hessianmatrix of the log-likelihood function. Under �c > �0, they are

r`C;1.�0; �c ; �/D

0BBBBB@

@`C .�0;�c ;�/@�0

@`C .�0;�c ;�/@�c

@`C .�0;�c ;�/@�

1CCCCCA

and

HC;1.�0; �c ; �/D

0BBBBBB@

@2`C .�0;�c ;�/

@�20

@2`C .�0;�c ;�/@�0@�c

@2`C .�0;�c ;�/@�0@�

@2`C .�0;�c ;�/@�c@�0

@2`C .�0;�c ;�/

@�2c

@2`C .�0;�c ;�/@�c@�

@2`C .�0;�c ;�/@�@�0

@`C .�0;�c ;�/@�c

@`C .�0;�c ;�/@�

1CCCCCCA;

respectively. Under �c D �0, they are

r`C;0.�0; �/D Ar`C;1.�0; �c ; �/

and

H0.�0; �/D AHC;1.�0; �0; �/A0;

respectively, where AD

0@ 1 1 0

0 0 1

1A and A0 is the transpose of A.

772

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

Page 12: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

Under �c D �0, the conditional MLE �10;C and �0C maximizes `C .�0; �0; �/, or one of the roots ofr`C;0.�0; �/D 0. Under �c > �0, the conditional MLE �10;C , �1c;C , and �1C maximizes `C .�0; �c ; �/,which is another root of r`C;1.�0; �c ; �/D 0 that satisfies the constraint �c > �0. We, therefore, havethe following numerical algorithm.

The Newton–Raphson algorithm

(1) At step 0, provide initial guess values for �1;.0/0;C , �1;.0/c;C , and �1;.0/C for O�10;C , O�1c;C , and O�1C ,

respectively, and initial guess values for �0;.0/0 and �0;.0/ for O�00 and O�0, respectively.(2) At step kC 1, update new values for O�10;C , O�1c;C , and O�1C by

0BBBBB@

�1;.kC1/0;C

�1;.kC1/c;C

�1;.kC1/C

1CCCCCAD

0BBBBB@

�1;.k/0;C

�1;.k/c;C

�1;.k/C

1CCCCCA�H

�1C;1.�

1;.k/0;C ; �

1;.k/c;C ; �

1;.k/C /r`C;1.�

1;.k/0;C ; �

1;.k/C /

and new values for O�00;C and O�00;C by

0@ �

0;.kC1/0

�0;.kC1/

1AD

0@ �

0;.k/0

�0;.k/

1A�H�10 .�

0;.k/0 ; �0;.k//r`0.�

0;.k/0 ; �

0;.k/0 /:

Iterate this step until convergence.

(3) Denote �1;.1/0;C , �1;.1/c;C , and �1;.1/c as the final results for O�10;C , O�1c;C , and O�1C , respectively. Denote

�0;.1/0 and �0;1 as the final results for O�00 and O�0, respectively.

(a) Under �c D �0, the algorithm reports O�00 D �0;.1/0 and O�0 D �0;.1/ as the MLE of �0 and �.

(b) Under �c > �0, the algorithm reports the following: (i) O�10;C D �1;.1/0;C , O�1c;C D �

1;.1/c;C , and

O�1C D �1;.1/C as the MLE of �0, �c , and � if �1;.1/c;C > �1;.1/0;C or (ii) O�10;C D O�

1c;C D �

0;.1/0;C

and O�1C D �0;.1/C as the MLE of �0, �c , and � otherwise.

Because `C .�0; �c ; �/ and `C .�0; �0; �/ may have more than one local maximum, the initial guessvalues are important in the algorithm. We used the MEs as the initial guess values. It implies that�1;.0/0;C D Q�10;C , �1;.0/c;C D Q�1c;C , �1;.0/C D Q�1C , �0;.0/0;C D Q�00;C and �0;.0/C D Q�0C . We showed in our

numerical evaluations that these initial guess values were always able to attain the global maximumof the likelihood function and that the Newton-Raphson algorithm converged to the true MLE of theunknown parameters.

Acknowledgements

We appreciate the helpful comments from the anonymous referees and the associate editor, which significantlyimproved the quality of this paper. This research was funded by the US National Science Foundation GrantsSES-07-52657 (Zhang) and SES-07-52019 (Lin).

References1. Anselin L. Local indicators of spatial association-LISA. Geographic Analysis 1995; 27:93–115.2. Kulldorff M. A spatial scan statistic. Communications in Statistics, Theory and Methods 1997; 26:1481–1496.3. Jung I. A generalized linear models approach to spatial scan statistics for covariate adjustment. Statistics in Medicine

2009; 28:1131–1143.4. Kulldorff M. Scan statistics for geographical disease surveillance: an overview. In Spatial and Syndromic Surveillance for

Public Health, Lawson AB, Kleinman K (eds). Wiley: New York, 2005; 113–131.5. Zhang T, Lin G. Scan statistics in loglinear models. Computational Statistics and Data Analysis 2009; 53:2851–2858.6. Loh J, Zhu Z. Accounting for spatial correlation in the scan statistic. Annals of Applied Statistics 2007; 2:560–584.

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774

773

Page 13: Spatial scan statistics with overdispersion

T. ZHANG, Z. ZHANG AND G. LIN

7. Agresti A. Categorical Data Analysis, 2nd edn. Wiley: New York, 2002; 559–560.8. Gelfand AE, Dalal SR. A note on overdispersed exponential families. Biometrika 1990; 77:55–64.9. Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Statistics in Medicine 2006; 25:3929–3943.

10. McCullagh P. Quasi-likelihood functions. Annals of Statistics 1983; 11:59–67.11. Stukel TA. Comparison of methods for the analysis of longitudinal interval count data. Statistics in Medicine 1993;

12:1339–1351.12. Best NG, Ickstadt K, Wolpert RL. Spatial Poisson regression for health and exposure data measured at disparate

resolutions. Journal of the American Statistical Association 2000; 95:1076–1088.13. Lord D, Part PY. Investigating the effects of the fixed and varying dispersion parameters of Poisson–gamma models on

empirical Bayes estimates. Accident Analysis and Prevention 2008; 40:1441–1457.14. Lawless JF. Negative binomial and mixed Poisson regression. The Canadian Journal of Statistics 1987; 3:209–225.15. Billingsley P. Probability and Measure. Wiley: New York, 1995; 215–216.16. Lai D. Sex ratio at birth and infant mortality rate in China: an empirical study. Social Indicators Research 2005;

30:313–326.17. Crepon B, Duguet E. Research and development, competition and innovation pseudo-maximum likelihood and simu-

lated maximum likelihood methods applied to count data models with heterogeneity. Journal of Econometrics 1997;79:355–378.

18. Alexander N, Moyeed R, Stander J. Spatial modeling of individual-level parasite counts using the negative binomialdistribution. Biostatistics 2000; 1:453–463.

19. Potts J, Elith J. Comparing species abundance models. Ecological Modelling 2006; 199:153–163.20. Green PJ, Richardson S. Hidden Markov models and disease mapping. Journal of the American Statistical Association

2002; 97:1055–1070.21. Lawson AB. Disease cluster detection: a critique and a Bayesian proposal. Statistics in Medicine 2006; 25:897–916.22. Lawson AB, Clark A. Spatial mixture relative risk models applied to disease mapping. Statistics in Medicine 2002;

21:359–370.23. Abramowitz M, Stegun IA. Handbook of Mathematical Functions. Dover Publications Inc.: New York, 1964; 253–194.

774

Copyright © 2011 John Wiley & Sons, Ltd. Statist. Med. 2012, 31 762–774