23
In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes. New Zealand Veterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference, Christchurch, 23-28 June 1996; 83-105. Issues related to handling of spatial data D.U. PFEIFFER Department of Veterinary Clinical Sciences Massey University, Palmerston North, New Zealand Introduction Epidemiological analyses are mainly conducted using data which does not include or take account of spatial relationships between the observations studied. More recently, the need for spatial data analysis has been pointed out as it may provide additional insight when attempting to reveal epidemiological cause-effect relationships (Rothman 1990). While the theory of spatial analysis has been an area of research interest for many years, only the advent of personal computers made the techniques more easily accessible to epidemiologists. Still, even recent text books on medical or veterinary epidemiology do not provide more than basic introductions to the subject area of spatial data analysis. This is could seem surprising, as place has always been seen as part of classic epidemiological triad of time, person, place. Spatial Data The distinction between spatial and non-spatial data can easily become the subject of extensive discussions. In general, observations for which absolute location and/or relative positioning (spatial arrangement) is taken into account can be referred to as spatial data (Anselin 1992). It can be subdivided into two major categories representing discrete and continuous phenomena. Based on the former classification, which has also been called entity view, spatial phenomena are described using zero dimensional objects such as points, one dimensional objects such as lines or two dimensional objects such as areas. If space is described using continuous phenomena, such as in the case of temperature or topography, this has also been described as field view. In practice, the latter is usually measured based on sampling discrete entities such as locations in space. The entity view allows spatial objects to have attributes. Spatial analysis is typically aimed at the spatial arrangement of the observational units, but can also take into account attribute information. An analysis conducted only on the basis of the attributes of the observational units ignoring the spatial relationships is not considered a spatial data analysis. Spatial Data Analysis The methods used in spatial data analysis can be broadly categorized in those concerned with visualizing data, those for exploratory data analysis and methods for development of statistical models (Bailey and Gatrell 1995). During most analyses, a combination of techniques will be used with the data first being displayed visually, followed by exploration of possible patterns and possibly modeling. Data visualization One of the first steps in any data analysis should be an inspection of the data. Visual displays of information using plots or maps will provide the epidemiologist with the basis for generating hypotheses and, if required, an assessment of the fit or predictive ability of models. Over the last couple of years interactive computer packages have been developed which allow dynamic

Issues Related to Spatial Data

  • Upload
    paula

  • View
    11

  • Download
    1

Embed Size (px)

DESCRIPTION

Estatistica Espacial

Citation preview

Page 1: Issues Related to Spatial Data

In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes. New ZealandVeterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference,Christchurch, 23-28 June 1996; 83-105.

Issues related to handling of spatial data

D.U. PFEIFFER

Department of Veterinary Clinical SciencesMassey University, Palmerston North, New Zealand

Introduction

Epidemiological analyses are mainly conducted using data which does not include or takeaccount of spatial relationships between the observations studied. More recently, the need forspatial data analysis has been pointed out as it may provide additional insight when attemptingto reveal epidemiological cause-effect relationships (Rothman 1990). While the theory ofspatial analysis has been an area of research interest for many years, only the advent ofpersonal computers made the techniques more easily accessible to epidemiologists. Still, evenrecent text books on medical or veterinary epidemiology do not provide more than basicintroductions to the subject area of spatial data analysis. This is could seem surprising, as placehas always been seen as part of classic epidemiological triad of time, person, place.

Spatial Data

The distinction between spatial and non-spatial data can easily become the subject of extensivediscussions. In general, observations for which absolute location and/or relative positioning(spatial arrangement) is taken into account can be referred to as spatial data (Anselin 1992). Itcan be subdivided into two major categories representing discrete and continuous phenomena.Based on the former classification, which has also been called entity view, spatial phenomenaare described using zero dimensional objects such as points, one dimensional objects such aslines or two dimensional objects such as areas. If space is described using continuousphenomena, such as in the case of temperature or topography, this has also been described asfield view. In practice, the latter is usually measured based on sampling discrete entities such aslocations in space.

The entity view allows spatial objects to have attributes. Spatial analysis is typically aimedat the spatial arrangement of the observational units, but can also take into account attributeinformation. An analysis conducted only on the basis of the attributes of the observational unitsignoring the spatial relationships is not considered a spatial data analysis.

Spatial Data Analysis

The methods used in spatial data analysis can be broadly categorized in those concerned withvisualizing data, those for exploratory data analysis and methods for development of statisticalmodels (Bailey and Gatrell 1995). During most analyses, a combination of techniques will beused with the data first being displayed visually, followed by exploration of possible patternsand possibly modeling.

Data visualization

One of the first steps in any data analysis should be an inspection of the data. Visual displays ofinformation using plots or maps will provide the epidemiologist with the basis for generatinghypotheses and, if required, an assessment of the fit or predictive ability of models. Over thelast couple of years interactive computer packages have been developed which allow dynamic

Page 2: Issues Related to Spatial Data

84

displays of the data. Geographic information systems can be used to produce maps and theyallow the exploration of spatial patterns in an interactive fashion.

Exploratory data analysis

Data exploration is aimed at developing hypotheses and makes extensive use of graphicalviews of the data such as maps or scatter plots. Exploratory data analysis makes fewassumptions about the data and should be robust to extreme data values. Simple analyticalmodels can also be used in this analysis phase.

Models of spatial data

For this type of spatial data analysis specific hypotheses are formally tested or predictions aremade using statistical models of the data. Modeling of spatial phenomena has to incorporatethe possibility of spatial dependence in order to provide a true representation of the existingeffects. Such spatial effects can be either large scale trends or local effects. The first is alsocalled a first order effect and it describes overall variation in the mean value of a parametersuch as rainfall. The second which is named a second order effect is produced by spatialdependence and represents the tendency of neighboring values to follow each other in terms oftheir deviation from the mean. This can for example be the case with the incidence of aninfectious animal disease affecting animals on farm properties. First order effects can be readilymodeled by standard regression models. The presence of second order effects violates theindependence assumption of standard statistical analysis techniques, and appropriate analysistechniques will have to take account of the covariance structure in the data giving rise to theselocal effects.

Often spatial data are modeled as stationary spatial processes which assumes that whilethere may be dependence between neighboring observations, it is independent from absolutelocation. A spatial process is isotropic, if in a stationary process covariance betweenobservations at different locations depends only on the distance but not on direction. Non-stationary data is almost impossible to model as most locations will require different parametersets. Therefore, most spatial modeling procedures begin with first identifying a trend in meanvalue and then modeling the residuals from this trend as a stationary process.

With any of these models it has to be kept in mind that they are abstractions of reality, andfirst or second order effects are artifacts of the modeler. Bailey and Gatrell (1995) concludethat models can be at best 'not wrong', rather than 'right'. They add that the analyst shouldalways involve judgment and intuition in statistical modeling.

Problems in Spatial Data Analysis

A major factor influencing spatial data analysis is the geographical scale at which the data isbeing analyzed. It may be possible to identify specific non-random patterns at a local levelwhich when looked at from a national level turn into random variations. Another problem canbe that many spatial data sets are based on irregularly shaped area units or there may bedirectional effects. Proximity or neighborhood also may be more difficult to clearly define thanfor example in time-series analysis. Any type of spatial analysis will be subject to some degreeof edge effect where area units on the map boundary do have neighbors only in one direction.Many data analyses have to be conducted with observations based on information summarizedat a particular spatial aggregation level such as at the veterinary district. Inferences from suchanalyses may only be correct if used at the same level of aggregation. This situation has alsobeen called the modifiable areal unit problem.

Page 3: Issues Related to Spatial Data

85

Methods of Spatial Data Analysis

Methods used in spatial data analysis can be divided according to the three main categories ofdata to be analyzed. They are point patterns, spatially continuous and area data.

Point patterns

Spatial point patterns are based on the coordinates of events such as the locations of outbreaksof a disease. It is also possible that they include attribute information such as the time ofoutbreak occurrence. Data on point patterns can be based on a complete map of all pointevents or a sampled point pattern. The basic interest of a spatial point pattern analysis will beto detect whether it is distributed at random or represents a clustered or regular pattern. It isimportant to recognize that the stochastic process studied relates to the locations where eventsare occurring. A spatial point pattern can be quantified in terms of the intensity of the processusing its first order properties, measured as the mean number of events per unit area. Secondorder properties or spatial dependency are analyzed on the basis of the relationship betweenpairs of points or areas. The latter is typically interpreted as analysis for clustering.

Visualization of spatial point patterns

The method used most frequently to present spatial point patterns is a dot map. It is generallydifficult to assess randomness of a pattern from visual inspection of such a map. It becomesimportant to take account of the population at risk when for example inspecting a dot map ofdisease outbreaks. One method for representing this difference in population at risk is to use acartogram, where the size of the areas is geometrically transformed proportional to thecorresponding population value.

In a case-control study of tuberculosis breakdown in cattle herds from the Waikato regionof New Zealand all cattle herds which had broken down with tuberculosis infection werecompared with a random sample of cattle herds free from infection. Figure 1 presents a seriesof dot maps showing the locations of cases and random controls. Inspection of the map withlocations of the case herds could give the observer the impression that they are clustered.Without inspection of the distribution of random control herds it becomes difficult todifferentiate whether clustering only occurred in case herds or if the distribution of all cattleherds in this area is inherently non-random. In this situation it clearly is the case that cattleherds are not randomly distributed throughout the study region. But there is also someclustering of tuberculosis breakdowns.

Random Control Herds Case Herds Case and Control Herds

Figure 1: Dot maps of the locations of herds from a case-control study of tuberculosis breakdown in New Zealand cattleherds

Page 4: Issues Related to Spatial Data

86

Exploratory analysis of spatial point patterns

Techniques for exploratory spatial analysis of point patterns are aimed at deriving summarystatistics or plots of the observed distribution to investigate specific hypotheses. The methodsused are examining first or second order effects.

First order effects for point patterns can be examined with two techniques - quadratcounts and kernel estimation. The quadrat methods involve dividing the area into sub-regionsof equal size - quadrats and produce a summary statistic on the basis of the number of countsper quadrat. The counts are then divided by the size of the area. These techniques give anindication of the variation of the intensity of the underlying process in space. The disadvantageof the techniques is that they aggregate the information into area type data which can result inloss of information. Kernel estimation is a technique which uses the original point locations toproduce a smooth bivariate histogram of intensity. It has been used for example for homerange estimation in wildlife ecology (Izenman 1991).

Second order properties of point patterns can be investigated using the distances betweenthe points - particularly nearest-neighbor distances. The latter can be estimated using twotechniques - either the distance between a randomly selected event and the nearest neighboringevent or between a randomly selected location in space and the nearest event. Spatialdependence can be investigated by visual examination of the probability distributions of theobserved nearest -neighbor distances. Clustered events would show a steep part of thedistribution function with lower values, whereas regularity would be indicated by steepness ofthe curve with higher values. The k - function will allow taking into account not just thenearest events. It depends on the assumption of an underlying isotropic process and isproblematic to use in the presence of significant first order effects.

Modeling of spatial point patterns

Spatial point modeling techniques are aimed at explaining an observed point pattern, andtypically involve comparison with the model of complete spatial randomness (CSR). A pointpattern generated by a random spatial process should follow a homogeneous Poisson process.This implies that every event has an equal probability of occurring at any position in the studyarea and occurrence is independent of the location of any other event, hence the absence offirst order and second order effects. It is against this basic model that the analysis will assesswhether the point process is regular, clustered or random. There are a range of methodsavailable to test for CSR. Some are based on quadrat counts such as the index of dispersiontests, others use nearest-neighbor distances such as the Clark-Evans test or the K function.Comparison of an observed pattern with CSR has its limitations in epidemiology as it does notallow definition of the type of point process other than whether it is completely random inspace or not. It also cannot take account of issues such as a clustered underlying population atrisk. Alternative models which could be used include the heterogeneous Poisson process, theCox process, the Poisson cluster process or Markov point process (Bailey and Gatrell 1995).Getis and Ord (1992) describe the use of a distance statistic G which can be used to assessspatial autocorrelation for point patterns as well as for area data. It can be used to detect localpockets of dependence which might not show up when using a global statistic.

The analysis of point patterns is important in veterinary epidemiology as it allowsinferences on the occurrence of spatial clustering. The presence of clustering would suggestinfectiousness or the presence of specific environmental risk factors. Second order effects in aspatial process can be the result of disease clustering. Disease clustering can be assessed usinga number of methods and they can be categorized into general and focused tests (Waller and

Page 5: Issues Related to Spatial Data

87

Lawson 1995). The latter tests relate to the clustering of events around fixed point locationssuch for example a nuclear power plant. Wartenberg and Greenberg 1990 describe techniquesfor detection of hot spot clusters and clinal clusters.

A tool which can be effectively used for the analysis of clustering effects is the K function(Kingham, Gatrell, and Rowlingson 1995). In this context, two classes of point processes suchas cases of disease and random controls without the disease are compared. The principle is thatboth point processes are pooled and then the point process describing the cases is comparedwith the pooled process. Cuzick and Edwards (1990) developed a method which is also basedon nearest-neighbor distances. The test statistic simply compares the number of case-case pairsfor a given number of nearest neighbors. Applying this technique to the case-control datamentioned above it appears that there is significant clustering of cases compared with thecontrol population (see Figure 2). The software Stat! (BioMedware, Ann Arbor, Michigan,U.S.A.) was used to run the analysis and produce the graph.

Figure 2: Cuzick and Edwards’ method applied to tuberculosis breakdown case control study data (+ = cases, • =controls, arrows identify nearest neighbors)

Kingham, Gatrell, and Rowlingson (1995) describe a method combining the Diggle andChetwynd method based on bivariate K functions with results of a logistic regression analysisallowing them to make use of information about additional covariates to test for clustering.

Methods aimed at the detection of hot spots or small areas which might represent clustersof disease include the Geographical Analysis Machine developed by Openshaw (1990). Thistechnique is based on comparing the observed intensity of cases in circles of varying radius.The result of the analysis is a map with circles indicating the areas where case incidence washigher than expected under the assumption of spatial randomness. Alexander and Cuzick(1992) reviewed methods for assessment of disease clusters. Wartenberg and Greenberg(1993) discuss problems associated with detection of disease clusters.

Page 6: Issues Related to Spatial Data

88

Other analyses using point pattern information

Point events in space may have time of occurrence as one particular attribute. Specific methodsare available to investigate clustering in space and time. Interaction is present if pairs of casesare near in space as well as in time. Contagious diseases requiring direct contact will producespace-time clustering between cases. The techniques described below are not considered usefulfor non-contagious diseases. There are three main methods available which allow assessmentof a space-time relationship between cases: Knox’s method, Mantel’s method and the K-nearest neighbor method. All three techniques require production of distance matrices of thespatial as well as the temporal relationship between cases.

Data from a longitudinal study of Mycobacterium bovis infection in a wild possumpopulation in New Zealand will be used to demonstrate the usage of these techniques (Pfeiffer1994). As a first step a histogram of the geographical nearest neighbor distances has to beproduced as shown in Figure 3. The distribution of nearest neighbor distances expected underspatial randomness is shown in the histogram. It assumes uniform population density across thestudy area. The excess of shorter distances compared with the Poisson probability densityfunction suggests that there may be spatial clustering in this data set. The map in the samefigure indicates the locations of the 26 cases used in the analysis and their nearest neighbors areindicated by the arrows. Figure 4 presents the temporal distance distribution and a mapconnecting the cases according to their sequence of occurrence. The map does suggest thedisease has shifted its spatial focus over time and that there is some degree of temporalclustering.

Figure 3: Map and histogram of geographical distances between cases of tuberculosis infection in wild possums

Page 7: Issues Related to Spatial Data

89

Figure 4: Temporal distance map and histogram for cases of tuberculosis infection in wild possums

As a next step a formal statistical test has to be conducted to assess the statistical significanceof a potential space-time interaction process. When using Knox’s method, a critical distance intime as well as in space defining closeness has to be set and pairs of cases are tabulated into a2*2 contingency table with spatial and temporal closeness/farness defining the rows andcolumns (Knox 1964). Knox saw the critical distance as defining latency period. In mostsituations determining the critical distance requires a subjective decision. Approximaterandomization permutation techniques are used to construct a Null distribution for Knox’s teststatistic. Figure 5 shows the results of Knox’s test applied to the tuberculous possum datausing a critical distance of 100m in space and 3 months in time. The result of 30 for Knox’stest statistic X which is significant at a p-value of 0.02 suggests that given the selected criticaldistances time-space interaction is present in this data set.

Page 8: Issues Related to Spatial Data

90

Figure 5: Results of Knox’s method applied to cases of tuberculosis infection in wild possums

Another approach to investigate time-space interaction could be the use of the Mantel method(Mantel 1967). Mantel’s approach does not require selection of critical distances. It uses both,time and space distance matrices between all cases. But it should be kept in mind that theMantel test can be insensitive to non-linear associations between time and space distances.Distance measures can be transformed in a number of ways, such as the reciprocaltransformation which reduces the effect of large time and space distances. The Null hypothesisis that the time distances are independent of the space distances. Randomization permutationtechniques can be used to generate a test statistic for the Mantel test. Figure 6 presents theresults of this analysis when applied to the data on tuberculous possums. The scatter plot ofspace distances against temporal distances seems to suggest while the points are scatteredthroughout the plot that there some denser accumulations of cases present. The frequencydistribution of the test statistic under the Null hypothesis on the basis of 500 randompermutations is presented in the left window of Figure 6. It can be concluded that there issignificant space-time interaction.

Page 9: Issues Related to Spatial Data

91

Figure 6: Results from applying the Mantel method to test for time-space interaction between cases of tuberculosisinfection in wild possums

A third approach available is the K-nearest neighbor test of space-time interaction in pointdata. The test statistic indicates the number of case pairs which are K nearest neighbors in timeand space. The statistic is based on an approximate randomization of the Mantel productstatistic. Figure 7 presents the results from applying the K-nearest neighbor method to thepossum tuberculosis data. The map shows the locations of the cases and the arrows indicatek=2 nearest neighbors. The test statistic produced on the basis of 1000 random permutationssuggests that only the cumulative statistic Jk is statistically significant, whereas Jk is not. Thelatter parameter measures the statistical significance from increasing K by 1. The test statisticsupports the presence of space-time interaction, and suggests that the first 5 nearest neighborsare involved in space-time interaction.

Page 10: Issues Related to Spatial Data

92

Figure 7: Results from applying the K-nearest neighbor method to test for time-space interaction between cases oftuberculosis infection in wild possums

Spatially continuous data

The point pattern analyses assessed characteristics of the spatial distributions of points, butmade only limited use of attribute information. With spatially continuous and also area data theanalysis focus shifts towards use of the attribute information, in order to describe their patternin space. Spatially continuous data is also often referred to as geostatistical data. The data isusually collected by sampling at fixed points in space. The main objective of the analysis will beto describe the spatial variation in an attribute value, using the data collected at the sampledpoints. The spatial variation can be modeled as first and second order spatial processes.

Visualization of spatially continuous data

The data values obtained from the sampled locations can be mapped using proportionallyscaled symbols or columns for each sampling point. Column charts will in fact allowpresentation of multiple attribute values at the same point. Overlapping symbols can cause aproblem with interpretation of such maps. But none of these approaches will be able torepresent the underlying continuity of the process studied.

Exploratory analysis of spatially continuous data

The techniques used in this area can be categorized into methods for describing first ordereffects and those for second order effects.

Page 11: Issues Related to Spatial Data

93

Exploratory analysis of first order effects in spatially continuous data

The main techniques in this area are spatial moving averages, tessellation methods and kernelestimation techniques. A spatial moving average interpolates values between a given numberof neighboring sampling points. The more points are used the smoother the created surface willbe. It will be possible to describe global trends. A weighting mechanism can be introduced toaccount for varying distances between sampling points. Alternatively a tessellation of theobserved sample points can be used. This is most commonly done using Delauneytriangulation, also referred to as a triangulated irregular network (TIN). This method assignsto each sampling point a territory in which each point is closer to this sampling point than toany other. The resulting polygon map is called a Dirichlet tessellation and the tiles are knownas Voronoi or Thiessen polygons. Such a TIN can be used to construct a contour map or adigital terrain model (DTM). Figure 8 shows a triangulated irregular network based on heightinformation from a 50 m grid plus break lines. The TIN was then used to create a contour mapand a digital terrain model of possum tuberculosis longitudinal study area. Pfeiffer (1994) usedthis technique to generate polygons from point locations describing the areas where particularMycobacterium bovis strains occurred. The technique can be appropriate for presence/absencetype information, where a smooth surface is not the objective of the interpolation.

TIN Contour map DTM

Figure 8: TIN and derived contour maps and digital terrain model for the possum tuberculosis longitudinal study site

As with point patterns it is also possible to use kernel estimation to convert the attribute datafrom the sampling points into a surface. This time not using the number of events per unit areabut rather the value of the attribute. This technique has been used in geographicalepidemiology to model the relative risk function, measuring local risk relative to the regionalmean (Bithell 1990).Exploratory analysis of second order effects in spatially continuous data

Spatial dependence between attribute values measured at sampled locations is described usingthe covariance function or covariogram. The presence of second order effects would result inpositive covariance between observations a small distance apart and lower covariance orcorrelation if they are further apart. The covariogram describes the function of the covariancefor varying distances h between sample points and the correlogram the correspondingcorrelation. The semi-variogram is a graphical representation of the variation betweensampling points separated by a given distance and direction. For a stationary spatial process allthree describe similar information. Estimates of the semi-variogram are considered to be morerobust to departures from stationarity represented as a general trend in the spatial process. A

Page 12: Issues Related to Spatial Data

94

continuous process without spatial dependence will result in a horizontal line. A stationaryprocess will reach an upper bound, referred to as the sill at a distance h called the range.Theoretically, the intercept with the y-axis should be at a value of 0 variation. In reality,sampling error and small scale variation will result in variability at small distances and thevariogram will meet the y-axis not in the origin. This intercept with the y-axis is called thenugget effect. Variograms which do not reach an upper bound suggest non-stationarity. Figure9 shows an isotropic sample semi-variogram for the proportion of tuberculous possumscaptured at trap sites during the longitudinal study. The shape of the variogram suggests thatthe process is non-stationary, but given the relatively small nugget value there is also likely tobe spatial dependence.

Parameters defining a variogram Example of a variogram

Figure 9: Isotropic semi-variogram for the proportion of tuberculous possums at individual trap sites in the longitudinalstudy

Modeling of spatially continuous data

A number of approaches can be used to model or predict spatially continuous data. For thefirst-order processes trend surface analysis can be used based on ordinary polynomial leastsquares regression. Results have to be treated with caution, because the standard regressionassumptions of independent random errors and heteroscedasticity are likely to be violated.Lessard et al. (1990) used an inverse distance-weighted mathematical algorithm to interpolateclimatic measurements between sample points. Most trend surface models may be able todescribe an overall trend, but are not useful for local prediction. In the presence of weak firstorder, but strong second order effects it is more appropriate to use models fitted tovariograms. Such models can be defined ‘by eye’ and are most commonly based on thespherical, exponential or Gaussian model. The fit of a particular model can be assessedthrough cross-validation. Figure 10 shows an omni-directional exponential variogram modelfor the possum tuberculosis prevalence data. For the model to be valid it would be necessary toremove the non-stationarity through trend regression.

Page 13: Issues Related to Spatial Data

95

Figure 10: Omnidirectional exponential variogram model for possum tuberculosis prevalence data

The variogram model itself does not allow prediction of values. This can be achieved withKriging. This is a weighted moving average technique for estimating the value of a spatiallydistributed variable from adjacent values while considering interdependence expressed in avariogram. It allows the interpolation error to be mapped and from a statistical viewpoint isconsidered to be the most satisfactory method for interpolation (Oliver and Webster 1990).Pfeiffer (1994) used ordinary Kriging to produce a surface of possum population densitybased on possum capture data at sample points (see Figure 11). The omnidirectionalvariogram suggests that this data is more stationary than the tuberculosis prevalence data, butit also shows strong spatial dependence. An exponential model was fitted and used as the basisfor Kriging. The distribution of Kriging errors shows that there are some reasonably higherrors and according to the map they are located in one particular area of the study.

Page 14: Issues Related to Spatial Data

96

Omnidirectional variogram model Histogram of Kriging errors

Map of Kriging errors

Contour map based on Kriging estimates

Figure 11: Variogram model, Kriging errors and estimates for possum density in the longitudinal study on possumtuberculosis epidemiology

A number of multivariate methods can be used for modeling of spatially continuous data.Principal components can be used to combine the information from multiple variables into asmall number of components, each of them representing a particular combination of variablesand explaining a particular proportion of the variation in the data. Eastman and Fulk (1993)used the technique to analyze the information contained in a time series of NDVI maps forAfrica, thereby conducting a space-time analysis. Cliff et al. (1995) discuss the application ofmultidimensional scaling (MDS) to spatial epidemiological data. They use the technique tomap geographical information about measles mortality in Australia and New Zealand asdisease space where points with similar disease risks are closer to each other on the MDS mapeven though they are far removed geographically. Bailey and Gatrell (1995) discuss a range ofother multivariate analysis techniques for spatially continuous data.

Area data

Attribute data which does have values within fixed polygonal zones within a study area isreferred to as area data or lattice data. The areal units can constitute a regular lattice or gridor consist of irregular units. It is usually not required to estimate values as they should bepresent for all areas. The main emphasis with area data is on detection and explanation ofspatial patterns or trends possibly extended to take account of covariates.

Page 15: Issues Related to Spatial Data

97

Visualization of area data

Area data can be visualized using a wide range of techniques. The Choropleth map is probablythe most commonly used tool. Appropriate use of class intervals and colors to represent valuesin a choropleth map is essential. Cartograms or density equalized maps can be used to expressthe importance of particular areas. The analyst has to be aware of the problems which can becaused by the modifiable areal units problem. It is also possible to display several attributes atthe same time, by adding scaled columns or symbols to a choropleth map. Area informationcan be presented together with spatially continuous data for example by draping a choroplethmap over a DTM.

Canton BoundariesLakes

Cattle per Hectare0 - 0.40.4 - 0.60.6 - 0.80.8 - 1.01.0 - 1.2

0 20 40 60 80 100 Kilometers

N

Choropleth map of cattle density in SwitzerlandThiessen polygons representing areas used by possumsinfected with four different strains of Mycobacterium

bovis draped over a DTM

Figure 12: Examples of choropleth maps

Exploration of area data

Informal investigations of hypotheses can be aimed at first order or second order spatialprocesses. Most of these techniques require a methodology for measuring proximity. Possibleapproaches include various distance measures between polygon centroids as well as presenceor length of a shared boundary. For further analyses the proximity information can be describedthrough generation of contiguity or spatial weights matrices. This is quite difficult to achieve incurrently available GIS software and there are only few specialized spatial statistics softwarepackages (e.g. SpaceStat, Regional Research Institute, West Virginia University, Morgantown,West Virginia, U.S.A.) which can perform such operations.

For investigation of first order effects, a spatial weights matrix can be used to estimatespatial moving averages. If the data is available on the basis of a regular grid, the medianpolish may be appropriate. Kernel estimation can also be applied for investigations of firstorder effects in area data.

In the case of second order spatial processes the objective is to explore spatial dependenceof deviations in attribute values from their mean. In the context of area data, this effect isreferred to as spatial autocorrelation and it quantifies the correlation between values of thesame attribute between different locations. The most commonly used techniques for spatialautocorrelation are Moran's I and Geary's C. The first is closely related to the covariogramand the second to the variogram used for spatially continuous data. Power analyses of thesestatistics have been conducted by Walter (1993) who concluded that the power of Moran’s Iwas highest. A correlogram can be used to graphically display the correlation between values

Page 16: Issues Related to Spatial Data

98

at different spatial lags. If the autocorrelation does not decline after a number of lags, itindicates the presence of non-stationarity. The correlogram has similar applications in spatialanalysis as it has in time-series analysis for describing patterns. Hungerford (1991) analyzed thespatial distribution of cattle anaplasmosis between counties within the state of Illinois usingsecond-order analysis and detected significant spatial clustering within the state.

The above mentioned methods do not provide local indicators of spatial association whichwould be useful for identifying so-called hot spots. The Moran scatterplot and spatial lag piesdescribed in Anselin (1994) can be used to describe local patterns of variation visually.Quantitative estimates can be obtained using the G statistic by Getis and Ord (1992) or thelocal indicators of spatial association by Anselin (1995). The latter can be used as an indicatorof local pockets of non-stationarity (hot spots), similar to the G statistic, and also to assess theinfluence of individual data points on the global statistic and to identify outliers. Anselin,Dodson, and Hudak (1993) describe how these different techniques can be combined to form aexploratory spatial analysis system. Figure 13 shows a number of examples used by theseauthors to display local variation (from Anselin’s world wide web sitehttp://lambik2.rri.wvu.edu/esda.htm). The spatial lag pie map superimposes a pie on each areawith the top half of the pie representing the local value and the bottom part the neighboringvalues for this particular variable. It gives the observer an appreciation of the ratio between thelocal value and the surrounding spatial units. The Moran scatterplot shows the original valueof each observation on the x-axis and the value of its spatial lag on the y-axis. The plot can beused to identify outliers or even to conduct local regression to further describe the spatialassociation. These outliers can then be mapped as shown in Figure 13. The map of the areaswith significant LISA statistic indicates the area were there appears to be spatialautocorrelation.

Page 17: Issues Related to Spatial Data

99

Spatial Lag Pie Map Moran scatterplot

Moran scatterplot outliers Map of areas with significant LISA or G statistic

Figure 13: Example of an exploratory spatial data analysis approach

In landscape ecology, approaches have been developed to describe the interactions amongpatches within a landscape mosaic referred to as landscape pattern. Most biological processesand that includes of course diseases are influenced by a multitude of factors which togethermay form a particular pattern. Spatial patterns are particularly difficult to quantify. Ecologistsuse the term landscape structure which describes the spatial relationships between habitatpatches within a landscape (Dunning, Danielson, and Leck 1992). The software FRAGSTATS(McGarigal and Marks, Oregon State University, Corvallis, Oregon, U.S.A.) allows calculationof a wide range of indices and parameters describing landscape structure which could be usedfor further analyses.

Modeling of area data

Modeling techniques are aimed at establishing explanatory relationships between attributevalues of a dependent variables, taking account of the relative spatial arrangement of the areasand other values associated with each area unit. Again, it is possible to focus the analysis onfirst order or second order effects. Multiple ordinary least squares regression can only beused for preliminary exploratory analyses, but suffers from the problem that in the presence ofspatial dependence the errors are not independent and that the variance is unlikely to beconstant. The presence of spatial dependence can be assessed readily using a spatialcorrelogram. A range of spatial regression models have been described by Haining (1990) andthey can be implemented using the SpaceStat software mentioned above. Hungerford (1991)analyzed the relationship between cattle density and anaplasmosis prevalence on a county basis

Page 18: Issues Related to Spatial Data

100

in Illinois using measures of spatial correlation. Perry et al. (1991) used a GIS to investigatethe occurrence of Rhipicephalus appendiculatus in Africa to identify the factors controlling thedistribution of the vector tick which transmits the parasite Theileria parva causing East Coastfever, Corridor disease and January disease in cattle. A number of authors have included spatialdata into multivariate analysis as independent variables. Clifton-Hadley (1993) used spatialdescriptive measures, spatial autocorrelation and distance to particular features of interest toanalyse patterns of occurrence of badger-related tuberculosis breakdowns of cattle herds insouth-west England. Pfeiffer (1994) used a GIS to provide for point locations (cases ofdisease) specific geographical variables such as height above sea level, aspect, slope anddistance to features of interest which were then used as explanatory variables in multivariatestatistical analysis.

In the field of epidemiology, parameters of interest are very often counts or proportionswhich can be modeled using generalised linear modeling techniques rather than ordinaryleast-squares regression. It should be noted though that spatial forms of these models are notwell developed yet. Bailey and Gatrell (1995) suggest introducing covariates into theregression model such as the spatial coordinates or a variable representing regions categorizedbroadly by location to remove the effect of spatial dependence. Glass et al. (1995) developed arisk density map for Lyme disease based on a multiple logistic regression model, but they didnot attempt to remove spatial dependence from the data. A number of different predictivemodeling approaches for spatial data was compared by Williams et al. (1994). They used linearand non-linear discriminant analysis, tree-based induction and neural networks to map tsetsedistributions in Zimbabwe and concluded that while the simpler methods (linear discriminantanalysis and tree-based induction) were less precise, they were easier to interpret. Figure 14presents some preliminary results of a logistic regression analysis for prediction of Theileriaparva presence in an African country (this analysis was conducted by Perry,B.D., Kruska,R.L.,Pfeiffer,D.U. and others at ILRI, Nairobi, Kenya). The regression model includes eightdifferent environmental and land use variables and is based on information collected at randomsample locations throughout the country. The model was used to generate a risk maprepresenting the probability of T.parva presence at a particular location given a number of riskfactors included in the model. This map is presented as a DTM and as a raster map. Inaddition, two additional raster maps are shown which display the lower and upper 95%confidence limits of T.parva presence as predicted by the regression model. The receiveroperating characteristic curve (ROC) characterizing the predictive accuracy of the model couldbe used to adjust the decision making cut-off for the prediction probability balancing sensitivityand specificity as required. In this analysis the possible presence of spatial dependence was nottaken into account.

Page 19: Issues Related to Spatial Data

101

Sampling location

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Percent of False Positives

Sen

siti

vity

0.32

0.16

0.08

ROC curve for logistic regression model

DTM of predicted probability of Theileria parva presence Raster map of predicted probability

of T.parva presence

Raster map of lower 95% confidence limit ofprobability of T.parva presence

Raster map of upper 95% confidence limit ofprobability of T.parva presence

Figure 14: Results of a multiple logistic regression analysis for prediction of Theileria parva presence

Page 20: Issues Related to Spatial Data

102

The appropriate technique for mapping probabilities or rates is as a measure of relativerisk which could be estimated by dividing the observed risk by an estimate of expected risk.The standardized mortality ratio has been used widely to represent spatial variation of diseaserisk (Elliott, Martuzzi, and Shaddick 1995). Small numbers of observations may result inextreme values. More recently this problem has become less important through the adoption ofempirical Bayes estimation. These techniques require estimates of a prior probabilitydistribution which can for example be based on the overall probabilities across all areas.Bayesian techniques are then used to convert these estimates into posterior probabilityestimates. These can be made spatial by using neighborhood probabilities to derive a priorprobability distribution.

Decision making and spatial data

Spatial data together with other non-spatial information is used for decision making purposes.This has become more difficult because the amount of information available has increasedsubstantially. Specific decision making tools have been developed attempting to simplify theprocess of making the right choice. A recent area of activity has been the adaptation of multi-criteria and multi-objective evaluation techniques to spatial problems. Such systems can takeaccount of uncertainty in the data as well as of the risk of making the wrong decision (Eastmanet al. 1995).

Spatial data has also become an essential component of disease information systems whichare beginning to replace largely manual systems which have been used by decision makers forthe control of endemic and epidemic diseases. The large amounts of data which can beprocessed easily, their objectivity and the quickness of response are some of the advantages ofcomputerized animal disease information systems. GIS provides an essential component ofsuch systems. An example of an animal disease information system is EpiMAN which wasdeveloped in New Zealand for the management of an outbreak of foot-and-mouth disease(Morris et al. 1992).

Epidemiological simulation

GIS can provide geographical data which allows computer simulations of the dynamics ofinfectious diseases for specific geographical locations. Spatial heterogeneity can be representedin simulation models resulting in more realistic representations of reality. There are only fewexamples where this approach has been used in veterinary epidemiology. Sanson (1993)described a model of foot-and-mouth disease which represents inter-farm spread of the diseaseon a true geographical area, using various transmission mechanisms. Pfeiffer (1994) developeda geographic simulation model of the dynamics of bovine tuberculosis infection in wild possumpopulations. The geographical component is a major feature of this model. The model usesvegetation maps to represent the ecological conditions of particular environments.

Conclusion

Spatial data has become an important component of disease investigations. The availability ofgeographic information systems in combination with fast and relatively inexpensive computerhardware leaves the epidemiologist with the responsibility of making effective use of theinformation. While the descriptive techniques for spatial data have been available for a longtime, exploratory and modeling techniques are still a very active area of development and theyare not as accessible to the analyst so that they could become a routine component ofepidemiological analysis.

Page 21: Issues Related to Spatial Data

103

References

Alexander,F.E. and J. Cuzick. 1992. Methods for the assessment of disease clusters.Geographical and environmental Epidemiology: Methods for small-area Studies. EditorsP. Elliott, J. Cuzick, D. English, and R. Stern, 238-50. 382 . Oxford: Oxford UniversityPress.

Anselin,L. 1994. Exploratory spatial data analysis and geographic information systems. NewTools for spatial Analysis. Editor M. Painho, 45-54. Luxembourg: Eurostat.

Anselin,L. 1995. Local indicators of spatial association - LISA. Geographical Analysis 27, no.2: 93-115.

Anselin,L. 1992. Spatial data analysis with GIS: An introduction to application in the socialSciences. 75 . Technical Report Series. Santa Barbara, California: National Center forGeographic Information and Analysis.

Anselin,L., R. F. Dodson, and S. Hudak. 1993. Linking GIS and spatial data analysis inpractice. Geographical Analysis 1: 3-23.

Bailey,T.C. and A. C. Gatrell 1995. Interactive spatial data analysis. Harlow, Essex, England:Longman Group. 413pp

Bithell, J. F. 1990. An application of density estimation to geographical epidemiology.Statistics in Medicine 9: 691-701.

Cliff, A. D., P. Haggett, M. R. Smallman-Raynor, D. F. Stroup, and G. D. Williamson. 1995.The application of multidimensional scaling methods to epidemiological data. StatisticalMethods in Medical Research 4: 102-23.

Clifton-Hadley, R. S. 1993. The use of a geographical information system (GIS) in the controland epidemiology of bovine tuberculosis in south-west England. Proceedings of the Societyfor Veterinary Epidemiology and Preventive Medicine, editor M. V. Thrusfield, 166-79.Society for Veterinary Epidemiology and Preventive Medicine.

Cuzick, J., and R. Edwards. 1990. Spatial clustering for inhomogeneous populations. Journalof the Royal Statistical Society B 52, no. 1: 73-104.

Dunning, J. B., B. J. Danielson, and C. F. Leck. 1992. Ecological processes that affectpopulations in complex landscapes. Oikos 65: 169-75.

Eastman, J. R., and M. Fulk. 1993. Long sequence time series evaluation using standardizedprincipal components. Photogrammetric Engineering and Remote Sensing 59, no. 6: 991-96.

Eastman, J. R., W. Jin, P. A. K. Kyem, and J. Toledano. 1995. Raster procedures for multi-criteria/multi-objective decisions. Photogrammetric Engineering and Remote Sensing 61,no. 5: 539-47.

Elliott, P., M. Martuzzi, and G. Shaddick. 1995. Spatial statistical methods in environmentalepidemiology: a critique. Statistical Methods in Medical Research 4: 137-59.

Getis, A., and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics.Geographical Analysis 24 (3): 189-206.

Glass, G. E., B. S. Schwartz, J. M. Morgan, D. T. Johnson, P. M. Noy, and E. Israel. 1995.Environmental risk factors for Lyme disease identified with geographic information systems.American Journal of Public Health 85, no. 7: 944-48.

Page 22: Issues Related to Spatial Data

104

Haining,R. 1990. Spatial Data Analysis in the social and environmental Sciences. Cambridge:Cambridge University Press.

Hungerford, L. L. 1991. Use of spatial statistics to identify and test significance in geographicdisease patterns. Preventive Veterinary Medicine 11: 237-42.

Izenman, A. J. 1991. Recent developments in nonparametric density estimation. Journal of theAmerican Statistical Association 86 (413): 205-24.

Kingham, S. P., A. C. Gatrell, and B. Rowlingson. 1995. Testing for clustering of health eventswithin a geographical information system framework. Environment and Planning A 27:809-21.

Knox, E. G. 1964. The detection of space-time interaction. Applied Statistics. 13: 25-29.

Lessard, P., R. L`Eplattenier, R. A. I. Norval, K. Kundert, T. T. Dolan, H. Croze, and others.1990. Geographical information systems for studying the epidemiology of cattle diseasescaused by Theileria parva. Veterinary Record 126: 255-62.

Mantel, N. 1967. The detection of disease clustering and a generalized regression approach.Cancer Research. 27 (2): 209-20.

Morris,R.S., Sanson,R.L and Stern,M.W. 1992: EPIMAN - A Decision Support System forManaging a Foot-and-Mouth Disease Epidemic. Proceedings Fifth Annual Meeting of theDutch Society for Veterinary Epidemiology and Economy, Wageningen, 1-35.

Oliver, M. A., and R. Webster. 1990. Kriging: a method of interpolation for geographicalinformation systems. International Journal of Geographical Information Systems 4 (3):313-32.

Openshaw, S. 1990. Automating the search for cancer clusters: a review of problems,progress, and opportunities. Spatial epidemiology. Editor R.W. Thomas, 48-78. PionPublications.

Perry, B. D., R. Kruska, P. Lessard, R. A. I. Norval, and K. Kundert. 1991. Estimating thedistribution and abundance of Rhipicephalus appendiculatus in Africa. PreventiveVeterinary Medicine 11: 261-68.

Pfeiffer, D. U. 1994. The role of a wildlife reservoir in the epidemiology of bovinetuberculosis. Unpublished PhD Thesis, Massey University, Palmerston North, New Zealand.

Rothman, K. J. 1990. A sobering start for the cluster busters' conference. American Journal ofEpidemiology 132 Sup 1: S6-S13.

Sanson, R.L. 1993. The development of a decision support system for an animal diseaseemergency. Unpublished PhD Thesis. Massey University, Palmerston North, New Zealand.

Waller, L. A., and A. B. Lawson. 1995. The power of focused tests to detect diseaseclustering. Statistics in Medicine 14: 2291-308.

Walter, S. D. 1993. Assessing spatial patterns in disease rates. Statistics in Medicine 12: 1885-94.

Wartenberg, D., and M. Greenberg. 1990. Detecting disease clusters: the importance ofstatistical power. American Journal of Epidemiology 132 Sup 1: S156-S166.

Wartenberg, D., and M. Greenberg. 1993. Solving the cluster puzzle: Clues to follow andpitfalls to avoid. Statistics in Medicine 12: 1763-70.

Page 23: Issues Related to Spatial Data

105

Williams, B., D. Rogers, G. Staton, B. Ripley, and T. Booth. 1994. Statistical modelling ofgeoreferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetationdata. Modelling vector-borne and other parasitic Diseases. Editors B. D. Perry, and J. W.Hansen, 267-80. 369 . Nairobi, Kenya: The International Laboratory for Research onAnimal Diseases.