Acevedo-Arreguin (2005)

Embed Size (px)

Citation preview

  • 8/9/2019 Acevedo-Arreguin (2005)

    1/60

    UNIVERSITY OF CALIFORNIA

    SANTA CRUZ

    SPATIAL TEMPORAL STATISTICAL MODELING OF CRIME DATA:

    THE KERNEL CONVOLUTION APPROACH

    A final report submitted in partial satisfaction

    Of the requirements for the degree of

    MASTER OF SCIENCES

    InSTATISTICS AND STOCHASTIC MODELLING

    By

    Luis Antonio Acevedo-Arregun

    June 2008

    The Masters Project of Luis Antonio

    Acevedo Arreguin

    Is approved:

    ____________________________________

    Professor Bruno Sanso

    ____________________________________

    Professor Herbert Lee

    1

  • 8/9/2019 Acevedo-Arreguin (2005)

    2/60

    Copyright cbyLuis Antonio Acevedo-Arregun2008

    2

  • 8/9/2019 Acevedo-Arreguin (2005)

    3/60

    Spatial Temporal Statistical Modeling of Crime Data:The Kernel Convolution Approach

    Luis Antonio Acevedo-Arregun

    Professor Bruno SansoFaculty AdvisorProfessor Herbert Lee

    Academic Advisor

    University of California, Santa CruzDepartment of Applied Mathematics and Statistics

    June 2008

    Abstract

    A Poisson point process with a kernel convolution approach isused to model the intensity of crime events occurred in the metropoli-tan area of Cincinnati, OH, during every day of 2006. Departingfrom the traditional Gaussian process machinery used in statis-tics, the model investigated in this paper considers the convolu-tion of a Gamma process with a bivariate Gaussian kernel withina Bayesian framework. In addition to a succinct discussion ofresults, we provide some plots, movies, and the complete code inthe R computer programming language.

    3

  • 8/9/2019 Acevedo-Arreguin (2005)

    4/60

    Spatial Temporal Statistical Modeling of Crime Data:The Kernel Convolution Approach

    Introduction: Statistical analysis of spatial-temporal data can be applied

    to address problems with different degrees of complexity, from testing thehypothesis of complete spatial randomness to identifying clustering of eventsin space-time. We are interested in the analysis of data (si, ti) : i = 1,...,n,where each si denotes the location and ti the corresponding time of occurrenceof an event within a spatial region A and within a time-interval, (0, T). PeterJ. Diggle calls a dataset of this kind a spatio-temporal point pattern, andthe underlying stochastic model for the data a spatio-temporal point process.[1]. He also discusses the importance of distinguishing in practice datafor which the events occur in a space-time continuum, and data for whichthe time-scale is discrete either by natural conditions or by aggregating thenumber of events over a sequence of discrete time-periods. In our case, weattempt different levels of data aggregation in time to find seasonal patterns,and circumscribe the complexity of the problem to that of finding the numberof events per unit area and its evolution over time.

    Although the hypothesis of complete spatial randomness (CSR) is rarelyof any scientific interest, Diggle uses the concept to introduce the definitionof the process underlying such a hypothesis:

    A homogeneous Poisson process is a point process that satisfiestwo conditions: the number of events in any planar region Afollows a Poisson distribution with mean |A|, where | | denotesarea and the constant is the intensity, or mean number of eventsper unit area; and the number of events n disjoint regions areindependent [1].

    Then, Diggle shows that the intensity function of the process can bedefined as

    (s) = lim|dx|0

    E[N(dx)]

    |ds|

    , (1)

    where ds denotes an infinitesimal neighborhood of the point s, and N(ds) de-notes the number of events in ds. However, a more rigorous mathematical def-inition of a Poisson point process is provided by Moller and Waagepetersen,who state that the general mathematical theory for point processes is heav-ily based on measure theory. [5]

    4

  • 8/9/2019 Acevedo-Arreguin (2005)

    5/60

    Moller and Waagepetersen provide the following definitions and remarksto define the basic properties of Poisson point processes:

    We start by considering Poisson point processes defined on aspace S d and specified by a so-called intensity function

    : S [0, ) which is locally integrable, i.e.B ()d <

    for all bounded B S. This is by far the most important casefor applications.

    REMARK 3.1 In the definition below of a Poisson process we useonly the intensity measure given by

    (B) =

    B

    ()d,B S.

    This measure is locally finite, i.e. (B) < for bounded B S,and diffuse, i.e. ({}) = 0 for S.

    DEFINITION 3.1 Let f be a density function on a set B S, andlet n N (where N = 1, 2, 3,...). A point process X consistingof n i.i.d. points with density f is called a binomial point processofn points in B with density f. We write X binomial(B,n,f)(where means distributed as). Note that B in Definition 3.1has volume |B| > 0, since

    B

    f()d = 1. In the simplest case,|B| < and each of the n i.i.d. points follows Uniform(B),the uniform distribution on B, i.e. f() = 1/|B| is the uniformdensity on B.

    DEFINITION 3.2 A point process X on S is a Poisson point

    process with intensity function if the following properties aresatisfied (where is given by (3.1)):

    (i) for any B Swith (B) < , N(B) po((B)), the Poissondistribution with mean (B) (if (B) = 0 then N(B) = 0;

    (ii) for any n (N) and B S with 0 < (B) < , conditionalon N(B) = n, XB binomial(B,n,f) with f() = ()/(B).We then write X Poisson(S, ).

    For any bounded B S, determines the expected number ofpoints in B,

    EN(B) = (B).

    5

  • 8/9/2019 Acevedo-Arreguin (2005)

    6/60

    Heuristically, ()d is the probability for the occurrence of apoint in an infinitesimally small ball with centre and volume d[5].

    Note that their notation is slightly different than the one used here.

    Fortunately for our purposes, as they in turn cite Daley and Vere-Jones(1988,2003), statistical inference for space-time processes is often simplerthan for spatial point processes, which is later further treated in Section9.2.5 of Moller and Waagepetersens book.

    On the other hand, the more complicated issue of performing spatial clus-tering model is addressed by Andrew B. Lawson and David G. T. Denison.They claim that there are two main approaches to model clusters. Basi-cally, the difference between those views is whether or not the locations ofthe aggregations are parameterized [3]. Bayesian cluster modelling and, es-pecially, mixture distribution and nonparametric approaches are given moreemphasis because of the development of fast computational algorithms for

    sampling of complex Bayesian models (most notably Markov chain MonteCarlo algorithms).

    Method: The Kernel Convolution approach has been used for several yearsto model spatial and temporal processes [7] [4]. For example, Stroud, Mullerand Sanso (2001) applied it to model two large environmental datasets,whereas Lee, Higdon, Calder, and Holloman (2004) convolved simple Markovrandom fields with a smoothing kernel to model some cases in hydrology andaircraft prototype testing. In most of the applications of the kernel convo-

    lution approach, Gaussian kernels were used. The emphasis on Gaussianspatial and space-time models is because they are quite flexible and can beadapted to a wide variety of applications, even where the observed data aremarkedly non-Gaussian [2].

    David Higdon explains the reason of using the convolution models:

    [Gaussian Markov random field] GMRF models work well for im-age and lattice data; however, when data are irregularly spaced, acontinuos model for the spatial process z(s) is usually preferable.In this section, convolution or, equivalently, kernel models areintroduced. These models construct a continuos spatial modelz(s) by smoothing out a simple, regularly spaced latent process.In some cases, a GMRF model is used for this latent process.

    6

  • 8/9/2019 Acevedo-Arreguin (2005)

    7/60

    The convolution process z(s) is determined by specifying a latentprocess x(s) and a smoothing kernel k(s). We restrict the latentprocess x(s) to be nonzero at the fixed spatial sites 1,...,m,also in S, and define x = (x1,...,xm)

    T where xj = x(j), j =1,...,m. For now, the xj s are modeled as independent draws

    from a N(0, 1/x) distribution. The resulting continuos Gaussianprocess is then

    z(s) =

    S

    k(u s)dx(u)

    =m

    j=1

    k(j s)xj

    where k(j ) is a kernel centered at j [2].

    Then, Higdon provides several cases to illustrate the advantages of thismethodology. Likewise, Swall (1999) provides a complete description of thekernel convolution approach and some interesting variations, such as spatiallyvarying kernels, to be applied in cases where non-stationary conditions of keyparameters are not guaranteed [8]. Here, we propose the kernel convolutionapproach to model a Poisson point process, the crime events occurred everyday for a year in Cincinnati, Ohio. The modeling considers the convolu-tion of a Gamma process with a bivariate Gaussian kernel within a Bayesianframework. To obtain acceptable mixing in the sampling of the posteriordistributions of the crime intensity and the corresponding latent variables,a beta proposal is implemented in the Monte Carlo Markov Chain subrou-

    tine. Sanso and Schmidt (2004) describe the theoretical aspects of the betaproposal in the implementation of Metropolis-Hastings steps of a MCMCroutine [6].

    Data processing: The dataset to test the models was constructed from adatabase that the Cincinnati Police Department maintains at

    http://www.cincinnati-oh.gov/police/pages/-5192-/

    with records of crimes committed in the Hamilton County for several years.The database reports date, time, and location of crimes, as well as other datathat might be useful to characterize the magnitude of the reported event. A

    UCR code, which seems to be related to a uniform crime reporting code

    7

  • 8/9/2019 Acevedo-Arreguin (2005)

    8/60

    that the FBI uses nationwide, was assigned to each event. There were morethan 70 different UCR codes describing a variety of events such as telephoneharassment, vehicle theft, murder, and the like. Although very diverse, thisvariable was used to reclassify the data in order to be used in the test models.

    Specifically, the data corresponding to 2006 were downloaded, address

    geocoded, imported into R, and performed a simple descriptive statisticalanalysis. Since the database only reported the street address of each crime,the more than 43000 records of that year were processed to obtain theirgeographical coordinates. The process of geocoding was conducted by usingonline services. The website

    http://www.gpsvisualizer.com

    was helpful because by acquiring a Google API key, users can geocode thou-sands of records a day. Thus, converting multiple addresses to GPS coordi-nates only requires a minor modification on the HTML code of the geocoder

    webpage to include the API key, reduce the google delay value to 0.5 sec-onds or less, and increase the number of records to be processed at once.

    Once geocoding was performed, the data was imported into R and sometemporal variables were added such as the day of the week and the day ofthe year in which the crime was committed. Also, the UCR codes weretransformed into four main categories of events: crimes against people withextreme violence, crimes against people with minor violence, crimes againstproperty, and crimes against the system. For example, the categorical vari-able crime class, which was incorporated to the database

    CRIME2006_plus3.dat,

    was given values from 1 to 4 depending on the category of the crime. A crimewith a UCR of 105 (corresponding to murder) was given a crime class valueof 1, whereas a crime with a UCR of 1120 (passing bad checks) was given acrime class value of 4, and so on. These new categorical variables allowed usto perform a preliminary analysis, which was summarized in some box plotsand time series plots. A complete description of the UCR codes used by theCincinnati police is provided in the file UCR code description.txt.

    From a preliminary analysis of the entire dataset, a cyclical pattern wasobserved on a plot of the number of crimes both with respect to the dayof the week and with respect to the day of the year. The plots showed

    the highest incidence of crimes happened around the middle of the year

    8

  • 8/9/2019 Acevedo-Arreguin (2005)

    9/60

    whereas the values tended to decrease around the end of the year. Simi-larly, a higher number of crimes was reported on Mondays than the numberreported for the rest of the week. A similar pattern emerged when a partof the dataset, the one corresponding to the crimes against people with ex-treme violence, was plotted with respect to the temporal covariates. Thus,

    crimes type 1 were selected for statistical modeling. Within this crime cate-gory were included the UCR codes 100 to 495 and 800 to 864, which alongwith the rest of the data are in the file crime2006 plus3.dat. The filesCRIME2006 database description.txt provide more details on the entiredataset.

    As part of the data processing, map importing into R was another taskthat required the search of mapping resources both for obtaining the satellitephotographs and for georeferencing the imported images. Google, especially

    http://earth.google.com,

    was again a good source of satellite images from the study area, almost inthe same way the website

    http://tiger.census.gov

    was very helpful not only for providing the spatial covariates later included inthe models, but also for generating maps from any part of the United Statesby just specifying the GPS coordinates of the area of interest. For example,to generate a map of Hamilton County, whose Tiger code is TGR39061, theuser only needs to type the following address in the browser of his or herpreference

    http://tiger.census.gov/cgi-bin/mapgen?lat=39.166828&

    lon=-84.538348&wid=0.290456&ht=0.290456&iwd=480&iht=480

    in a single line and without spaces. The parameters included in the link werecomputed by using the boundary coordinates of the Hamilton County, alsoprovided by the Census website (i.e., 84.820305 < longitude < 84.256391and 39.021600 < latitude < 39.312056). The GPS coordinates in the linkcorrespond to the center of the map, the wid and ht values represent thewidth and the height of the image in GPS units whereas the iwd and ihtvalues represent the same dimensions but in pixels. Thus, the width and theheight of the image were chosen depending on the dimensions of the JPEGimage to be generated.

    9

  • 8/9/2019 Acevedo-Arreguin (2005)

    10/60

    The JPEG file with the map of Cincinnati was later processed in R byusing the package rimage. This was required to generate a surface ma-trix that could be used by the command image as many times as neededwithout demanding a lot of computational time and also to facilitate thegeoreferencing of the JPEG map. Georeferencing was necessary to plot the

    crime points on a map without transforming the GPS coordinates of eachpoint into another coordinate system. A satellite image of Cincinnati ob-tained from Google Earth was processed in the same way, thus generatingthe files Cincinnati map1.dat for the option map1 in the computer pro-gram for model 1, and the files Cincinnati map2.dat, long map2.dat, andlat map2.dat for the option map2 in the same program. The optionmap1 corresponds to the simple road map, whereas the option map2corresponds to the satellite photograph. These files are required to generatethe background on the plots both for the figures included in this report andthe background in the accompanying video clips.

    Model Statement: Under the kernel convolution approach, the intensity(s, t) of a point process is modeled as the convolution of a random pro-cess Z(s) and a weighting kernel k(s u) over a grid of u locations. Boththe spatial and the temporal covariates are included into the model throughmultiplicative effects s(s) and t(t), so for a Poisson process on an observa-tion window R, the corresponding expressions for the intensity, the expectednumber of points, and the likelihood for n points y d occurring at timest = 1...T are

    (s, t) =u

    k(s u)Z(u)s(u)t(t) (2)

    R,T =T

    R

    (s, t) ds dt (3)

    L(|d) =tT

    exp(R(t))ni=1

    t(yi) (4)

    where u indicates a grid location and s indicates a point location. The spatialmultiplicative effect s is a function of two spatial covariates, X1(s) as thepopulation density in year 2000 (number of individuals per square mile) andX2(s) as the number of vacant units,

    s(u) = exp(1X1(u) + 2X2(u)), (5)

    10

  • 8/9/2019 Acevedo-Arreguin (2005)

    11/60

    whereas the temporal multiplicative effect is based on a linear combinationof sines and cosines of four temporal covariates,

    t(t) = exp(t1sin(2t4

    12) + t2cos(

    2t412

    )

    +t3sin(2t352

    ) + t4cos( 2t352

    )

    +t5sin(2t2365

    ) + t6cos(2t2365

    )

    +t7sin(2t1

    7) + t8cos(

    2t17

    ))

    (6)

    where t1 {1,.., 7} is the day of the week (1 for Sunday), t2 {1, 2,...365} isthe day of the year (1 for January 1st, 2006), t3 {1, 2,...52} is the numberof week, and t4 {1, 2,...12} is the number of month (1 for January).

    The kernels over a 13x10 grid were chosen to be bivariate Gaussian withfixed parameters 2x and

    2

    y, which were estimated for the elliptical contoursof each bivariate Gaussian to have one standard deviation from its center (ulocation) on both directions, x and y, equal to 52.5% the distance betweentwo grid points in the same row or column. The correlation was set tozero. Associated to these kernels was a Gamma process Z(u) with fixedhyperparameters and , and a multiplicative factor that played the roleof transforming Z(u) into the process Z(u) with one of its hyperparameters, or , acting as a random variable. The corresponding prior distributions

    for all the parameters of the model were chosen to be

    ( Z(u)) Gamma(7

    4

    2xyu

    R

    k(s u)ds,

    7

    4) (7)

    () Gamma(7

    4

    0.0075

    2xy,

    7

    4) (8)

    (1) N(0, 0.00012) (9)

    (2) N(0, 0.00052) (10)

    (tj) N(0, 0.52) (11)

    where j {1, 2,...8}. The posterior distributions were sampled by a combi-nation of Gibbs steps and Metropolis-Hastings algorithms.

    11

  • 8/9/2019 Acevedo-Arreguin (2005)

    12/60

    Results: The model parameters were estimated by using Markov chainMonte Carlo (MCMC). Specifically for the posterior distribution of Z(u),a beta proposal was implemented to improve the acceptance rate of the pro-posal value for a new iteration k of the M-H step. Thus, the proposal Z(u)for a new Zk(u) was sampled from

    Z(u) Zk1(u)

    Beta(

    a

    2,

    a(1 )

    2), (12)

    where and a were set on 0.95 and 2.5, respectively. This multiplicativerandom walk seemed to induce a fast convergence of the MCMC. More detailson this approach can be found in Sanso (2007). The parameter was sampledfrom its posterior Gamma distribution by a Gibbs step. The rest of theparameters were sampled from their corresponding posterior distributionsby using M-H with normal proposals. Thus, the proposal distributions forthe spatial s were

    1 N(k1

    1 |0, 0.0000052

    ) (13)2

    N(k12

    |0, 0.0002502) (14)

    whereas the proposal distributions for the temporal s were simply

    tj N(k1tj |0, 0.025

    2), (15)

    where j {1, 2,...8}. For modeling the daily variation of the intensity ,5000 iterations were required to convergence and a burning in of 2500. Sincethe entire computer program was coded in R, the simulation took over 20hours/run. The code is included in the file ppm llnl ver7a.r.

    Once the posterior means of the spatial and temporal parameters were

    computed, the corresponding multiplicative factors s(u) and t(t) were es-timated as

    s(u) = exp(0.000141X1(u) + 0.004256X2(u)) (16)

    t(t) = exp(0.027926sin(2t4

    12) 0.091187cos(

    2t412

    )

    0.222996sin(2t3

    52) + 0.226660cos(

    2t352

    )

    +0.134605sin(2t2365

    ) 0.229748cos(2t2365

    )

    +0.056335sin(2t17

    ) + 0.037308cos( 2t17

    )

    12

  • 8/9/2019 Acevedo-Arreguin (2005)

    13/60

    (17)

    which, in conjunction with the baseline intensity (u), allow to make infer-ences on the expected number of crime events over the region of interestper day. Contour plots of the intensity for an area of Cincinnati delimited

    by the longitudes 84.63W and 84.38W, as well as the latitudes 39.09N and39.22N were plotted for each of the 365 days of 2006, and can be observedon the 6-min movie Video-2.wmv.

    Conclusions: The model allowed us to obtain a picture of the criminal hotspots of the metropolitan area of Cincinnati, OH, in terms of providing thelocation on an actual map of the various modes of the spatial distribution ofthe intensity and its corresponding evolution over time. By incorporatingthe information of other spatial variables that were considered constant withrespect of time, such as population density or the number of houses for rent,it was possible to visually find the correlation between crime intensity and

    densely populated areas of Cincinnati.The model also served the objective of testing new ways to deal with

    the massive computational resources required to process thousand of data onhundred of grid points by using new proposal distributions for the Metropolis-Hastings steps to get fast convergence for the MCMC. The Beta proposalresulted in faster MCMC iterations than the traditional Gaussian proposal.Faster simulations might be obtained by translating the R code to Fortranor C++. We worked the entire computer code in R because of its advantagesfor educational settings with limited computational resources (i.e., it is opensource), and because of its graphical capabilities that allowed us to followthe MCMC iterations on the computer screen in real time.

    This model might be improved by incorporating kernels with varyingparameters over space and time to explore the correlation between crime ac-tivity and city infrastructure such as roads or land use. Also, a preliminarysummary of criminal activity based on a spatial distribution of events occur-ring during certain days of the week, month or year, might be incorporatedinto the model to explore its forecasting potential.

    Acknowledgements: This masters project would not have been possiblewithout the support of Dr. William Hanley and his team in the Lawrence Liv-ermore National Laboratories. Likewise, Professors Bruno Sanso and HerbertLee, as well as Dr. Matt Taddy were especially important academic advisors

    for this project to come to a fruitful end.

    13

  • 8/9/2019 Acevedo-Arreguin (2005)

    14/60

  • 8/9/2019 Acevedo-Arreguin (2005)

    15/60

    0 100 200 300

    5

    15

    Daily variation of crimes against PEOPLE

    [Case 1: Extreme Violence]

    Day 1 = Jan 01, 2006DAY OF THE YEAR

    CRIMEEVENTSREPORTED

    1 2 3 4 5 6 7

    5

    15

    Crimes against PEOPLE during the days of the week

    [Case 1: Extreme Violence]

    Day 1 = SundayDAY OF THE WEEK

    CRIMECOUNT

    Figure 2: The crime type 1, which includes events with high level of violenceespecially against people, showed a cyclical pattern as that showed by theentire dataset. There was a high number of crimes reported on Mondays(upper panel) and high rates of criminal activity around the middle of theyear (lower panel).

    15

  • 8/9/2019 Acevedo-Arreguin (2005)

    16/60

    84.60 84.55 84.50 84.45 84.40

    39

    .10

    39

    .12

    39

    .14

    39.1

    6

    39

    .18

    39

    .20

    39

    .22

    Cincinnati Crime Data: Mean Intensity Surface Baseline

    Longitude

    La

    titud

    e

    Figure 3: The mean baseline (u), or the mean intensity surface when s(u) =t(t) = 1.

    16

  • 8/9/2019 Acevedo-Arreguin (2005)

    17/60

    84.60 84.55 84.50 84.45 84.40

    39

    .10

    39

    .12

    39

    .14

    39.16

    39

    .18

    39

    .20

    39

    .22

    Cincinnati Crime Data: Mean Intensity Surface

    Jun / 25 /2006

    Observed number of crime events = 13

    Longitudemean(ND) = 12.6876043087135 var(ND) = 0.350450423118253

    La

    titude

    Figure 4: The mean intensity surface corresponding to June 25th, 2006. Thispicture is a frame taken from a movie generated in R and post processed withMicrosoft Windows Media Encoder 9.

    17

  • 8/9/2019 Acevedo-Arreguin (2005)

    18/60

    2500 3000 3500 4000 4500 50000

    .000125

    0.0

    00150

    Trace plot of theta [ 1 ]

    covariate = population density 2000

    mean = 0.000141174482952991 var = 2.29262511999377e11theta [ 1 ]

    Histogram of theta [ 1 ]

    covariate = population density 2000

    Acceptance: 0.481696339267854

    theta [ 1 ]

    0.000125 0.000135 0.000145 0.000155

    0

    40

    100

    Figure 5: The trace plot for the parameter 1 shows an acceptable mixing(upper panel), whereas the estimated posterior mean of 1, whose histogramis depicted in the lower panel, shows that, when used to compute s(u),an increment of 1000 new residents might increase the intensity of crime inaround 15%.

    18

  • 8/9/2019 Acevedo-Arreguin (2005)

    19/60

    2500 3000 3500 4000 4500 5000

    0.0

    035

    0.0

    050

    Trace plot of theta [ 2 ]

    covariate = vacant units

    mean = 0.00425622627213655 var = 8.58944178813553e08theta [ 2 ]

    Histogram of theta [ 2 ]covariate = vacant units

    Acceptance: 0.585317063412683theta [ 2 ]

    0.0035 0.0040 0.0045 0.0050

    0

    40

    80

    Figure 6: The trace plot for the parameter 2 shows an acceptable mixing(upper panel), whereas the estimated posterior mean of 2, whose histogramis depicted in the lower panel, shows that, when used to compute s(u), anincrement of 10 vacant units might increase the intensity of crime in around4.3%. In the case that 50 houses or apartments ended at some time with no

    occupants, the intensity of crime events might increase in 23.7%.

    19

  • 8/9/2019 Acevedo-Arreguin (2005)

    20/60

    0 100 200 300

    0

    10

    20

    Observed Number of Crimes

    day of the year

    No

    bs

    0 100 200 300

    0

    10

    20

    Posterior Mean of Number of Crimes

    with 95% confidence interval

    day of the year

    Nca

    lc

    Figure 7: The daily variation of the number of crime events, n and R(t),plotted from the observed data and from the estimated values according tothe model 1.

    20

  • 8/9/2019 Acevedo-Arreguin (2005)

    21/60

    84.60 84.55 84.50 84.45 84.40

    39

    .10

    39

    .12

    39

    .14

    39.1

    6

    39

    .18

    39

    .20

    39

    .22

    Population density 2006 in county 39061

    rmx

    rmy

    Figure 8: Future research might explore kernels with temporal and spatiallyvarying parameters.

    21

  • 8/9/2019 Acevedo-Arreguin (2005)

    22/60

    R source code:

    # POISSON PROCESS MODELING

    # MODEL 1: KERNEL CONVOLUTION APPROACH

    # Version 7a

    # Oct 19, 2007

    # Luis Acevedo Arreguin#

    ##############################################################################

    ######

    # MODELING SPECIFICATIONS

    dataset = "complete" # "partial"; "complete"

    covariates = "on" # "on"; "off"

    temporal = "on" #

    display = "on" # "on"; "off"

    proposal = "beta" # "gammaloc1"; "gammaloc2"; "beta"; "lognormal"; "direct";

    "prior"; "gamma1"; "gamma2"; "gamma3"; "gamma4LR"

    prior_xu = "fixed" # "fixed"; "random"

    kernel = "fixed" # ""fixed", "mono"; "multi";

    kernel_size = 0.525 # initial kernel ellipse dimension (one standard deviation)

    as a fraction of grid point separation

    ITER = 5000

    burn = 1/2

    map = "map1" # "map1" = atlas map; "map2" = satellite map

    # SETTING A GRID PxQ (nrow=P, ncol=Q)

    P = 1 3

    Q = 1 0

    # DOMAIN COORDINATES (CORRESPONDING TO HAMILTON COUNTY, CODE= TGR39061,

    FROM CENSUS.GOV)

    x1 = -84.820305

    x2 = -84.256391

    y1 = 39.021600y2 = 39.312056

    # WORKING DIRECTORY

    # setwd("C:/Documents and Settings/Me/Desktop/week 22")

    # setwd("C:/Users/UCB Tiger/Desktop/week 18")

    setwd("C:/Users/abc/Desktop/week 27")

    # setwd("G:/Documents/week 20")

    set.seed(9132)

    ##############################################################################

    ######

    # SUBROUTINES AND FUNCTIONS

    # MATT TADDYS ROUTINE FOR INTERPOLATION AND CONTOURING

    22

  • 8/9/2019 Acevedo-Arreguin (2005)

    23/60

    "ezinterp"

  • 8/9/2019 Acevedo-Arreguin (2005)

    24/60

    deltax = (x2-x1)/slices

    xi = x1 + deltax/2

    volume = 0

    for(i in 1:slices) {

    m3 = xi*rho*sd2/sd1

    sd3 = sd2*sqrt(1-rho*rho)

    integral2

  • 8/9/2019 Acevedo-Arreguin (2005)

    25/60

    t1

  • 8/9/2019 Acevedo-Arreguin (2005)

    26/60

    delta_x

    delta_y

    n_star = P*Q

    x_grid = seq(from=(min(x)-delta_x/4), to=(max(x)+delta_x/4), length=Q)

    y_grid = seq(from=(min(y)-delta_y/4), to=(max(y)+delta_y/4), length=P)

    if(covariates == "on") {

    x_grid = seq(from=(min(xu2)+delta_x/10), to=(max(xu2)-delta_x/10), length=Q)

    y_grid = seq(from=(min(yu2)+delta_y/10), to=(max(yu2)-delta_y/10), length=P)

    }

    # area

  • 8/9/2019 Acevedo-Arreguin (2005)

    27/60

    y2u2[(P2-1)*Q2 + i]

  • 8/9/2019 Acevedo-Arreguin (2005)

    28/60

    t2

  • 8/9/2019 Acevedo-Arreguin (2005)

    29/60

    }

    # DATA FOR THE CONTROL PANEL DISPLAY

    # June 25

    # t1_u = rep(1, n_star)

    # t2_u = rep(176, n_star)# t3_u = rep(26, n_star)

    # t4_u = rep(6, n_star)

    # efe_u

  • 8/9/2019 Acevedo-Arreguin (2005)

    30/60

    dy

  • 8/9/2019 Acevedo-Arreguin (2005)

    31/60

    k1[j,i]

  • 8/9/2019 Acevedo-Arreguin (2005)

    32/60

    WHICH

    # THE MCMC IS RUN

    Arx

  • 8/9/2019 Acevedo-Arreguin (2005)

    33/60

    }

    # THE HYPERPARAMETERS FOR GAMMA PRIORS ARE NAMED BY CONCATENATING THE LETTERS

    A O R B

    # (CORRESPONDING TO ALPHA OR BETA) AND THE RANDOM VARIABLE INITIALS FOR

    WHICH

    # THE MCMC IS RUN. SOMETIMES UNDERSCORES ARE INTRODUCE FOR THE SAKE OF

    CLARITY

    if(covariates == "off") {

    q1

  • 8/9/2019 Acevedo-Arreguin (2005)

    34/60

    ######

    # MCMC IMPLEMENTATION

    sum1prev = 0

    k = 1

    for(k in 1:(ITER-1)) {

    if(temporal == "on") {

    Ls[,k]

  • 8/9/2019 Acevedo-Arreguin (2005)

    35/60

    contour(x_grid,y_grid,surface5a,add=TRUE)

    post_NDt

  • 8/9/2019 Acevedo-Arreguin (2005)

    36/60

    Lprod[k]

  • 8/9/2019 Acevedo-Arreguin (2005)

    37/60

    ##############################################################################

    ######

    # UPDATING HYPERPARAMETERS

    if(prior_xu == "fixed") {

    alpha[k+1]

  • 8/9/2019 Acevedo-Arreguin (2005)

    38/60

    Kutheta_star

  • 8/9/2019 Acevedo-Arreguin (2005)

    39/60

    } # END covariates M-H

    if(temporal == "on") {

    mt_prior_temp = rep(0, 8)

    sdt_prior_temp

  • 8/9/2019 Acevedo-Arreguin (2005)

    40/60

    thetatemp[5,k]*sin(2*pi*t/365) + thetatemp[6,k]*cos(2*pi*t/365)+

    thetatemp_star[7,k]*sin(2*pi*twd/7) + thetatemp[8,k]*cos(2*pi*twd/7))

    NDtemp_star[8]

  • 8/9/2019 Acevedo-Arreguin (2005)

    41/60

    for(j in 1:n) {

    Mts[j]

  • 8/9/2019 Acevedo-Arreguin (2005)

    42/60

    post_syu

  • 8/9/2019 Acevedo-Arreguin (2005)

    43/60

    ##############################################################################

    ######

    post_fxu

  • 8/9/2019 Acevedo-Arreguin (2005)

    44/60

    contour(x_grid,y_grid,surface8b,nlevels=NL,add=TRUE)

    points(x,y)

    post_Lu_var

  • 8/9/2019 Acevedo-Arreguin (2005)

    45/60

    # y1 = 39.021600

    # y2 = 39.312056

    if(map == "map1") {

    stretch_x = 1.25

    stretch_y = 1.00

    offset_x = 0.00

    offset_y = 0.00}

    if(map == "map2") {

    stretch_x = 0.95

    stretch_y = 1.15

    offset_x = 0.01

    offset_y = -0.005

    }

    rx1 = mean(mx) - stretch_x*(mean(mx)-min(mx)) + offset_x

    rx2 = mean(mx) + stretch_x*(mean(mx)-min(mx)) + offset_x

    ry1 = mean(my) - stretch_y*(mean(my)-min(my)) + offset_y

    ry2 = mean(my) + stretch_y*(mean(my)-min(my)) + offset_y

    rmx

  • 8/9/2019 Acevedo-Arreguin (2005)

    46/60

    meanintensity

  • 8/9/2019 Acevedo-Arreguin (2005)

    47/60

    par(mfrow = c(2,1))

    hist(exp(1000*theta[1,(1+round(k*burn)):(k-1)]),main="Histo gram of

    exp(1000*theta) for covariate 1",sub=paste("mean = "

    ,mean(exp(1000*theta[1,(1+round(k*burn)):(k-1)]))," var = "

    ,var(exp(1000*theta[1,(1+round(k*burn)):(k-1)]))),breaks=50)hist(exp(1000*theta[2,(1+round(k*burn)):(k-1)]),main="Histo gram of

    exp(100*theta) for covariate 2",sub=paste("mean = "

    ,mean(exp(100*theta[2,(1+round(k*burn)):(k-1)]))," var = "

    ,var(exp(100*theta[2,(1+round(k*burn)):(k-1)]))),breaks=50)

    hist(exp(1000*theta[3,(1+round(k*burn)):(k-1)]),main="Histo gram of

    exp(1000*theta) for covariate 3",sub=paste("mean = "

    ,mean(exp(1000*theta[3,(1+round(k*burn)):(k-1)]))," var = "

    ,var(exp(1000*theta[3,(1+round(k*burn)):(k-1)]))),breaks=50)

    hist(exp(1000*theta[4,(1+round(k*burn)):(k-1)]),main="Histo gram of

    exp(1000*theta) for covariate 4",sub=paste("mean = "

    ,mean(exp(1000*theta[4,(1+round(k*burn)):(k-1)]))," var = "

    ,var(exp(1000*theta[4,(1+round(k*burn)):(k-1)]))),breaks=50)

    hist(exp(1000*theta[5,(1+round(k*burn)):(k-1)]),main="Histo gram of

    exp(1000*theta) for covariate 5",sub=paste("mean = ",mean(exp(1000*theta[5,(1+round(k*burn)):(k-1)]))," var = "

    ,var(exp(1000*theta[5,(1+round(k*burn)):(k-1)]))),breaks=50)

    ### END LLNL PLOTS

    # DISPLAY PANEL 2

    # post_Lu

  • 8/9/2019 Acevedo-Arreguin (2005)

    48/60

    t2_u = rep(t2_date, n_star)

    efe_u

  • 8/9/2019 Acevedo-Arreguin (2005)

    49/60

    hist(fxu[index4,(1+round(k*burn)):(k-1)],breaks=100,main=pa ste("Histogram of

    x(",index4,") with min L(u)"))

    hist(fxu[index3,(1+round(k*burn)):(k-1)],breaks=100,main=pa ste("Histogram of

    x(",index3,") with max L(u)"))

    plot((1+round(k*burn)):(k-1), tauk[(1+round(k*burn)):(k-1)],type="l"

    ,main=paste("Trace plot of the kernel tau"))hist(tauk[(1+round(k*burn)):(k-1)],main=paste("Histogram of the kernel tau")

    ,breaks=100)

    ##############################################################################

    ######

    # THETA FOR SPATIAL AND TEMPORAL COVARIATES

    par(mfrow=c(2,1))

    cov = 5

    plot((1+round(k*burn)):(k-1), theta[cov,(1+round(k*burn)):(k-1)],type="l"

    ,main=paste("Trace plot of theta [",cov,"]

    covariate =",cov_name[cov]),sub=paste("mean = ",mean(theta[cov

    ,(1+round(k*burn)):(k-1)])," var = ",var(theta[cov,(1+round(k*burn)):(k-1)])))

    hist(theta[cov,(1+round(k*burn)):(k-1)],main=paste("Histogr am of

    theta [",cov,"]

    covariate =",cov_name[cov]),sub=paste("Acceptance:",1+accepttheta[cov] /k)

    ,breaks=100)

    cov = 2

    plot((1+round(k*burn)):(k-1), thetatemp[cov,(1+round(k*burn)):(k-1)]

    ,type="l",main=paste("Trace plot of thetatemp [",cov,"]"),

    sub=paste("mean = ",mean(thetatemp[cov,(1+round(k*burn)):(k-1)])," var = "

    ,var(thetatemp[cov,(1+round(k*burn)):(k-1)])))

    hist(thetatemp[cov,(1+round(k*burn)):(k-1)],main=paste("His togram of

    thetatemp [",cov,"]"),

    sub=paste("Acceptance:",1+acceptthetatemp[cov]/k),breaks=100)

    ##############################################################################

    ######

    # NUMBER OF EVENTS, N(D), OVER THE STUDY REGION

    par(mfrow=c(2,1))

    if(covariates == "on") NDk

  • 8/9/2019 Acevedo-Arreguin (2005)

    50/60

    ,sub=paste("MCMC with burning in of ",100*burn," % of ",k," iterations")

    ,breaks=100)

    td = 12 # monthly basis: 1 to 12

    monthlab = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"

    , "Oct", "Nov", "Dec")

    Mtuk

  • 8/9/2019 Acevedo-Arreguin (2005)

    51/60

    post_fxu

  • 8/9/2019 Acevedo-Arreguin (2005)

    52/60

    points(xd,yd)

    # DAILY BASIS

    par(mfrow = c(1,1))

    postscript("day_%04d.eps", paper="letter", onefile = FALSE, title=" ")

    # td

  • 8/9/2019 Acevedo-Arreguin (2005)

    53/60

    yo=seq(min(yu), max(yu), length=gridlen))

    LL = c(40, 50, 60, 70, 80, 90, 100, 120, 140, 160, 180, 200, 240, 280, 320

    , 360) # NL = 14

    image(rmx,rmy,map2,main=paste("Cincinnati Crime Data: Mean Intensity Surface",monthlab[tmo], "/", mday, "/2006"),

    sub=paste("Observed number of crime events = ",length(t2[t2==td])),pty="s"

    ,col=color1,xlab=paste("Longitude

    mean(Ncalc) = ",post_NDt," var(Ncalc) = ",var_NDt),ylab="Latitude",xlim=X_map

    ,ylim=Y_map)

    contour(surface_mov1,drawlabels=TRUE, levels=LL, add=TRUE, col=color2)

    points(xd,yd,pch=15,col="blue")

    } # end td loop

    dev.off()

    image(surface_mov1,xlab=paste("mean(ND) = ",post_NDt," var(ND) = ",var_NDt)

    ,ylab=" ",

    main=paste("Mean Intensity Surface: ",monthlab[td], " 2006"),

    sub=paste("Observed number of crime events = ",length(t4[t4==td])))contour(surface_mov1,add=TRUE)

    points(xd,yd)

    ##############################################################################

    ######

    # ADDITIONAL GRAPHS (FOR THE FINAL DOCUMENT)

    # pdf("baseline2006.pdf", paper="letter", onefile = TRUE, title=" ")

    postscript("baseline2006.eps", paper="letter", onefile = TRUE, title=" ")

    Lub

  • 8/9/2019 Acevedo-Arreguin (2005)

    54/60

    xd

  • 8/9/2019 Acevedo-Arreguin (2005)

    55/60

    lines(1:365, Ncalc+1.96*Nse, col="red")

    lines(1:365, Ncalc-1.96*Nse, col="blue")

    dev.off()

    # THETA FOR SPATIAL AND TEMPORAL COVARIATES

    # pdf("Theta2.pdf", paper="letter", onefile = TRUE, title=" ")

    postscript("Theta2.eps", paper="letter", onefile = TRUE, title=" ")

    par(mfrow=c(2,1))

    cov = 2

    plot((1+round(k*burn)):(k-1), theta[cov,(1+round(k*burn)):(k-1)],type="l"

    ,main=paste("Trace plot of theta [",cov,"]

    covariate =",cov_name[cov]),xlab = paste("theta [",cov,"]"), ylab=" "

    ,sub=paste("mean = ",mean(theta[cov,(1+round(k*burn)):(k-1)])," var = "

    ,var(theta[cov,(1+round(k*burn)):(k-1)])))

    hist(theta[cov,(1+round(k*burn)):(k-1)],xlab = paste("theta [",cov,"]")

    , ylab=" ",main=paste("Histogram of theta [",cov,"]

    covariate =",cov_name[cov]),sub=paste("Acceptance:",1+accepttheta[cov] /k)

    ,breaks=100)

    dev.off()

    # pdf("ThetaTemp%02d.pdf", paper="letter", onefile = FALSE, title=" ")

    postscript("ThetaTemp%02d.eps", paper="letter", onefile = FALSE, title=" ")

    par(mfrow=c(2,1))

    for(cov in 1:8) {

    plot((1+round(k*burn)):(k-1), thetatemp[cov,(1+round(k*burn)):(k-1)],type="l"

    ,main=paste("Trace plot of thetatemp [",cov,"]"),

    xlab = paste("theta [",cov,"]"), ylab=" "

    ,sub=paste("mean = ",mean(thetatemp[cov,(1+round(k*burn)):(k-1)])," var = "

    ,var(thetatemp[cov,(1+round(k*burn)):(k-1)])))

    hist(thetatemp[cov,(1+round(k*burn)):(k-1)],xlab = paste("theta [",cov,"]"), ylab=" ",main=paste("Histogram of thetatemp [",cov,"]"),

    sub=paste("Acceptance:",1+acceptthetatemp[cov]/k),breaks=100)

    }

    dev.off()

    # EXPLORATORY ANALYSIS

    ITR = length(x)

    i = 1

    j = 1

    crime_count

  • 8/9/2019 Acevedo-Arreguin (2005)

    56/60

    crime_weekday

  • 8/9/2019 Acevedo-Arreguin (2005)

    57/60

    yj

  • 8/9/2019 Acevedo-Arreguin (2005)

    58/60

    NDobs_sector

  • 8/9/2019 Acevedo-Arreguin (2005)

    59/60

    A_tauk/B_tauk;A_tauk/B_tauk^2

    sum(Lu_times_Au)

    min(noaccept/k);max(noaccept/k)

    min(1+acceptrx/k);max(1+acceptrx/k)

    min(1+acceptry/k);max(1+acceptry/k)

    min(1+acceptru/k);max(1+acceptru/k)

    min(1+accepttheta/k);max(1+accepttheta/k)

    np;a_eta;delta_eta

    A_alpha;B_alpha;A_alpha/B_alpha;A_alpha/B_alpha^2

    A_beta;B_beta;A_beta/B_beta;A_beta/B_beta^2

    pv1; Nx; Ny

    proposal; kernel; prior_xu; covariates; display

    # END

    ##############################################################################

    ######

    59

  • 8/9/2019 Acevedo-Arreguin (2005)

    60/60

    References

    [1] Peter J. Diggle. Statistical Methods for Spatio-Temporal Systems, chapter1 Spatio-Temporal Point Processes: Methods and Applications, pages 145. Chapman & Hall / CRC, 2007.

    [2] David Higdon. Statistical Methods for Spatio-Temporal Systems, chapter6 A Primer on Space-Time Modeling from a Bayesian Perspective, pages217279. Chapman & Hall / CRC, 2007.

    [3] Andrew B Lawson and David G T Denison. Spatial Cluster Modeling,chapter 1 Spatial Cluster Modelling: An Overview, pages 119. Chapman& Hall / CRC, 2002.

    [4] Herbert K. H. Lee, Dave M. Higdon, Catherine A. Calder, and Christo-pher H. Holloman. Efficient models for correlated data via convolutions

    of intrinsic processes. Statistical Modelling, 5:5374, 2005.[5] Jesper Moller and Ramus Plenge Waagepetersen. Statistical Inference

    and Simulation for Spatial Point Processes. Chapman & Hall / CRC,2004.

    [6] Bruno Sanso and Alexandra M. Schmidt. Spatio-temporal models basedon discrete convolutions. Technical Report AMS2004-07, University ofCalifornia, Santa Cruz, 2004.

    [7] Jonathan R. Stroud, Peter Muller, and Bruno Sanso. Dynamic modelsfor spatiotemporal data. Journal of the Royal Statisitical Society. Series

    B (Statistical Methodology), 63(4):673689, 2001.

    [8] Jenise Lynn Swall. Non-Stationary Spatial Modeling Using a ProcessConvolution Approach. PhD thesis, Duke University, 1999.

    60