Criminal Incident Data Association Using OLAP Technology

Preview:

Citation preview

Criminal Incident Data Association Using OLAP Technology

Donald E. Brown & Song LinDepartment of Systems & Information

EngineeringUniversity of Virginia

Summary

In this paper, we combine OLAP (Online Analytical Processing) and data mining to associate criminal incidents.This method is tested with a robbery dataset from Richmond, Virginia

Objectives of Spatial Knowledge MiningLeverage DBMS (records management), OLAP, & GISFind spatial-temporal patterns and relationships in dataSupport crime analysis & information sharing

Related Applications - UVa

ReCAP Regional Crime Analysis Program Provides support for regional analysis using RDBMS Requires implementation on each client computer

CARV Crime Analysis and Reporting in Virginia Runs on Citrix Metaframe, so the number of concurrent

users is limited

GRASP Geospatial Repository for Analysis and Safety Planning Web interface for a central repository of criminal incident

data and geospatial files

Outline

IntroductionExisting studies on OLAP & data miningCombined approachApplicationConclusions

Introduction (crime association)

80-20 rule: 20% of the criminals commit 80% of the crimesHow can we link criminal incidents committed by the same criminal?Start by looking at the same crime types

Theories of criminal behavior (criminology)

Rational choice (Clarke and Cornish) Criminals evaluate “benefit” and

“risk”, make rational decisions to maximize “profit”.

Routine activity (Felson) A ready criminal Suitable target Lack of effective guardian

Theories of criminal behavior (template)

“Template” (Brantingham & Brantingham) Environment sends out cues about its

characteristics Criminals use cues to evaluate Template is built to associate certain cues

with suitable targets Template is self-reinforcing and enduring A criminal does not have many templates

An operational approach to the theories (template)

Criminal incidents committed by the same person Similar patterns in time Similar patterns in space Similar patterns in MO

It is possible to associate incidents from the same person by discovering these patterns

Existing Association Methods & Systems

AREST (Badiru et al.) Suspect matching

ViCAP (FBI) Incident matching

COPLINK (U. Arizona) Link search terms with cases (concept

space)

Existing Association Methods & Systems

TSM (Brown) Total similarity measures Could be used for both incidents and

suspects matching

SQL Used by analysts in practice

Comments on existing methods

Computer technologies are central to criminal incident associationFor example MIS Databases Information Retrieval GIS

Comments on existing methods

Two additional techniques that enable incident association Data Warehousing / OLAP Data Mining

We develop a method thatseamlessly integrates OLAP and data

mining.

Related Work on OLAP and data mining

OLAP Ancestor: OLTP (transactional data) OLAP: (summary data for analysis) Dimension:

OLAP data is multidimensional Dimension: numeric or categorical

attributes Hierarchical structures exist in dimensions

Aggregates: Sum, count, average, max, min, …

OLAP and Data Mining

Both of them are powerful tools to support decision making process, but OLAP focus on efficiency, few

quantitative analysis methods are used Data mining is typically for 2-D dataset

(spreadsheets), not for multidimensional OLAP data structures

Idea: combine them

Existing studies on combining OLAP and Data mining

Cubegrade Problem (Imielinski) Generalized version of association

rule Association rule: change of “count”

aggregate imposing another constraint, or perform a “drill-down” operation

Other aggregates could also be considered

Existing studies on combining OLAP and Data mining

Constrained Gradient Analysis Retrieve pairs of OLAP cells

Quite different in aggregates Similar in dimension (parents, children,

siblings) More than one aggregate could be

considered simultaneously (e.g., sum and mean).

Existing studies on combining OLAP and Data mining

Data driven exploration (Sarawagi) Find “exceptions” Mean and STD are calculated for a

cell If the aggregate of the cell is outside

the (-2.5, +2.5) exception OLAP version of “3” rule

Associating records by finding distinctive values or outliers

Basic idea If a group of records have common characteristics, and

these “common” characteristics are unusual or “outliers”, we are more confident in asserting that these records come from the same causal mechanism.

Look for distinctive characteristics – the best would be DNA

OLAP-outlier-based method to associate records

Rationale for distinctive values or outliers Weapon used in robberies “gun” – very common, hard to associate “Japanese sword” – distinctive, come from

the same person

We build an outlier score function to measure this “distinctiveness”, Higher score more distinctive more

confident to associate It is for categorical attributes (MO is

important in linking criminal incidents)

Definitions

Cell, Parent, Neighbor Cell: a vector of values for some

attributes. Parent: replace one attribute of the

cell with wildcard element “*”. Neighbor: A group of cells having the

same Parent.

Derive from OLAP field

Illustration -- Cell

Dimension 1

Dimension 2

a1 a4a3a2

b1

b2

b4

b3

Two-Dimension Cell

(a 4,b 2)

One-Dimension Cell

(*,b 4)

Illustration --parent

a1 a2 a4 a5a3

b4

b3

b2

b1

Cell (a5,b3) has two parents: (a5, *) and (*,b3)

Illustration -- Neighbor

Neighbor is a collection of cells sharing the same parent

Outlier Score Function

We start building this function from one dimension, and then we generalize to higher dimensions.For one dimension, we have the following two observations. Values with small probability

(frequency) are more “unusual” Outlier score is high when the

uncertainty level is low.

Observation I

Blond Brown Black Red Gray

HairColor

0

10

20

30

40

50

Cou

nt

P=0.1Outlier

For attribute “color”, value “blond” covers 10% of the records. Hence, it should get a higher outlier score.

Observation II

Blond Brown Black Red Gray

HairColor

0

20

40

60

80

Count

Blond Brown

HairColor

0

20

40

60

80

Coun

t

Although both of them have frequency=0.2, the left one is more “unusual”, because the uncertainty level is low.

Observation III

“more evidence” More evidence is better than less

higher outlier score

OSF for One Dimension

-log(p) comes from information theory, where p is the probability of a valueEntropy measures the information in a message (in this case, in a data record)

Entropy

pOSF

)log(

OSF for Higher Dimensions

For any cell, calculate the sum of the OSF of its parent cell and the OSF conditional on the neighbor of this cell. (one-dimension OSF)Do this calculation for all parent cells.Take the maximum as the outlier score for this cell.

)(*,*,...,*0

))(

))(log()),(((max

)(c

cofneighborkEntropy

cfrequencykcparentf

cf th

Association (using this OLAP-outlier method)

For a pair of incidents (A,B) If there is a cell that contains both A

and B And the outlier score of this cell is

large enough (threshold test) Associate them

Application (dataset)

Applied to a robbery dataset (Richmond, VA, 1998) Why robbery?

For evaluation purpose # of multiple offenses > murder # of known suspects > B & E

Attributes

Three attributes Modus Operandi -- categorical Census Features -- numeric Distance Features – numeric

Feature Selection

Redundant features feature selection Cluster features (similar features in the

same group) Pick a representative feature for each

group Method: k-medoid clustering

Applicable to distance matrix Return “medoids”

Feature Selection Result

Component 1

Co

mp

on

en

t 2

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

-0.6

-0.4

-0.2

0.0

0.2

0.4

These two components explain 44.25 % of the point variability.

Medoids -- 1 : HUNT 2 : ENRL3 3 : TRANS.PC

Final Selected Features

Medoids HUNT (housing unit density) ENRL3 (public school enrollment)

POP3 (population:12-17) more meaningful (attacker and victims)

TRAN_PC (transportation expense per capita) MHINC (median income)

Discretize

Discretize these numeric features into bins Similar to histogram Sturges’ number of bins rule

Evaluation

For incidents with known suspects (170) Generate all incident pairs If a pair of incidents have the same

criminal suspect, then “true association”

Compare results given by the algorithm with the “true result”

Evaluation Criteria

Two measures Detected true associations

Larger is better Average number of relevant records

Similar to search engines like “google” Given one record, system return a list Take the average of the length of all lists Shorter is better.

Evaluation Criteria (cont.)

From information retrieval Recall: ability to provide relevant

items Precision: ability to provide only

relevant items

1st measure is “recall”; 2nd is equivalent to “precision”2nd also measures the user effort (in further investigation)

Result (OLAP-outlier based)

Threshold Detected true associations

Avg. number of relevant records

0 33 169.00 1 32 121.04 2 30 62.54 3 23 28.38 4 18 13.96 5 16 7.51 6 8 4.25 7 2 2.29 0 0.00

Result of binary association method (calculating similarity score)

Threshold Detected true associations Avg. number of relevant records 0 33 169.00

0.5 33 112.98 0.6 25 80.05 0.7 15 45.52 0.8 7 19.38 0.9 0 3.97 0 0.00

Comparison Outlier vs. Binary

0

5

10

15

20

25

30

35

0 20 40 60 80 100 120 140 160 180

Avg. relevant records

Similarity

Outlier

Comparison (cont.)Generally, the curve of our method lies above the other one Given the same accuracy level, this method

returns less records Keep the same “length” of the list, this

method is more accurate

The other method is better at the tail However, that means the average number of

relevant records is > 100 Given the size is 170, no analyst would

investigate 100 incidents.

Generally, the new method is effective.

Comparison(Outlier vs. Simple Combination)

0

5

10

15

20

25

30

35

0 50 100 150 200

Similarity

Outlier

Combine

WebCAT Implementation

A secure web environment that can read several data formats, translate them into a uniform standard (XML)Uses free, open-source technology ASP, XML, MapServer, SVG, etc.

Provides tools to meet spatial and statistical analysis needs, to include associationProvides utilities for querying and reporting

Conclusions

Developed a new data association method for linking criminal incidents that combines Concepts in OLAP (multidimensional) Ideas in data mining (outlier detection)

Testing with a robbery dataset shows promiseDeployment through WebCAT provides open source (XML-based) capability for data access and analysis over the web

Questions?

Recommended