Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi

Time-focused density-based Time-focused density-based clustering of trajectories clustering of trajectories

of moving objectsof moving objects

Margherita D’AuriaMargherita D’Auria

Mirco NanniMirco Nanni

Dino PedreschiDino Pedreschi

2

Plan of the talkPlan of the talk

Introduction Motivations Problem & context Density-based Clustering (OPTICS)

Density-based clustering on trajectories Trajectory data model distance measure Results

Temporal Focusing A clustering quality measure Heuristics for optimal temporal interval

Conclusions & future work

3

MotivationsMotivations

Plenty of actual and future data sources for Plenty of actual and future data sources for spatio-temporal dataspatio-temporal data

Sophisticated analysis method are required, in Sophisticated analysis method are required, in order to fully exploit themorder to fully exploit them Data mining methods Which kind of patterns/models?

Main objectives Main objectives A better understanding of the application domain An improvement for private and public services

4

Problem & contextProblem & context

A distinguishing case: Mobile devicesA distinguishing case: Mobile devices PDAs Mobile phones LBS-enabled devices (may include the two above)

They (can) yield traces of their movementThey (can) yield traces of their movement

An important An important problemproblem: : Discovering groups of individuals that (approx.) move together in

some period of time E.g.: detection of traffic jams during rush hours

A candidate Data Mining reformulation of the problemA candidate Data Mining reformulation of the problem Clustering of individuals’ trajectories

5

Which kind of clustering?Which kind of clustering?

Several alternatives are availableSeveral alternatives are available General requirements:General requirements:

Non-spherical clusters should be allowed E.g.: A traffic jam along a road It should be represented as a cluster which individuals form a

“snake-shaped” cluster

Tolerance to noise Low computational cost Applicability to complex, possibly non-vectorial data

A suitable candidate: Density-based clusteringA suitable candidate: Density-based clustering In particular, we adopt OPTICS

6

A crushed intro to OPTICSA crushed intro to OPTICS

A density threshold is defined through two parameters: ε: A neighborhood radius MinPts: Minimum number of points

Key concepts:Key concepts: Core objects

Objects with a ε-Neighborhood that contains at least MinPts objects Reachability-distance reach-d( p, q )

(simplified definition:) Distance between objects p and q

Example:Example: Object “q” is a core object if MinPts=2 Object “p” is not Their reach-d() is shown

qqpp

εε

εε –neighborhood of q –neighborhood of q

reachch-d(p,q)

7

A crushed intro to OPTICSA crushed intro to OPTICS

The algorithm:

1. Repeatedly choose a non-visited random object, until a core object is selected

2. Select the core object having the smallest reachability distance from all the visited core objects. If none can be found, go to step 1

Order of visit Output: reach-d() of all visited points

((reachability plotreachability plot))

“jump” from left-hand group (0-9) to right-hand one (10-18) (10-18)

Reachabilitythreshold

Cluster 1 Cluster 2

8

Applying OPTICS to trajectoriesApplying OPTICS to trajectories

Two key issues have to be solvedTwo key issues have to be solved A suitable representation for trajectories is needed

Which data model for trajectories?

A mean for comparing trajectories has to be provided Which distance between objects? OPTICS needs to define one to perform range queries

9

A trajectory data modelA trajectory data model

Raw input data:Raw input data: Each trajectory is represented as a set of time-stamped coordinates T=(t1,x1,y1), …, (tn, xn, yn) => Object position at time ti was (xi,yi)

Data modelData model Parametric-spaghetti: linear interpolation between consecutive points

10

Adopted distance = average distanceAdopted distance = average distance

It is a metric => efficient indexing methos allowedIt is a metric => efficient indexing methos allowed

A distance between trajectoriesA distance between trajectories

||

))(),((|),(

21

21 T

dtttdD T

T

11

A sample datasetA sample dataset

Set of trajectories forming 4 clusters + noiseSet of trajectories forming 4 clusters + noise

Generated by the CENTRE system (KDDLab software)Generated by the CENTRE system (KDDLab software)

12

K-means

OPTICS

HAC-average

OPTICS vs. OPTICS vs. HAC & K-meansHAC & K-means

13

Temporal focusingTemporal focusing

Different time intervals can show different Different time intervals can show different behavioursbehaviours E.g.: objects that are close to each other within a time

interval can be much distant in other periods of time

The time interval becomes a parameterThe time interval becomes a parameter E.g.: rush hours vs. low traffic times

Problem: significant time intervals are not always Problem: significant time intervals are not always known known a prioria priori An automated mechanism is needed to find them

14

Temporal focusingTemporal focusing

The proposed methodThe proposed method

1.1. Provide a notion of interestingness to be Provide a notion of interestingness to be associated with time intervalsassociated with time intervals

We define it in terms of estimated quality of the clustering extracted on the given time interval

2.2. Formalize the Temporal focusing task as an Formalize the Temporal focusing task as an optimization problemoptimization problem

Discover the time interval that maximizes the interestingness measure

15

A quality measure for A quality measure for density-based clusteringdensity-based clustering

General principleGeneral principle High-density clusters separated by

low-density noise are preferred

The methodThe method High-density clusters correspond to

low dents in the reachability plot=> Evaluate the global quality Q of the

clustering output as the average reachability within clusters (noise is discarded)

LOWDENSITY

HIGHDENSITY

MEDIUMDENSITY

Definition: given Definition: given εε and dataset D, compute and dataset D, compute QQD, εD, ε as: as:

QD, ε = - R (D, ε’) = - AVGo in D’ reach-d(o)

D’ = D – {noise objects}

16

FAQsFAQs

How Q() is computed for a given time interval I ?How Q() is computed for a given time interval I ? Step 1: trajectory segments out of I are clipped away Step 2: OPTICS is run on the clipped trajectories Step 3: Q(I) is computed on the output reachability plot

How is the reachability threshold set for each interval?How is the reachability threshold set for each interval? A reachability threshold is needed in order to locate clusters (and noise) The threshold for the largest I is manually set by the user Thresholds for other intervals I’ I are computed from the first one by

proportionally rescaling w.r.t. average reachability

Is the optimal Q(I) biased towards tiny intervals?Is the optimal Q(I) biased towards tiny intervals? Yes. The problem has been fixed by defining Q’(I) = Q(I) / log |I|

=> A small decrease in Q(I) is accepted when it yields a much larger I

17

EsperimentsEsperiments

A more complex sample dataset (generated by CENTRE)A more complex sample dataset (generated by CENTRE) Clear clusters in the central time interval vs. dispersion on the borders

18

Optimizing Optimizing Q()Q()

Find the optimal Q() by plotting values for all time intervalsFind the optimal Q() by plotting values for all time intervals The optimum corresponds to the central time interval

19

Heuristics for optimum searchHeuristics for optimum search

Each Q() value computation requires a run of the OPTICS algorithmEach Q() value computation requires a run of the OPTICS algorithm

Computing all O(NComputing all O(N22) values is too expensive (N=|{sub-intervals}|)) values is too expensive (N=|{sub-intervals}|)

Alternative approaches are neededAlternative approaches are needed

Preliminary tests with hill-climbing (i.e., greedy) approach:Preliminary tests with hill-climbing (i.e., greedy) approach:

Test on the same datasetTest on the same dataset

Global optimum found in the Global optimum found in the 70,7% of runs70,7% of runs

Avg. number of steps: 17Avg. number of steps: 17

Avg. OPTICS runs: 49Avg. OPTICS runs: 49

starting starting

pointspoints

local local

optimaoptima

global global

optimumoptimum

20

Conclusions & Future worksConclusions & Future works

Summary of the workSummary of the work Extension of OPTICS to a trajectory data model & distance Definition of the Temporal Focusing problem Definition of a clustering quality measure (Preliminary) Tests with exhaustive & greedy optimization

Future workFuture work Experimental validation over broader benchmarks Tighter integration between OPTICS and search strategy Alternative, domain-specific definition of quality measures

Documents

Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi