35
Mastering the Spatio-Temporal Knowledge Discovery Process PhD Candidate: Roberto Trasarti PhD Thesis discussion University of Pisa

Roberto Trasarti PhD Thesis

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Roberto Trasarti PhD Thesis

Mastering the Spatio-Temporal Knowledge Discovery Process

PhD Candidate:

Roberto Trasarti

PhD Thesis discussion

University of Pisa

Page 2: Roberto Trasarti PhD Thesis

Spatio-Temporal context

Research on moving-object data analysis has been recently fostered by the widespread diffusion of new techniques and systems for monitoring, collecting and storing location-aware data, generated by a wealth of technological infrastructures, such as: Global Positioning System (GPS) Global System for Mobile (GSM) Sensor networks

Page 3: Roberto Trasarti PhD Thesis

Knowledge Discovery Process

Knowledge discovery is a multi-step process, that involves data preprocessing, pattern mining stages and pattern post-processing.

Page 4: Roberto Trasarti PhD Thesis

Motivations

Lack of a unifying framework, where mining tools are specific components of the knowledge discovery process.

Having elements from different worlds causes an impedence mismatch

Data Models ?

Page 5: Roberto Trasarti PhD Thesis

Related Works

In the literature there aren’t proposals addressing the problem of an uniform framework

There are approaches on Moving Objects Database such as Secondo and Hermes which provide some primitives.

The thesis work has been inspired by well known literature works on the inductive database vision

Page 6: Roberto Trasarti PhD Thesis

The proposed Framework

A conceptual framework that poses the basis of the proposed data mining query language and the developed system, the Two-Worlds model.

This thesis proposes: A uniform way to represent the worlds

entities: data and models A set of operators between the two-worlds

Page 7: Roberto Trasarti PhD Thesis

The object relational database paradigm

Database: D = {S1...Sn}

Schema: Sj = {T1...Tm}

Table: Ti = <a1...ah>

Attribute: ar A Attribute types: A = {Numerical, Categorical, Descriptive,

Object}

Numerical : the types which describe a number with its precision.

Categorical : representing a value in a pre-defined set and format.

Descriptive : any string of characters.

Object : a complex type which can contain other attributes, lists and methods

Page 8: Roberto Trasarti PhD Thesis

Object representation of Data and Models

Using the object relational paradigm we represent data and models as objects

The set of attribute types A can be partitioned in three subset : As Ad Am

Ad

Data Types

Data World

Am

Models types

Model World

ObjectType

Spatial objectTemporal objectMoving object

T-Pattern objectsCluster object Flock object

Page 9: Roberto Trasarti PhD Thesis

Data Types

Spatial object is an object which has a geometric shape and a position in space.

Temporal objectis an object which has an absolute temporal reference and a duration.

Moving objectis an object which changesin time and space.

y

x

t

y

x

t

Page 10: Roberto Trasarti PhD Thesis

Data-World

The D-World represents the entities to be analyzed, as well as their properties and mutual relationships.

Intuitively the D-World is the set of entities which describe the trajectory dataset and/or a set of regions and/or a partition of the day.

The D-World is a set of tables defined only by attributes in Ad and As

Page 11: Roberto Trasarti PhD Thesis

Models Types

T-Pattern is a concise description of frequent behaviors, in terms of both space and time

Clusteris a the spatio-temrporal affinitybetween a set of moving objectsw.r.t. a distance function.

Flockis the spatio-temporal coincidence between a set of moving objectswho move togheter.

RegionA

RegionB

RegionC

10 min

5 min

Page 12: Roberto Trasarti PhD Thesis

Model-World

The M-World contains all the movement patterns extracted from the data with their properties and relationships.

The M-World contains the collection of models, unveiled at the different stages of the knowledge discovery process.

The M-World is a set of tables defined only by attributes in Am and As

Page 13: Roberto Trasarti PhD Thesis

Two-Worlds Operators

Operators can be intra-world or inter-world and for each type different classes of operators have been defined.

Page 14: Roberto Trasarti PhD Thesis

The aim of this class of operators is to build objects in D-World starting from the raw data.

It realizes the data acquisition step of the knowledge discovery process.

Generic Data Constructor operator is defined as OPconstructor(T,p)

Td

Data Constructor Operators

Page 15: Roberto Trasarti PhD Thesis

This kind operatorsrealizes the extractionof models from the D-World through data mining algorithms.

Generic Model Constructor operator is defined as OPmining(Td,p)

Tm

Model Constructor Operators

Page 16: Roberto Trasarti PhD Thesis

Transformation operators are intra-world tasks aimed at manipulating data and models

These operations are the means for expressing data pre-processing and post-processing tasks.

Generic D-Transformation operator is defined as OPD-Transf (Td,p)

T’d

Generic M-Transformation operator is defined as OPM-Transf

(Tm,p) T’m

Transformation Operators

Page 17: Roberto Trasarti PhD Thesis

Relation operatorsinclude both intra-worldand inter-world operations and have the objective of creating relations between data, models, and the combination of the two.

Generic DD-Relation operator is defined as OPDD-Relation (Tdd,f ) TR

dd

Generic MM-Relation operator is defined as OPMM-Relation (Tmm,f )

TRmm

Generic DM-Relation operator is defined as OPDM-Relation (Tdm,f ) TR

dm

Relation Operators

Page 18: Roberto Trasarti PhD Thesis

The predicate f can assume a large variety of predicates. However, the semantics of these predicates depends on the type of the data (resp.model) objects to which they are applied.

Predicates of relation operators

Spatial Object

Temporal Object

Moving Object

T-Pattern Cluster Flock

Spatial Object

Intersects

ContainsEquals

Intersects

Contains

Intersects

Contains

IntersectContains

IntersectContains

Temporal Object

Intersects

ContainsEquals

Intersects

Contains

Intersects

Contains

Moving Object

Intersects

ContainsEquals

Intersects

ContainsEntails

Intersects

ContainsEntails

Intersects

ContainsEntails

T-Pattern Intersects

ContainsEquals

Cluster Intersects

ContainsEquals

Flock Intersects

ContainsEquals

DM

MM

DD

Page 19: Roberto Trasarti PhD Thesis

Data Mining Query Language We defined a data mining query language to

support the user during knowledge discovery tasks.

Three advantages: The compositionality of the operators The iterative querying The repeatability of the process

Page 20: Roberto Trasarti PhD Thesis

DMQL Grammar

DMQL:= DataConstructionOperator| ModelConstructionOperator| TransformationOperator| RelationOperator|SQLStandard

TransformationOperator:=’CREATE TRANSFORMATION‘

TableName ’USING’ TransformationName

’FROM(’SqlCall’)[’SET’Parameters]

RelationOperator:=’CREATE RELATION’ TableName ’USING’ RelationPredicate’FROM(’SqlCall’)’

DataConstructionOperator:=’CREATE DATA’ TableName ’BUILDING’ DataConstructorName’FROM(’SqlCall’)’[’SET’Parameters]

ModelConstructionOperator:=’CREATE MODELS’ TableName ’USING’ ModelConstructorName’FROM(’SqlCall’)’[’SET’Parameters]

Page 21: Roberto Trasarti PhD Thesis

The Design of the GeoPKDD system

The GeoPKDD system is an implementation of the Two-Worlds model and the Data Mining Query Language.

Page 22: Roberto Trasarti PhD Thesis

Object Realtional Database and Database Manager

As described above the object relational database contains both data and models and grants the power of SQL. It contains the representation of data and models.

The database manager realizes a middle layer and using the translation libraries detaches the system from the database techonologies

Page 23: Roberto Trasarti PhD Thesis

Language Parser and Controller Identifies the various types of queries and

builds a plan of execution of them as sequence of actions for the controller.

Example:

CREATE MODELS ClusteringTable USING OPTICSFROM (Select t.id, t.trajobj fromTrajectories t)SET OPTICS.distance_method = Route Similarity AND OPTICS.eps = 50 AND OPTICS.min_size = 100

Plan:

1. Retrieve[ Select t.id, t.trajobj from Trajectories t ]

2. Translate[ Data type: Moving point ]

3. Execute[ Mining algorithm: Optics algorithm, Parameters: ... ]

4. Translate[ Model type: Cluster ]

5. Store[ Table Name: ClusteringTable ]

Page 24: Roberto Trasarti PhD Thesis

Algorithms Manager

This component is a plug-in module capable of managing different sets of libraries

Each library realizes a different sets of operators according to the Two-World framework proposed.

Page 25: Roberto Trasarti PhD Thesis

Algorithms Libraries Data construction library

Moving object Reconstruction algorithm

Spatial object Builder algotirhm Termporal object Builder algoritm

Model construction library T-Pattern algorithm Optics algorithm T-Flock algorithm

Transformation library Resampling algorithm Intersection algoritm Object filtering T-Anonimity algorithms

Relation Library All the predicates

CREATE DATA MobilityData BUILDING MOVING_POINTS FROM (SELECT userid,lon,lat,datetime

FROM MobilityRawData ORDER BY userid,datetime)

SET MOVING_POINT.MAX_SPACE_GAP = 2000 m AND MOVING_POINT.MAX_TIME_GAP = 1800 sec

CREATE MODELS Patterns USING T-PATTERN FROM (Select t.id, t.trajobj from Trajectories t) SET T-PATTERN.support = .02 AND T-PATTERN.time = 120 sec

CREATE TRANSFORMATION AnonimizedData USING NWA FROM (SELECT t.id, t.trajobj FROM Trajectories t) SET ANONYMIZATION.K = 10 AND ANONYMIZATION.TIME_SLOT = 600 sec

CREATE RELATION EntailmentTable USING ENTIAL FROM (SELECT t.id, t.trajobj, p.id, p.obj FROM Trajectories t, Patterns p)

Page 26: Roberto Trasarti PhD Thesis

Extending the system

The GeoPKDD system provides various way to be extended:

Architecture level: new components Algorithm level: new algrorithms Types level: new data types or model types

Page 27: Roberto Trasarti PhD Thesis

Add-ons: Reasoning component This component exploits application domain

knowledge encoded in an ontology to infer a semantic interpretation of discovered patterns.

SELECT id, trajobj FROM Trajectories tWHERE SEM_CONCEPT(trajobj) = 'TouristTrajectory'

Page 28: Roberto Trasarti PhD Thesis

Add-ons: Location Prediction The goal is to constructs a predictive model using

the set of T-patterns extracted on a set of trajectories.

Given a new trajectory the predictive model can be used to predict the next location of it.

Trajectory dataset Local patterns Prediction Tree

CREATE TRANSFORMATION TPatternTree USING TPATTERN_TREEFROM( Select p.id, p.TpatternObj FROM PatternTable p )

Page 29: Roberto Trasarti PhD Thesis

Add-ons: K-Best Map Matching A new way to perform the Map Matching

The shortest path assumption in real cases can be violated in situations where other external factors play a role (i.e. Traffic congestion)

CREATE DATA K-MobilityData BUILDING K-MOVING_POINTSFROM( SELECT userid, lon, lat, datetime FROM MobilityRawData ORDER BY userid, datetime)SET K-MOVING_POINTS.K = 5 AND K-MOVING_POINTS.MAP = StreetMapFile.wkt

Page 30: Roberto Trasarti PhD Thesis

A Case Study in a Urban Mobility Scenario

A set of experiments performed on a real world case study, demonstrating the capabilities of the GeoPKDD system and how this can be exploited to extract useful knowledge from raw mobility data.

GPS traces 17K private cars One week of ordinary mobility 200K trips (trajectories) Milan, Italy

Data donated by

Page 31: Roberto Trasarti PhD Thesis

Demo

GeoPKDD system Equipped with a very simple GUI which

enables the user to write down DMQL queries and visualize the results

M-Atlas The new generation of the GUI where the

DMQL is used to build complex analysis creating scripts.

Page 32: Roberto Trasarti PhD Thesis

Contributions

The contributions of the thesis are:

the creation of a theoretical framework in order to manage the complex Knowledge discovery process on mobility data

the definition of a DMQL which realizes the operators of the framework

the implementation of a real system capable of handling large amount of data

three extensions of the system: reasoning component, k-best map matching and location prediction algorithms

An extensive study and analysis on a real case of study

Page 33: Roberto Trasarti PhD Thesis

Achievements

The GeoPKDD system was one of the two project demonstrators and has been successfully presented in the final review of the GeoPKDD project.

Presented at the European parliament as one the selected project in the Future and Emerging Technologies (FET) program

Published in several conferences such as KDD, ICDM, EDBT, AGILE, etc.

It is used in the collaboration with the Milan Mobility Agency for mobility understanding

It is currently used in collaboration with Orange Telecom for the “Big Paris” project

Page 34: Roberto Trasarti PhD Thesis

Publications2010

12 Roberto Trasarti, Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Chiara Renso: A Query Language for Mobility Data Mining. International Journal of Data Warehousing and Mining (IJDWM) 2010

11 Mirco Nanni, Roberto Trasarti, Chiara Renso, Fosca Giannotti, Dino Pedreschi : Advanced Knowledge Discovery on Movement Data with the GeoPKDD system. EDBT 2010

200910 Mirco Nanni, Roberto Trasarti, Fosca Giannotti: K-BestMatch reconstruction and comparison of trajectory data. SSTDM

- ICDM 20099 Fosca Giannotti, Roberto Trasarti: Mobility, Data Mining and Privacy: The GeoPKDD Paradigm. SIAM Journal (IM09)8 Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Chiara Renso, Roberto Trasarti: Mining Social Mobility Behaviors from

GPS data. SCMPS - SocialCom 20097 Roberto Trasarti, Miriam Baglioni, Chiara Renso: DAMSEL: a System for Progressive Querying and Reasoning on

Movement data. FlexDBIST 20096 Anna Monreale, Fabio Pinelli, Roberto Trasarti, Fosca Giannotti: WhereNext: a Location Predictor on Trajectory Pattern

Mining. KDD 2009. 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining5 Fosca Giannotti, Chiara Renso, Roberto Trasarti: GeoPKDD – Geographic Privacy-aware Knowledge Discovery. FET

2009. The Eruopean Future Technologies Conference4 Miriam Baglioni, Jose de Macedo, Chiara Renso, Roberto Trasarti and Monica Wachowicz : Towards semantic

interpretation of movement behavior. 12th AGILE International Conference on Geographic Information Science.

20083 Riccardo Ortale, E. Ritacco, Nikos Pelekis, Roberto Trasarti, Gianni Costa, Fosca Giannotti, Giuseppe Manco, Chiara

Renso, Yannis Theodoridis: The DAEDALUS Framework: Progressive Querying and Mining of Movement Data. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008).

2 Fabio Pinelli, Anna Monreale, Roberto Trasarti, Fosca Giannotti: Location prediction within the mobility data analysis environment Daedalus. First International Workshop on Computational Transportation Science (IWCTS). MOBIQUITOUS 2008. ACM digital library

1 Riccardo Ortale, E. Ritacco, Nikos Pelekis, Roberto Trasarti, Gianni Costa, Fosca Giannotti, Giuseppe Manco, Chiara Renso, Yannis Theodoridis: DAEDALUS: A knowledge discovery analysis framework for movement data. SEBD 2008: 191-198

  Journals

  Conference Proceedings

  Demos or Posters

Page 35: Roberto Trasarti PhD Thesis

Thank you

Questions?