TRAJECTORY-BASED POINT OF INTEREST RECOMMENDATIONsummit.sfu.ca/system/files/iritems1/9832/ETD4883.pdf · TRAJECTORY-BASED POINT OF INTEREST RECOMMENDATION by Geo rey Benjamin Zenger

TRAJECTORY-BASED POINT OF INTEREST

RECOMMENDATION

by

Geoffrey Benjamin Zenger

B.Sc. (Hons. First Class), Simon Fraser University, 2007

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science

in the School

of

Computing Science

c© Geoffrey Benjamin Zenger 2009

SIMON FRASER UNIVERSITY

Fall 2009

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

Last revision: Spring 09

Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

Abstract

Existing point of interest (POI) recommendation systems for mobile users only consider a

user’s present spatio-temporal location, and do not utilize a user’s trajectory history. In this

thesis, we identify some essential requirements for a mobile trajectory-based recommenda-

tion system, and present a new framework for trajectory-based POI recommendation. We

construct a k-truncated generalized suffix tree to represent a historical trajectory database,

and use it to execute exact matching recommendation queries. In addition to individual

points of interest, we can recommend generalizations of POIs by using density estimation.

We also consider extensions of our framework. Two variants are developed, allowing for the

execution of fuzzy matching and order-flexible queries. Furthermore, a technique for diver-

sifying recommendations is presented. The resulting system can efficiently and accurately

predict a user’s next visited point given a query, and is demonstrated to be effective and

scalable on two real world datasets.

Keywords: trajectory mining; POI recommendation; recommendation systems; fuzzy

matching; order-flexible matching; recommending generalizations

iii

To Brittany

iv

“You’re beginning with an illogical premise and proceeding

perfectly logically to an illogical conclusion.”

— Donald Rumsfeld, 2001

v

Acknowledgments

I would like to extend my gratitude to my senior supervisor, Dr. Jian Pei, for guiding me

through the last two years of study and research. Through his creativity, energy, and exper-

tise he has given me a great appreciation for academic research and the joy of conducting

original research. I would also like to thank him for his patience even when work and a

medical emergency distracted me from my academic work.

In addition, I want to thank Dr. Qianping Gu for agreeing to serve on my committee

and to Dr. Joseph Peters for his willingness to serve as one of my supervisors. Through

the courses I took with him and numerous discussions held in his office, Dr. Peters played

an instrumental role in teaching me that research can, and in fact should, be a fun and

enjoyable endeavour.

I would like to thank Michael Tsumura, Ivailo Ivanov, Nebojsa Stefanovic, and everybody

else that I have worked with and worked for at SAP Business Objects for their flexibility

that has allowed me to attend courses at SFU and write this thesis.

I would like to thank my friends and family for supporting my decision to continue my

studies and graciously endure an additional two years in which I had little free time.

Finally, I want to express my gratitude to Brittany, who has cared for me, provided support,

and has graciously served as a sounding board for my research ideas.

vi

Contents

Approval ii

Abstract iii

Dedication iv

Quotation v

Acknowledgments vi

Contents vii

List of Tables xi

List of Figures xii

List of Algorithms xiv

1 Introduction 1

1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

2.1 Mobile Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Trajectory Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

vii

3 Problem Description 10

3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Recommendation System Requirements . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Quantifiability of Confidence . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.2 On-line Recommendation Capability . . . . . . . . . . . . . . . . . . . 15

3.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.4 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.5 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.6 Fuzziness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.7 Order-Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.8 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Satisfying the Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Exact Matching 20

4.1 Exact Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Naıve Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Accounting for POI Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Increasing the Confidence of Each Similar POI . . . . . . . . . . . . . 23

4.2.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.3 Recommending Generalizations . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Spatio-specific Generalized Recommendations . . . . . . . . . . . . . . . . . . 27

4.4 Implementing Exact Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.1 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.2 k-truncated Generalized Suffix Trees . . . . . . . . . . . . . . . . . . . 32

4.4.3 Computing Point Distance . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.4 Executing Exact Matching Queries . . . . . . . . . . . . . . . . . . . . 35

4.5 Diversification of Recommendations . . . . . . . . . . . . . . . . . . . . . . . 38

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Variants 42

5.1 Fuzzy matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Implementing Fuzzy Matching . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Order-Flexible Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii

5.2.1 History-Centric Approach . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.2 Query-Centric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Experimental Results 53

6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.1 Dataset Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 Evaluating Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.3 Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3.4 Query Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.3.5 Number of Recommendations . . . . . . . . . . . . . . . . . . . . . . . 73

6.3.6 Diversification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.3.7 Other Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.3.8 Effects of k-Truncated Suffix Trees . . . . . . . . . . . . . . . . . . . . 79

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7 Conclusion 83

7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.1.1 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.1.2 Parallelizing Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.1.3 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1.4 Temporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1.5 Continuous Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1.6 Longer Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1.7 Other Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A Constructing suffix trees 89

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.2 Ukkonen’s Algorithm for Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . 89

A.3 Constructing Generalized Suffix Trees . . . . . . . . . . . . . . . . . . . . . . 92

ix

A.4 Constructing k-truncated Generalized Suffix Trees . . . . . . . . . . . . . . . 92

Bibliography 95

x

List of Tables

6.1 Dataset Trajectory Information . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Dataset POI Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xi

List of Figures

4.1 Graphical depiction of example . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Example of Gaussian kernel estimation . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Example illustrating problem over over-generalization . . . . . . . . . . . . . 27

4.4 Solving over-generalization problem with grid cells . . . . . . . . . . . . . . . 29

4.5 Suffix tree for the word “mississippi$” . . . . . . . . . . . . . . . . . . . . . . 31

4.6 3-truncated suffix tree for the word “mississippi$”. . . . . . . . . . . . . . . . 33

4.7 Example of concept distance. conceptDistance(x, y) = 3 . . . . . . . . . . . . 34

5.1 Demonstrating the Fuzzy Search Radius around a Trajectory . . . . . . . . . 44

6.1 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Processed Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Concept Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4 INFATI Datasets: Weighted Scores vs. Fuzzy matching radius . . . . . . . . . 65

6.5 Trucks Datasets: Weighted Scores vs. Fuzzy matching radius . . . . . . . . . 66

6.6 INFATI Datasets: Unsatisfiable Queries vs. Fuzzy matching radius . . . . . . 66

6.7 Trucks Datasets: Unsatisfiable Queries vs. Fuzzy matching radius . . . . . . . 67

6.8 INFATI Datasets: Binary Scores vs. Fuzzy matching radius . . . . . . . . . . 67

6.9 Trucks Datasets: Binary Scores vs. Fuzzy matching radius . . . . . . . . . . . 68

6.10 INFATI Datasets: Query Time vs. Fuzzy matching radius . . . . . . . . . . . 68

6.11 Trucks Datasets: Query Time vs. Fuzzy matching radius . . . . . . . . . . . . 69

6.12 INFATI-500: Effect of Query Length on Weighted Score . . . . . . . . . . . . 71

6.13 INFATI-500: Effect of Query Length on Binary Score . . . . . . . . . . . . . 71

6.14 INFATI-500: Effect of Query Length on Query Time . . . . . . . . . . . . . . 72

6.15 INFATI-500: Effect of Query Length on Unsatisfiable Queries . . . . . . . . . 72

xii

6.16 INFATI-500: Effect of the Number of Recommendations on Weighted Score . 74

6.17 INFATI-500: Effect of the Number of Recommendations on Binary Score . . 74

6.18 INFATI-500: Effect of Diversification on Weighted Score . . . . . . . . . . . . 75

6.19 INFATI-500: Effect of Diversification on Binary Score . . . . . . . . . . . . . 76

6.20 INFATI-500: Effect of Spatial Factor on Weighted Score . . . . . . . . . . . . 76

6.21 INFATI-500: Effect of Spatial Factor on Query Time . . . . . . . . . . . . . . 78

6.22 INFATI-500: Effect of Kernel Width on Weighted Score . . . . . . . . . . . . 78

6.23 Effects of Query Length on Suffix Tree Construction Time . . . . . . . . . . . 80

6.24 Effects of Truncation on Suffix Tree Memory Usage . . . . . . . . . . . . . . . 81

6.25 INFATI-500: Effects of Truncation on Query Times . . . . . . . . . . . . . . . 81

A.1 3-truncated suffix tree for the word ”mississippi$. . . . . . . . . . . . . . . . . 93

xiii

List of Algorithms

1 Searching Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Processing Next Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Algorithm for Diversifying Recommendations . . . . . . . . . . . . . . . . . . . 40

4 Searching Suffix Trees For Fuzzy Matching . . . . . . . . . . . . . . . . . . . . 46

5 Searching Suffix Trees for Order-Flexible Matching . . . . . . . . . . . . . . . . 51

6 Pseudo-code for Dataset Processing . . . . . . . . . . . . . . . . . . . . . . . . 57

7 Ukkonen’s Algorithm (High Level) . . . . . . . . . . . . . . . . . . . . . . . . . 90

8 Modified Ukkonen’s Algorithm for k truncated suffix trees. . . . . . . . . . . . 94

xiv

Chapter 1

Introduction

Portable GPS devices, cell phones, and other location-aware mobile devices have become

ubiquitous in recent years. These devices are capable of gathering vast quantities of data

regarding a user’s movements. Each user’s movements constitute a trajectory: a sequence of

points, each with a precise time-stamp and location. Although some may view the gathering

and use of this data as an invasion of personal privacy, the availability of this data opens

new avenues for improving the quality of point of interest recommendation systems.

Current mobile point of interest (POI) recommendation systems take into account the

present location of an individual, along with other attributes of the individual, such as age,

sex, and occupation. However, they are unable to incorporate the recent movements of

an individual and knowledge about historical trajectories into the recommendation process.

This thesis addresses the problem of incorporating a user’s current trajectory, as well as

a database of historical trajectories, into the point of interest recommendation process in

order to improve the quality of the returned recommendations.

Imagine yourself visiting a new city, either as a tourist or for business purposes, and

pulling out your cell phone to enable a point of interest recommendation system. The

research presented in this thesis would allow you to query the recommendation system with

your recently travelled trajectory and be presented with an interesting museum to visit, a

restaurant to eat at, and a store to shop at. For example, the system may determine that

after you visited Science World, Stanley Park, and the Planetarium, the place that you

are most likely to want to visit next is the Van Dusen Gardens. Given this knowledge, a

trajectory-based recommendation system could recommend that you visit the gardens, and

presumably pay for itself by charging the gardens a nominal fee to display an advertisement.

1

CHAPTER 1. INTRODUCTION 2

A very similar application to the application of recommending points of interest to

visitors to a new city is that of recommending places to visit in a vast museum for people

short on time. For example, a trajectory-based recommendation system could be used in an

art museum, such as the Prado, where the system could recommend that individuals who

had just spent time viewing Zurbaran’s Agnus Dei and El Greco’s Annunciation may want

to view El Greco’s The Knight with His Hand on His Breast next.

Another application of the research in this thesis is to people’s morning commutes.

We can surmise that there will often be a small set of points of interest, including places

such as coffee shops, newsstands, cafes, and convenience stores that are frequented during

the morning commute by individuals following certain trajectories. Using a trajectory-

based POI recommendation system, we could combine information about a user’s historical

trajectories along with knowledge of other users’ historical trajectories to recommend points

of interest to an individual during their commutes. By using historical trajectory information

from other users, it would be possible for the recommendation system to recommend points

even when the user takes a novel route to work. There are two ways of using this information

on which points a user is most likely to want to visit. The first way is to recommend the

point of interest that the user is most likely to want to visit. However, another use of these

recommendations would be to sell advertising to the competitors of the top recommended

points of interest in the hopes of shifting the preferences of the user. It is possible that a

viable business could be built on the model of giving away location-aware mobile devices to

commuters for free and having them pay for themselves through advertising revenue.

Beyond the realm of recommending points of interest and mobile ad delivery, there

are other potential applications of a trajectory-based POI recommendation system. For

example, it is plausible that it could be used to predict the movements of tracked animals.

Nonetheless, the commuting and tourism applications are the principal motivation for the

research in this thesis, and the methods contained in this thesis have been developed with

these applications in mind.

The problem of trajectory-based POI recommendation is challenging for two primary

reasons. The first principal challenge is that trajectory-based POI recommendation is a

new problem and there are no previously published requirements for how a trajectory-based

POI recommendation system should behave. Previous research into mobile recommendation

systems does not take into account trajectory information. Furthermore, existing systems

tend to be capable of recommending only specific points of interest, and not generalizations


of points of interest. Both of these limitations are addressed by this thesis.

The second principal challenge is that recommendation queries need to be answered

in real time. It is easy to devise methods that do not remain efficient as the number of

previously observed trajectories grows. However, as the goal is to build a system capable

of executing recommendation queries in mere seconds on a mobile device, we need to make

sure that queries can be answered efficiently even when the results are based on a large

historical database.

1.1 Main Contributions

The main contributions of this thesis are:

• The introduction, motivation, and formalization of the trajectory-based POI recom-

mendation problem. Previous research into mobile recommendation systems does not

incorporate a user’s recent trajectory history into the recommendation process.

• The development of a set of desired properties for a useful trajectory-based POI rec-

ommendation system.

• A practical solution to the trajectory-based POI recommendation problem, built upon

the (k-truncated) generalized suffix tree data structure. This system is capable of

answering fuzzy-matching and order-flexible queries in addition to more basic exact-

matching queries. The framework developed is highly configurable, and can be con-

figured to show a wide variety of behaviours.

• An effective approach for recommending generalizations of points of interest in addition

to specific points of interest. Previous research into mobile recommendation systems

always recommends specific points of interest and is not capable of recommending

generalizations.

• An efficient method for ensuring that the recommendations returned for a given query

are diverse. Diversifying the result set is demonstrated to improve the quality of a

query’s recommendations.

• Experimental evidence that trajectory-based POI recommendation can be performed

efficiently on large datasets and generates higher quality recommendations than exist-

ing recommendation systems based only on a user’s current location.


1.2 Outline

• Chapter 1 (Introduction): Motivates the contents of this thesis, describes the main

contributions, and contains the outline you’re presently reading.

• Chapter 2 (Related Work): Overview of past research related to the research pursued

in this thesis, and descriptions of how past research differs from work presented in this

thesis.

• Chapter 3 (Problem Description): Presents technical definitions of all terms used in

this thesis, formal definition of the trajectory-based POI recommendation problem, an

overview of requirements for a useful trajectory-based POI recommendation system,

and a description of the specific methods constructed to satisfy these requirements.

• Chapter 4 (Exact Matching): Description of the exact matching problem, naive confi-

dence measure for trajectory-based POI recommendation, two methods for accounting

for POI similarity, a method for recommending generalizations of POIs, a method for

recommending spatio-localized generalizations, a greedy algorithm for diversifying the

set of recommendations returned, overview of the (k-truncated) generalized suffix tree

data structure, and algorithms for executing recommendation queries.

• Chapter 5 (Variants): Motivation for the fuzzy matching and order-flexible matching

variants, formal definition of these variants, description of an efficient algorithm to

execute fuzzy matching queries, description of two options for defining the order-

flexible matching problem, and presentation of an efficient algorithm to execute order-

flexible matching queries.

• Chapter 6 (Experimental Results): Descriptions and visualizations of the datasets

used for experimentation, algorithms for processing datasets, and experimental results

demonstrating the effectiveness of the methods presented in this thesis.

• Chapter 7 (Conclusion): Proposes future research directions, and summarizes the rest

of the thesis.

• Appendix 1 (Constructing Suffix Trees): Detailed descriptions of the generalized suffix

tree and k-truncated generalized suffix tree data structures, efficient algorithms for

constructing suffix trees, and examples of suffix tree construction.

Chapter 2

Related Work

The existing research related to this thesis can be grouped into two broad categories: mobile

recommendation systems, and trajectory mining. Due to the popularity of collaborative

filtering in recommendation systems, we briefly discuss the concepts and major ideas, though

collaborative filtering is not used in this thesis.

2.1 Mobile Recommendation Systems

One of the first systems that could arguably fall under the term “mobile recommendation

system” was the “Personalized Shopping Assistant (PSA)” device proposed by Asthana et al.

back in 1994 [1]. The PSA was a walkman sized wireless device that was able to communicate

with a server using a radio (RF) link to radio transceivers placed around a store. Through

a simple user interface, it was able to locate items, engage the customer by telling jokes

to him or her, and crucially for us, direct a customers attention to new items or to those

“of particular interest to a particular customer.” For example, knowing that a customer

had recently purchased a VCR, it was able to recommend that the customer buy a video.

Furthermore, although it does not appear to have been implemented by the researchers, it

is proposed that the PSA would be location aware, and able to recommend only those items

near the customer in the expanse of a vast supermarket. Although extremely primitive, the

PSA implemented the basic functionality present in mobile recommendations to the present

day.

Moving forward a decade, mobile technology had greatly developed, to the point where

cellular phones and global position system (GPS) devices are becoming near ubiquitous.

5

CHAPTER 2. RELATED WORK 6

By the early 2000s, a cheap cell phone was capable of performing every function that the

PSA was capable of, without the constraint of being tied to a specific store. With this

additional power available, it became possible to add context to recommendations, and with

the popularity of mobile devices it became worthwhile to aggressively research methods for

delivering meaningful recommendations to mobile devices.

In 2004, van Setten et al. [29] developed COMPASS and proposed combining context-

awareness with recommendation systems, such as those discussed in [21]. According to van

Setten et al., context “is any information that can be used to characterize the situation of an

entity,” where an entity is simply any object, place, or individual relevant to the functioning

of the application. Thus, context could include time, day of week, age of a user, physical

location, or car model being driven. Although both context-awareness and recommendation

systems are “used to provide users with relevant information and/or services”, they are

distinguished by the former being based on a user’s context, and the latter being based on

a user’s interests. The goal of [29] is to provide a system unifying these two concepts.

Although COMPASS is a large system composed of many parts, including a user profiler

and a recommendation engine, it is fundamentally a mobile application that proposes point

of interest recommendations based on a user’s present location, the present time, and other

information about the individual, such as the acceptable price range for a dinner. [29] does

not address the particular recommendation process used, but the authors did perform a user

study of 57 individuals, demonstrating that the public believes that context-aware recom-

mendations are indeed useful. No trajectory information is used in their recommendation

system, nor is any historical knowledge about the user taken into account.

One aspect of some mobile recommender systems is the idea of critique-based recom-

mendation. For example, Nguyen and Ricci’s [22] discussed how allowing a user to critique

the recommendations made and to incorporate these critiques into the recommendations

can improve future recommendations. Although critique-based feedback is interesting and

useful, the work presented in this thesis does not incorporate a mechanism for critique-based

feedback.

In 2006, Horozov et al. proposed a system for personalized POI recommendations known

as “Geowhiz” [12], that like COMPASS considers a user’s context when making recommen-

dations, but goes deeper than COMPASS, explicitly describing techniques to incorporate

the user’s context into the recommendation process. At the core of Geowhiz is an enhanced

collaborative filtering method that works by taking into account a user’s location. It is an


item-based collaborative filtering method that works by first identifying points of interest

near a user’s present location (within a defined radius), and performing collaborative filter-

ing only on that set of nearby POIs. Like with COMPASS, the context considered is a static

snapshot of the user’s present state, and so the user’s location history is not considered. It

is worth noting that [12] includes a number of useful technical insights, such as how to use

“pseudo-users” to bootstrap a recommendation system, as well as how to use “serendipity”

to introduce a small amount of randomness into the recommendation system.

Finally, another modern, real-world mobile recommendation system is “CityVoyager”

[27]. Unlike the systems previously discussed, CityVoyager bases its recommendations on

its users’ location history. It does this by identifying its users’ frequent locations, and these

frequent locations are used as input to an item-based collaborative filtering system (see

section 2.2 for a description of item-based collaborative filtering). No user attributes (such

as age or gender) are considered. Although based on a tiny sample of only two users to

gauge the quality of recommendations, Takeuchi and Sugimoto’s results [27] indicate that

their system may be useful. However, once again it considers only the present location of a

user and not where he/she is coming from and where he/she has been.

2.2 Collaborative Filtering

Collaborative filtering is a technique developed in the 1990s and is found at the root of many

recommendation systems. The two principal categories of collaborative filtering methods

are model-based methods, and memory-based methods [12]. Memory-based (or user-based)

methods, such as the RINGO system [25] work by dynamically computing the relationships

between users each time a query is presented to the system. Historical data for the most

similar users is then used to make a recommendation. Model-based methods, such as item-

based techniques [23], are highly scalable, and work by computing the relationships between

items. They do not require the computation of the relationships between all users on each

query.

Collaborative filtering remains an active field of research, and although its methods are

not directly used by this thesis, as discussed in section 7.1.1 at the end of this thesis, it would

be an interesting problem to integrate collaborative filtering into the mobile trajectory-based

recommendation system developed in this thesis.


2.3 Trajectory Mining

The goal of this thesis is to devise a framework for a mobile trajectory-aware point of interest

recommendation system. In addition to being built upon research into recommendation

systems, the other field closely related to the content of this thesis is trajectory mining.

Trajectory mining is a very new field of research. There was a modicum of related

research performed in the 1990s, tackling problems such as vehicle classification [4] and

trajectory clustering using regression models [6]. However, these research studies tackle the

problem of trajectory mining from highly mathematical and statistical stances, respectively.

Trajectory mining as a topic of data mining is very new and has only been the subject of

intensive research in the past several years.

Like data mining, research into trajectory mining has tended to focus on the traditional

three pillars of clustering, classification, and pattern mining. For example, Lee, Han, and

Whang [18] introduced a trajectory clustering method known as TRACLUS that uses a

partition-and-group idea to cluster trajectories and generate representative trajectories for

these clusters. Another line of research has been to investigate incremental trajectory clus-

tering methods, such as those discovered by Elnekave et al [3]. Related to this research into

trajectory clustering methods is convoy discovery based on a method of trajectory simpli-

fication [15], where convoys are sets of trajectories that are density-connected during some

time interval.

Regarding trajectory classification, Lee et al. [17] presented the “TraClass” algorithm

to classify trajectories. The features for the classifier are discovered by performing a region-

based clustering of the trajectories, followed by a trajectory similarity-based clustering step.

Among the applications in mind for this direction of research is to classify whether a boat is

an oil-tanker, a tugboat, a fishing-boat, and so on, and another application is to classify an

animal given its historical trajectories. While interesting, this research is not particularly

relevant to the problem and methods in this thesis.

Giannotti et al. [7] addressed the problem of trajectory pattern mining by using a

“region-of-interest” approach to find trajectories moving between regions of interest. A

spatial approach to trajectory pattern detection employs a spatial approach to trajectory

pattern detection, spatial information is considered in a pre-processing phase that reduces

trajectories to a sequence of regions of interest. In their approach, temporal differences be-

tween visits matter, but the exact time of a visit does not, and it is not possible for the order


of points in trajectories to be swapped. Similarly, spatial regions matter, but not specific

locations. [7] is relevant to the research contained in this thesis, but the problem tackled is

different, and for our purposes suffers from the serious limitation of only considering regions

and not specific locations. Finally, Gidofalvi and Pedersen [9] mined long trajectories of

moving objects and showed how to identify trips using an SQL-based implementation.

Highly relevant to the problem addressed by this thesis is the research done by Zheng et

al. [31] on mining interesting locations and travel sequences from GPS trajectories. Using

ideas from the HITS (hypertext induced topic search) model developed by Jon Kleinberg

[16], Zheng et al. [31] used a HITS-based inference model to find locations and trajectories

that could be recommended. In particular, they treated users as hubs, and locations as

authorities, and this is used to compute the interest of each location. A very useful appli-

cation of this research would be to devise tour plans for cities, as the methods described

could determine popular tour routings from GPS trajectory data. However, this method

does not allow for queries to be executed of the form “given my historical trajectory Q,

where should I visit next?” which are the main focus of this thesis. In addition, it is worth

noting that their methods incorporate no collaborative filtering aspect and thus do not take

into account any knowledge about the users of the system.

Finally, the line of research perhaps similar to this thesis is the research of Frentzos et al.

[5] on nearest-neighbour searches on moving object databases. One possible approach to the

problem tackled in this thesis would be to find the k nearest neighbours to a query trajectory

and to use them to determine the optimal next points to recommend the querying user to

visit. This is conceptually similar to the fuzzy matching method proposed in chapter 5,

although Frentzos et al. discussed only how to find similar trajectories and do not address the

recommendation process. Whereas the methods in this thesis are generally tied only to the

order of points visited in historical trajectories and use time information only occasionally,

the methods in [5] are intimately tied to time, and work based on the distance between

trajectories over a definite period of time. As a result, their methods are able to compare

trajectories visiting a different number of points in a particular time interval. Lastly, the

methods contained in this thesis are sensitive to the particular point of interest / concept

visited at each trajectory point, whereas the methods of Frentzos et al. are based purely

on spatial and temporal information. Nonetheless, an interesting future research direction

would be to try and merge the ideas in this thesis with the methods used by Frentzos et al.

and to see if an effective system could be designed.

Chapter 3

Problem Description

This thesis addresses the problem of efficiently generating a set of recommended next points

to visit following a given trajectory. In addition to this query trajectory, we have a database

of historical trajectories along with information about the points of interest in the region.

In this chapter, after presenting some necessary definitions, we define the general problem

tackled by the thesis. Following this, we present and motivate a number of requirements

that we believe to be desirable for a point of interest recommendation system to possess,

and then with these requirements define the three specific problems tackled by this thesis.

3.1 Definitions

This section contains a listing of the definitions that will be used to motivate and describe

the problem of trajectory-based point of interest. Other definitions needed only for imple-

menting the methods presented later in this thesis will be presented when they are needed.

Definition 3.1.1. A concept is a tuple c = (name, children, ...) consisting of a string

c.name that is a description of the concept, as well as a set of child concepts c.children that

are contained within c. A concept will generally be referred to by its name. For example,

we could have a concept “Coffee Shop” with child concepts “Starbucks” and “Second Cup”.

Lastly, there exists a function conceptDistance(c, d) that computes the distance between

any two concepts c, d. Let z denote the lowest common ancestor of c and d. If z does

not exist then conceptDistance(c, d) = ∞. Otherwise, thesis, conceptDistance(c, d) =

max(depth(c) − depth(z), depth(d) − depth(z)), where depth(x) denotes the depth of x in

10

CHAPTER 3. PROBLEM DESCRIPTION 11

the concept hierarchy

Definition 3.1.2. A concept hierarchy is a forest of concepts. It is possible, and normal,

for a concept hierarchy to have multiple roots, and the distance between any two concepts

not sharing a common root is defined to be infinity. Given two concepts a, b, if a = b or a is

an ancestor of b then we write that a ≥ b, and say that a is a super-concept of b. The depth

of a concept c, depth(c) is the number of edges between c and the root of the component of

the concept hierarchy containing c.

Definition 3.1.3. A point of interest or POI is a tuple poi = (lon, lat, concept) containing

at minimum, a longitude poi.lon, a latitude poi.lat, and a concept poi.concept. A point

of interest is generally any specific location that a trajectory can visit and that can be

recommended when answering a query.

Definition 3.1.4. A point is the fundamental object used in this thesis. A point p always

has an associated concept p.concept, and there exists a function pointDistance(x, y) that

computes the distance (possibly infinite) between any two points x, y, and so all points are

comparable. This distance measure combines the conceptual and spatial distance between

two points, and a description of how to construct such a measure is presented in section

4.4.3. Two points x, y are said to be similar if pointDistance(x, y) < ∞. Three types of

point are used in this thesis: trajectory points, concept points, and localized points.

Definition 3.1.5. A trajectory point is a tuple p = (poi, ts, ...) containing at minimum, a

point of interest p.poi, and a time-stamp p.ts, along with any other information deemed

relevant. Note that for convenience, we will sometimes refer to a point’s longitude p.lon,

and latitude p.lat, although this notation is merely shorthand for p.poi.lon, and p.poi.lat.

For convenience, we will often use p.concept to refer to the concept associated with p.poi,

p.poi.concept. Note that this refers to a concept, and not a concept point. Every trajectory

point corresponds to a particular point of interest, but this is not a great limitation as it

would be easy to add a notion of “non-recommendable” points of interest. We will see,

however, that for our methods to be efficient, it is desirable to have as few points of interest

as possible.

Definition 3.1.6. A generalized point is any point that can potentially represent other

points. If a generalized point gp contains another point p then we say that gp generalizes p

and can write gp ≥ p. In this thesis we use two types of generalized point: concept points,


and localized points. For convenience, we will also say that a generalized point gp contains

a point of interest poi if gp ≥ q for any trajectory point q with q.poi = poi

Definition 3.1.7. A concept point is a tuple p = (concept, ...) containing a concept p.concept

along with any relevant information. A concept point p has no spatial location, and is said

to generalize any other point q if p.concept ≥ q.concept.

Definition 3.1.8. A localized point is a tuple p = (concept, region, ...) representing a con-

cept p.concept in a particular region p.region. A localized point p generalizes another point

q if q lies entirely within q.region and if p.concept ≥ q.concept. In this thesis, regions asso-

ciated with localized points are always square, although there is no limitation on the shape

of the region.

Definition 3.1.9. A trajectory is a sequence of trajectory points, t = p1 → p2 → ...→ pn,

where each pi = (poi, ts, ...) is a trajectory point and pi+1.ts ≥ pi.ts for 1 ≤ i < n, and

|t| = n is the length of t.

Definition 3.1.10. A query trajectory is any trajectory presented as input to the trajectory-

based recommendation system. Given a query trajectory q, the objective of this thesis is to

generate a set of recommended next points for the user presenting q to visit. Generally, a

query trajectory will be very short, with |q| ≤ 5 in most cases.

Definition 3.1.11. A trajectory fragment is any substring of a trajectory. That is, given

a trajectory t = p1 → ... → pn of length |n|, a fragment of T is any trajectory f = p1+i →p2+i → ...→ pm+i where i ≥ 0 and m+ i ≤ n. Any trajectory fragment f = q1 → ...→ qm

can be written in the form (b : n), where b = q1 → ...→ qm−1 is the body of f , and n = qm

is the next point of f .

Definition 3.1.12. A historical trajectory database, denoted tDB is a bag of trajectories

that have been traversed by users of the system some time in the past. The recommendations

for each query will be constructed based on the information in this database.

Definition 3.1.13. Two trajectory fragments f = f1 → ... → fn, and g = g1 → ... → gm

match exactly or have an exact match if |f | = |g| = n, and if fi.poi = gi.poi for all 1 ≤ i < n.

Note that the time-stamps of the trajectory points in f and g are ignored. The longitudes

and latitudes of corresponding trajectory points in f and g will always match if f and g

match exactly because we require that their associated POIs be equal.


Definition 3.1.14. Two trajectory fragments f, g match fuzzily with order k or have a fuzzy

match of order k if |f | = |g| = n, and if fuzzyError =

n∑i=1

pointDistance(fi, gi) < k.

Definition 3.1.15. A recommendation is a point that has been output by a recommendation

system given a query q. Each recommendation r has an associated confidence, where 0 ≤confidence(r) ≤ 1. The confidence of a recommendation is the estimated probability of the

user visiting r after traversing the query trajectory q.

3.2 Problem Definition

The primary goals of this thesis are to provide a realistic model for framing the problem

of trajectory-based point of interest recommendation and to then describe an efficient and

scalable method for answering recommendation queries. In this section I define the problem

tackled by this thesis at a high level. Later in this chapter the three specific variations of

this problem solved by this thesis will be presented.

To begin, we assume that the following information is available:

• A historical trajectory database tDB

• A database P of points of interest (POIs)

• A concept hierarchy C

• A query trajectory q

Like a traditional search engine, the problem of trajectory-based POI recommendation

is a query-answering problem. However, unlike a traditional search engine where a query

consists of a series of words, here a query is a trajectory that a user of the system has just

traversed. The goal is then to return the top-k points that the user is most likely to desire

to visit next.

Definition 3.2.1. Trajectory-Based POI Recommendation Problem: Given a database tDB

of historical trajectories, a database P of points of interest (POIs), a concept hierarchy

C, and a query trajectory q, find the top-k points most likely to follow q. These top-k

recommendations are known as the recommendations for q.

This is a very general definition of the problem, and one interesting ambiguity in its

statement is that it does not state which trajectories are to be contained in the historical


trajectory database. By varying the contents of the historical trajectory database, we can in

fact construct multiple models of the problem. For example, we could perform recommenda-

tion based on personal history, recommendation based on user group, and recommendation

based on all historical trajectories. In this thesis we will generally be thinking of the last

of these, but all can be done simply by working with a subset of the historical trajectory

database. In section 7.1.1 we will discuss an approach for combining the POI recommenda-

tions generated by these different models.

One important insight that will be useful later in this thesis is that the query trajectory

q will generally be very short. People rarely stop at more than a few points of interest

on a given trip. The length of the query trajectory will determine how easy it is for the

query trajectory to match a trajectory fragment in the historical trajectory database and

thus, how many available recommendations will be available. There is a trade-off involved

in choosing the length of query trajectory to use, as longer query trajectories may increase

the precision of results, but it may lead to a lack of diversity of the results, as well as over-

fitting. Using long query trajectories rather than shorter queries trajectories could result in

a system much less likely to return useful results for rarely traversed trajectories. We will

use some experiments to illuminate this trade-off in chapter 6.

3.3 Recommendation System Requirements

Later in this chapter we will describe the three particular instances of the trajectory-based

POI recommendation problem that are tackled by this thesis. However, before doing so,

we want to first motivate and describe some properties that I believe to be desired of a

useful trajectory-based recommendation system. The three instances of the recommendation

problems solved by this thesis each will incorporate more of these requirements than the

previous. These requirements are:

1. (Quantifiability of Confidence) The confidence of each recommendation must be quan-

tifiable and should range between 0 and 1.

2. (On-line Recommendation Capability) Recommendation queries must execute in real-

time. However, there is no limitation on the amount of pre-processing time.

3. (Scalability) The trajectory-based recommendation system must be scalable to be able

to handle an arbitrarily large historical trajectory database, as well as any number of


simultaneous requests.

4. (Generalization) Highly similar possible recommendations should mutually boost each

others’ confidence.

5. (Diversity) The k recommended points should be diverse.

6. (Fuzziness) The next points of trajectory fragments similar to, but not exactly match-

ing, the query trajectory should factor into the recommendation process.

7. (Order-Flexibility) The order of trajectory points visited very close in time in a tra-

jectory should be ignored.

8. (Personalization) Trajectories in the database of users very similar to the querying

user are more useful for making a recommendation than those of others users

3.3.1 Quantifiability of Confidence

One highly desirable requirement for any trajectory-based recommendation system is for a

statistically grounded confidence to be assigned to each recommendation. Aside from the

obvious use of ranking recommendations, advertisers may only want to pay for advertising

to a user of the recommendation system if the probability of the user wanting to visit the

advertiser’s establishment is greater than a certain threshold.

3.3.2 On-line Recommendation Capability

To be useful in the real world, a trajectory-based recommendation system must be able

to execute recommendation queries in real time. As the envisioned use of the system is

for people on the move, it is important that queries be satisfied quick enough that it is

possible for a user to act based on the returned set of recommendations. On the other

hand, like a normal search engine, we can allow for large amounts of pre-processing time

and computational resources. It is desirable to minimize the resources required for pre-

processing the historical trajectory data to be able to satisfy queries, but this is of much

less importance than ensuring that queries can be executed extremely quickly. Even if

the choices were made for queries to be executed locally on a mobile device rather than a

backend server, any amount of work could be performed prior to loading a processed dataset

onto the mobile device.


3.3.3 Scalability

If a trajectory-based recommendation system were to be put into use, the size of the histori-

cal trajectory database could be expected to grow rapidly as the system became increasingly

popular, and it is important that any system be able to scale. There are two dimensions of

scalability to handle. The first is the number of incoming requests, but this is essentially

solved by scaling the hardware used to process requests and will not be mentioned further

in this thesis. The second is the size of the historical trajectory database. It is important

that the time required to answer a recommendation query grows sub-linearly with respect

to the size of the historical database.

If we assume a fixed maximum query length and ignore the fuzziness requirement, it

would be possible for queries to be answered in expected constant time by pre-computing

the results of all possible queries using a hash table to store and retrieve their results.

However, the memory requirements of this approach are prohibitive, and furthermore, this

approach can not handle the fuzziness requirement because that incorporating that require-

ment requires the recommendation system to be able to execute queries with a novel query

trajectory that has never been previously observed.

3.3.4 Generalization

Consider a situation in which at a certain street corner there are three coffee shops and

a bank. Further suppose that in all of history, each coffee shop has been visited by 8

individuals after those individuals traversed some trajectory, and that 10 individuals have

visited the bank after traversing the same trajectory. Although if we only look at raw

probabilities we should recommend visiting the bank to a new user who has just traversed

this trajectory, intuitively it seems that we would be better off recommending a coffee shop.

The generalization requirement encapsulates the idea that the presence of a number of

highly similar points of interest in some neighbourhood should bolster our confidence in

recommending each of these points of interest.

3.3.5 Diversity

Building upon the same situation used to motivate the generalization requirement, suppose

that a user wants to see the top 3 recommended points to visit, given her recent trajectory

history. It is possible that working purely from a mathematical standpoint that the top 3


recommended points could be a Starbucks on one corner, a Second Cup on another corner,

and a Blenz on a third corner of the intersection. The problem with recommending these

three points is that they are too similar, and this decreases the usefulness of the recom-

mendations to the user of the system, and may even discourage advertising as a potential

advertiser may not want to have his ad get lost in a flurry of highly similar ads. The diver-

sity requirement is that the top-k recommendations returned to answer a query should be

diverse when possible.

3.3.6 Fuzziness

Working again off of the same example where we had three coffee shops on the corners of

an intersection, suppose that very few people have historically visited one of them, perhaps

because it is a new coffee shop. Further suppose that when a user goes to visit the coffee

shop and uses her mobile recommendation system, nobody has ever visited the coffee shop

after following her historical trajectory. If we were to base our system’s recommendations

only on those historical trajectory fragments exactly matching our user’s last few locations

visited, we would not be able to recommend any points of interest for her to visit next.

The solution is for a POI recommendation system to base its recommendations also

on the historical trajectory fragments that are “close to” or “fuzzy matches of” our user’s

last few locations visited. For example, suppose that all users have previously travelled

the trajectory a → b → c, but the current user queries the recommendation system with

the trajectory fragment a → b′, where b and b′ are similar. The requirement of fuzziness

expresses the notion that it should be possible to recommend c given this query, albeit with

diminished confidence due to the fact that query trajectory does not exactly match the

trajectory in the historical database. Furthermore, the requirement expresses the notion

that even if c were the next point of a historical trajectory fragment exactly matching the

query trajectory, the fact that c is the next point of other historical trajectory fragments

that fuzzy match the query trajectory should bolster our confidence in recommending c.

3.3.7 Order-Flexibility

Imagine a set of commuters who take the subway to work, half of whom visit a coffee shop

followed by a newsstand after they disembark, while the other half visit the newsstand

followed by the coffee shop. Each of these two visits may occur within a minute or two


of each other, and it is this situation we have in mind when thinking of the requirement

of order-flexibility. The order-flexibility requirement expresses the idea that the order of

events visited very close in time should not matter significantly when answering queries, as

the order of these events may carry very little information.

3.3.8 Personalization

The remaining piece of information that is likely to be available in a real-world scenario

is knowledge about the users of the system. For example, we may know the gender, oc-

cupation, age, and any number of other facts about each user. This information could be

used to improve the quality of recommendations by allowing us to integrate some form of

collaborative filtering into the recommendation process.

Using the methods in this thesis it would be possible to perform recommendations based

on user groups. For example, given the knowledge that a particular user is a banker, it would

be possible to execute queries for this user based only on the historical trajectories of other

bankers. Similarly, it would be possible to base recommendations for a given user based

only on the historical trajectories of this user. The recommendations based on personal

history, user group, and the entire historical trajectory database could be combined using

a mixture model. Of all the requirements expressed in this section, the requirement of

personalization is the only one not explicitly addressed by this thesis. More ideas on how to

build personalization on top of the methods contained in this thesis can be found in Section

7.1.1.

3.4 Satisfying the Requirements

In this thesis, we address all of the above requirements except for that of personalization.

In the following two chapters we proceed in stages, building up a sequence of solutions

to the trajectory-based point of interest recommendation problem. Each satisfies more re-

quirements than the previous, and thus each solves a particular instance of the general

trajectory-based POI recommendation problem introduced above. The three primary in-

stances of the trajectory-based POI recommendation problem tackled are:

1. Exact matching (Quanfiability of Confidence, On-Line Recommendation Capability,

Scalability, Generalization, Diversity)


2. Fuzzy matching (+ Fuzziness)

3. Order-flexible matching (+ Order-Flexibility)

The exact matching problem is described in chapter 4, and its solution captures the

main technical contributions of this thesis. The fuzzy matching, and order-flexible matching

problems are solved in chapter 5, and their solutions will naturally build upon the foundation

laid in chapter 4 by the solution for the fuzzy matching problem.

Chapter 4

Exact Matching

This chapter describes how we can achieve all of the requirements of the previous chapter,

except for the requirements of fuzziness and order-flexibility, using the technique of exact

trajectory matching. The exact matching methods contained in this chapter will be extended

in the next chapter to incorporate fuzzy and order-flexible matching as well. The methods

described in this chapter constitute the core contribution of this thesis.

Recall from definition 3.1.13 that an exact match between two trajectory fragments q, s

means that all corresponding points in q and s visit the same point of interest (the time-

stamps of points are ignored). Exact matching then means that given a query trajectory

q, we shall generate the top-k point of interest (POI) recommended next points for q only

considering the trajectory fragments in the historical trajectory database exactly matching

q. Although the exact matching methods are simple to understand and easy to formulate,

they are still sufficiently complex to motivate the description and use of the principal data

structures and algorithms that will be used to later incorporate fuzziness and order-flexible

queries.

In order for a method to be useful for mobile point of interest recommendation, recall

that recommendation queries must be executed in real-time, but that we are allowed an

arbitrary amount of time to pre-process the historical trajectory database. Thus, this chap-

ter is split into two parts. The first part is a step-wise construction of how to achieve the

requirements of quantifiability, generalization, and diversity, given the set of next points for

the query trajectory q. The second part describes how to pre-process the historical database

efficiently in order to permit efficient query execution, thus meeting the on-line recommen-

dation and scalability requirements, and also describes how to query this pre-processed data

20

CHAPTER 4. EXACT MATCHING 21

efficiently. This second part primarily relies on the k-truncated generalized suffix tree data

structure, and a brief description of the data structure is contained here, while a more

detailed description, including methods for construction, is contained in appendix A.

4.1 Exact Matching

Let tDB be the trajectory database, consisting of a bag of trajectories, and let q be a query

trajectory, where l = |q| is the length of q. Let H be the set of all trajectory fragments of

length l + 1 in tDB, so that each fragment h = (b : n) ∈ H consists of two parts: a body b

of length l, followed by a next point n. Then, let M = {h = (b : n) ∈ H|exactMatch(q, b)}be the set of all trajectory fragments in H with a body that exactly matches q. M will be

known as the set of exactly matches. With these definitions we can now precisely define the

exact matching problem.

Definition 4.1.1. Exact Matching Problem: For a given query trajectory q, find the top-k

next points (ranked by decreasing confidence) of all trajectory fragments in M . These top-k

next points are known as the recommendations for q.

In order to devise a solution to the exact matching problem satisfying all of our require-

ments, we must come up with a good measure of confidence. This will be done in a few steps.

First, we present a naıve method satisfying only the quantifiability requirement, and then

from this starting point we will show how to handle the generalization requirement. The

diversity requirement will be handled later as a pre-processing step that can be executed on

the output from our other methods.

4.1.1 Naıve Approach

As a first step towards devising a good definition of confidence, it is natural to begin with

the raw observed probabilities of each possible recommendation. To begin, given a query

trajectory q, and the set of exact matches M , let N denote the set of all next points of

the trajectory fragments in M . These are the next points of our query trajectory q, and

to compute their confidences we need to define a function, support(x,N) to compute the

number of occurrences of a next point x in N (again, ignoring the point’s time-stamp):

support(x,N) = |{h = (b : n) ∈M |n = x}| (4.1)


Now we can compute the naıve confidence of recommending each possible next point x:

confidence(x) =

support(x,N)

|N | if x ∈ N

0 otherwise(4.2)

This confidence measure clearly satisfies the requirement of quantifiability, but it does

not satisfy the generalization requirement. To see this, consider the following example. Let

q = a→ b→ c, so that l = 3, and let the trajectory database tDB be:

Body Next Point Support

abc Starbucks-1 2

abc Starbucks-2 2

abc Starbucks-3 2

abc Second Cup 3

Second Cup

Starbucks 1

Starbucks 2 Starbucks 3Start

Figure 4.1: Graphical depiction of example

Suppose that all four points are equally distant from each other in space, but where

the conceptual distance between the three Starbucks locations is smaller than between the

Starbucks locations and the Second Cup. Then the point distance between each of the three

Starbucks locations is smaller than the distance between any of the Starbucks locations and

the Second Cup. Using the confidence measure defined above, the top recommendation

would be “Second Cup”, but we can see that two-thirds of all trips traversing the trajectory

abc led to an individual visiting one of the Starbucks locations. In accordance with the gen-

eralization requirement described in the previous chapter, we want to be able to recommend

a Starbucks location (or “Starbucks” the concept) above the Second Cup location.

4.2 Accounting for POI Similarity

To remedy this problem, we need to find a method to account for point of interest (POI)

similarity, and there are two distinct means of accomplishing this goal. The first approach is


to somehow increase the confidence of recommending each of the individual, highly similar,

points of interest on account of there being other highly similar POI nearby. The second

approach is more interesting, and it is to recommend a generalization of the highly similar

points that would encompass all of them. I describe both methods below, and it will be

argued that the latter approach is superior.

4.2.1 Increasing the Confidence of Each Similar POI

The first means of accounting for POI similarity is to increase the confidence of each point

of interest if there are similar points of interest nearby. Suppose that we were to define

a function of two points similarity(x, y) to compute their similarity, where the function

returns a number between 0 and 1. Using this, we could then create a new confidence

measure for a next point n. For example, we could define the confidence of a recommendation

x to be:

confidence(x) =1

|N |∑y∈N

support(y,N)× similarity(x, y) (4.3)

This measure clearly satisfies the quantifiability requirement as the maximum possible

sum is |N |, and so this measure will always return a confidence between 0 and 1. Further-

more, it appears to satisfy the generalization requirement as well. However, there are a

number of problems with this approach that will lead us to favour the approach presented

in section 4.2.3.

The first problem is that there is no theoretical foundation for computing the confidence

of x in this manner. Yet even disregarding this, there is a conceptual problem with altering

the confidences of specific points of interest. If we alter the confidences of individual points of

interest by incorporating the supports of similar points of interest it becomes very difficult to

interpret the results returned by the recommendation system. This is because it is no longer

possible to infer from the confidence of a point of interest recommendation whether the point

is even well visited. Given a confidence score computed using the confidence measure stated

above, it is unclear what can actually be inferred. From the resulting confidence score,

there would be no indication that it was being recommended only due its proximity and

not because anybody had ever visited the point previously. In an extreme example, this

confidence measure could lead us to recommend a point with only a single historical visit,

even though all of its surrounding points had been visited hundreds of time. This issue


could be somewhat alleviated by more heavily weighting the contribution of the support

of x in N , perhaps by squaring the result of the similarity function, but we shall see that

there is a better approach that will naturally avoid these problems. Rather than alter the

confidences of individual POI, the new approach will be to recommend a generalization of

points.

4.2.2 Density Estimation

The probability distribution function (PDF) is one of the fundamental concepts in statistics,

as it both is a description of the distribution of a random variable X as well as a means

of computing the probabilities associated with X. That is, given a probability distribution

function f for random variable X, it is possible to compute the probability of observing any

value associated with X by the simple equation, Pr(X = a) = f(a), and more generally for

continuous variables, Pr(a < X < b) =

∫ b

af(x) dx.

For our purposes of computing the confidences of recommending points of interest, if

we knew the probability distribution function fq for the random variable representing all

possible next points following a query trajectory q, then the exact matching problem being

tackled in this chapter would be trivial. The algorithm would simply be to compute fq(x)

for all possible next points x, and to choose those x with the top-k results. Unfortunately,

it is not possible to have the PDF given to us for all possible query trajectories, and so this

simple idea would not work. However, it is possible to build an estimate of the PDF from

observed data, and this procedure is known as density estimation. An excellent resource on

density estimation is [26].

As described by Silverman [26], there are many means of computing density estimates

that can be grouped into two broad categories: parametric and non-parametric. Paramet-

ric density estimation techniques require as input a certain probability distribution, while

non-parametric density estimation techniques make no assumptions about the underlying

distribution of the observed data. The most common non-parametric density estimation

techniques include: histogram estimation, kernel estimation, and nearest neighbour estima-

tion.

Histogram estimation requires a random variable that represents values that can be

mapped onto the real numbers, and so does not easily apply to the problem of predicting

points of interest. Nearest neighbour estimation on the other hand, also does not easily apply


to our situation as many points of interest may be very similar to each other. To claim that

the probability of observing a novel point is the probability of its nearest neighbour will not

allow us to recommend generalized points that contain many highly related points because

the predicted probability of observing this generalized point would be far too small.

Kernel Estimation

For this thesis we choose to use kernel estimation as the means of estimation, due to its

effectiveness in handling unknown data distributions. Kernel estimation is related to the

process of sampling, in that the predicted probability of observing a given point is based on

the distribution of sample points, where all sample points are equally weighted. However,

kernel estimation is based on the idea that observing a point increases the probability of

observing other points nearby, and consequently, distributes the contribution weight of other

points according to a kernel function, K. Furthermore, the kernel width h (also known as

the smoothing parameter) is introduced to control the effect of the kernel function in the

neighbourhood of each point.

According to Silverman [26], the accuracy of kernel estimation depends much more on

the chosen kernel width h than on the particular kernel function chosen. Considering this,

and due to their broad applicability and common use, we have chosen to use the Gaussian

kernels as our kernel functions.

Definition 4.2.1. A Gaussian kernel is a function Gh(x, y) = 12πe− d(x,y)2

2h2

With this Gaussian kernel, given a set of observed objects S = (y1, y2, ..., yn) it is possible

to estimate the density of a point x using the following density estimation function:

fh(x) =1

n

n∑i=1

Gh(x, yi) (4.4)

Notice that there is no requirement that x be a member of S, and so this density estima-

tion function will allow us to estimate the probability of observing a previously unobserved

point. This will be exploited in section 4.2.3 in order to be able to recommend generalized

points.

Figure 4.2 demonstrates how Gaussian kernel estimation functions. A Gaussian curve is

constructed around each of the six points lying on the x-axis, and top curve is the sum of

these six curves. The Gaussian kernel estimate for this dataset is not shown, but as there are

six points in the dataset, it would be one sixth the sum of the Gaussian curves constructed


Figure 4.2: Example of Gaussian kernel estimation

around each point. In other words, the Gaussian kernel estimate for this dataset is one sixth

the top curve in the figure.

4.2.3 Recommending Generalizations

The better means of accounting for POI similarity is to add the ability to our system to rec-

ommend generalizations of POI. In addition to computing the confidence of recommending

each next point n ∈ N as above, we will also compute the confidence of recommending all

generalizations of n. Recall that if gp is a generalized point that contains a POI p, then we

can say that gp generalizes p, and can write gp ≥ p.Suppose that for a given query there are next points n1, n2, ..., nm all with common

conceptual ancestor z, and recall from the previous section that the density estimate of a

point z is the expected probability of observing z. This allows us to compute the confidence

of recommending z by computing the support of z and normalizing it by the Gaussian kernel

density estimate for z.

Recalling that we want the confidences of non-generalized points to not be affected by

other points, using a Gaussian kernel density estimation function fh(x), we can write:

confidence(z) =

fh(z) if z a generalized point

support(z,N)2π|N | otherwise

(4.5)

This family of measures (there is a different measure for each possible h) has been

selected because it is both simple and theoretically well-founded. Furthermore, if we define

the distance between two points as 0 if they are the same, and ∞ otherwise, then this

confidence for a point computed by this family is exactly 12π the value computed by the

naıve exact matching confidence measure presented in section 4.1.1. Thus the naıve exact

matching case is just a special case of our method for recommending generalizations.

As a final note about recommending generalized POI, if we were to leave our method

as described above, it would be possible to recommend both a generalization of a point


as well as the point itself. However, this is practically undesirable as one of our goals is

recommend a diverse set of POI. What we can do is to only recommend a generalization z

if our confidence in it is greater than our confidence in recommending any of its children,

whether an explicit POI or a more specialized generalization.

4.3 Spatio-specific Generalized Recommendations

Using the generalized confidence measure presented in section 4.2.3 we are able to recom-

mend both trajectory points and concept points without difficulty. However, there is still a

problem to be addressed, which is that the generalized recommendations made contain no

spatial information and are purely conceptual recommendations. That is, we still have no

mechanism for recommending localized points.

To see why this is a problem, suppose there are three franchises of a popular coffee shop

very near to each other, but that there is another franchise of the same coffee shop across

town. Now suppose that all four of these franchises are next points of our query trajectory

q, and that the common generalization of these four franchises is z. The problem we can see

is that confidence(z) will be low, due to the spatial distance between three of the franchises

and the fourth franchise.

Shop 1 Shop 2

Shop 3

Shop 4Start

Figure 4.3: Example illustrating problem over over-generalization

What we want is to be able to recommend a generalized point y that subsumes only the

three franchises that are very nearby to each other, and to be able to disregard the other

POI. However, it would be undesirable for the administrator for the POI recommendation

system to have to manually create an intermediate layer in the POI hierarchy between the

least generalization and the POI themselves that indicated spatial proximity. What we

need is to create a dynamic conceptual hierarchy level capable of recommending POI in


close proximity. This will allow us to recommend localized points in addition to trajectory

points and concept points.

To accomplish this, we can overlay the space containing all of our trajectories with four

interleaving grids and give each cell in each of the four grids a cell code. Each of these grids

will have an edge length 2r, and the four grids are offset from each other by r in one or

both dimensions. Given a cell for one of the grids, cells for each of the other grids could

be found by adding r to the longitude of our original cell and leaving the latitude alone, by

adding r to the latitude of our original cell and leaving its longitude unchanged, and finally

by adding r to both the latitude and longitude of our original grid cell.

With these four interleaving grids we can attach a tuple (cell1, cell2, cell3, cell4) to each

point in our trajectories. Two points can then be considered to be in close proximity if they

share a cell code. By using four interleaving grids, we have a trivial method of determining

whether points are nearby to each other because if each grid cell has dimensions 2r × 2r

then any two points of distance no more than r from each other will share a cell code. For

the purposes of distance computations, the location of a localized point associated with a

cell will be taken to be the centroid of all contained points of interest rather than the cell

center.

Definition 4.3.1. The extents of the historical trajectory database is a tuple extents =

(minLon,maxLon,minLat,maxLat) that contains the minimum longitude, maximum lon-

gitude, minimum latitude, and maximum latitude observed on any trajectory point in the

historical trajectory database.

In this thesis, we use the following simple method for computing the cell codes for a given

trajectory point p, the extents extents of the historical trajectory database, and a cell edge

length r. Two helper variables, baseCellCodeLon and baseCellCodeLat are introduced to

simplify the equations:


baseCellCodeLon = b2(p.lon− extents.minLon)

r− 1c

baseCellCodeLat = b2(p.lat− extents.minLat)r

− 1c

cell1 = (baseCellCodeLon, baseCellCodeLat) (4.6)

cell2 = (baseCellCodeLon, baseCellCodeLat+ 1) (4.7)

cell3 = (baseCellCodeLon+ 1, baseCellCodeLat) (4.8)

cell4 = (baseCellCodeLon+ 1, baseCellCodeLat+ 1) (4.9)

A visualization of how using grid cells can solve the problems of recommending only

concept points by allowing the recommendation system to recommend concepts in a specific

region is presented in figure 4.4. In this figure we can see that shops 1, 2, and 3, all share a

grid cell, and so we could recommend a localized point associated with this grid cell for the

concept “shop”. Using only concept points, we would only be able to recommend individual

shops or all of the shops together. Note that all four interleaving grids are shown in the

figure, although all of the lines are shared by two interleaving grids. One cell from each of

the four interleaving grids is shaded in the figure.

Shop 1 Shop 2

Shop 3

Shop 4

Start

Figure 4.4: Solving over-generalization problem with grid cells

4.4 Implementing Exact Matching

One of the principal goals of this thesis is that all recommendation queries should execute

in real-time, and thus they need to be extremely efficient. However, we are permitted an

arbitrary amount of data pre-processing time. Hence, what is needed is to pre-process our


historical trajectory data, and to store it in some data structure that will permit us to

perform queries efficiently.

Each subsection of this section covers one aspect of efficiently implementing exact match-

ing. I begin by espousing on the traditional generalized suffix tree data structure, and then

proceed to describe the k-truncated generalized suffix tree, which for our purposes will be

cheaper to construct and consume less memory than a traditional generalized suffix tree.

After describing these data structures, I cover the details in computing the distance between

points of interest (and their generalized varieties), as well as how to execute exact matching

queries on a generalized suffix tree (k-truncated or not).

4.4.1 Suffix Trees

First introduced by Peter Weiner in 1973 [30], who referred to them as “position trees”,

suffix trees have become part of the standard data structure tool-box, and have found

wide application in many string algorithms. According to Dan Gusfield, author of the

comprehensive book on suffix trees, “Algorithms on Strings, Trees, and Sequences” [10], the

“classic application” for suffix trees is the substring problem. In the normal formulation,

the problem is to determine whether some string r is a substring of the string on which we

have constructed a suffix tree. This is easy to perform with a suffix tree because a suffix tree

is a tree that contains all suffixes of a given string. This means that solving the substring

problem becomes a simple matter of traversing the suffix tree, matching characters of r to

the characters on edges of the suffix tree until all characters of r have been matched or it is

possible to proceed no further.

Following Gusfield [10], we define a suffix tree:

Definition 4.4.1. A suffix tree T for a string s of length m is a rooted directed tree with

m leaves. Each internal node other than the root has at least two children and each edge

is labelled with a non-empty substring of s. No two edges leading out of a node can have

labels beginning with the same character. The defining characteristic of a suffix tree is that

for any leaf i, the concatenation of the labels of the edges on the path from the root to leaf

i is si..m, the suffix of s that starts at position i

Definition 4.4.2. A generalized suffix tree T for a set of strings S = {s1, s2, ..., sn}, and

where |sj | = mj is a suffix tree constructed on a set of strings rather than a single string and

contains all suffixes of all of the strings in S. Each internal node other than the root has at


least two children and each edge is labelled with a non-empty substring of one or more sj .

No two edges leading out of a node can have labels beginning with the same character. The

defining characteristic of a generalized suffix tree is that for any leaf (i, j), the concatenation

of the labels of the edges on the path from the root to leaf (i, j) is sji..m , the suffix of sj

that starts at position i.

An example of a suffix tree for the word “mississippi$” is presented in 4.5. The “$” is

not strictly necessary, but is used so that every suffix has its own leaf node. In this example

we can see the defining characteristic of a suffix tree: that the concatenation of the labels of

the edges on the path from the root to a leaf i is si..m, the suffix of s that starts at position

i. For example, we can see that the concatenation of edge labels from the root to node 8 is

“ippi$”, which is the suffix of “mississippi$” starting from position 8.

MISSISSIPPI$I$S

P

$SSI

SSIPPI$PPI$

PPI$

I$

PI$I SI

PPI$

SSIPPI$SSIPPI$ P

PI$

1

234 567

8910 11

Figure 4.5: Suffix tree for the word “mississippi$”

Suffix trees are great for the problem of exact matching, and they will be useful for

the fuzzy matching techniques presented in the next chapter as well. They are both space

and time efficient to construct, and will permit us to rapidly query for the next points of

a query trajectory q. Although a linear time method for suffix tree construction was first

discovered in 1973 by Peter Weiner [30], the first online method for suffix tree construction

was published in 1995 by Esko Ukkonen [28]. “Online” in this context means that characters

are added to the suffix tree in the order in which they’re presented, and this means that it

is possible to update the suffix tree with new characters as they’re discovered. The methods

used in this thesis are based on Ukkonen’s algorithm.

The methods described in this section all have published time and space complexity

bounds of O(m), where m is the sum of lengths of all input strings. However, it is important

to note that this bound is only valid assuming a fixed alphabet. The linear bounds are not

alphabet-independent. As Gusfield describes it: “All linears are equal but some are more


equal than others.” [10]. For the purposes of this thesis, the alphabet Σ is not fixed, and

so the real time complexity bound is O(m log |Σ|) [10]. Furthermore, the space complexity

for constructing a suffix tree is in fact O(mΣ). It is thus important that the alphabet size

required for a problem be kept as small as possible. If the alphabet size grows linearly with

the length of the strings, then the time required to construct a suffix tree degenerates to

O(m logm), and the space complexity to O(m2). As suffix tree construction can be done

offline this is not too bad from the perspective of running time, but the space requirements

can be prohibitive.

This complexity bound leads us to an important practical consideration, which is that

the alphabet size should be limited as much as possible. For example, given a set of real

trajectories it is likely that almost all points will be unique if the position is measured too

finely, but if points are locked such that longitudes and latitudes are rounded to the nearest

10 metres then the alphabet size would be reduced drastically. In the experimentation

section of this thesis, a different approach is taken, and all trajectories are represented as a

sequence of visited points of interest, which limits the alphabet size under consideration to

the number of points of interest.

Suffix trees can be used to solve the exact matching problem because given a query

trajectory q, we can traverse down the suffix tree until we match q, and from there we can

determine all possible next points by simply reading the children of that node.

Detailed descriptions of the methods required to construct suffix trees are omitted here

for brevity and can be found in Appendix A. However, an important practical note is that

many suffix tree implementations use a linked list at each node, and this can slow down

insertion and lookup time because as the tree grows, it takes increasingly longer to find

the child node corresponding to the next character in the suffix being inserted. This is an

important consideration to take when noting that one of our requirements is that queries

should be executed in real-time.

4.4.2 k-truncated Generalized Suffix Trees

An ordinary (generalized) suffix tree can be thought of having a maximal string depth equal

to the length of the longest string present in the suffix tree. However, for the purposes

of this thesis, a query trajectory q is unlikely to be very long, and furthermore, if it were

long it is questionable how useful older trajectory points would be. One assumption that

can be used to reduce the time and memory required to construct and store a suffix tree is


to require that a query trajectory not exceed some fixed length l, and in this case we can

improve on using a plain generalized suffix tree.

The k-truncated generalized suffix tree (kTGST) was first introduced in 2008 by Schulz et

al. [24]. It differs from an ordinary generalized suffix tree in that their algorithm constructs

a tree of depth at most k and permits searching only for sequences of length at most k.

It was originally introduced as a means of improving bioinformatics algorithms where the

query strings are typically short DNA, RNA, or amino acid sequences.

Rather than containing all suffixes of all strings added to the tree, the kTGST contains

all substrings of length k (known as k-grams) of the strings added to the tree. This is

exactly what is needed to solve the exact matching methods described in this chapter. As

with regular generalized suffix trees, for the sake of brevity, a description of how to construct

kTGSTs has been excised from this chapter and can be found in appendix A.

The kTGST improves on traditional generalized suffix trees in a number of ways : im-

proved construction time, reduced memory usage, and quicker queries. The improved con-

struction time and reduced memory usage arise because by limiting the strings being added

to the suffix tree to be k-grams we greatly increase the chance of identical strings being

added to the tree. The quicker queries result because in a normal generalized suffix tree,

information about the string id and position of a suffix in a string is stored only in the leaf

nodes of the tree, and so after matching a suffix, we need to traverse the entire subtree below

the match. With a kTGST if our query is length k, then we will have no subtree below the

match that needs exploring, and for queries of length less than k, the subtree that needs to

be traversed is likely to be small compared to the case of regular generalized suffix trees.

Figure 4.6 contains an example of a k-truncated suffix tree for the word “mississippi$”

with k = 3. Note that some leaf nodes now correspond to multiple locations in the original

word, and that the maximum depth of the suffix tree does not exceed 3.

MIS$I$

SP

$SS

PP$I$

PI$I SI

S P

1

2, 53, 6

4 7

8910 11

Figure 4.6: 3-truncated suffix tree for the word “mississippi$”.


4.4.3 Computing Point Distance

In sections 4.2.3 and 4.5 we depend on some function dist(x, y) to compute the distance

between two points x and y, where each is either a point of interest, a generalized point, or

a localized generalized point. Note that generalized points have no spatial location, and all

types of point have an associated concept.

Throughout this thesis, the spatial distance sDist(x, y) is taken to mean the Euclidean

distance between two points x, y, although any well defined distance measure could be used

instead. Recall from definition 3.1.1 the definition of concept distance used in this thesis. Let

z be the lowest common ancestor of both x and y in the concept hierarchy. If z does not exist

(i.e. x and y have no common ancestor in the concept hierarchy), then the conceptual dis-

tance between x and y is said to be infinite. Otherwise, the conceptual distance, cDist(x, y)

between two points x and y is defined as max(depth(x) − depth(z), depth(y) − depth(z)),

if z exists, and where depth(x) denotes the depth of a concept x in the concept hierarchy.

Note that other distance measures, such as using the sum of these two distances, could be

used. An example of computing cDist is given in figure 4.7.

A

B

x

C

D

y

32

Figure 4.7: Example of concept distance. conceptDistance(x, y) = 3

To compute the distance between a concept point x and any other point y, we can com-

pute the conceptual distance, cDist(x, y) and normalize it by a user defined generalization

factor gf used to control how distant concept points are from each other and from other

points.

To compute the distance between two points x, y where neither x nor y is a concept point,

we can separately compute the spatial and conceptual distances between them, normalizing

each distance by an arbitrary spatial factor, sf , and concept factor, cf respectively, and

summing the two values.


To summarize, the distance between two points x, y is computed as:

distance(x, y) =

cDist(x, y)× gf if x or y is a concept point

cDist(x, y)× cf + sDist(x, y)× sf otherwise

(4.10)

The three constants used in this equation, the generalized factor gf , the concept factor

cf , and the spatial factor sf need to be carefully chosen in order for our recommendation

system to perform optimally. Unfortunately, there is no rule that allows us to determine

good values for these constant a priori. The most critical decision to be made is what

we want the relative weights of these distances to be. For example, by choosing a large

spatial factor sf and a small concept factor cf , we could tune the recommendation system

to recommend generalized points covering trajectory points with highly dissimilar concepts

provided that the trajectory points are very close to each other. Our recommendation for

choosing values for these factors is to try a number of combinations of these factors and to

select the combination that works the best for a given dataset. This method of choosing the

constants may seem arbitrary, but there is no a priori means of knowing good values for these

constants. The constants must be chosen for each application, and by choosing different

values it is possible to tune the recommendation system to show different properties. For

example, by tweaking these parameters we can alter the observed ratios of concept, localized,

and trajectory points recommended. Taken to an extreme, one interesting idea would be to

set the spatial factor near zero in order to be able to answer recommendation queries even by

tourists in cities for which we have no historical data. The values chosen for experimentation

in this thesis are discussed in Chapter 6.

4.4.4 Executing Exact Matching Queries

Given the techniques presented thus far in this chapter, it is simple to execute an exact

matching query. When presented with a query trajectory q, all that is required is to walk

the edges of a generalized suffix tree until we find a matching node. From this node, we can

determine all of the next points N of q, and with this set of next points, all that remains

is to compute the confidences of all possible recommendations according to the measures

defined in sections 4.2.3 and 4.3.


Searching Suffix Trees

Intuitively, it is very simple to search a suffix tree, although there are some details that need

to be handled correctly when actually implementing searching a suffix tree. To search a suffix

tree one must simply walk down the edges of the suffix tree iterating through each point in

the query trajectory until either all points have been matched, a mismatch is encountered,

or no suitable child is found to continue walking down the tree. The pseudo-code for our

suffix tree searching algorithm is presented in algorithm 1.

Input: A suffix tree node node and a query trajectory q, and an integer lenrepresenting how much of q has already been matched to reach node

Output: A set of next points N of q

N ← {}1

edgestr ← node.edgestr2

edgelen← node.edgelen3

pos← 04

/* Compare characters until mismatch or end of string reached */

while len < |q| and pos < edgelen and q[len] = edgestr[pos] do5

pos← pos+ 16

len← len+ 17

if len = |q| then /* Matched trajectory */8

/* Traverse subtree to find all next points */

N ← TraverseSubtree(node)9

return N10

else if pos < edgelen then /* Encountered mismatch */11

return {}12

/* Reached end of edge. Need to recurse if node is internal */

if not(node.isALeaf) then13

child← FindChild(node, q[len])14

if notnull(child) then /* We found a child to continue searching */15

return searchTree (child, q, len)16

/* Node is a leaf, or no child exists. Can’t recurse */

return {}17

Algorithm 1: Searching Suffix Trees

Two important technical notes must be made about the algorithm for searching for all

next points of a query trajectory q:


• Once we’ve found a node n matching the query trajectory q, we cannot immediately

determine the next points of q. This is because information about the strings / trajec-

tories stored in the suffix tree is stored only in the leaf nodes, and so we must traverse

the entire tree below n in order to determine the next points of q. This is performed

by the function TraverseSubtree(node).

If k-truncated generalized suffix trees are being used, the potential amount of tree

that needs to be traversed below n may be much smaller. This is a major reason why

using k-truncated generalized suffix trees will turn out to be more efficient than using

regular generalized suffix trees.

• In order to efficiently implement the FindChild(node, point) method, it is critical

that the children of node be indexed for efficient lookup. For trajectory-based POI

recommendation, the alphabet tends to be extremely large, and it is important that

this lookup be performed using a hash table, or a tree instead of iterating through

all children and comparing them to point. Our implementation uses a binary tree to

perform this lookup.

Processing Next Points

Once we have obtained the bag N of all next points of q, it remains to compute the confi-

dences of all possible recommendations and to determine the set of recommendations that

should be presented to the user. The processing of next points takes place in two phases:

gathering and diversification.

The gathering phase computes the confidences of all possible recommendations. To

do this, we begin by determining the supports of all next points of q, and by computing

the confidences of recommending each of these points according to the methods described

earlier in this chapter. The next phase is to compute the confidences of recommending all

concept generalizations of the next points of q. The final confidence computation phase is

to determine all cells that contain a next point of q, and then for each cell c to perform

the process of computing the confidence of all generalizations (now spatio-localized) of the

next points of q contained in c. By performing this process we can construct a set R of

all possible recommendations for the query trajectory q, along with their associated raw

confidences. Pseudo-code for this process is presented in algorithm 2.

The next phase performed in processing next points is the optional diversification of


Input: A bag N of all next points of a query trajectory qOutput: An set R consisting of (point, confidence) pairs, each representing a

recommendation and its computed confidence.

/* Phase 1: Compute the confidences of each next point */

R← computeConfidencesOfTrajPoints(N)1

/* Phase 2: Compute the confidences of all concept points */

C ← gatherAllObservedConcepts(N)2

R← R ∪ computeConfidencesOfConceptPoints(C,N)3

/* Phase 3: Iterate over all cells and compute the confidences of

all localized points */

Cells← gatherAllObservedCells(N)4

for cell ∈ Cells do5

Ncell ← gatherPointsInCell(N, cell)6

R← R ∪ computeConfidencesOfTrajPoints(Ncell, cell)7

Ccell ← gatherConceptsInCell(N, cell)8

R← R ∪ computeConfidencesOfConceptPoints(Ccell, Ncell, cell)9

return R10

Algorithm 2: Processing Next Points

recommendations. This process is discussed in the next section. The output of the diversi-

fication process is an updated R with new, diversified, confidences.

4.5 Diversification of Recommendations

Although not a theoretical problem, there is a practical problem with the methods described

in the previous sections. In particular, there is the chance of recommending, for example, a

generalized Starbucks POI, as well as a generalized Second Cup POI, or perhaps a different

Starbucks POI in a different region. This is potentially undesirable both for a potential

advertiser, as well as for a user of the system. Even though it is most likely that an

individual will visit, for example, one of three coffee shops, the user may not desire to be

recommended all three coffee shops, and similarly, one coffee shop may not want to spend

money advertising just to appear with other coffee shops.

In the previous sections, we wanted the recommendation system to recommend a gen-

eralized point only if the points contained in the generalized point are highly related. This

is because we do not want to recommend a generalized point that is too general as that


would not be useful for a user of the recommendation system. However, some of the top-k

recommendations may not be similar enough to recommend together as a generalized point

but nonetheless be similar enough to not want to recommend both of them. For example,

it would be possible for two of the top-k recommendations for a query to be a Starbucks

and a Second Cup that are far enough apart that the confidences for recommending each

of them is greater than the confidence for recommending a generalized point for “Coffee

Shop”. Despite the two coffee shops being too distant from each other for the recommenda-

tion system to recommend a generalized point subsuming both of them, we may not want

to recommend two coffee shops in the case that we could recommend a book store or a

hardware store instead.

This is not a problem with the confidence measure defined above, but a matter of how

to practically apply the confidences as estimated. What we can do is to process the list of

recommendations after computing the confidences of each possible next point for a given

query trajectory. The goal is to compute a “diversified confidence” for each next point

that can be used to generate a list of diversified top-k recommendations. The goal is not

to improve the confidences of the recommendations, as the method can be guaranteed to

reduce the sum of confidences of the top-k recommendations returned, but instead merely

to ensure that the recommendations are diverse.

An easy method to implement diversity is to iterate through the list of all recommenda-

tions in order of decreasing confidence, and to decrease the confidence of each recommen-

dation by how similar it is to recommendations already processed. If a recommendation

is very similar to recommendations already made, then it will have its effective confidence

decreased by a large amount, and conversely, if a recommendation is very dissimilar to

recommendations already made, then its effective confidence should not be affected.

The effect of point similarity on the diversification process can be controlled by a di-

versification factor divFac, 0 ≤ divFac ≤ 1. If divFac = 1 then no two recommendations

representing comparable points will both end up with non-zero confidences.

Pseudocode for the diversification algorithm used by this thesis is presented in algorithm

3. This is a straightforward greedy algorithm, and it is easy to see by inspection that it

runs in time O(|R|2), where R is the set of potential recommendations to be returned to

the user.

One objection that can be made to using this means of diversifying the set of rec-

ommendations is that it is completely deterministic, and given a set of recommendations


Input: A set of recommendations R, a diversification factor divFacOutput: A set of recommendations S with confidences adjusted to ensure

diversity

S ← {}1

sort R by decreasing confidence2

for i← 1 to |R| do3

rec← R[i]4

for j ← 1 to |S| do5

pointDistance← distance(rec.point, S[j].point)6

if pointDistance <∞ then7

divReduction← pointDistance× (1− divFac)8

if divReduction < 1 then9

rec.confidence← rec.confidence× divReduction10

S ← S ∪ {rec}11

return S12

Algorithm 3: Algorithm for Diversifying Recommendations

always returns the same result. This is objectionable because using the algorithm presented

above we may always recommend a Starbucks for some query and never a Second Cup,

even though our confidence in recommending the Second Cup may be only slightly lower

than our confidence in recommending the Starbucks. This is undesirable both for users of

the recommendation system, for the establishments that are not being recommended due

to the recommendation process, and for the operator of the recommendation system, who

may lose out on potential advertising revenue. This problem can be remedied by slightly

randomizing the order in which the points are considered, the more likely that point is to

have its confidence left untouched. As mentioned in Section 2.1, this process is known as

serendipity. However, we do not pursue any experimentation along this line in this thesis

as the greedy method presented is sufficient for our purposes.

4.6 Summary

This chapter began by motivating and defining the exact matching problem, and then

proceeded to present a naıve solution to the problem. The naıve solution was revealed to

suffer an inability to account for POI similarity. Two approaches to accounting for POI

similarity were presented, one that worked by altering the confidences of recommending


similar points of interest, and one that functioned by using density estimation to recommend

concept points. The former was demonstrated to be flawed, and the latter was selected as

the superior approach. Following this, we introduced a method for recommending localized

points using a system of four interleaving grids to determine whether the next points for a

query are spatially nearby.

After describing the conceptual framework for performing exact matching we delved into

the details of implementing exact matching using the (k-truncated) generalized suffix tree

data structure. Furthermore, we presented a concrete equation for computing the distances

between points and provided algorithms for executing recommendation queries. Finally

we presented a simple method for ensuring that the top-k returned recommendations are

diverse.

Chapter 5

Variants

The exact matching techniques described in the previous chapter are useful and are the

core of this thesis. However, they still do not address the fuzziness and order-flexibility

objectives for a useful trajectory-based recommendation system. The goal of this chapter

is to demonstrate how these two limitations can be naturally addressed in a reasonably

efficient manner by building on the exact matching techniques of the previous chapter.

5.1 Fuzzy matching

To repeat some preliminary definitions from previous chapters, let tDB be the trajectory

database, consisting of a bag of trajectories, and let q be a query trajectory, where l = |q|is the length of q. Furthermore, let H be the set of all trajectory fragments of length l + 1

in tDB, so that each fragment h = (b : n) ∈ H consists of two parts: a body b of length l,

followed by a next point n.

Using the exact matching techniques of the previous chapter, when presented with a

query trajectory q we are only able to recommend next points that follow a body q in H.

This has two principal limitations:

• (No Recommendations) Given a query trajectory q, if q does not appear as the body

of some trajectory fragment in H, then we are unable to recommend any points of

interest.

• (Similar Trajectories) If there is a trajectory fragment s that is very similar to q, and

where there is a recommendation for s with very high confidence then we should be

42

CHAPTER 5. VARIANTS 43

able to make this next point in addition to a next point following q. Depending on

the particular dataset considered, it may be quite common for there to be many very

similar, but non-identical, trajectories to be present in the trajectory database, and

the exact matching methods of the previous chapter are unable to utilize all of this

information.

Our goal is for the trajectory-based POI recommendation system to base its recommen-

dations also on the historical trajectory fragments that are “close to” or “fuzzy matches

of” the query trajectory q. The next points of these historical trajectory fragments that

fuzzily match the query trajectory should be considered as possible recommendations when

executing a query. Furthermore, if some point a is the next point of one or more trajectory

fragments similar to q (including an exact match of q), then the support of these similar

trajectory fragments leading to point a should boost our confidence in recommending a.

More concretely, in this section we want to develop a method whereby the next points of

all trajectory fragments in H are considered, and where their contributions are weighted

proportionally to their body’s similarity to the query trajectory q.

Let F denote the set F = {(b : n) ∈ H|similarity(b, q) > 0}. That is, F the set of all

trajectory fragments in the historical trajectory database that have bodies with a positive

similarity with q. With this definition we can define the fuzzy matching problem analogously

to the exact matching problem defined in the previous chapter.

Definition 5.1.1. Fuzzy Matching Problem: For a given query trajectory q, find the top-k

next points (ranked by decreasing confidence) of all trajectory fragments in F . The effect

of the next point of a fragment f = (b : n) ∈ F on the confidence for a recommendation

must be weighted proportionally to similarity(b, q).

5.1.1 Implementing Fuzzy Matching

One of the primary goals of this thesis is for all methods to be efficient, and for all queries

to be executable in real time. We must be mindful that the method chosen to perform

fuzzy matching remain efficient even on large data sets. The fundamental change required

to perform fuzzy matching is to modify the suffix tree searching algorithm presented in

the previous chapter to add a fuzzyError variable to the input for the algorithm, and to

relax the condition of the while loop to not require exact matches, but to instead allow for

mismatches, but to add the distance between the observed point and the expected point to


the fuzzyError. When processing next points, the fuzzyError will be used to determine how

to weight each next point.

At first glance, it may appear that this is all we need to do to handle the fuzzy matching

case, but there are still two problems that need to be addressed. The first is that the

contribution of the next points of the trajectory being considered s was defined in terms of

the similarity between s and the query trajectory q, but the fuzzyError as described can

grow arbitrarily large. The second problem is that as described, we may need to consider all

trajectories in the trajectory database (only those with a point of infinite distance from its

corresponding point in q could be excluded), and so as the data set grows the time required

to search for all possible next points may grow (linearly, not exponentially as our generalized

suffix tree can contain no more nodes than there are characters in the dataset) such that

queries can no longer be executed sufficiently quickly.

The solution to both of these problems is to specify a fuzzy search radius, fsr that can

serve both as a control to limit the breadth of the search and also to give a scale to convert

the fuzzyError into a similarity score, so that similarity(s, t) = 1 − min(fsr,fuzzyError)fsr .

The unit of fuzzy search radius is the abstract unit of point distance as defined in section

4.4.3. By defining a fuzzy search radius, we are able to limit our search space to those

historical trajectory fragments that fuzzy match the query trajectory with order fsr. This

will typically allow us to avoid visiting most of the nodes in our generalized suffix tree.

Figure 5.1 is an abstract representation of the space surrounding a query trajectory that

lies within the fuzzy search radius maxDist.

Start

maxDist

Figure 5.1: Demonstrating the Fuzzy Search Radius around a Trajectory

With this enhancement, it is now possible to efficiently execute fuzzy matching queries.

Recall that in the previous chapter, we defined the Kernel estimation function fh to be:


fh(x, S) =1

n

n∑i=1

Gh(x, S[i]) (5.1)

In equation 5.1 we sum over all possible next points, and in many situations it is possible

for many of these points to be the same point. Suppose that from S we construct a new

set R consisting of pairs (y, support(y)), where y is a unique point (no two pairs share

the same point x). In addition, let support(y) = |{s ∈ S|s = y}| be the support of y in S.

Furthermore, let totalSupport =∑|R|

i=1 support(R[i]) be the sum of the supports of all unique

points of R. At this point, the support for a point is independent of the query trajectory,

and so for now we set support(y, q) = support(y). Recalling that in order to execute fuzzy

matching queries our density estimates will need to depend on the query trajectory, we can

rewrite the definition of fh as:

gh(x,R, q) =1

totalSupport

|R|∑i=1

support(R[i], q)Gh(x,R[i]) (5.2)

With equation 5.2, it is possible for us to incorporate the next points of trajectory

fragments that do not exactly match the query trajectory, but that have a positive similarity

score with the query trajectory. All that we need to do is to change the means of computing

support(y) for each next point y. For each trajectory fragment (b : n) ∈ F , the contribution

of this fragment towards support(n) will be equal to similarity(b, q), where q is the query

trajectory. This leads us to a revised equation for support(y, q):

support(y, q) =∑

(b:n)∈F

similarity(b, q) if y = n

0 otherwise(5.3)

Given the revised equation for support(y, q), equation 5.3, and the reworked kernel

density estimate computed by equation 5.2 we can compute the fuzzy confidence for a given

recommendation z using what is essentially the same confidence measure as was used in the

previous chapter:

fuzzyConfidence(z, q) =

gh(z,N, q) if z a generalized point

support(z,N,q)2π|N | otherwise

(5.4)

In order to efficiently compute support(y, q) for each trajectory point y given a query q,

we need to modify the method for searching a generalized suffix tree presented in section


4.4.4. The updated algorithm will output both the next points of all historical trajectory

fragments that fuzzy match the query trajectory q as well as the similarity between q and

these historical trajectory fragments. Pseudo-code for our revised algorithm is presented in

algorithm 4.

Input: A suffix tree node node and a query trajectory q, a fuzzy search radiusmaxDistance, the current cumulative distance cumDistance, and aninteger len representing how much of q has already been matched toreach node

Output: A set of pairs (N, similarity) of next points and similarity scores foreach historical trajectory fragment that fuzzy matches q

N ← {}1

edgestr ← node.edgestr2

edgelen← node.edgelen3

pos← 04

/* Compare characters until max distance or end of string reached */

while len < |q| and pos < edgelen and cumDistance < maxDistance do5

cumDistance← cumDistance+ pointDistance(q[len], edgestr[pos])6

pos← pos+ 17

len← len+ 18

if cumDistance ≥ maxDistance then /* Max Distance Exceeded */9

return {}10

else if len = |q| then /* Matched trajectory */11

/* Traverse subtree to find all next points */

N ← TraverseSubtree(node)12

return (N, cumDistance)13

/* Reached end of edge. Need to recurse if node is internal */

Results← {}14

if not(node.isALeaf) then15

for child ∈ node.children do16

Results← Results∪searchTree(child, q,maxDistance, cumDistance, len)17

return Results18

Algorithm 4: Searching Suffix Trees For Fuzzy Matching

A further optimization could be to perform an A∗-search of the generalized suffix tree,

where we begin by first exploring those paths most similar to the query trajectory t. Our

goal is always to find the top-k recommendations, and we should be able to explore until we


find the top-k recommendations and no other branches explored could possibly contain a

top-k recommendation. In the worst case, all branches of the tree may be explored, although

in practice this is unlikely. The principal reason for not implementing this approach in the

thesis is that if diversification is performed, then the top-k recommendations found before

diversification will likely not be the top-k recommendations after diversification. A possible

resolution would be to search until the top-x recommendations are discovered where x > k,

but this is not pursued any further here.

One optimization that has been suggested is that we could only consider fuzzy matching

against similar trajectories that lead to the next points of q, the query trajectory. However,

this suggestion is flawed and the reason is simple: if q has never been previously observed

then we would not be able to make any recommendations, even if there were many previously

observed trajectories similar to q.

5.2 Order-Flexible Queries

Despite the power offered by the fuzzy matching technique of the previous section, there is

still a remaining limitation to the method. In particular, the order of points in the trajectory

query q must exactly match the order of points in the body of a trajectory s in the trajectory

database, and this violates the desire expressed in chapter 3 that the order of points close

in time should not matter.

There is an ambiguity in this problem statement in that it does not state whether we’re

looking for points close in time in the query trajectory or in the trajectory database, and this

leads to two approaches to solving the problem of order-flexible queries. The first approach

is the history-centric approach, wherein we only match the query trajectory against the

trajectory database, but allow a degree of out-of-order matching when the time difference

between points in a historical trajectory are close enough in time. The other approach is the

query-centric approach, wherein we look for points close in time in the query trajectory and

use this to generate permutations of the query trajectory to match against the trajectory

database. Considering the motivation for performing order-flexible queries presented in

Chapter 3, we can posit that in a real situation that these two approaches would lead

to a similar set of recommendations, but this would need extensive testing. Due to its

much greater efficiency, the query-centric approach is the approach taken by this thesis for

experimental purposes.


5.2.1 History-Centric Approach

The history-centric approach is perhaps the more obvious of the two approaches, and would

be an extension of the fuzzy-matching method described earlier in the chapter. That is, given

a query trajectory q, we want to match q against the database of historical trajectories,

performing a form of fuzzy matching whereby q would match (with some error factor) a

historical trajectory h even where q 6= h, but where swapping some points in h that are

close in time to each other would transform h into q. For the purpose of this discussion, we

can further limit the problem by not allowing any point h to be swapped with any point

other than its immediate predecessor or successor in the trajectory.

At first glance, it may appear that all that we need to do is to match q against the

historical trajectory database, performing some form of look-ahead while matching to see if

swapping two points in the historical trajectory would allow it to match q. This would work

if we did not care about the time difference between points when performing order-flexible

matching. However, as argued in Chapter 3, we do care about the time difference between

visiting points in this scenario.

One proposed solution to handle this would be to attach a list to all nodes in the

generalized suffix tree of all historical trajectories passing through the node, and to keep

track of the temporal differences between a user’s visit to that node and the next node in the

trajectory. Clearly this would greatly increase the memory requirements of storing the suffix

tree, but that is not the most severe problem with this solution. The real problem is that

even with this information we cannot simply search the tree, looking ahead for potential

node swaps of small time distance, because we would have no efficient means of determining

how many historical trajectories followed the path traversed and of determining the support

for any next points. The only solution would be to trace historical trajectories through the

generalized suffix tree, which means that in the worst case, we would need to search for

all historical trajectory fragments of length k in the tree. The time required to do this is

O(n ∗ k) where n is the sum of lengths of all historical trajectories, and k is the length of

the query trajectory.

To see why it is necessary to search for all historical trajectory fragments of length k (k-

grams), consider a historical database considering of two trajectories, x = y = a → b → c.

Suppose that the difference between x and y is that for x the time difference between the

visits to a and b is 1 minute, and the time between the visits to b and c is 10 minutes.


Similarly, suppose that for y, the time between the visits to a and b is 10 minutes, but the

time between the visits to b and c is only 1 minute. Finally, suppose that the maximum

allowable time difference between two points for them to be swapped is 1 minute. Then,

given a query trajectory q = a→ c→ b, we can see that we match q against a permutation

of y, but we cannot match q to any permutation of x. In order to determine this we need to

separately consider both x and y. The generalized suffix tree as constructed for executing

exact matching and fuzzy matching queries does not contain enough information for us to

determine that only one of x and y can match q when taking the history-centric approach

to order-flexible matching.

In order to examine all historical k-grams in the suffix tree, when constructing the suffix

tree we would need to consider the time-stamps of trajectory points when determining

if trajectory points match. It is highly unlikely for many trajectory points to share the

same time-stamps, and so using a suffix tree offers no advantages over a plain sorted list

of historical trajectory fragments in order to solve the history-centric approach to order

flexible matching. It is possible that there exists an efficient solution to the history-centric

approach, but we have been unable to think of a solution, and we believe that it is unlikely

that an efficient solution exists.

In conclusion, although the history-centric approach to order-flexible matching is intu-

itively reasonable, implementing the approach requires time linearly proportional to the size

of the historical database. This runs contrary to the requirement of on-line recommenda-

tion. We want an efficient and scalable approach where there is no linear factor of the size

of the historical database in the complexity for the method.

5.2.2 Query-Centric Approach

A more scalable method for performing order flexible queries is the query-centric approach.

This approach addresses the problem by repeating the fuzzy matching process with all per-

mutations of the query trajectory q that are temporally close to q. Taking this approach

will allow us to satisfy the order-flexibility requirement while allowing for an efficient imple-

mentation with no linear factors of the size of the historical trajectory database in its time

complexity.

Let Qp = (Qp1, Qp2, ..., Qpl) denote the set of all permutations of the query trajectory q

such that qi is swapped with qi+1 only if qi+1.time− qi.time < maxDiff , where maxDiff

is an arbitrary threshold. Using the same definitions as in the previous section, let H denote


the set of all trajectory fragments in the historical trajectory database tDB. Furthermore,

let OF = (OF1, OF2, ..., OFl) be the set of all sets of trajectory fragments with positive

similarity to each Qpi, so that OFi = {(b : n) ∈ H|similarity(b,Qpi) > 0}.

Definition 5.2.1. Order-Flexible Problem: For a given query trajectory q, find the top-k

next points (ranked by decreasing confidence) of all trajectory fragments in OF . The contri-

butions of the next point of a fragment f = (b : n) ∈ OFi must be weighted proportionally

to similarity(b, q), as well as to orderError(Qpi, q).

The query-centric approach can be implemented as an extension of the fuzzy matching

technique presented in the previous section. It works by first identifying all points in q that

are close in time to the next point in q, and build up a vector, canSwap, where canSwapi

is a boolean value indicating whether qi+1.time − qi.time < maxDiff . That is, whether

the qi is nearby in time to qi+1. With this information it is possible for us to recursively

generate all permutations of q such that no point x in a permutation has a time-stamp more

than maxDiff later than the successor of x. For simplicity, we require that the index of

p in a permutation be no more than 1 off from its index in q. This requirement could be

removed, at the cost of a more complex and less efficient implementation.

As with the fuzzy matching case described in the previous section, we need to weight the

contributions of the next points resulting from matching an order swapped query trajectory

appropriately. Let swaps(p, q) denote the minimum number of point swaps needed to obtain

a trajectory p from q. The error resulting from searching for p in the generalized suffix tree

rather than q is simply orderError = swaps(p, q)× orderFactor, where orderFactor is an

arbitrary constant. As with the fuzzy matching methods of the previous section, in order

to guarantee that the algorithm runs efficiently we need to limit the search by bounding the

maximum allowable error.

Combining this with the fuzzyError factor introduced in the previous section, the final

error used to weight the contribution of a discovered next point is error = fuzzyError +

orderFactor.

Although the number of possible combinations of swaps is exponential in the length of

the query trajectory, q, the length of q is expected to be very short (≤ 5 in most applica-

tions), and so there is no exponential blow up to be experienced here. Assuming a fixed

query trajectory length, the query-centric approach will execute in only O(1) times the time

required to execute a fuzzy-matching query. This query-centric approach is the approach

taken by this thesis for experimentation purposes.


Pseudo-code for the basic order-flexible matching algorithm is presented in algorithm 5.

Note that in this pseudo-code the searchTree(...) function call refers to the fuzzy matching

suffix tree search algorithm presented in section 5.1.1. A consequence of this is that our

implementation of the query-centric approach to order-flexible matching is built upon the

fuzzy matching. If someone wanted to perform order-flexible matching without allowing

fuzzy matching, this function call could be replaced with a call to the exact matching suffix

tree search algorithm presented in section 4.4.4.

Input: A suffix tree root root and a query trajectory q, a fuzzy search radiusmaxDistance, a maximum time duration maxOrderDiff for orderswapping, and a swap error swapError

Output: A set of pairs (N, similarity) of next points and similarity scorescorresponding to each historical trajectory fragment that order-flexiblematches q

/* Phase 1: Generate all allowed permutations of q */

/* X = ((q0, numSwaps0), (q1, numSwaps1), ..., (qm, numSwapsm))where q0 = qand numSwapsi denotes the number of swaps required to obtain qifrom q */

X =← generateAllowedPermutations(q,maxOrderDiff, swapError)1

/* Phase 2: Generate results for all permutations */

Results← {}2

for (qi, numSwapsi) ∈ X do3

initialError ← numSwapsi × swapError4

Results← Results ∪ searchTree(root, qi,maxDistance, initialError, 0)5

return Results6

Algorithm 5: Searching Suffix Trees for Order-Flexible Matching

To illustrate the simplicity of the query-centric approach to order-flexible matching, con-

sider the following example. Consider a historical database considering of two trajectories,

x = a→ b→ c→ d, and y = b→ a→ c→ e. The times between visits in these trajectories

is irrelevant. Now, suppose we have two query trajectories, p = q = a → b → c. In trajec-

tory p, 1 minute elapses between the visits to a and b, and between the visits to c and d.

In trajectory q, 10 minutes elapses between the visits to a and b, but only 1 minute elapses

between the visits to b and c.

Given a maximum time difference of 1 minute, the set of allowed permutations of p is

Pp = (a → b → c, b → a → c, a → c → b), whereas the set of allowed permutations of q


is Qp = (a → b → c, a → c → b). Matching will be performed using each of the allowed

permutations of p and q. We can see that a permutation of p matches the body of x and

another matches the body of y, and so we will be able to recommend both d and e given

the query p. However, no permutation of query q matches y, and so given query q our

trajectory-based recommendation system will only be able to recommend point d.

5.3 Summary

This chapter introduced two extensions of the exact matching problem: the fuzzy matching,

and the order-flexible matching problem, and provided methods to solve both of them.

The exact matching formulation of the trajectory-based POI recommendation problem was

revealed to suffer from two limitations: no recommendations, and similar trajectories. These

limitations were overcome by defining a fuzzy search radius and incorporating the next

points of all trajectories lying within this radius into the recommendation process. The

search algorithm presented was a simple extension of the algorithm for executing exact

matching queries presented in section 4.4.4. After solving the fuzzy matching problem we

considered the order-flexible matching problem, and found that there are two approaches to

the problem. The first, the history-centric approach, was demonstrated to be intractable,

but the second, the query-centric approach, was demonstrated to be solvable by further

extending the search algorithms used for fuzzy matching and exact matching.

Chapter 6

Experimental Results

The goal of this chapter is to present experimental results demonstrating the efficiency and

effectiveness of the methods presented in chapters 4 and 5. This chapter is divided into

three parts. The first contains a description of the datasets used and the methods used to

process them, and the second section describes how to evaluate the quality of results, and

the final part contains the results of the experiments run along with analysis of these results.

6.1 Datasets

The experimental results in this thesis have been collected using a number of processed

variants of two publicly available datasets: the INFATI dataset [14], derived from tracking

the movements of cars in a town in northern Denmark, and the trucks dataset [5], which

tracks the movements of a number of trucks in Athens, Greece. The INFATI dataset is split

into two “teams”, and for simplicity we use only the trajectories gathered by “team 1”. The

entirety of the trucks dataset is used. Visualizations of the two datasets are presented in

figure 6.1.

6.1.1 Dataset Processing

Both the INFATI and the trucks datasets are pure trajectory datasets and suffer from

the limitation that although they contain a plethora of trajectory points, each point in a

trajectory does not denote a visit to a point of interest. Rather, each point merely denotes

the location of a vehicle some fixed amount of time following its predecessor. Furthermore,

53

CHAPTER 6. EXPERIMENTAL RESULTS 54

(a) INFATI dataset (b) Trucks dataset

Figure 6.1: Datasets used

there are no points of interest present in the dataset, and hence to make the datasets useful

for experimenting on the methods presented in chapters 4 and 5, we need to process the

datasets to add points of interest and to map trajectories to these points of interest. In

addition, we will choose a method for probabilistically perturbing the trajectories in our

datasets in order to highlight the differences between the exact matching technique and its

variants.

Dataset Processing Model

Each of the initial datasets (INFATI and trucks) can be thought of as a historical trajectory

database tDB, where each point has its own point of interest. The initial problem to be

solved by processing the datasets is that of choosing a subset of points where we will assume

that real points of interest exist. There are three obvious approaches to solving this problem.

The first approach is to simply choose a random selection of points in the initial dataset

and place a point of interest at the location of each of these points. The second would

be to spatially cluster the trajectory points, and to declare that a reference point for each

of the top-n clusters is a point of interest. Yet another approach is to place points of


interest independently of the given trajectories, either randomly or at regular intervals.

This approach is flawed because most of these points of interest will never be visited, and

real points of interest tend to be clustered into a small subset of the total space. For

the experimentation in this thesis, I have chosen to go with the first approach due to its

simplicity and because it allows us to easily generate processed datasets of varying sizes.

What we do is use a user-defined poiRate, so that every poiRateth observed point in the

initial dataset is assumed to be the location of a point of interest. This POI will be randomly

assigned a leaf concept.

Given a set of points of interest, the next problem that needs to be addressed is that of

mapping the initial trajectories to the points of interest. This is done in a straightforward

manner: we compute the distance of each point of an initial trajectory as it is observed to

each point of interest, ignoring all POIs that an earlier point in the initial trajectory has

already mapped to. If there is an unvisited POI poi within some fixed distance maxDist of

the trajectory point, then the trajectory point is mapped to poi. Otherwise, the trajectory

point is ignored.

By itself, the processing algorithm described in the previous two paragraphs works quite

well, but suffers from the problem that there is too little variety in the resulting datasets.

This can be attributed to the fact that the initial datasets used were generated by a relatively

small number of individuals, and it can be expected that a dataset generated from tracking

thousands of individuals would show much more variety. As a result of this deficiency, in

order to illuminate the differences between the exact matching technique of chapter 4 and

the fuzzy and order-flexible methods of chapter 5, we need to extend the processing model.

To enhance the fuzziness of the processed datasets, we define a constant splitProb that

denotes the probability of “splitting” a point of interest. Splitting a point of interest will

generate a new POI y from an initial point x, such that y.concept = x.concept, and such

that pointDistance(x, y) < maxSplitDistance. Each time we map an initial trajectory

point p to a POI x, we will split p with probability splitProb. If we don’t split p, then x we

will randomly map p to x or one of the POIs generated by splitting x, where each of these

POIs will be selected with equal probability.

Similarly, to ensure that some trajectories visit points in different orders, we define a

constant orderProb that denotes the probability of swapping any two trajectory points p, q

such that |p.ts−q.ts| < maxTimeDiff . After all initial trajectory points have been mapped

to points of interest, we will make a pass over all generated trajectories, and perform this


order-swapping process.

Pseudo-code describing this processing algorithm can be found in Algorithm 6.

INFATI Processing

The INFATI dataset was processed three times using the processing algorithm on page 57

to create three variant datasets. For each run, the splitting probability was set to 0.025, and

the order swapping probability to 0.33. Furthermore, the maximum distance for mapping a

trajectory point to a POI was set to 50. Finally, the maximum split distance was set to 2, and

the maximum time difference for order swapping to 2 seconds. Clearly the maximum time

difference is impractical for a real world application, but the points in the initial INFATI

dataset are separated by only 1 second, and so the order swapping time difference must be

made accordingly small.

Three processed INFATI datasets were generated: Infati-750, Infati-500, and Infati-

250. These datasets were processed using poiRate values of 750, 500, and 250, respectively.

Detailed information and a visualization of the Infati-500 dataset can be found in section

6.1.2.

Trucks Processing

The Trucks dataset was processed twice to create two variant datasets. As with the INFATI

dataset, for each run, the splitting probability was set to 0.025, and the order swapping

probability to 0.33. However, the Trucks use a different coordinate system than the INFATI

dataset, and thus the distances need to be adjusted. For processing the Trucks dataset, the

maximum distance for mapping a trajectory point to a POI was set to 0.007. The maximum

split distance was set to 0.0005 and the maximum time difference to 1 minute.

Two processed Trucks datasets were generated: Trucks-100, and Trucks-50. These

datasets were processed using poiRate values of 100 and 50, respectively, and like for the

INFATI dataset, detailed information and a visualization of the Trucks− 100 dataset can

be found in the following section.

6.1.2 Dataset Statistics

Taking together the results of processing the INFATI and Trucks datasets, we have five

datasets available for us to experiment with: Infati-750, Infati-500, Infati-250, Trucks-100,


Input: A set of initial trajectories T , a POI rate poiRate, a maximum spatialdistance maxDist, a splitting probability splitProb, a maximum splitdistance maxSplitDist, an order-swapping probability orderProb, and amaximum time difference maxTimeDiff

Output: A set of POIs POIs, and a historical trajectory database tDB

/* Phase 1: Generate POIs for every poiRateth point */

initialPOIs← generatePOIAtEveryNthPoint(T, poiRate)1

POIs← initialPOIs tDB ← {}2

/* Phase 2: Map trajectory points to POIs */

for traj ∈ T do3

visitedPOIs← {}4

for point ∈ traj do5

nextPoint← nearestUnvisitedPOI(point, POIs,maxDist, visitedPOIs)6

if exists(nextPoint) then7

/* There is an unvisited POI within maxDist of point */

if random() < splitProb then8

/* Generate a new POI within maxSplitDist of nextPoint*/

trajPoint← split(nextPoint,maxSplitDist)9

else10

splitPoints← allpointsderivedfromnextpoint11

trajPoint← chooseRandom(nextPoint, splitPoints)12

POIs← POIs ∪ {trajPoint}13

visitedPOIs← visitedPOIs ∪ {trajPoint}14

tDB ← tDB ∪ {visitedPOIs}15

/* Phase 3: Swap order of points close in time */

for traj ∈ tDB do16

for point ∈ traj do17

if successor(point).ts− point.ts < maxTimeDiff then18

if random() < orderProb then19

swap(point, successor(point))intraj20

return POIs, tDB21

Algorithm 6: Pseudo-code for Dataset Processing


Name #Traj. Points #Trajs. Avg. Length #Swaps #Users

Infati-750 35830 1487 24.1 1878 11Infati-500 54521 1546 35.3 4200 11Infati-250 107280 1624 66.1 13187 11Trucks-100 90654 5260 17.2 20272 60Trucks-50 100520 5339 18.8 23157 60

Table 6.1: Dataset Trajectory Information

Name Number of POIs Median POI Visits Max POI Visits

Infati-750 1812 13 133Infati-500 2760 13 140Infati-250 5413 13 163Trucks-100 3352 18 257Trucks-50 4776 12 269

Table 6.2: Dataset POI Information

and Trucks-50. This section contains a pair of tables of information about the processed

datasets, as well as visualizations of two of the processed datasets. Most of the columns in

these tables are self explanatory.

Table 6.1 contains information about the trajectories in each processed dataset. The

only non-obvious column is “#Swaps”, which lists how many points were swapped during

phase 3 of the processing algorithm

Table 6.2 contains information about the points of interest in each processed dataset.

The columns that need explanation are “Median POI Visits”, which lists the median number

of visits to each point of interest in the dataset, and “Max POI Visits’, which similarly lists

the maximum number of visits to any POI in the dataset.

Visualizations of Infati-500 and Trucks-100 can be found in figure 6.2. Comparing the

visualizations of these processed datasets with the graphs of the unprocessed initial trajec-

tory data found in figure 6.1, we can observe that they’re generally very similar, although

much of the fine detail has been lost. This is not a problem for us as we want trajectories

where each point denotes a visit to a point of interest, and we argue that the rarely visited

regions lost in the processing do not contain points of interest. The concept hierarchy used


(a) INFATI-500 dataset (b) Trucks-100 dataset

Figure 6.2: Processed Datasets

Restaurant

Slow Food Fast Food

Milestone'sWhite Spot McDonald's

Subway

Coffee Shop

Starbucks Second CupBlenz

Tourism

Bridge AquariumSkyride

Figure 6.3: Concept Hierarchy


to generate the processed datasets is presented in figure 6.3.

6.2 Evaluating Quality

In order to analyze the results of our experiments, we need to define appropriate metrics

for evaluating the quality of a recommendation or set of recommendations. In this section

we briefly describe the two scoring metrics used by this thesis to evaluate the quality of

results: binary scoring, and weighted scoring. Before stating the definitions of these two

scoring metrics, we need a preliminary definition for the set of matching POIs.

Definition 6.2.1. The matching POIs of a test trajectory fragment t = (q, n) is defined as

matchingPOIs(q, n) = {p ∈ recommendations(q)|p ≥ n}. That is, matchingPOIs is the

set of all POIs in the recommendations for q that contain the test fragment’s next point n.

Definition 6.2.2. The binary score of a test trajectory fragment t = (q, n) is 1 if any of

the recommendations for q contain the test fragment’s next point n, and 0 otherwise. This

can be concisely expressed as:

binaryScore(q, n) =

1 if matchingPOIs 6= ∅

0 otherwise(6.1)

The binary score is useful for informing us about how many test points we were able

to make a meaningful recommendation for, but it is a very poor measure of the quality of

recommendations. If we imagine a scenario where every concept in the concept hierarchy

had a common ancestor c, then all our recommendation system would have to do in order

to have a binary score of 1 for every test is return a concept point associated with c as one

of its recommendations. Although we want the sum of binary scores over all test trajectory

fragments to be high, this cannot be the metric that we try to optimize because it can

be optimized by always returning the most general recommendations. What we need is to

devise a measure that incorporates a trade-off between maximizing the number of queries

that can be satisfied and minimizing the uncertainty of each recommendation. For our

purposes, the uncertainty of a recommendation will be equal to the number of distinct POIs

that are contained in the recommendation.


Recall from definition 3.2.1 that P denotes the database of all points of interest. Let us

define a function pois(p) that given a point p returns the set of all points of interest that

could be represented by p.

pois(p) =

{p.poi} if p is a trajectory point

{poi ∈ P |p ≥ poi} if p is a generalized point(6.2)

Definition 6.2.3. The weighted score of a test trajectory fragment t = (q, n) is the sum of

the probabilities that n is represented by each the recommendations for q. The probability

that n is represented by a recommendation p is equal to 1|pois(p)| if p ≥ n and 0 otherwise.

This can be written concisely as:

weightedScore(q, n) =∑

p∈matchingPOIs(q,n)

1

|pois(p)|(6.3)

The weighted score measure avoids the problems of the binary score measure. In addition

to maximizing the number of tests for which a valid recommendation is returned, optimizing

the weighted score will minimize the uncertainty of the recommendations returned.

6.3 Experimentation

6.3.1 Implementation

All experiments were implemented in C++. Nearly all of the code was written specifically

for this thesis, with the notable exception of the core algorithm for k-truncated suffix tree

construction, which was adapted and extended from an open source implementation written

by the authors of [24]. In particular, the algorithm was adapted to work on objects other

than character strings, and its memory management concepts were redesigned.

All experimentation was performed on an Intel Core 2 Quad, with 4 GB of RAM.

Although the CPU has four cores, to minimize the risk of processes affecting each other,

such as through cache contention, only one core at a time was utilized for experimentation.

6.3.2 Design

The experimentation on each dataset was performed using the technique of k-fold cross

validation, with k = 10. k-fold cross validation works by dividing the dataset into k mutually


exclusive subsets of roughly equal size, known as folds. Suffix tree construction and testing

are then performed k times. On each iteration, 1 fold is selected as the test set, and the

other k−1 folds are used as the training set that we construct the (l-truncated) generalized

suffix tree on. For more background on this process, please see [11].

In this section, all averages, such as for query time, are taken as the average value over

all k folds. On the other hand, all sums, such as the total score for a dataset given certain

parameter values, are taken as the sum of the values over all k folds. To eliminate a possible

bias, the contents of the folds for each run of each experiment were randomly selected. As a

final note, for these experiments we fold on the users in a dataset, rather than on individual

trajectories.

For a given set of test trajectories test and a query length l, every trajectory fragment

f = (q : n) in test of length l + 1 is used as a test trajectory. The score (similarly, binary

score, average query time, etc.) for a dataset is defined to be the sum (or average, where

applicable) of the value for the metric in question over all test trajectories in each fold.

Except where otherwise indicated, the following parameters were used to configure the

recommendation algorithm: queryLength = 3, numRecommendations = 3, gf = 0.2,

cf = 0.01, and kernelWidth = 1. Specifically for experiments on the INFATI dataset,

values of sf = 1, and cellEdgeLength = 100 are used. For experiments on the Trucks

dataset, default values of spatialFactor = 500 and cellEdgeLength = 0.01 are used. There

is nothing intrinsically special about these values, and they were chosen merely for their

observed effectiveness in leading to accurate recommendations. For experiments on the

INFATI dataset testing the effect of order-flexibility, a maximum time difference of 2 seconds

is used. Similarly, for experiments on the Trucks dataset testing the effect of order-flexibility,

a maximum time difference of 60 seconds is used. Unless otherwise indicated, order-flexible

matching is disabled. When swapping is allowed we always set swapError = 1, except for

when the fuzzy matching radius is set to 0, in which case we have set swapError to 0 in

order to illuminate the effects of allowing order-flexibility when only exact matches between

trajectory points are allowed. Finally, l-truncated suffix trees are used by default, where l

is set to be equal to the query length parameter being used.

In many of the graphs on the following pages, data series are identified by the dataset

the series was generated from. A number of experiments were run twice, once disallowing

order-flexible matching, and once allowing order-flexible matching. For these experiments,

the data series generated when allowing order-flexible is distinguished by appending “(o)”


to the name of the dataset. For example, the data series name “INFATI-500” implies

that order-flexible matching was disallowed for that experimental run, whereas the data

series name “INFATI-500 (o)” implies that order-flexible matching was allowed for that

experimental run.

6.3.3 Basic Results

This section contains the results of some baseline experiments using the default configura-

tions for each dataset presented above. The majority of the experimental results described

in subsequent sections are comparisons of variations on the default configuration with the

results described in this section.

This section addresses the following questions:

• How does the fuzzy matching radius affect the weighted and binary scores of a dataset?

How does it affect the time required to execute a query?

• How does the fuzzy matching radius affect the number of queries for which we are

unable to make any recommendation, good or not?

• How does allowing order-flexible matching affect the weighted and binary scores of a

dataset? How does it affect the time required to execute a query?

We begin by looking at how varying the fuzzy matching radius affects the weighted and

binary scores of a dataset, both when allowing and when disallowing order-flexible matching.

Figures 6.4 and 6.8 contain these results for the INFATI dataset, and figures 6.5 and 6.9

contain these results for the Trucks dataset.

In figure 6.4 we see that for the INFATI dataset, the weighted score increases rapidly

with increasing fuzzy matching radius until the fuzzy matching radius reaches approximately

25, after which the weighted score slowly declines with increasing fuzzy matching radius.

Similarly, in figure 6.5 we see that a small fuzzy matching radius results in a higher score

than disallowing fuzzy matching, but that continuing to increase the fuzzy matching radius

results in a quickly diminishing weighted score. This behaviour, in which there is a hump

in the weighted score curve, can be explained as follows. As the fuzzy matching radius is

increased from zero, at first we match only historical trajectory fragments that are highly

similar to the query trajectory, and are in fact only small variations. However, when the

fuzzy matching radius reaches a certain threshold we begin matching against very dissimilar


trajectories in the historical trajectory database, and it is at this point that the slope of the

weighted score curve turns downwards. This demonstrates that the fuzzy matching radius

used in a trajectory-based recommendation system needs to be carefully chosen in order to

maximize the benefits of allowing fuzzy matching.

Another observation that can be derived from these figures 6.4 and 6.5 is the effect of

order-flexible matching on scores. In both of these charts we can see that when the fuzzy

matching radius is 0, allowing order-flexible matching results in increased weighted scores,

whereas when the fuzzy matching radius is greater than 0, the weighted scores are lower

than when disallowing order-flexible matching. Similar to how we explained the observation

that setting the fuzzy matching radius too large, allowing order flexible matching can lead

to a query matching fragments from highly dissimilar historical trajectories. This problem

could be partially alleviated by using a very large swap error parameter.

Figures 6.6 and 6.7 demonstrate that for both the INFATI and Trucks datasets, in-

creasing the fuzzy search radius leads to the number of queries for which we are unable to

make any recommendations decreasing monotonically. However, this says nothing about

the quality of the returned results. An interesting observation in these two charts is that

allowing order-flexible matching results in a significant decrease in the number of unsatisfi-

able queries, even when we are not allowing for fuzzy matching. This observation explains

why allowing order-flexible matching helps when the fuzzy matching radius is set to 0 (recall

from section 6.3.2 that when the fuzzy matching radius is set to 0 we set swapError = 0

as well).

Looking at the figures 6.8 and 6.9 we see that results similar to those for the weighted

score measure can be observed when looking at the binary scores for our datasets. For both

INFATI and Trucks datasets, we see a large increase in the binary score as the fuzzy matching

radius is increased to a certain level. However, once this threshold is transgressed, increasing

the fuzzy matching radius decreases the binary score as the quality of recommendations

becomes worse.

The final basic results are those on how the varying the fuzzy matching radius and

allowing for order-flexibility affect the time to execute per query. These results are presented

in figures 6.10 and 6.11 for the INFATI and Trucks datasets, respectively. In these figures we

can observe that the growth in the time per query is nearly linear with the fuzzy search radius

for the Trucks dataset. We can see this relationship as well for the INFATI dataset, but

it is interesting to note that for small increases in the fuzzy search radius the query time


0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0 20 40 60 80 100 120 140

Wei

ghte

dS

core

Fuzzy search radius

INFATI-250INFATI-250 (o)



Figure 6.4: INFATI Datasets: Weighted Scores vs. Fuzzy matching radius

does not necessarily increase. This demonstrates that the time per query is not directly

dependent on the search radius, but only on the number of points contained within the

search radius. Regarding allowing order-flexible matching, we can clearly see in these figures

that allowing order-flexible matching significantly increases the time required to execute a

recommendation query.


100

200

300

400

500

600

700

800

900

0 5 10 15 20

Wei

ghte

dS

core

Fuzzy search radius

Trucks-50Trucks-50 (o)


Figure 6.5: Trucks Datasets: Weighted Scores vs. Fuzzy matching radius

0

10000

20000

30000

40000

50000

60000

0 20 40 60 80 100 120 140

Cou

nt

of

Un

sati

sfiab

leQ

uer

ies

Fuzzy search radius




Figure 6.6: INFATI Datasets: Unsatisfiable Queries vs. Fuzzy matching radius


0

2000

4000

6000

8000

10000

12000

0 5 10 15 20

Cou

nt

of

Un

sati

sfiab

leQ

uer

ies

Fuzzy search radius



Figure 6.7: Trucks Datasets: Unsatisfiable Queries vs. Fuzzy matching radius

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 20 40 60 80 100 120 140

Bin

ary

Sco

re

Fuzzy search radius




Figure 6.8: INFATI Datasets: Binary Scores vs. Fuzzy matching radius


0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20

Bin

ary

Sco

re

Fuzzy search radius



Figure 6.9: Trucks Datasets: Binary Scores vs. Fuzzy matching radius

0

2

4

6

8

10

12

14

16

0 20 40 60 80 100 120 140

Tim

ep

erQ

uer

y(m

s)

Fuzzy search radius




Figure 6.10: INFATI Datasets: Query Time vs. Fuzzy matching radius


0

5

10

15

20

25

30

35

0 5 10 15 20

Tim

ep

erQ

uer

y(m

s)

Fuzzy search radius



Figure 6.11: Trucks Datasets: Query Time vs. Fuzzy matching radius


6.3.4 Query Length

All of the basic results in section 6.3.3 were generated using a query length of 3, and this

section addresses the question of how the scores and query times are affected by varying the

query length. For brevity, results are presented only for the INFATI dataset. Results on how

varying the query length affects the memory usage and construction time for k-truncated

generalized suffix trees are presented later, in section 6.3.8.

The first results of this section are on the effect of varying the query length k on the

weighted score and binary scores of a dataset. In figure 6.12 we see that except in the case

where the fuzzy matching radius is 0, the weighted scores are significantly greater for query

lengths of 2 and 3 than they are for a query length of 1. The same result can be observed

for the binary score measure in 6.13.

Recalling from Chapter 2 that all existing mobile recommendation systems do not incor-

porate a user’s trajectory history, these results demonstrate that the methods of this thesis

are an improvement over existing methods.

The observant reader will notice that the scores for a query length of 5 are lower than

those for a query length of 1. This is not because the recommendations for long queries

are lower quality than those for short queries, but is due to insufficient historical data. In

figure 6.15 we can observe that when our query length is 5 there are approximately 25, 000

queries that we were unable to make any recommendations for. This could problem could be

avoided by making recommendations based on the longest suffix of the query trajectory that

can be matched in the historical trajectory database, but this idea is not pursued further

in this thesis.

The final observation in this section is derived from figure 6.14, which compares the

time required to execute a query with the fuzzy matching radius for a variety of different

query lengths. In this figure it is clear that increasing the query length results in shorter

query execution times. This is because increasing the query length decreases the number of

matching trajectory points that we need to consider recommending.


2000

4000

6000

8000

10000

12000

14000

16000

0 5 10 15 20 25

Wei

ghte

dS

core

Fuzzy search radius

k = 1k = 2k = 3k = 5

Figure 6.12: INFATI-500: Effect of Query Length on Weighted Score

0

5000

10000

15000

20000

25000

30000

35000

0 5 10 15 20 25

Bin

ary

Sco

re

Fuzzy search radius

k = 1k = 2k = 3k = 5

Figure 6.13: INFATI-500: Effect of Query Length on Binary Score


0

2

4

6

8

10

12

0 5 10 15 20 25

Tim

ep

erQ

uer

y(m

s)

Fuzzy search radius

k = 1k = 2k = 3k = 5

Figure 6.14: INFATI-500: Effect of Query Length on Query Time

0

5000

10000

15000

20000

25000

30000

0 5 10 15 20 25

Cou

nt

of

Un

sati

sfiab

leQ

uer

ies

Fuzzy search radius

k = 1k = 2k = 3k = 5

Figure 6.15: INFATI-500: Effect of Query Length on Unsatisfiable Queries


6.3.5 Number of Recommendations

All of the baseline results presented in section 6.3.3 were generated after configuring the

recommendation system to return the best 3 recommendations, and it is natural to wonder

how the number of returned recommendations affects the scores of the results for all queries.

Altering the number of recommendations returned does not affect the time required per

query because in order to select the top k recommendations the system as designed needs to

compute the confidences of all possible recommendations. Furthermore, varying k does not

affect the number of unsatisfiable queries because as long as there is at least one possible

recommendation to be made for a query, we say that the query is satisfiable. Thus, all that

we need to look at is the effect of the number of recommendations on the weighed score and

binary score of a dataset.

In figure 6.16 we can see that generally more recommendations leads to a greater weighed

score, although not always. That returning two recommendations performs slightly worse

than returning one recommendation can be explained by noting that the folds were randomly

generated for each experimental run. The differences between these weighted scores are so

small that we believe that this anomaly can be ignored. Looking only at the weighted scores

gives us only weak evidence that increasing the number of recommendations improves the

quality of results of the recommendation system.

Looking at figure 6.17 we see that increasing the number of recommendations greatly

increases the binary score, although each added recommendation diminishes the benefit of

returning an additional recommendation. Combining this result with our previous obser-

vation that increasing the number of recommendations returned has a small effect on the

weighted score, we can draw a number of conclusions. The first is that the best recommen-

dation returned tends to be a specific trajectory point, and the second is that subsequent

recommendations tend to be generalized points. Together with the idea that it is easier for

a user to understand and make decisions based on a small number of recommended points

than a large number, this leads us to recommend that in a real world system between 2 and

4 recommendations be returned to the user. Furthermore, these conclusions lend credibility

to our decision to return 3 recommendations for the baseline results.


8000

9000

10000

11000

12000

13000

14000

15000

0 5 10 15 20

Wei

ghte

dS

core

Fuzzy search radius

numRecs = 1numRecs = 2numRecs = 3numRecs = 5

Figure 6.16: INFATI-500: Effect of the Number of Recommendations on Weighted Score

5000

10000

15000

20000

25000

30000

35000

0 5 10 15 20

Bin

ary

Sco

re

Fuzzy search radius

numRecs = 1numRecs = 2numRecs = 3numRecs = 5

Figure 6.17: INFATI-500: Effect of the Number of Recommendations on Binary Score


9000

10000

11000

12000

13000

14000

15000

0 5 10 15 20 25

Wei

ghte

dS

core

Fuzzy search radius

Not DiversifiedDiversified

Figure 6.18: INFATI-500: Effect of Diversification on Weighted Score

6.3.6 Diversification

One of the desired requirements for a useful trajectory-based POI recommendation system

described in section 3.3 was the requirement of diversification. Later in the thesis, in section

4.5, a simply greedy algorithm for diversifying a result set was presented. This section

presents results demonstrating that performing diversification does not significantly diminish

the quality of returned results and can in fact increase the scores for a dataset. Note that

for all experiments, the top 3 recommendations are returned.

In both figures 6.18 and 6.19 we can see that enabling diversification boosted both the

weighed and binaries scores for the INFATI-500 dataset. This result may appear surprising

at first, but it can be taken as evidence that the results returned by the recommendation

system when diversification is disabled are too similar to each other. Diversification is pro-

viding exactly the benefit that we desired it to provide, which is to increase the probability

that one of the top-k trajectories is similar to the next point visited by a user of the system.


14000

16000

18000

20000

22000

24000

26000

28000

30000

32000

0 5 10 15 20 25

Bin

ary

Sco

re

Fuzzy search radius

Not DiversifiedDiversified

Figure 6.19: INFATI-500: Effect of Diversification on Binary Score

9000

10000

11000

12000

13000

14000

15000

0 20 40 60 80 100 120 140

Wei

ghte

dS

core

Fuzzy search radius

Spatial Factor (sf) = 0.5Spatial Factor (sf) = 1Spatial Factor (sf) = 2Spatial Factor (sf) = 4

Figure 6.20: INFATI-500: Effect of Spatial Factor on Weighted Score


6.3.7 Other Variations

The framework developed in this thesis has a large number of parameters, and results

demonstrating the effects of the most important parameters were presented in the previous

sections of this chapter. To present results on every parameter would require too much space

and provide little additional insight, and so in this section we present results demonstrating

the effects of what we believe to be the next two most important parameters. These are the

spatial factor sf , and the kernel width kw.

The effects of varying the spatial factor sf on the weighted score of the INFATI-500

dataset are presented in figure 6.20. The first observation drawn from this figure is that

using a very small spatial factor, such as sf = 0.5 results in the optimal fuzzy matching

radius being much smaller than when a larger spatial factor is used. The curves for larger

spatial factors are stretched out. This is unsurprising given our earlier results, but what is

more interesting is that the maximum weighted score depends on the chosen spatial factor.

Furthermore, there is no clear relationship between the maximum weighted score attained

and the chosen spatial factor.

On the other hand, we can observe in figure 6.21 that there is a clear relationship between

the chosen spatial factor and the time required to execute a query. As the spatial factor

increases, the time required to execute a query decreases. This result is expected given that

increasing the spatial factor will decrease the number of trajectories that do not exactly

match the query trajectory but that lie within the fuzzy matching radius.

The other parameter varied in this section is the kernel width kw. As mentioned in

Section 4.2.2, the accuracy of kernel estimation generally depends more on the chosen kernel

width kw than on the particular kernel function chosen. Consequently, we explore the effect

of varying kw here. In figure 6.22 we can see that the selected kernel width has a clear

effect on the weighted score for the INFATI-500 dataset. On the other hand, the effect is

not monotonic, and the optimal kernel width appears to be 2, rather than 1 or 4. In any

real world implementation of the recommendation system the kernel width would need to

be carefully tuned to return optimal results.


1

2

3

4

5

6

7

0 20 40 60 80 100 120 140

Tim

ep

erQ

uer

y(m

s)

Fuzzy search radius

Spatial Factor (sf) = 0.5Spatial Factor (sf) = 1Spatial Factor (sf) = 2Spatial Factor (sf) = 4

Figure 6.21: INFATI-500: Effect of Spatial Factor on Query Time

9000

10000

11000

12000

13000

14000

15000

0 5 10 15 20 25

Wei

ghte

dS

core

Fuzzy search radius

Kernel Width (kw) = 1Kernel Width (kw) = 2Kernel Width (kw) = 4

Figure 6.22: INFATI-500: Effect of Kernel Width on Weighted Score


6.3.8 Effects of k-Truncated Suffix Trees

Until this point, all of our experimentation has been performed using k-truncated generalized

suffix trees, with k set to match the current queryLength parameter. In this section we

compare the use of k-truncated generalized suffix trees with the use of ordinary generalized

suffix trees.

In figure 6.23 we compare the effects of varying k on the time required to construct

a k-truncated generalized suffix tree. Furthermore, we compare this time this with the

time required to construct an ordinary generalized suffix tree, which is independent of the

length of queries allowed. In this figure we can clearly see that for small values of k that

the construction of k-truncated suffix trees is much more efficient than the construction of

ordinary suffix trees, and that as k increases, the time required to construct a k-truncated

suffix tree converges towards the time required to construct an ordinary suffix tree. Although

the tree construction times shown in this figure are small, for a large real-world system the

time benefits of using a k-truncated suffix tree could be considerable.

In figure 6.24 we show the effects of varying k on the memory required to store a k-

truncated generalized suffix tree, and compare this with the memory required to store a

plain generalized suffix tree. Similar to how for small values of k, k-truncated suffix trees

required much less time to construct than plain suffix trees, we see that for small values of

k, k-truncated suffix trees use much less memory than plain suffix trees. Furthermore, as k

increases, the memory used by a k-truncated suffix tree converges towards the memory used

by a plain suffix tree. In the paper [24] introducing k-truncated generalized suffix trees it is

noted that by using an optimization known as “multi-int leaves” it may be possible to reduce

the memory usage of the k-truncated suffix tree. This optimization works by allowing leaves

to represent locations in multiple strings. Thus, it may be possible to improve the memory

advantage of using k-truncated suffix trees over plain suffix trees, but this is not explored

in this thesis.

Finally, in figure 6.25 we compare the time required to execute queries using truncated

and non-truncated suffix trees. Interestingly, for query lengths of 3 and 5, there is a slight

time advantage to using truncated suffix trees, but when the maximum query length is 2

queries are executed slightly quicker on a plain suffix tree. The precise reason for this is

unclear, but we postulate that this is due to the fact that the internal leaves of a k-truncated

suffix tree are stored in a linked list, and that when k = 2 these lists of internal leaves may


100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10

Su

ffix

Tre

eC

onst

ruct

ion

Tim

e(m

s)

Query Length (k)

INFATI (truncated)INFATI (non-truncated)

Trucks (truncated)Trucks (non-truncated)

Figure 6.23: Effects of Query Length on Suffix Tree Construction Time

grow very long. Regardless, the effect of truncating the suffix tree on the time required

to execute is minimal in every case. In [24], the authors find that queries are executed

significantly more quickly on k-truncated suffix trees than plain suffix trees. This contrasts

with our present observations, and the key difference between their experiments and ours

is that the strings (trajectories) stored in our suffix trees are generally fairly short, whereas

the strings (DNA sequences) stored in their suffix trees tend to be thousands of characters

long.


4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

1 2 3 4 5 6 7 8 9 10

Su

ffix

Tre

eM

emor

yU

sage

(KB

)

Query Length (k)

INFATI (truncated)INFATI (non-truncated)

Trucks (truncated)Trucks (non-truncated)

Figure 6.24: Effects of Truncation on Suffix Tree Memory Usage

0

1

2

3

4

5

6

0 5 10 15 20 25

Tim

ep

erQ

uer

y(m

s)

Fuzzy search radius

k = 2k = 2 (NT)

k = 3k = 3 (NT)

k = 5k = 5 (NT)

Figure 6.25: INFATI-500: Effects of Truncation on Query Times


6.4 Summary

This chapter began by describing the two real world datasets used by this thesis, the INFATI

dataset, and the Trucks dataset. We then described how we processed these two datasets to

generate five derived datasets that are suitable for use by the trajectory-based recommen-

dation system described in this thesis. Furthermore, a method for evaluating the quality

of returned results was presented. Following these preliminary sections, a large number of

experiments were run and analyzed. The most important observations are:

• (Section 6.3.4) Using a query length greater than 1 leads to increased recommendation

quality. This means that using the framework described in this thesis can produce

higher quality recommendations than existing mobile POI recommendation systems.

• (Section 6.3.3) Allowing fuzzy matching can increase the weighted score of a dataset,

but if the fuzzy matching radius used is too large then the query trajectory will match

too many other trajectories, leading to a decrease in recommendation quality.

• (Section 6.3.3) Allowing order-flexible matching has a significant benefit when fuzzy

matching is disallowed, but when fuzzy matching is allowed the effects are negligible.

• (Section 6.3.3) Queries can be answered quickly, in the order of milliseconds, even

when the fuzzy matching radius is large. However, the query execution time grows

with both the fuzzy matching radius and the size of the historical trajectory database.

• (Section 6.3.5) Returning a small number of recommendations, such as 3 is the optimal

trade-off between ensuring that at least one recommendation is relevant to the user

and on the other hand, not overwhelming the user with recommendations.

• (Section 6.3.6) Using the diversification algorithm presented in section 4.5 can result

in improved weighted and binary scores.

• (Section 6.3.7) Varying the parameters of the recommendation system affect the qual-

ity of results returned and need to be tuned for optimal performance. However, the

system’s performance is relatively stable with respect to these parameters.

• (Section 6.3.8) For small values of k, k-truncated generalized suffix trees instead of

plain generalized suffix trees reduces the time required for suffix tree construction as

well as the memory required for storing the suffix tree.

Chapter 7

Conclusion

Location-aware mobile devices are quickly becoming ubiquitous, and this provides an oppor-

tunity for mobile recommendation systems. Existing research into mobile recommendation

systems has focused on recommending points of interest to a user based on a user’s current

location, ignoring the recent trajectory history of the user. Any methods for improving the

quality of the returned recommendations stands to greatly increase the usefulness of the

mobile recommendation system.

The most significant contribution of this thesis is the introduction and formalization of

the trajectory-based POI recommendation problem along with a set of desired requirements

for a useful trajectory-based POI recommendation system and efficient solutions to three

variants of the problem. Beginning with a naive approach to the exact matching problem,

we proceeded to construct a trajectory-based recommendation system framework capable

of recommending concept points and localized points in addition to individual points of in-

terest. Following this construction, the recommendation framework was extended to allow

for fuzzy matching and order-flexible queries to be executed in an efficient manner. For

each of these variants we provided the necessary details for their efficient implementation.

Finally, we demonstrated that the trajectory-based POI recommendation framework de-

veloped in this thesis is both efficient and effective on a group of datasets constructed by

processing two real world datasets. The recommendation system framework constructed in

this thesis is efficient, scalable, highly configurable and capable of generating higher quality

recommendations than a recommendation system that ignores trajectory histories.

83

CHAPTER 7. CONCLUSION 84

7.1 Future Directions

Many research directions were considered, but not pursued as the main research thrust

of this thesis changed directions during its development. Many of these directions may

nonetheless lead to interesting extensions of the research presented in this thesis, and these

are summarized below.

7.1.1 Personalization

One of the requirements for a useful trajectory-based POI recommendation system presented

in section 3.3 was the requirement of personalization. This requirement expresses the desire

for the recommendations returned for a given query trajectory to be personalized for the

user submitting the query. We foresee two distinct approaches for accomplishing this.

The first approach is to perform collaborative filtering on the set of potential recommen-

dations. This approach implements personalization as a post-processing step where the set

of recommendations returned to the user could be selected based on the points of interest

visited by similar users. Although this approach might work, a more interesting approach

could be developed by integrating personalization more deeply into the recommendation

process.

The second approach is to construct three sets of recommendations. The first set of

recommendations is based on personal history. These recommendations are generated by

performing the recommendation process only on the trajectories in the historical trajectory

database that were generated by the current user. The second set of recommendations

is based on user group history. These recommendations are generated by performing the

recommendation process only on those trajectories in the historical trajectory database that

were generated by users in the same user group as the current user. This user group could

be determined according to user attributes such as occupation or age. The final set of

recommendation is based on the histories of all users, and is the set of recommendations

that we have been computing in this thesis. Notice that all of these sets of recommendations

are generated by the same recommendation process, only that they are based on different

subsets of the historical trajectory database. These three sets of recommendations could be

mixed to determine the final set of recommendations to be returned to the user.


7.1.2 Parallelizing Matching

One of our requirements for a useful trajectory-based recommendation system presented

in section 3.3 was that the system should be highly scalable, and one approach to achieve

this would be to parallelize the matching process. Not only is to possible to parallelize the

suffix tree search methods across multiple machines, they can easily be parallelized across

multiple threads on the same physical machine. As most computers now have multiple CPU

cores, this means that we will be able to more fully utilize the processing power available

to answer trajectory-based top-k recommendation queries.

To parallelize the methods in this chapter, what we can do is to split our set of historical

trajectories S into m disjoint sets, each containing approximately |S|m trajectories. Next,

we can build a generalized suffix tree (or k-truncated generalized suffix tree) for each of

these sets of trajectories. This will be less space efficient than building a single suffix tree,

but each of these suffix trees will be independent of the others, and so nothing prevents

us from building all of them in parallel. The total running time to build all of the suffix

trees will then be O(m log |Σ|), exactly the same as before, except that this work can be

split amongst all available processors on the current machine (or across machines for truly

massive data sets). Thus, it is possible to arbitrarily split up the pre-processing of our

historical trajectories.

Generating m suffix trees instead of a single suffix tree improves the time required to

construct the suffix trees because each one can be constructed in parallel, and could also

assist by keeping the memory requirements of each tree small. Doing this would also affect

the time required to execute queries, but the effect on query time should generally be

negligible. This is because a query trajectory q is expected to be very short, and the time

required to match a query trajectory q is proportional only to the length of q and not to

the size of the suffix tree (assuming the tree has a fixed alphabet size). Even if we need to

perform m suffix tree matching operations, the running time should not be largely affected.

Furthermore, each of these matching operations can be done in parallel to build up the

set of next points of q, as they are independent of each other. A potential future research

direction would be to implement a parallel matching algorithm and to test its effect on query

execution times.


7.1.3 User Feedback

In order to build a trajectory-based POI recommendation system that would be useful for

real human beings we may want to incorporate user feedback into the recommendation

process. This means that it would be possible for users to vote on whether a given rec-

ommendation is relevant or not, and for this vote to affect future recommendations. We

could accomplish this by modifying our confidence measures as described in sections 4.1.1

and 4.2.3 to incorporate relevance feedback. For example, we could multiply the confidence

of each recommendation by the proportion of users who have previously found that recom-

mendation to be useful and relevant. More details on possible mechanisms for incorporating

relevance feedback into recommendation systems can be found in [19].

7.1.4 Temporal Constraints

Many points of interest are only likely to be visited at particular times of day. For example,

a breakfast cafe may be visited only in the morning, and a movie theatre may be visited only

in the evening. A breakfast cafe recommendation in the evening is not useful for a user of the

system, and undesirable from the perspective of an advertiser who may have to pay for an

irrelevant recommendation. The trajectory-based POI recommendation system developed

in this thesis is not time-aware and has no means of incorporating temporal constraints,

such as that a breakfast cafe is relevant only in the morning, into the recommendation

process. One potential means of addressing this limitation would be to add an attribute

corresponding to the time of day, such as “morning” or ”evening”. This attribute would

be part of the key for a trajectory point, in addition to the point of interest corresponding

to the trajectory point. When searching for a query trajectory in the historical trajectory

database we would need to match both the time of day and the point of interest visited

in order to proceed down a branch of the generalized suffix tree. An interesting research

direction would be to investigate if this is an effective means of incorporating temporal

constraints and whether there are any other means of incorporating temporal constraints

that are more effective.

7.1.5 Continuous Matching

In section 2.3 we discussed research by Frentzos et al. [5] into nearest-neighbour searches on

moving object databases. Whereas our methods are intimately connected with the number


of points lying on a trajectory, the methods described by Frentzos et al. are intimately

connected with the notion of time. Their methods are able to compute the similarity

between two trajectories over a specific period of time, regardless of the number of points in

each of the two trajectories, and thus can compute the similarity between two trajectories

in a continuous manner. An interesting future research direction would be to develop a

trajectory-based POI recommendation system based on the continuous matching methods

described by Frentzos et al. that satisfies the requirements described in section 3.3.

7.1.6 Longer Tails

Given a trajectory t of length n, and a trajectory fragment f = ti..j , let the tail of f be

defined to be tj..n. That is, the tail of f is the suffix of t starting from the last point in f .

The m-tail of f is defined to be tj..min(j+m,n), the first m points of the tail of f .

Throughout this thesis we have taken a query trajectory q and matched it against the his-

torical trajectory database in order to find all historical trajectory fragments F that match

q (perhaps allowing fuzzy matches or order-flexible matches). Then, for each historical tra-

jectory fragment t ∈ F we consider the next point of t as a potential recommendation for the

query q. In other words, recommendations are based on the 1-tail of each matching histor-

ical trajectory fragment. One future research direction would be to consider incorporating

more than the 1-tail of each matching historical trajectory fragment into the recommenda-

tion process and to incorporate the m-tail of each matching historical trajectory fragment.

To see why this would be useful, suppose that following a query trajectory q, some people

visit a museum x, but many people visit x following a visit to a cafe y. By only considering

the next points of the historical trajectory fragments matching q, the confidence of recom-

mending x may be low, but by considering the 2-tails of the historical trajectory fragments

matching q the confidence of recommending x could be significantly higher.

Incorporating longer tails into the recommendation process would be a simple extension

of the methods described in this thesis. After matching a query trajectory q in the suffix

tree representing the historical trajectory database, we can easily walk the subtree below

the match in the suffix tree to determine the m-tails of each matching historical trajectory

fragment. The contribution of points in the m-tail should be weighted according to their

position in the tail. For example, the next point of the query trajectory (the first point in

the m-tail) should be given full weight, whereas later points should be given less weight. It

would be a very interesting future research direction to explore the effects of incorporating


longer tails into the recommendation process and to observe if it can significantly improve

recommendation quality.

7.1.7 Other Directions

This section briefly describes a number of other potential research directions that are too

small to warrant their own section.

• Investigate using alternative distance measures and kernel functions. In this thesis we

use Gaussian kernel estimation for performing density estimation, and use the distance

metric defined in section 4.4.3 to compute point distances. It would be interesting to

investigate if other distance measures and kernel functions could be used to improve

the quality of recommendations.

• Investigate the possibility of allowing concept points to be present in the historical

trajectory database and query trajectory in addition to the trajectory points that we

currently permit. This could be utilized to eliminate user-specific locations such as a

user’s home from the historical trajectory database and query trajectory. Instead of

everybody starting their day at their own home, it could be possible for everybody to

start at the concept point “home”, and this could potentially increase the effectiveness

of the recommendation system.

• Investigate whether weighting the error of fuzzy matches improves the quality of query

results when the query length is large. Intuitively, the most recent point in a query

trajectory is more important than earlier points, and so a potential improvement would

be to weight the distance between a point p in the query trajectory q and a point in a

historical trajectory according to the position of p in q. For example, we could weight

the distance between p and another point by ik where i is the index of p in q and

k = |q| is the length of q.

Appendix A

Constructing suffix trees

A.1 Introduction

Definitions of suffix trees, generalized suffix trees, and k-truncated generalized suffix trees

can be found in Chapter 4 in section 4.4.1, and it is recommended that it be read prior

to reading this appendix. This appendix describes an efficient method for their construc-

tion. One useful resource for a description of suffix trees and their construction is Dan

Gusfield’s book, “Algorithms on Strings, Trees, and Sequences” [10]. However, the content

and organization of this appendix are essentially taken from Schultz et al. [24].

Although a linear time method for suffix tree construction was first discovered in 1973

by Peter Weiner [30], the first online method for suffix tree construction was published in

1995 by Esko Ukkonen [28]. “Online” in this context means that characters are added to

the suffix tree in the order in which they are presented, and this means that it is possible

to update the suffix tree with new characters as they are discovered.

The methods described in this appendix all have a published complexity bound of O(m),

where m is the sum of lengths of all input strings. However, it is important to note that this

bound is only valid assuming a fixed alphabet. For the purposes of this thesis, the alphabet

Σ is not fixed, and so the real complexity bound is O(m log Σ).

A.2 Ukkonen’s Algorithm for Suffix Trees

Given a string s of length m, Ukkonen’s algorithm processes s from left to right (in the

order that characters are presented) in m phases in order to construct an suffix tree T . In

89

APPENDIX A. CONSTRUCTING SUFFIX TREES 90

phase i, the substring s1..i and all of its suffixes s2..i, ..., si,i are inserted into T if they are

not already present in T . Furthermore, each phase is divided into extensions, so that the

action of extension j of phase i is to add the substring sj..i into the T if it is not already

present.

Following Gusfield [10], extremely high level pseudo-code for Ukkonen’s algorithm is

presented in algorithm 7.

Input: A string sOutput: A suffix tree T for s

T ← tree consisting of a single edge representing s11

for i← 2 to |s| do2

/* Begin phase i */

for j ← 1 to i do3

/* Begin extension j */

Starting from the root, find the end of the path labeled sj..i in the current4

treeIf needed, extend the path by adding character si5

Algorithm 7: Ukkonen’s Algorithm (High Level)

Ukkonen’s algorithm’s handling of each extension j for phase i + 1 can be divided into

two distinct parts. The first part is concerned with inserting the string sj..i+1 into the suffix

tree, and the second is concerned with finding the substring sj+1..i in the suffix tree, and

thus preparing for the next extension.

For the first part, we need to know how to insert sj..i+1 into T given that β = sj..i is

already present in the tree. This is done according to three rules:

• Rule 1: The path β from the root ends at a leaf of T . To update T , simply add si+1

to the end of the leaf’s label.

• Rule 2: The path β from the root does not end at a leaf of T , and no path continuing

from β begins with si+1. In this case, a new leaf must be added to T . Note that if β

ends in the middle of an edge, then that edge must be split.

• Rule 3: The path β from the root does not end at a leaf of T , but there is a path

continuing from β beginning with si+1. In this case, there is nothing to do.

It is useful and interesting to note that the first rule is not strictly required because any


leaf node created by the second rule will always remain a leaf node, and so when the leaf

node is added, we can simply setting the label of the edge leading to the new leaf to be the

entire suffix from si+1.

As mentioned above, in addition to performing suffix extensions, the other part of Ukko-

nen’s algorithm is concerned with finding next suffix sj+1..i+1 to be extended. At a high-level

this is very straightforward to understand. However, we need to be careful to avoid con-

structing an O(n2) algorithm, and to avoid this Ukkonen uses a number of implementational

“tricks”.

The first technique, originally proposed in 1976 by McCreight [20], is the suffix link. A

suffix link is a pointer between two internal nodes, N,M of the suffix tree T , such that if

the path to N is xα, and the path to M is α, then there will be a suffix link from N to M .

The suffix link will be denoted as N.link.

At first glance, it seems that these suffix links will be sufficient to find the location in

the tree where the next extension is to be performed. However, we may need to traverse

upwards from our extension point to find the nearest internal node N , and we can then

follow N.link to another internal node M , but from that node may again need to traverse

down the tree along some path γ to find the next extension point. The technique used to

optimize this is known as the skip and count trick.

The key to the skip and count trick is that we are guaranteed that γ is already present

in the tree. Thus, at M we need only look for the child of M whose first character is the

first character of γ. We then either move to this child of M or to the end of γ, if it ends

in the middle of an edge, and this is repeated until we reach the end of γ. The essential

point here is that we can move from node to node (or node to end of γ) using constant time

operations, and so the time to find the location of the next suffix extension is proportional

only to the number of nodes passed through when traversing γ. Using these techniques, it

can be shown that a suffix tree can be constructed in time linear to the length of the input

string, assuming a fixed alphabet.

An important practical note is that many suffix tree implementations use a linked list

at each node, and this can slow down insertion and lookup time because as the tree grows,

it takes increasingly longer to find the child node corresponding to the next character in the

suffix being inserted. This is an important consideration to take when noting that one of

our requirements is that queries should be executed in real-time.

An algorithm block with the pseudo-code for all of the methods found in this section


and the next section can be found at the end of the chapter.

A.3 Constructing Generalized Suffix Trees

The algorithm described in the previous section work for when we want to construct a suffix

tree on a single string. However, there are two small enhancements that need to be made

to the algorithm in order to be able to construct generalized suffix trees on multiple strings.

The first change that needs to be made is to add to every leaf an identifier for its source

string. The second change is the use of Internal Leaves, which are linked lists of pairs (string

id, position in string) that serve to indicate which strings a suffix is present in, as well as

the starting position of the suffix in that string. These two minor changes are sufficient to

make the algorithm described in the previous section able to construct generalized suffix

trees. These changes do not affect the complexity of the algorithm described in the previous

section, and it is possible to construct a generalized suffix tree in time O(m), where m is

the sum of lengths of all input strings. One optimization presented by Schultz et al is that

it can be more efficient to represent all internal leaves using a single node as this reduces

the number of pointers required to store the suffix tree, and can result in faster time by

increasing reference locality and reducing the number of cache misses.

As a final point for this section, these changes lead to an additional rule for suffix

extension:

• Rule 4: Whenever inserting a suffix leads to a node in the tree T , create a new

internal leaf to record the ID of the current string being processed and position of the

suffix of the current suffix in that string, and add it to the node.

A.4 Constructing k-truncated Generalized Suffix Trees

The two obvious methods to construct a k-truncated generalized suffix tree (kTST) for a set

of strings are firstly, to delete subtrees of a full generalized suffix tree, and secondly, to build

a generalized suffix tree by inserting every k-mer of each input string. However, neither of

these methods are particularly good, with the latter approach having a time complexity of

O(km), where m is the sum of lengths of all input strings. Continuing to follow Schultz,

Bauer, and Robinson [24], it is possible to construct k-truncated generalized suffix trees in

linear time.


What we need to do to construct the kTST is move left to right over each input string,

considering a window of size no greater than k. The current string depth will be kept track

of and denoted by depth. Due to considering only a window of size no greater than k, rather

than an entire suffix, we need to modify rule 2 as follows:

• Rule 2∗: Same as rule 2, except that only a k-mer truncated suffix is inserted into

the tree.

In addition to this suffix extension rule modification, a modification to the second part of

Ukkonen’s algorithm is required as well. Once the algorithm reaches depth k, it inserts the k-

mer sj..j+k into T , and the next string to be inserted is then sj+1..j+k+1. Whereas Ukkonen’s

algorithm would normally break, and move to the next phase, resetting the extension, what

we do here is increment the phase and advance to the next extension, so that the algorithm

arrives at sj+1..j+k at the end of the extension. This string already exists in the three as it

would have been inserted in the previous phase. Note that the algorithm for constructing k-

truncated generalized suffix trees can be used to construct ordinary generalized suffix trees

by simply setting k = ∞. Pseudo-code for the modifed Ukkonen’s algorithm is adapted

from [24] and presented in algorithm 8

MIS$I$

SP

$SS

PP$I$

PI$I SI

S P

1

2, 53, 6

4 7

8910 11

Figure A.1: 3-truncated suffix tree for the word ”mississippi$.


Input: A string sOutput: A k-truncated suffix tree T for s

m← length(s)1

lastNode← root(T ), node← lastNode2

j ← 1, depth← 03

for i← 1 to m+ 1 do /* Phase i */4

for j while j ≤ i and i ≤ m do /* Extension j */5

/* Part 1: Insert suffix sj..i+1 into T */

if si+1 isn’t contained in tree at current position then6

if sj..i doesn’t end directly at node then7

node← SPLITEDGE8

Add leaf to node with edge label starting with si+1 ; /* Rule 2/2∗ */9

lastNode.link ← node, lastNode← node10

Move down one character along edge ; /* Rule 3 */11

depth← depth+ 112

if depth = k then13

Add new internal leaf ; /* Rule 4 */14

i← i+ 115

break16

/* Part 2: Update current position to sj+1..i */

if node 6= root(T ) then17

if sj..i ends directly at node and node has a suffix link then18

node← node.link19

depth← depth− 120

xα← label between current position and node.parent21

node← node.parent22

depth← depth− 123

if node 6= root(T ) then24

node← node.link25

γ ← xα26

γ ← α27

Use skip & count to move back down via γ ; /* This alters node */28

if current position is node and lastNode has no suffix link then29

lastNode.link ← node, lastNode← node30

Algorithm 8: Modified Ukkonen’s Algorithm for k truncated suffix trees.

Bibliography

[1] A. Asthana, M. Crauatts, and P. Krzyzanowski. An indoor wireless system for person-alized shopping assistance. In WMCSA ’94: Proceedings of the 1994 First Workshopon Mobile Computing Systems and Applications, pages 69–74, Washington, DC, USA,1994. IEEE Computer Society.

[2] Zhixiang Chen, Richard Fowler, Ada W. Fu, and Chunyue Chen. Fast Construction ofGeneralized Suffix Trees Over a Very Large Alphabet, volume 2697 of Lecture Notes inComputer Science, pages 284–293. Springer Berlin / Heidelberg, 2003.

[3] Sigal Elnekave, Mark Last, and Oded Maimon. Incremental clustering of mobile objects.Data Engineering Workshops, 22nd International Conference on, 0:585–592, 2007.

[4] R. Fraile and S. J. Maybank. Vehicle trajectory approximation and classification. InPaul H. Lewis and Mark S. Nixon, editors, British Machine Vision Conference, 1998.

[5] Elias Frentzos, Kostas Gratsias, Nikos Pelekis, and Yannis Theodoridis. Algorithms fornearest neighbor search on moving object trajectories. Geoinformatica, 11(2):159–193,2007.

[6] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of regressionmodels. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 63–72, New York, NY, USA, 1999.ACM.

[7] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. Trajectory patternmining. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 330–339, New York, NY, USA, 2007.ACM.

[8] Gyozo Gidofalvi, Xuegang Huang, and Torben B. Pedersen. Privacy-preserving datamining on moving object trajectories. In Proceedings of the 8th International Confer-ence on Mobile Data Management,, Mannheim, Germany, May 2007.

[9] Gyozo Gidofalvi and Torben Bach Pedersen. Mining long, sharable patterns in trajec-tories of moving objects. Geoinformatica, 13(1):27–55, 2009.

95

BIBLIOGRAPHY 96

[10] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science andComputational Biology. Cambridge University Press, January 1997.

[11] Jiawei Han and Micheline Kamber. Data Mining, Second Edition, Second Edition :Concepts and Techniques. Morgan Kaufmann, 2 edition, January 2006.

[12] Tzvetan Horozov, Nitya Narasimhan, and Venu Vasudevan. Using location for person-alized poi recommendations in mobile environments. In SAINT ’06: Proceedings ofthe International Symposium on Applications on Internet, pages 124–129, Washington,DC, USA, 2006. IEEE Computer Society.

[13] Ming Hua, Jian Pei, Ada W.C. Fu, and Xuemin Lin Ho-Fung Leung. Efficiently answer-ing top-k typicality queries on large databases. In VLDB ’07: Proceedings of the 33rdinternational conference on Very large data bases, pages 890–901. VLDB Endowment,2007.

[14] Christian S. Jensen, H. Lahrmann, Stardas Pakalnis, and J. Runge. The infati data.CoRR, cs.DB/0410001, 2004.

[15] Hoyoung Jeung, Man Lung Yiu, Xiaofang Zhou, Christian S. Jensen, and Heng TaoShen. Discovery of convoys in trajectory databases. Proc. VLDB Endow., 1(1):1068–1080, 2008.

[16] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,46(5):604–632, 1999.

[17] Jae-Gil Lee, Jiawei Han, Xiaolei Li, and Hector Gonzalez. Traclass: trajectory classi-fication using hierarchical region-based and trajectory-based clustering. Proc. VLDBEndow., 1(1):1081–1094, 2008.

[18] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. Trajectory clustering: a partition-and-group framework. In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD in-ternational conference on Management of data, pages 593–604, New York, NY, USA,2007. ACM.

[19] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction toInformation Retrieval. Cambridge University Press, Cambridge, UK, 2008.

[20] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM,23(2):262–272, 1976.

[21] Paul Resnick and Hal R. Varian. Recommender systems. Commun. ACM, 40(3):56–58,1997.

[22] Francesco Ricci and Quang Nhat Nguyen. Acquiring and revising preferences in acritique-based mobile recommender system. IEEE Intelligent Systems, 22(3):22–29,2007.

BIBLIOGRAPHY 97

[23] Badrul Sarwar, George Karypis, Joseph Konstan, and John Reidl. Item-based collab-orative filtering recommendation algorithms. In WWW ’01: Proceedings of the 10thinternational conference on World Wide Web, pages 285–295, New York, NY, USA,2001. ACM.

[24] Marcel H. Schulz, Sebastian Bauer, and Peter N. Robinson. The generalised k-truncatedsuffix tree for time-and space-efficient searches in multiple dna or protein sequences.Int. J. Bioinformatics Res. Appl., 4(1):81–95, 2008.

[25] Upendra Shardanand and Pattie Maes. Social information filtering: algorithms forautomating “word of mouth”. In CHI ’95: Proceedings of the SIGCHI conference onHuman factors in computing systems, pages 210–217, New York, NY, USA, 1995. ACMPress/Addison-Wesley Publishing Co.

[26] B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman &Hall/CRC, April 1986.

[27] Yuichiro Takeuchi and Masanori Sugimoto. CityVoyager: An Outdoor RecommendationSystem Based on User Location History, volume 4159 of Lecture Notes in ComputerScience, pages 625–636. Springer Berlin / Heidelberg, 2006.

[28] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

[29] Mark van Setten, Stanislav Pokraev, and Johan Koolwaaij. Context-Aware Recommen-dations in the Mobile Tourist Application COMPASS, volume 3137 of Lecture Notes inComputer Science, pages 235–244. Springer Berlin / Heidelberg, 2004.

[30] Peter Weiner. Linear pattern matching algorithms. In SWAT ’73: Proceedings of the14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11,Washington, DC, USA, 1973. IEEE Computer Society.

[31] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locationsand travel sequences from gps trajectories. In WWW ’09: Proceedings of the 18thinternational conference on World wide web, pages 791–800, New York, NY, USA,2009. ACM.

Documents

TRAJECTORY-BASED POINT OF INTEREST RECOMMENDATIONsummit.sfu.ca/system/files/iritems1/9832/ETD4883.pdf · TRAJECTORY-BASED POINT OF INTEREST RECOMMENDATION by Geo rey Benjamin Zenger